AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20070126111613.34a43121@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<chris.vandever@onstor.com>,<mark.farabaugh@onstor.com>,<brian.stark@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@onstor-exch02.onstor.net/INBOX	0	BB375AF679D4A34E9CA8DFA650E2B04E0138C429@onstor-exch02.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Fri, 26 Jan 2007 11:16:55 -0800
From: Andrew Sharp <andy.sharp@onstor.com>
To: "Chris Vandever" <chris.vandever@onstor.com>
Cc: "Mark Farabaugh" <mark.farabaugh@onstor.com>, Brian Stark
 <brian.stark@onstor.com>
Subject: Re: CF Issue
Message-ID: <20070126111655.258ed186@ripper.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E0138C429@onstor-exch02.onstor.net>
References: <BB375AF679D4A34E9CA8DFA650E2B04E022FA0E3@onstor-exch02.onstor.net>
	<BB375AF679D4A34E9CA8DFA650E2B04E0138C429@onstor-exch02.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Hi Guys,

Sorry folksez, I've been in and out sick this week.

We need to see the output on the console and/or the /var/log/messages
file, but the 2 seconds I've spent looking at the info in this email
thread suggests that what's on the flash is incorrect/corrupted, not
the hardware itself.

There is a bug in the OpenBSD kernel where the kernel has problems
keeping the buffer cache in sync when using mmap and actively swapping
pages from the same process (including text pages of a shared
library).  This was causing problems with our upgrade because our
upgrade process was running the system out of memory, but it doesn't
sound like what's happening here.

I think Chris is on the right track: get one of these cards from the
manufacturer and do a system compare on it as Chris mentioned below.

Cheers,

a

On Thu, 25 Jan 2007 16:10:10 -0800 "Chris Vandever"
<chris.vandever@onstor.com> wrote:

> Andy, I'm not sure we need to know about the error handling for CF
> errors, so I think you're off the hook here, although any insights
> would still be appreciated.
> 
> Okay, I think I have the picture now, so correct me if I'm wrong.
> *	We DON"T think it's a problem with the flash failing because
> we can rewrite the flash with new images and it then works (the
> "re-programming part you mentioned).
> *	We think the problem is persistent with the flash because it
> repeats after rebooting (the bit about our contract manufacturer
> seeing the problem, shipping us the flash, and then us seeing it
> here).
> *	You mentioned having "validated" the flash.  Does that mean
> you did a "system compare -s" between the flash and the release
> supposedly installed on it?  If not, what do you mean?
> 
> There was a similar problem that was fixed in R1.3.2.0 (change list
> #19519) that was a transient problem where pm got bad data from BSD
> and erroneously thought timekeeper had died and restarted it, the
> restarted timekeeper failed, terminating pm, and pm killed all of its
> children except the original timekeeper.  But, your data isn't
> consistent with that problem and you're running code that should have
> the fix in it.
> 
> I'll need the following info:
> *	How did the contract manufacturer get the image on the flash?
> Did it ever work, or did it fail immediately?
> *	If you haven't already done a "system compare -s" against the
> appropriate R1.3.3.2 build I would suggest doing so.  It is entirely
> possible this is a known problem where the system upgrade process
> resulted in one or more files being corrupted.  I would guess this is
> most likely the problem, although you didn't mention anyone doing a
> system upgrade.  A few files are expected not to compare, so send me
> the output.
> *	If the compare looks good, then I'll need the elog (level
> info) from a fresh boot that exhibits the problem.  Please include
> the syslog as well just in case.
> 
> I sit on the last aisle between the Maui and Oahu conference rooms,
> closer to Oahu, across from Bill Nadzam if that helps.
> 
> ChrisV
> 
> _____________________________________________
> From: Mark Farabaugh 
> Sent: Thursday, January 25, 2007 3:32 PM
> To: Chris Vandever
> Cc: Andy Sharp
> Subject: RE: CF Issue
> 
> Chris, Andy,
> I'm relatively new here and ONstor so I'm not totally up on the
> Bobcat. Our contract manufacture has had multiple failures with the
> symptom below.
> We have validated and re-programmed a flash with this symptom and it
> came out working fine.  We do not believe this to be a part issue.
> What could cause this symptom?  Could this be a corruption issue
> because of an improper power down?  1.3.3.2 issue?  I have this flash
> if you want to try it.
> Also, please introduce yourselves, it's better to put faces with
> names. Regards, Mark
> 
> _____________________________________________
> From: Chris Vandever 
> Sent: Thursday, January 25, 2007 11:37 AM
> To: Mark Farabaugh
> Cc: Andy Sharp
> Subject: FW: CF Issue
> 
> When the system boots it starts all of the OS processes for BSD and
> then starts a program called pm (process manager).  pm will start all
> of the other ONStor processes based on the contents of the
> /usr/local/agile/etc/pmtab file.  This file tells it what processes to
> start and in what order.  When pm starts a process it waits for a
> signal from that process to indicate it has finished its
> initialization.  Once it gets that signal pm will then start the next
> program in the list.
> 
> In your case pm has started the following programs:
> 
> 		PID   TT  STAT    TIME    COMMAND
> 		30799 ??  S       0:02.26 /usr/local/agile/bin/elog
> 		19755 ??  I       0:00.02
> /usr/local/agile/bin/registryMgr
> 		16903 ??  D       0:00.05 (ncmd)
> 
> A state of 'D' indicates that ncmd is in a disk I/O wait state.  I
> would have expected the disk request to time out, fail, and cause
> ncmd to die, resulting in pm trying to restart it.  This should be an
> infinite loop unless we are eventually able to read what we need.  I
> would also expect the CF error would be logged to the syslog (not
> elog) in /var/log/messages*, although we may not be able to write the
> log.  If this isn't happening, then we may need to get Andy Sharp
> involved to find out what BSD and the CF driver are doing.  However,
> I'm not sure what you're expecting to try to do.  If the CF is
> failing there's not really anything we can do about it that I know
> of, although Andy would have more insight.  You could try copying the
> files in /usr/local/agile/bin and /usr/local/agile/lib to try to
> identify what files are on failing parts of the flash, but again, I'm
> not sure what purpose it would serve.
> 
> ChrisV
> 
> _____________________________________________
> From: Mark Farabaugh 
> Sent: Thursday, January 25, 2007 9:03 AM
> To: Chris Vandever
> Cc: Brian Stark
> Subject: FW: CF Issue
> 
> Chris,
> Brian Stark suggested that I follow up with you on an issue.  Our
> Contract Manufacturer is seeing several compact flash failures where
> system commands will not execute (see below).  Brian asked that I run
> a ps- ax to look at what processes are running.  I have a failing
> flash available.  The output of the ps -ax is attached.
> 
> Regards, Mark
> 
> abcd diag> system show chassis
> Timed out waiting for response
> % Command failure.
> abcd diag> system show version
> Timed out waiting for response
> % Command failure.
> abcd diag> system show temperature
> Timed out waiting for response
> % Command failure.
> abcd diag>
> abcd diag>
> abcd diag> ps -ax
> % Unknown command/option.
> abcd diag> system reboot
> Are you sure ? [y|n] : y
> system reboot
> Are you sure ? [y|n] : y
> nfxsh_send: Unable to open rmc session to eventd_rmc, error -20.
> 
>  << File: abcd diag.doc >> 
> 
> 
> 
> 
> _____________________________________________
> From: Brian Stark 
> Sent: Wednesday, January 24, 2007 2:27 PM
> To: Mark Farabaugh
> Subject: RE: CF Issue
> 
> Mark,
> 
> The system commands are timing out because chassisd is not running.
> At this point, you'll need to get someone in the software group
> involved to help understand why this is happening.  I think that
> Chris Vandever would be the person to first talk with since she's
> helped Abdallah with similar issues in the past.
> 
> 
> Brian
> 
> 
> 	_____________________________________________ 
> 	From: 	Mark Farabaugh  
> 	Sent:	Wednesday, January 24, 2007 1:09 PM
> 	To:	Brian Stark
> 	Subject:	CF Issue
> 
> 	Brian,
> 	I received another CF where system commands time out.  I
> believe you asked to run a ps -ax when 
> 	I saw this.  Attached is the outputs.  Let me know what you
> want me to do.
> 
> 	Regards, Mark
> 
> 	 << File: abcd diag.doc >> 
