X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C74181.ECA54F33@onstor-exch02.onstor.net>; Fri, 26 Jan 2007 11:41:08 -0800
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Content-class: urn:content-classes:message
Subject: RE: CF Issue
Date: Fri, 26 Jan 2007 11:41:07 -0800
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E0239213A@onstor-exch02.onstor.net>
In-Reply-To: <20070126111655.258ed186@ripper.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: CF Issue
Thread-Index: AcdBfor7hCCMN2i7SmyGplvoUnNIhQAAHyFw
From: "Brian Stark" <brian.stark@onstor.com>
To: "Andy Sharp" <andy.sharp@onstor.com>,
	"Chris Vandever" <chris.vandever@onstor.com>
Cc: "Mark Farabaugh" <mark.farabaugh@onstor.com>

Andy and Chris,

Thanks a lot for your feedback.  As to how the released sw images are
programmed onto a CF, we have an external programming house that we give
a gold master CF to, and they then create copies with a programmer.
Mark has also done system compares on some flashes, and the only diffs
have been in the expected files.  I'm not sure he's done a system
compare on this particular CF, though, so I'll let him chime in here.
Finally, none of these suspect flashes had system upgrade run against
them -- they were created on the external programming equipment and
maybe a few from 'system copy all'.

I'll throw a bit more info out here as food for thought.  It seems that
the problems that ACT has seen recently may have only been on the RoHS
Bobcats with the TI1520 CF controller.  We're not yet clear if the
problems were on the first power-up, but on almost all of of the
flashes, the filesystem is dirty and has to be cleaned.  This would
suggest that the flashes have been previously used and weren't properly
shutdown, which has been a persistent problem at ACT.

Mark is now running tests on non-RoHS Bobcats with the old Intel 6729 CF
controller as well as the RoHS Bobcats to see if we can narrow it down
to something on the motherboard.  We're not sure there's a dependency on
the motherboard (and then presumably the CF controller), but ACT has
seen so many failures that it's making us a bit nervous.  We're also
trying to determine if the problems are isolated to a particular CF
vendor.


Brian



> -----Original Message-----
> From: Andrew Sharp [mailto:andy.sharp@onstor.com]=20
> Sent: Friday, January 26, 2007 11:17 AM
> To: Chris Vandever
> Cc: Mark Farabaugh; Brian Stark
> Subject: Re: CF Issue
>=20
> Hi Guys,
>=20
> Sorry folksez, I've been in and out sick this week.
>=20
> We need to see the output on the console and/or the=20
> /var/log/messages file, but the 2 seconds I've spent looking=20
> at the info in this email thread suggests that what's on the=20
> flash is incorrect/corrupted, not the hardware itself.
>=20
> There is a bug in the OpenBSD kernel where the kernel has=20
> problems keeping the buffer cache in sync when using mmap and=20
> actively swapping pages from the same process (including text=20
> pages of a shared library).  This was causing problems with=20
> our upgrade because our upgrade process was running the=20
> system out of memory, but it doesn't sound like what's happening here.
>=20
> I think Chris is on the right track: get one of these cards=20
> from the manufacturer and do a system compare on it as Chris=20
> mentioned below.
>=20
> Cheers,
>=20
> a
>=20
> On Thu, 25 Jan 2007 16:10:10 -0800 "Chris Vandever"
> <chris.vandever@onstor.com> wrote:
>=20
> > Andy, I'm not sure we need to know about the error handling for CF=20
> > errors, so I think you're off the hook here, although any insights=20
> > would still be appreciated.
> >=20
> > Okay, I think I have the picture now, so correct me if I'm wrong.
> > *	We DON"T think it's a problem with the flash failing because
> > we can rewrite the flash with new images and it then works (the=20
> > "re-programming part you mentioned).
> > *	We think the problem is persistent with the flash because it
> > repeats after rebooting (the bit about our contract manufacturer=20
> > seeing the problem, shipping us the flash, and then us seeing it=20
> > here).
> > *	You mentioned having "validated" the flash.  Does that mean
> > you did a "system compare -s" between the flash and the release=20
> > supposedly installed on it?  If not, what do you mean?
> >=20
> > There was a similar problem that was fixed in R1.3.2.0 (change list
> > #19519) that was a transient problem where pm got bad data from BSD=20
> > and erroneously thought timekeeper had died and restarted it, the=20
> > restarted timekeeper failed, terminating pm, and pm killed=20
> all of its=20
> > children except the original timekeeper.  But, your data isn't=20
> > consistent with that problem and you're running code that=20
> should have=20
> > the fix in it.
> >=20
> > I'll need the following info:
> > *	How did the contract manufacturer get the image on the flash?
> > Did it ever work, or did it fail immediately?
> > *	If you haven't already done a "system compare -s" against the
> > appropriate R1.3.3.2 build I would suggest doing so.  It is=20
> entirely=20
> > possible this is a known problem where the system upgrade process=20
> > resulted in one or more files being corrupted.  I would=20
> guess this is=20
> > most likely the problem, although you didn't mention anyone doing a=20
> > system upgrade.  A few files are expected not to compare,=20
> so send me=20
> > the output.
> > *	If the compare looks good, then I'll need the elog (level
> > info) from a fresh boot that exhibits the problem.  Please=20
> include the=20
> > syslog as well just in case.
> >=20
> > I sit on the last aisle between the Maui and Oahu conference rooms,=20
> > closer to Oahu, across from Bill Nadzam if that helps.
> >=20
> > ChrisV
> >=20
> > _____________________________________________
> > From: Mark Farabaugh
> > Sent: Thursday, January 25, 2007 3:32 PM
> > To: Chris Vandever
> > Cc: Andy Sharp
> > Subject: RE: CF Issue
> >=20
> > Chris, Andy,
> > I'm relatively new here and ONstor so I'm not totally up on the=20
> > Bobcat. Our contract manufacture has had multiple failures with the=20
> > symptom below.
> > We have validated and re-programmed a flash with this=20
> symptom and it=20
> > came out working fine.  We do not believe this to be a part issue.
> > What could cause this symptom?  Could this be a corruption issue=20
> > because of an improper power down?  1.3.3.2 issue?  I have=20
> this flash=20
> > if you want to try it.
> > Also, please introduce yourselves, it's better to put faces with=20
> > names. Regards, Mark
> >=20
> > _____________________________________________
> > From: Chris Vandever
> > Sent: Thursday, January 25, 2007 11:37 AM
> > To: Mark Farabaugh
> > Cc: Andy Sharp
> > Subject: FW: CF Issue
> >=20
> > When the system boots it starts all of the OS processes for BSD and=20
> > then starts a program called pm (process manager).  pm will=20
> start all=20
> > of the other ONStor processes based on the contents of the=20
> > /usr/local/agile/etc/pmtab file.  This file tells it what=20
> processes to=20
> > start and in what order.  When pm starts a process it waits for a=20
> > signal from that process to indicate it has finished its=20
> > initialization.  Once it gets that signal pm will then=20
> start the next=20
> > program in the list.
> >=20
> > In your case pm has started the following programs:
> >=20
> > 		PID   TT  STAT    TIME    COMMAND
> > 		30799 ??  S       0:02.26 /usr/local/agile/bin/elog
> > 		19755 ??  I       0:00.02
> > /usr/local/agile/bin/registryMgr
> > 		16903 ??  D       0:00.05 (ncmd)
> >=20
> > A state of 'D' indicates that ncmd is in a disk I/O wait state.  I=20
> > would have expected the disk request to time out, fail, and=20
> cause ncmd=20
> > to die, resulting in pm trying to restart it.  This should be an=20
> > infinite loop unless we are eventually able to read what we=20
> need.  I=20
> > would also expect the CF error would be logged to the syslog (not
> > elog) in /var/log/messages*, although we may not be able to=20
> write the=20
> > log.  If this isn't happening, then we may need to get Andy Sharp=20
> > involved to find out what BSD and the CF driver are doing. =20
> However,=20
> > I'm not sure what you're expecting to try to do.  If the CF=20
> is failing=20
> > there's not really anything we can do about it that I know of,=20
> > although Andy would have more insight.  You could try copying the=20
> > files in /usr/local/agile/bin and /usr/local/agile/lib to try to=20
> > identify what files are on failing parts of the flash, but=20
> again, I'm=20
> > not sure what purpose it would serve.
> >=20
> > ChrisV
> >=20
> > _____________________________________________
> > From: Mark Farabaugh
> > Sent: Thursday, January 25, 2007 9:03 AM
> > To: Chris Vandever
> > Cc: Brian Stark
> > Subject: FW: CF Issue
> >=20
> > Chris,
> > Brian Stark suggested that I follow up with you on an issue.  Our=20
> > Contract Manufacturer is seeing several compact flash=20
> failures where=20
> > system commands will not execute (see below).  Brian asked=20
> that I run=20
> > a ps- ax to look at what processes are running.  I have a failing=20
> > flash available.  The output of the ps -ax is attached.
> >=20
> > Regards, Mark
> >=20
> > abcd diag> system show chassis
> > Timed out waiting for response
> > % Command failure.
> > abcd diag> system show version
> > Timed out waiting for response
> > % Command failure.
> > abcd diag> system show temperature
> > Timed out waiting for response
> > % Command failure.
> > abcd diag>
> > abcd diag>
> > abcd diag> ps -ax
> > % Unknown command/option.
> > abcd diag> system reboot
> > Are you sure ? [y|n] : y
> > system reboot
> > Are you sure ? [y|n] : y
> > nfxsh_send: Unable to open rmc session to eventd_rmc, error -20.
> >=20
> >  << File: abcd diag.doc >>
> >=20
> >=20
> >=20
> >=20
> > _____________________________________________
> > From: Brian Stark
> > Sent: Wednesday, January 24, 2007 2:27 PM
> > To: Mark Farabaugh
> > Subject: RE: CF Issue
> >=20
> > Mark,
> >=20
> > The system commands are timing out because chassisd is not running.
> > At this point, you'll need to get someone in the software group=20
> > involved to help understand why this is happening.  I think=20
> that Chris=20
> > Vandever would be the person to first talk with since she's helped=20
> > Abdallah with similar issues in the past.
> >=20
> >=20
> > Brian
> >=20
> >=20
> > 	_____________________________________________=20
> > 	From: 	Mark Farabaugh =20
> > 	Sent:	Wednesday, January 24, 2007 1:09 PM
> > 	To:	Brian Stark
> > 	Subject:	CF Issue
> >=20
> > 	Brian,
> > 	I received another CF where system commands time out. =20
> I believe you=20
> > asked to run a ps -ax when
> > 	I saw this.  Attached is the outputs.  Let me know what=20
> you want me=20
> > to do.
> >=20
> > 	Regards, Mark
> >=20
> > 	 << File: abcd diag.doc >>
>=20
