AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:
CFG:
PT:0
S:andy.sharp@lsi.com
RQ:
SSV:mhbs.lsil.com
NSV:
SSH:
R:<Brian.Stark@lsi.com>,<abdallah.harb@lsi.com>
MAID:2
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/LSI/INBOX	0	E1EC65251D4B3D46BBC0AAA3C0629222A91A6968@cosmail02.lsi.com
X-Sylpheed-End-Special-Headers: 1
Date: Thu, 1 Oct 2009 11:03:11 -0700
From: Andrew Sharp <andy.sharp@lsi.com>
To: "Stark, Brian" <Brian.Stark@lsi.com>
Cc: Abdallah Harb <abdallah.harb@lsi.com>
Subject: Re: Cougar Qual Build
Message-ID: <20091001110311.637b047e@ripper.onstor.net>
In-Reply-To: <E1EC65251D4B3D46BBC0AAA3C0629222A91A6968@cosmail02.lsi.com>
References: <7E274F61DBC8AB44998A0DF6841FCFBA0138A20DCD@cosmail03.lsi.com>
	<7E274F61DBC8AB44998A0DF6841FCFBA0138A20DD3@cosmail03.lsi.com>
	<7E274F61DBC8AB44998A0DF6841FCFBA0138A20DD5@cosmail03.lsi.com>
	<D248F4BF20F45C438875F08125B9CC500102F4C162@cosmail03.lsi.com>
	<7E274F61DBC8AB44998A0DF6841FCFBA0138A20DD6@cosmail03.lsi.com>
	<7E274F61DBC8AB44998A0DF6841FCFBA0138A20DD9@cosmail03.lsi.com>
	<7E274F61DBC8AB44998A0DF6841FCFBA0138A20DDA@cosmail03.lsi.com>
	<7E274F61DBC8AB44998A0DF6841FCFBA0138A20DDC@cosmail03.lsi.com>
	<7E274F61DBC8AB44998A0DF6841FCFBA0138A20DDD@cosmail03.lsi.com>
	<7E274F61DBC8AB44998A0DF6841FCFBA019A328132@cosmail03.lsi.com>
	<E1EC65251D4B3D46BBC0AAA3C0629222A9133885@cosmail02.lsi.com>
	<20090930142356.0acbe33c@ripper.onstor.net>
	<E1EC65251D4B3D46BBC0AAA3C0629222A91A6931@cosmail02.lsi.com>
	<20090930150436.4e26d5ec@ripper.onstor.net>
	<E1EC65251D4B3D46BBC0AAA3C0629222A91A6968@cosmail02.lsi.com>
Organization: LSI
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

On Wed, 30 Sep 2009 16:21:35 -0600 "Stark, Brian" <Brian.Stark@lsi.com>
wrote:

> Since Abdallah has a system that's doing this now, can you head on
> down to the lab and try to collect the info that's needed?

A's system has nothing going on that would explain the situation, hence
I'm investigating the possibility of a bug in the driver or chassisd.


> -----Original Message-----
> From: Andrew Sharp [mailto:andy.sharp@lsi.com] 
> Sent: Wednesday, September 30, 2009 3:05 PM
> To: Stark, Brian
> Cc: Harb, Abdallah; Fong, Rendell
> Subject: Re: Cougar Qual Build
> 
> On Wed, 30 Sep 2009 15:45:17 -0600 "Stark, Brian"
> <Brian.Stark@lsi.com> wrote:
> 
> > Andy,
> > 
> > Why is it happening on only some systems?
> > 
> > 
> > Brian
> 
> No clue what the actual problem is, hence I couldn't tell you.  The
> message is triggered by one or more of our daemons gobbling up all
> memory and trying to get more while running in an infinite loop.
> Openbsd would just crash, whereas Linux soldiers on for a while trying
> to keep things going in case the culprits recant their evil ways.
> 
> The way to track this down is to figure out what process is out of
> control by running top when it is happening.  Then trigger a core of
> that process.  I've tried to communicate this several times to
> Joachim, but often he's in a situation where he feels doing something
> like that would not be the best PR move.
> 
> 
> > -----Original Message-----
> > From: Andrew Sharp [mailto:andy.sharp@lsi.com] 
> > Sent: Wednesday, September 30, 2009 2:24 PM
> > To: Stark, Brian
> > Cc: Harb, Abdallah; Fong, Rendell
> > Subject: Re: Cougar Qual Build
> > 
> > It's a software issue, not a hardware issue.
> > 
> > On Wed, 30 Sep 2009 12:02:35 -0600 "Stark, Brian"
> > <Brian.Stark@lsi.com> wrote:
> > 
> > > Andy and Rendell,
> > > 
> > > We've seen reports from the field about SiByte watchdog messages,
> > > and it looks like Abdallah has a system that's doing the same
> > > thing in the lab.  See the email thread below.
> > > 
> > > Please take a look at this system with him to see if we can
> > > determine why the messages are happening.  Given that we're
> > > launching a build of 100, this is high priority.
> > > 
> > > 
> > > Thanks,
> > > Brian
> > > 
> > > 
> > > -----Original Message-----
> > > From: Harb, Abdallah 
> > > Sent: Wednesday, September 30, 2009 10:58 AM
> > > To: Stark, Brian
> > > Subject: RE: Cougar Qual Build 
> > > 
> > > Brian,
> > > 
> > > Do you have any suggestions regarding the desposition of this
> > > board that still shows the Sibyte watchdog messages? "SiByte
> > > Watchdog in danger of initiating system reset in 0.9 seconds"
> > > 
> > > It seems that this message appears on certain boards but not all
> > > the boards. Currently, I have two boards running in a 2U chassis,
> > > one of the boards is continuously showing this message, where as
> > > the 2nd board has been running ddtest with no issues for over a
> > > week. For the time being, I can keep it running.
> > > 
> > > Thanks,
> > > Abdallah
> > > 
> > > ________________________________________
> > > From: Harb, Abdallah
> > > Sent: Sunday, September 20, 2009 5:28 PM
> > > To: Harb, Abdallah; Negus, Paul; Stark, Brian; Rosario, Eugene;
> > > Fuller, Joe Cc: Quilici, Lauren; Clary, Sue
> > > Subject: RE: Cougar Qual Build
> > > 
> > > Sunday's Qual Update:
> > > 
> > > 3rd board has successfully completed the 100 hrs of ddtest inside
> > > the ESS chamber. I already stopped the ESS chamber this afternoon.
> > > Now, both the 3rd & 4th board can be moved to finished goods.
> > > 
> > > 5th board still requires further verification to understand the
> > > "SiByte Watchdog" messages. I will continue to work on it this
> > > Monday.
> > > 
> > > Have a good weekend!
> > > Abdallah
> > > 
> > > 
> > > 
> > > 
> > > 
> > > ________________________________________
> > > From: Harb, Abdallah
> > > Sent: Saturday, September 19, 2009 4:26 PM
> > > To: Harb, Abdallah; Negus, Paul; Stark, Brian; Rosario, Eugene;
> > > Fuller, Joe Cc: Quilici, Lauren; Clary, Sue
> > > Subject: RE: Cougar Qual Build
> > > 
> > > Saturday's Qual Update:
> > > 
> > > 3rd & 4th board are still running ddtest in the ESS chamber with
> > > no issues. The test will continue to run until tomorrow.
> > > 
> > > Have a nice day!
> > > Abdallah
> > > 
> > > 
> > > ________________________________________
> > > From: Harb, Abdallah
> > > Sent: Friday, September 18, 2009 10:21 AM
> > > To: Harb, Abdallah; Negus, Paul; Stark, Brian; Rosario, Eugene;
> > > Fuller, Joe Cc: Quilici, Lauren; Clary, Sue
> > > Subject: RE: Cougar Qual Build
> > > 
> > > Friday's Qual Update:
> > > 
> > > 3rd board is still running ddtest in the ESS chamber with no
> > > issues. As of 10:00 am this morning, the actual numbers of hours
> > > that have passed thus far are 63 hrs. In order to complete the
> > > 100hrs as planned, the unit needs to run until 3:00 am on Sunday.
> > > I propose to keep the chamber running through out the weekend. I
> > > can come on the Saturday and Sunday to check on the status, and
> > > then end the test on Sunday morning.
> > > 
> > > Paul,
> > > Please advise if you're OK with this suggestion?
> > > Otherwise, I can stop the chamber this evening... keep the test
> > > running and then re-run the chamber on Monday for an additional 30
> > > hrs.
> > > 
> > > 
> > > 4th board is till running ddtest in the chamber with no issues.
> > > The board is considered "passed" by now. However, I'll keep it
> > > running in the chassis until I have another board that can replace
> > > it.
> > > 
> > > Due to Wayne's visit yesterday, I did not get a chance to work on
> > > the issue that was reported on the 5th board. I will work on it
> > > today.
> > > 
> > > Regards,
> > > Abdallah
> > > 
> > > 
> > > 
> > > ________________________________________
> > > From: Harb, Abdallah
> > > Sent: Thursday, September 17, 2009 10:25 AM
> > > To: Harb, Abdallah; Negus, Paul; Stark, Brian; Rosario, Eugene;
> > > Fuller, Joe Cc: Quilici, Lauren; Clary, Sue
> > > Subject: RE: Cougar Qual Build
> > > 
> > > Thursday's Qual Update:
> > > 
> > > 3rd board is still running ddtest in the ESS chamber with no
> > > issues. It has passed the first 24hrs of its 100 hrs testing.
> > > It will continue running as planned.
> > > 
> > > 4th board passed ddtest in the ESS chamber overnight.
> > > I'll keep it running until the end of the day.
> > > 
> > > 5th board is showing the following watchdog messages and then
> > > rebooting during sysdvt at room temp. "SiByte Watchdog in danger
> > > of initiating system reset in 1.0 seconds" I'm suspecting an
> > > issue with the chassis. I'll try to work on it today.
> > > 
> > > P.S. Wayne from LSI will be here today.
> > > 
> > > Thanks,
> > > Abdallah
> > > 
> > > ________________________________________
> > > From: Harb, Abdallah
> > > Sent: Wednesday, September 16, 2009 11:33 AM
> > > To: Negus, Paul; Stark, Brian; Rosario, Eugene; Fuller, Joe
> > > Cc: Quilici, Lauren; Clary, Sue
> > > Subject: RE: Cougar Qual Build
> > > 
> > > Paul,
> > > 
> > > Unfortunately, the DIMMs don't appear to be loose.
> > > Initially, I suspected a DIMM failure, but in the process of
> > > eliminating the failure, the same modules worked OK. Other than a
> > > lack of pin contact I can't think of possible root cause of the
> > > failure. I guess we'll see how these 2 boards finish the remaining
> > > of the tests steps. We definitely, need to keep an eye on the
> > > future builds if there is any similar failure that might point
> > > towards either a process or unreliable connectors.
> > > 
> > > Regards,
> > > Abdallah
> > > 
> > > 
> > > ________________________________________
> > > From: Negus, Paul
> > > Sent: Wednesday, September 16, 2009 10:33 AM
> > > To: Harb, Abdallah; Stark, Brian; Rosario, Eugene; Fuller, Joe
> > > Cc: Quilici, Lauren; Clary, Sue
> > > Subject: RE: Cougar Qual Build
> > > 
> > > Thanks Abdallah.  This is much better news.
> > > 
> > > Have we got any theories on why the DIMMS appear to be getting
> > > loose?
> > > 
> > > Paul
> > > 
> > > -----Original Message-----
> > > From: Harb, Abdallah
> > > Sent: Wednesday, September 16, 2009 10:28 AM
> > > To: Harb, Abdallah; Stark, Brian; Negus, Paul; Rosario, Eugene
> > > Subject: RE: Cougar Qual Build
> > > 
> > > Wednesday's Qual Test Update:
> > > 
> > > Last night, the 3rd board started ddtest in the ESS chamber for
> > > 100 hrs. It should run continuously until Saturday night. As of
> > > this morning the board is running OK.
> > > 
> > > 4th & 5th boards have passed functional after re-seating the DIMM
> > > modules.
> > > 
> > > The 4th board passed sysdvt under temp cycling last night.
> > > Today, it will run ddtest .. If it passes, then it can be moved to
> > > FG. Finally, the 5th board will be placed in the same chassis and
> > > run the same tests.
> > > 
> > > Regards,
> > > Abdallah
> > > 
> > > 
> > > 
> > > ________________________________________
> > > From: Harb, Abdallah
> > > Sent: Tuesday, September 15, 2009 9:44 AM
> > > To: Harb, Abdallah; Stark, Brian; Negus, Paul; Rosario, Eugene
> > > Subject: RE: Cougar Qual Build
> > > 
> > > Here's Tuesday's Qual Test Update:
> > > 
> > > 1- The first board that passed ddtest under temp was given to
> > > Brian for engineering review yesterday. 2- Second board passed
> > > ddtest under temp overnight. Will be used for the cross sectional
> > > measurement. 3- Third board passed sysdvt under temp overnight.
> > > This morning will start ddtest and keep it running for 100 hrs
> > > under temperature cycling. 4- The 4th & 5th board are still under
> > > functional debug.
> > > 
> > > Regards,
> > > Abdallah
> > > 
> > > ________________________________________
> > > From: Harb, Abdallah
> > > Sent: Monday, September 14, 2009 9:42 AM
> > > To: Stark, Brian; Negus, Paul; Rosario, Eugene
> > > Subject: Cougar Qual Build
> > > 
> > > Here's an update the qualification testing:
> > > 
> > > The Cougar board that was allocated to engineering for the
> > > qualification testing has passed ddtest at room temp over the
> > > weekend. The board is ready for Brian in the HW lab.
> > > 
> > > Last Friday, I was not able to restart the ddtest on the failing
> > > board (the board allocated for the cross section). Today, I'll
> > > reinstall the OS on the clients and possibly reconfigure the
> > > storage, and then start the test once again.
> > > 
> > > I will also assemble the remaining three Cougar boards, and start
> > > the functional and system test.
> > > 
> > > Regards,
> > > Abdallah
