AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:
CFG:
PT:0
S:andy.sharp@lsi.com
RQ:
SSV:mhbs.lsil.com
NSV:
SSH:
R:<Brian.Stark@lsi.com>,<Abdallah.Harb@lsi.com>,<Rendell.Fong@lsi.com>
MAID:2
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/LSI/INBOX	0	E1EC65251D4B3D46BBC0AAA3C0629222A91A6931@cosmail02.lsi.com
X-Sylpheed-End-Special-Headers: 1
Date: Wed, 30 Sep 2009 15:04:36 -0700
From: Andrew Sharp <andy.sharp@lsi.com>
To: "Stark, Brian" <Brian.Stark@lsi.com>
Cc: "Harb, Abdallah" <Abdallah.Harb@lsi.com>, "Fong, Rendell"
 <Rendell.Fong@lsi.com>
Subject: Re: Cougar Qual Build
Message-ID: <20090930150436.4e26d5ec@ripper.onstor.net>
In-Reply-To: <E1EC65251D4B3D46BBC0AAA3C0629222A91A6931@cosmail02.lsi.com>
References: <7E274F61DBC8AB44998A0DF6841FCFBA0138A20DCD@cosmail03.lsi.com>
	<7E274F61DBC8AB44998A0DF6841FCFBA0138A20DD3@cosmail03.lsi.com>
	<7E274F61DBC8AB44998A0DF6841FCFBA0138A20DD5@cosmail03.lsi.com>
	<D248F4BF20F45C438875F08125B9CC500102F4C162@cosmail03.lsi.com>
	<7E274F61DBC8AB44998A0DF6841FCFBA0138A20DD6@cosmail03.lsi.com>
	<7E274F61DBC8AB44998A0DF6841FCFBA0138A20DD9@cosmail03.lsi.com>
	<7E274F61DBC8AB44998A0DF6841FCFBA0138A20DDA@cosmail03.lsi.com>
	<7E274F61DBC8AB44998A0DF6841FCFBA0138A20DDC@cosmail03.lsi.com>
	<7E274F61DBC8AB44998A0DF6841FCFBA0138A20DDD@cosmail03.lsi.com>
	<7E274F61DBC8AB44998A0DF6841FCFBA019A328132@cosmail03.lsi.com>
	<E1EC65251D4B3D46BBC0AAA3C0629222A9133885@cosmail02.lsi.com>
	<20090930142356.0acbe33c@ripper.onstor.net>
	<E1EC65251D4B3D46BBC0AAA3C0629222A91A6931@cosmail02.lsi.com>
Organization: LSI
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

On Wed, 30 Sep 2009 15:45:17 -0600 "Stark, Brian" <Brian.Stark@lsi.com>
wrote:

> Andy,
> 
> Why is it happening on only some systems?
> 
> 
> Brian

No clue what the actual problem is, hence I couldn't tell you.  The
message is triggered by one or more of our daemons gobbling up all
memory and trying to get more while running in an infinite loop.
Openbsd would just crash, whereas Linux soldiers on for a while trying
to keep things going in case the culprits recant their evil ways.

The way to track this down is to figure out what process is out of
control by running top when it is happening.  Then trigger a core of
that process.  I've tried to communicate this several times to Joachim,
but often he's in a situation where he feels doing something like that
would not be the best PR move.


> -----Original Message-----
> From: Andrew Sharp [mailto:andy.sharp@lsi.com] 
> Sent: Wednesday, September 30, 2009 2:24 PM
> To: Stark, Brian
> Cc: Harb, Abdallah; Fong, Rendell
> Subject: Re: Cougar Qual Build
> 
> It's a software issue, not a hardware issue.
> 
> On Wed, 30 Sep 2009 12:02:35 -0600 "Stark, Brian"
> <Brian.Stark@lsi.com> wrote:
> 
> > Andy and Rendell,
> > 
> > We've seen reports from the field about SiByte watchdog messages,
> > and it looks like Abdallah has a system that's doing the same thing
> > in the lab.  See the email thread below.
> > 
> > Please take a look at this system with him to see if we can
> > determine why the messages are happening.  Given that we're
> > launching a build of 100, this is high priority.
> > 
> > 
> > Thanks,
> > Brian
> > 
> > 
> > -----Original Message-----
> > From: Harb, Abdallah 
> > Sent: Wednesday, September 30, 2009 10:58 AM
> > To: Stark, Brian
> > Subject: RE: Cougar Qual Build 
> > 
> > Brian,
> > 
> > Do you have any suggestions regarding the desposition of this board
> > that still shows the Sibyte watchdog messages? "SiByte Watchdog in
> > danger of initiating system reset in 0.9 seconds"
> > 
> > It seems that this message appears on certain boards but not all the
> > boards. Currently, I have two boards running in a 2U chassis, one of
> > the boards is continuously showing this message, where as the 2nd
> > board has been running ddtest with no issues for over a week. For
> > the time being, I can keep it running.
> > 
> > Thanks,
> > Abdallah
> > 
> > ________________________________________
> > From: Harb, Abdallah
> > Sent: Sunday, September 20, 2009 5:28 PM
> > To: Harb, Abdallah; Negus, Paul; Stark, Brian; Rosario, Eugene;
> > Fuller, Joe Cc: Quilici, Lauren; Clary, Sue
> > Subject: RE: Cougar Qual Build
> > 
> > Sunday's Qual Update:
> > 
> > 3rd board has successfully completed the 100 hrs of ddtest inside
> > the ESS chamber. I already stopped the ESS chamber this afternoon.
> > Now, both the 3rd & 4th board can be moved to finished goods.
> > 
> > 5th board still requires further verification to understand the
> > "SiByte Watchdog" messages. I will continue to work on it this
> > Monday.
> > 
> > Have a good weekend!
> > Abdallah
> > 
> > 
> > 
> > 
> > 
> > ________________________________________
> > From: Harb, Abdallah
> > Sent: Saturday, September 19, 2009 4:26 PM
> > To: Harb, Abdallah; Negus, Paul; Stark, Brian; Rosario, Eugene;
> > Fuller, Joe Cc: Quilici, Lauren; Clary, Sue
> > Subject: RE: Cougar Qual Build
> > 
> > Saturday's Qual Update:
> > 
> > 3rd & 4th board are still running ddtest in the ESS chamber with no
> > issues. The test will continue to run until tomorrow.
> > 
> > Have a nice day!
> > Abdallah
> > 
> > 
> > ________________________________________
> > From: Harb, Abdallah
> > Sent: Friday, September 18, 2009 10:21 AM
> > To: Harb, Abdallah; Negus, Paul; Stark, Brian; Rosario, Eugene;
> > Fuller, Joe Cc: Quilici, Lauren; Clary, Sue
> > Subject: RE: Cougar Qual Build
> > 
> > Friday's Qual Update:
> > 
> > 3rd board is still running ddtest in the ESS chamber with no issues.
> > As of 10:00 am this morning, the actual numbers of hours that have
> > passed thus far are 63 hrs. In order to complete the 100hrs as
> > planned, the unit needs to run until 3:00 am on Sunday. I propose to
> > keep the chamber running through out the weekend. I can come on the
> > Saturday and Sunday to check on the status, and then end the test on
> > Sunday morning.
> > 
> > Paul,
> > Please advise if you're OK with this suggestion?
> > Otherwise, I can stop the chamber this evening... keep the test
> > running and then re-run the chamber on Monday for an additional 30
> > hrs.
> > 
> > 
> > 4th board is till running ddtest in the chamber with no issues.
> > The board is considered "passed" by now. However, I'll keep it
> > running in the chassis until I have another board that can replace
> > it.
> > 
> > Due to Wayne's visit yesterday, I did not get a chance to work on
> > the issue that was reported on the 5th board. I will work on it
> > today.
> > 
> > Regards,
> > Abdallah
> > 
> > 
> > 
> > ________________________________________
> > From: Harb, Abdallah
> > Sent: Thursday, September 17, 2009 10:25 AM
> > To: Harb, Abdallah; Negus, Paul; Stark, Brian; Rosario, Eugene;
> > Fuller, Joe Cc: Quilici, Lauren; Clary, Sue
> > Subject: RE: Cougar Qual Build
> > 
> > Thursday's Qual Update:
> > 
> > 3rd board is still running ddtest in the ESS chamber with no issues.
> > It has passed the first 24hrs of its 100 hrs testing.
> > It will continue running as planned.
> > 
> > 4th board passed ddtest in the ESS chamber overnight.
> > I'll keep it running until the end of the day.
> > 
> > 5th board is showing the following watchdog messages and then
> > rebooting during sysdvt at room temp. "SiByte Watchdog in danger of
> > initiating system reset in 1.0 seconds" I'm suspecting an issue with
> > the chassis. I'll try to work on it today.
> > 
> > P.S. Wayne from LSI will be here today.
> > 
> > Thanks,
> > Abdallah
> > 
> > ________________________________________
> > From: Harb, Abdallah
> > Sent: Wednesday, September 16, 2009 11:33 AM
> > To: Negus, Paul; Stark, Brian; Rosario, Eugene; Fuller, Joe
> > Cc: Quilici, Lauren; Clary, Sue
> > Subject: RE: Cougar Qual Build
> > 
> > Paul,
> > 
> > Unfortunately, the DIMMs don't appear to be loose.
> > Initially, I suspected a DIMM failure, but in the process of
> > eliminating the failure, the same modules worked OK. Other than a
> > lack of pin contact I can't think of possible root cause of the
> > failure. I guess we'll see how these 2 boards finish the remaining
> > of the tests steps. We definitely, need to keep an eye on the future
> > builds if there is any similar failure that might point towards
> > either a process or unreliable connectors.
> > 
> > Regards,
> > Abdallah
> > 
> > 
> > ________________________________________
> > From: Negus, Paul
> > Sent: Wednesday, September 16, 2009 10:33 AM
> > To: Harb, Abdallah; Stark, Brian; Rosario, Eugene; Fuller, Joe
> > Cc: Quilici, Lauren; Clary, Sue
> > Subject: RE: Cougar Qual Build
> > 
> > Thanks Abdallah.  This is much better news.
> > 
> > Have we got any theories on why the DIMMS appear to be getting
> > loose?
> > 
> > Paul
> > 
> > -----Original Message-----
> > From: Harb, Abdallah
> > Sent: Wednesday, September 16, 2009 10:28 AM
> > To: Harb, Abdallah; Stark, Brian; Negus, Paul; Rosario, Eugene
> > Subject: RE: Cougar Qual Build
> > 
> > Wednesday's Qual Test Update:
> > 
> > Last night, the 3rd board started ddtest in the ESS chamber for 100
> > hrs. It should run continuously until Saturday night. As of this
> > morning the board is running OK.
> > 
> > 4th & 5th boards have passed functional after re-seating the DIMM
> > modules.
> > 
> > The 4th board passed sysdvt under temp cycling last night.
> > Today, it will run ddtest .. If it passes, then it can be moved to
> > FG. Finally, the 5th board will be placed in the same chassis and
> > run the same tests.
> > 
> > Regards,
> > Abdallah
> > 
> > 
> > 
> > ________________________________________
> > From: Harb, Abdallah
> > Sent: Tuesday, September 15, 2009 9:44 AM
> > To: Harb, Abdallah; Stark, Brian; Negus, Paul; Rosario, Eugene
> > Subject: RE: Cougar Qual Build
> > 
> > Here's Tuesday's Qual Test Update:
> > 
> > 1- The first board that passed ddtest under temp was given to Brian
> > for engineering review yesterday. 2- Second board passed ddtest
> > under temp overnight. Will be used for the cross sectional
> > measurement. 3- Third board passed sysdvt under temp overnight.
> > This morning will start ddtest and keep it running for 100 hrs
> > under temperature cycling. 4- The 4th & 5th board are still under
> > functional debug.
> > 
> > Regards,
> > Abdallah
> > 
> > ________________________________________
> > From: Harb, Abdallah
> > Sent: Monday, September 14, 2009 9:42 AM
> > To: Stark, Brian; Negus, Paul; Rosario, Eugene
> > Subject: Cougar Qual Build
> > 
> > Here's an update the qualification testing:
> > 
> > The Cougar board that was allocated to engineering for the
> > qualification testing has passed ddtest at room temp over the
> > weekend. The board is ready for Brian in the HW lab.
> > 
> > Last Friday, I was not able to restart the ddtest on the failing
> > board (the board allocated for the cross section). Today, I'll
> > reinstall the OS on the clients and possibly reconfigure the
> > storage, and then start the test once again.
> > 
> > I will also assemble the remaining three Cougar boards, and start
> > the functional and system test.
> > 
> > Regards,
> > Abdallah
