AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:
CFG:
PT:0
S:andy.sharp@lsi.com
RQ:
SSV:mhbs.lsil.com
NSV:
SSH:
R:<Abdallah.Harb@lsi.com>,<Brian.Stark@lsi.com>
MAID:2
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/LSI/INBOX	0	27AEC73CFDE2EA41849ACAC11A0B39D5CD30141D@cosmail03.lsi.com
X-Sylpheed-End-Special-Headers: 1
Date: Thu, 4 Mar 2010 23:28:49 -0800
From: Andrew Sharp <andy.sharp@lsi.com>
To: "Harb, Abdallah" <Abdallah.Harb@lsi.com>
Cc: "Stark, Brian" <Brian.Stark@lsi.com>
Subject: Re: SiByte Watchdog messages
Message-ID: <20100304232849.173a17d3@ripper.onstor.net>
In-Reply-To: <27AEC73CFDE2EA41849ACAC11A0B39D5CD30141D@cosmail03.lsi.com>
References: <27AEC73CFDE2EA41849ACAC11A0B39D504032A7F@cosmail03.lsi.com>
	<E1EC65251D4B3D46BBC0AAA3C0629222B239293E@cosmail02.lsi.com>
	<27AEC73CFDE2EA41849ACAC11A0B39D5CD3013BF@cosmail03.lsi.com>
	<2E4A140D742C3B4E911151A30C39CFE10DDA1244@cosmail03.lsi.com>
	<27AEC73CFDE2EA41849ACAC11A0B39D5CD3013FA@cosmail03.lsi.com>
	<20100301164756.67bb91f9@ripper.onstor.net>
	<27AEC73CFDE2EA41849ACAC11A0B39D5CD30140B@cosmail03.lsi.com>
	<20100301190016.6b8edf57@ripper.onstor.net>
	<27AEC73CFDE2EA41849ACAC11A0B39D5CD301419@cosmail03.lsi.com>
	<E1EC65251D4B3D46BBC0AAA3C0629222B281D633@cosmail02.lsi.com>
	<20100303181843.38b7ae5c@ripper.onstor.net>
	<27AEC73CFDE2EA41849ACAC11A0B39D5CD30141D@cosmail03.lsi.com>
Organization: LSI
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Arg.  So I start writing the test program.  That's when I realize that
the new version of the driver has been updated to handle the 1480, and
hence /dev/watchdog is now /dev/watchdog0, and chassisd in r402rel has
not been updated to the new file name, so it has not been handling the
watchdog on the system when my kernel was running.

Which is a problem.

Because the system should have rebooted since nothing was "petting" the
watchdog.

chassisd just puts out a little elog message that it can't set the
watchdog timer, and merrily goes about it's way.  I guess.

And so I added a symlink from watchog to watchdog0, and bingo, the
messages immediately start ~:^(

So now I'm trying to figure out why the system doesn't immediately
reboot...


 On Wed, 3 Mar 2010 19:40:17 -0700 "Harb, Abdallah"
<Abdallah.Harb@lsi.com> wrote:

> Once done... I can test the fixed kernel using the other Cougar 2U in
> the HW lab and the Cougar 1U at Venture.
> ________________________________________ From: Andrew Sharp
> [andy.sharp@lsi.com] Sent: Wednesday, March 03, 2010 6:18 PM
> To: Stark, Brian
> Cc: Harb, Abdallah
> Subject: Re: SiByte Watchdog messages
> 
> I am writing a test program to test a couple of things before
> declaring victory.
> 
> On Wed, 3 Mar 2010 18:55:51 -0700 "Stark, Brian" <Brian.Stark@lsi.com>
> wrote:
> 
> > Does this mean we may have a fix?
> >
> >
> > -----Original Message-----
> > From: Harb, Abdallah
> > Sent: Wednesday, March 03, 2010 10:56 AM
> > To: Sharp, Andy
> > Cc: Sharp, Andy; Stark, Brian
> > Subject: RE: SiByte Watchdog messages
> >
> > I booted from the bottom CF and then the SiByte messages started
> > showing up again. I released both consoles, you can re-connect to
> > it. ________________________________________
> > From: Andrew Sharp [andy.sharp@lsi.com]
> > Sent: Monday, March 01, 2010 7:00 PM
> > To: Harb, Abdallah
> > Cc: Sharp, Andy; Stark, Brian
> > Subject: Re: SiByte Watchdog messages
> >
> > On Mon, 1 Mar 2010 18:07:02 -0700 "Harb, Abdallah"
> > <Abdallah.Harb@lsi.com> wrote:
> >
> > > Are you looking at both blades? or just one?
> >
> > Both.
> >
> > > I'm at Venture this afternoon.
> > > If you're unable to reproduce the failure by tomorrow morning,
> > > then I do the trick to get it to fail.
> >
> > I eagerly await the trick.
> >
> >
> > >
> > > ________________________________________
> > > From: Andrew Sharp [andy.sharp@lsi.com]
> > > Sent: Monday, March 01, 2010 4:47 PM
> > > To: Harb, Abdallah
> > > Cc: Stark, Brian
> > > Subject: Re: SiByte Watchdog messages
> > >
> > > So what's the trick to getting it to do it's thing?  I logged in
> > > and it wasn't putting out that message, but I forgot to check if
> > > chassisd was running before I installed my kernel and rebooted.
> > > So far I've rebooted 3 times and nothing.
> > >
> > >
> > > On Thu, 18 Feb 2010 12:36:05 -0700 "Harb, Abdallah"
> > > <Abdallah.Harb@lsi.com> wrote:
> > >
> > > > Andy,
> > > >
> > > > I was told that you'll be helping us debugging the SiByte
> > > > watchdog messages. The following are the connections to a
> > > > Cougar unit in the HW lab that is constantly showing the
> > > > failure.
> > > >
> > > > Power Sentry: 10.0.20.15 port# 2.
> > > > Top board console: 10.0.20.11 2002
> > > > Top board IP address: 10.0.20.102
> > > > Bottom board console: 10.0.20.11 2001
> > > > Bottom board IP address: 10.0.20.101
> > > >
> > > > I also have another Cougar 2U in the HW lab that shows the same
> > > > failure. Let me know if you need access to this 2nd unit as
> > > > well.
> > > >
> > > > Thanks,
> > > > Abdallah
> > > >
> > > > ________________________________________
> > > > From: Harb, Abdallah
> > > > Sent: Friday, February 12, 2010 6:39 PM
> > > > To: Stark, Brian; Fong, Rendell
> > > > Subject: SiByte Watchdog messages
> > > >
> > > > Good evening,
> > > >
> > > > This is a follow up to our conversation regarding the SiByte
> > > > watch dog messages that we had yesterday. Today, I tried to
> > > > characterize the failure using a good chassis, a good Mezzanine
> > > > board, and two suspected motherboards. At the end of the day, I
> > > > had so many pages of experiment notes, but unfortunately, it's
> > > > hard to draw any meaningful conclusion out of it. In a
> > > > nutshell, slot location seems to be irrelevant to triggering
> > > > the failure, nor ejecting or inserting a motherboard from the
> > > > chassis.
> > > >
> > > > Next week, I will continue with this experiment using another
> > > > failed unit from Venture. I hope that I will have more
> > > > meaningful results than the one below.
> > > >
> > > > Here's a summary of my experiment that I conducted today:
> > > > I used a known good chassis, a known good Mezzanine card, and
> > > > two suspected motherboards.
> > > >
> > > > Experiment #1
> > > > Top slot - board IP: 10.0.20.101
> > > > Bottom slot - board IP: 10.0.20.102
> > > >
> > > > When both boards were inserted:
> > > > * Top board came up and showed continuous SiByte messages.
> > > > * Bottom board came up OK, but showed only one SiByte message
> > > > (1.0 sec).
> > > >
> > > > When bottom board was ejected:
> > > > * Top board came up OK, No messages.
> > > >
> > > > When top board was ejected:
> > > > * Bottom board came up OK, No messages.
> > > >
> > > > Experiment #2
> > > > (Swapped slot locations)
> > > > Top slot - board IP: 10.0.20.102
> > > > Bottom slot - board IP: 10.0.20.101
> > > >
> > > > When both boards were inserted:
> > > > * Top board came up OK, but showed only one SiByte message (1.0
> > > > sec). * Bottom board came up OK, No messages.
> > > > When this step was repeated:
> > > > * Top board came up OK, No messages.
> > > > * Bottom board came up and showed continuous SiByte messages
> > > > (0.9 sec).
> > > >
> > > > When bottom board was ejected:
> > > > * Top board came up OK, No messages.
> > > >
> > > > When top board was ejected:
> > > > * Bottom board came up OK, No messages.
> > > >
> > > > Please don't conclude that the failure only follows the board
> > > > with IP 10.0.20.101, because the other board also reported
> > > > continuous SiByte messages during another set of experiment.
> > > >
> > > > Regards,
> > > > Abdallah
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Harb, Abdallah
> > > > Sent: Tuesday, January 19, 2010 7:23 PM
> > > > To: Stark, Brian
> > > > Subject: RE: SiByte Watchdog messages
> > > >
> > > > Brian,
> > > >
> > > > Unfortunately, I didn't get a chance to work on it back in
> > > > October, but I worked on it today. I tried all of the following
> > > > debug tests listed in your email, and here are my findings:
> > > >
> > > > Q: Do the messages go away if sysdvt is halted?
> > > > A: No.
> > > >
> > > > Q: If the motherboards are swapped, do the SiByte messages
> > > > follow the board, stay on the same slot, or go away? A: The
> > > > SiByte messages follow the board.
> > > >
> > > > Q: If the mezzanine board is swapped out, do the SiByte messages
> > > > go away? A: Yes, the SiByte messages go away.
> > > >
> > > > Q: If the motherboard showing the problem is moved to another
> > > > chassis, do the SiByte messages go away? A: Yes, the SiByte
> > > > messages go away.
> > > >
> > > > The mezzanine board seems to be the source of the failure, but
> > > > the question is why would the SiByte messages show only on one
> > > > motherboard not on any other board? And if it shows on one
> > > > motherboard it always follow that specific motherboard
> > > > regardless of its slot number in the chassis, as long as the
> > > > suspect mezzanine board is used.
> > > >
> > > > Tomorrow morning, I will be at Venture, and then at ONStor in
> > > > the afternoon. Please let me know if there's anything else that
> > > > I should try?
> > > >
> > > > Regards,
> > > > Abdallah
> > > >
> > > >
