AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:
CFG:
PT:0
S:andy.sharp@lsi.com
RQ:
SSV:mhbs.lsil.com
NSV:
SSH:
R:<Abdallah.Harb@lsi.com>,<Brian.Stark@lsi.com>
MAID:2
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/LSI/INBOX	0	27AEC73CFDE2EA41849ACAC11A0B39D5CD30144B@cosmail03.lsi.com
X-Sylpheed-End-Special-Headers: 1
Date: Tue, 16 Mar 2010 02:03:59 -0700
From: Andrew Sharp <andy.sharp@lsi.com>
To: "Harb, Abdallah" <Abdallah.Harb@lsi.com>
Cc: "Stark, Brian" <Brian.Stark@lsi.com>
Subject: Re: SiByte Watchdog messages
Message-ID: <20100316020359.0e93b5b1@ripper.onstor.net>
In-Reply-To: <27AEC73CFDE2EA41849ACAC11A0B39D5CD30144B@cosmail03.lsi.com>
References: <27AEC73CFDE2EA41849ACAC11A0B39D504032A7F@cosmail03.lsi.com>
	<E1EC65251D4B3D46BBC0AAA3C0629222B239293E@cosmail02.lsi.com>
	<27AEC73CFDE2EA41849ACAC11A0B39D5CD3013BF@cosmail03.lsi.com>
	<2E4A140D742C3B4E911151A30C39CFE10DDA1244@cosmail03.lsi.com>
	<27AEC73CFDE2EA41849ACAC11A0B39D5CD3013FA@cosmail03.lsi.com>
	<20100301164756.67bb91f9@ripper.onstor.net>
	<27AEC73CFDE2EA41849ACAC11A0B39D5CD30140B@cosmail03.lsi.com>
	<20100301190016.6b8edf57@ripper.onstor.net>
	<27AEC73CFDE2EA41849ACAC11A0B39D5CD301419@cosmail03.lsi.com>
	<E1EC65251D4B3D46BBC0AAA3C0629222B281D633@cosmail02.lsi.com>
	<20100303181843.38b7ae5c@ripper.onstor.net>
	<27AEC73CFDE2EA41849ACAC11A0B39D5CD30141D@cosmail03.lsi.com>
	<20100315181318.3ac93d29@ripper.onstor.net>
	<27AEC73CFDE2EA41849ACAC11A0B39D5CD30144B@cosmail03.lsi.com>
Organization: LSI
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Uh, this is what I see on 10.0.20.101:

Cougar:~# ls -lrt /boot
total 32384
-rwxr-xr-x  1 root root 6169888 Jul  6  2009 fp_cg.bin*
-rwxr-xr-x  1 root root 7819312 Jul  6  2009 txrx_cg.bin*
drwxr-xr-x 24 root root    4096 Jul  6  2009 ../
-rw-r--r--  1 root root 3079776 Jul  7  2009 vmlinux.bin,orig
-rwxr-xr-x  1 root root 4654308 Jul  7  2009 vmlinux.32*
-rw-r--r--  1 root root 2070463 Jul  7  2009 System.map
-rw-r--r--  1 root root 3096264 Mar  2 00:53 vmlinux.bin,wdog-testing
-rw-r--r--  1 root root 3096320 Mar 15 22:24 vmlinux.cougar.wd-special.bin
drwxr-xr-x  2 root root    4096 Mar 15 22:24 ./
-rw-r--r--  1 root root 3096320 Mar 16 00:47 vmlinux.bin



On Mon, 15 Mar 2010 20:37:17 -0600 "Harb, Abdallah"
<Abdallah.Harb@lsi.com> wrote:

> Andy,
> 
> I copied this image "/boot/vmlinux.bin" from 10.0.20.101 into both
> blades of the 2nd Cougar 2U unit that I have here in the HW lab. The
> same messages are still showing on both blades after the copy and
> reboot.
> 
> I noticed that "vmlinux.bin" has a date stamp of March/1/10
> However, "vmlinux.cougar.wd-special.bin" has a date stamp of
> March/15/10. Which one of the two images that you want me to try?
> 
> Thanks,
> Abdallah
> 
> 
> ________________________________________
> From: Andrew Sharp [andy.sharp@lsi.com]
> Sent: Monday, March 15, 2010 6:13 PM
> To: Harb, Abdallah
> Cc: Stark, Brian
> Subject: Re: SiByte Watchdog messages
> 
> OK, it looks like I got it.  I found a bunch of crazy bugs in the
> driver.  I'm afraid to go back and see how many are in the shipping
> version, but obviously some of them are, because that's what causes
> the repeating message on these "special" systems.
> 
> Abdallah, you can take the /boot/vmlinux.bin off either of blades and
> try it on that machine at Venture.  The one on 10.0.20.101 has all the
> debug messages removed so you're not likely to get any extraneous
> messages on the console if you use that one.  All the repeating sibyte
> messages should be history.
> 
> Let me know if you have any questions.
> 
> I'll prepare a changelist for submitting.
> 
> Cheers,
> 
> a
> 
> On Wed, 3 Mar 2010 19:40:17 -0700 "Harb, Abdallah"
> <Abdallah.Harb@lsi.com> wrote:
> 
> > Once done... I can test the fixed kernel using the other Cougar 2U
> > in the HW lab and the Cougar 1U at Venture.
> > ________________________________________ From: Andrew Sharp
> > [andy.sharp@lsi.com] Sent: Wednesday, March 03, 2010 6:18 PM
> > To: Stark, Brian
> > Cc: Harb, Abdallah
> > Subject: Re: SiByte Watchdog messages
> >
> > I am writing a test program to test a couple of things before
> > declaring victory.
> >
> > On Wed, 3 Mar 2010 18:55:51 -0700 "Stark, Brian"
> > <Brian.Stark@lsi.com> wrote:
> >
> > > Does this mean we may have a fix?
> > >
> > >
> > > -----Original Message-----
> > > From: Harb, Abdallah
> > > Sent: Wednesday, March 03, 2010 10:56 AM
> > > To: Sharp, Andy
> > > Cc: Sharp, Andy; Stark, Brian
> > > Subject: RE: SiByte Watchdog messages
> > >
> > > I booted from the bottom CF and then the SiByte messages started
> > > showing up again. I released both consoles, you can re-connect to
> > > it. ________________________________________
> > > From: Andrew Sharp [andy.sharp@lsi.com]
> > > Sent: Monday, March 01, 2010 7:00 PM
> > > To: Harb, Abdallah
> > > Cc: Sharp, Andy; Stark, Brian
> > > Subject: Re: SiByte Watchdog messages
> > >
> > > On Mon, 1 Mar 2010 18:07:02 -0700 "Harb, Abdallah"
> > > <Abdallah.Harb@lsi.com> wrote:
> > >
> > > > Are you looking at both blades? or just one?
> > >
> > > Both.
> > >
> > > > I'm at Venture this afternoon.
> > > > If you're unable to reproduce the failure by tomorrow morning,
> > > > then I do the trick to get it to fail.
> > >
> > > I eagerly await the trick.
> > >
> > >
> > > >
> > > > ________________________________________
> > > > From: Andrew Sharp [andy.sharp@lsi.com]
> > > > Sent: Monday, March 01, 2010 4:47 PM
> > > > To: Harb, Abdallah
> > > > Cc: Stark, Brian
> > > > Subject: Re: SiByte Watchdog messages
> > > >
> > > > So what's the trick to getting it to do it's thing?  I logged in
> > > > and it wasn't putting out that message, but I forgot to check if
> > > > chassisd was running before I installed my kernel and rebooted.
> > > > So far I've rebooted 3 times and nothing.
> > > >
> > > >
> > > > On Thu, 18 Feb 2010 12:36:05 -0700 "Harb, Abdallah"
> > > > <Abdallah.Harb@lsi.com> wrote:
> > > >
> > > > > Andy,
> > > > >
> > > > > I was told that you'll be helping us debugging the SiByte
> > > > > watchdog messages. The following are the connections to a
> > > > > Cougar unit in the HW lab that is constantly showing the
> > > > > failure.
> > > > >
> > > > > Power Sentry: 10.0.20.15 port# 2.
> > > > > Top board console: 10.0.20.11 2002
> > > > > Top board IP address: 10.0.20.102
> > > > > Bottom board console: 10.0.20.11 2001
> > > > > Bottom board IP address: 10.0.20.101
> > > > >
> > > > > I also have another Cougar 2U in the HW lab that shows the
> > > > > same failure. Let me know if you need access to this 2nd unit
> > > > > as well.
> > > > >
> > > > > Thanks,
> > > > > Abdallah
> > > > >
> > > > > ________________________________________
> > > > > From: Harb, Abdallah
> > > > > Sent: Friday, February 12, 2010 6:39 PM
> > > > > To: Stark, Brian; Fong, Rendell
> > > > > Subject: SiByte Watchdog messages
> > > > >
> > > > > Good evening,
> > > > >
> > > > > This is a follow up to our conversation regarding the SiByte
> > > > > watch dog messages that we had yesterday. Today, I tried to
> > > > > characterize the failure using a good chassis, a good
> > > > > Mezzanine board, and two suspected motherboards. At the end
> > > > > of the day, I had so many pages of experiment notes, but
> > > > > unfortunately, it's hard to draw any meaningful conclusion
> > > > > out of it. In a nutshell, slot location seems to be
> > > > > irrelevant to triggering the failure, nor ejecting or
> > > > > inserting a motherboard from the chassis.
> > > > >
> > > > > Next week, I will continue with this experiment using another
> > > > > failed unit from Venture. I hope that I will have more
> > > > > meaningful results than the one below.
> > > > >
> > > > > Here's a summary of my experiment that I conducted today:
> > > > > I used a known good chassis, a known good Mezzanine card, and
> > > > > two suspected motherboards.
> > > > >
> > > > > Experiment #1
> > > > > Top slot - board IP: 10.0.20.101
> > > > > Bottom slot - board IP: 10.0.20.102
> > > > >
> > > > > When both boards were inserted:
> > > > > * Top board came up and showed continuous SiByte messages.
> > > > > * Bottom board came up OK, but showed only one SiByte message
> > > > > (1.0 sec).
> > > > >
> > > > > When bottom board was ejected:
> > > > > * Top board came up OK, No messages.
> > > > >
> > > > > When top board was ejected:
> > > > > * Bottom board came up OK, No messages.
> > > > >
> > > > > Experiment #2
> > > > > (Swapped slot locations)
> > > > > Top slot - board IP: 10.0.20.102
> > > > > Bottom slot - board IP: 10.0.20.101
> > > > >
> > > > > When both boards were inserted:
> > > > > * Top board came up OK, but showed only one SiByte message
> > > > > (1.0 sec). * Bottom board came up OK, No messages.
> > > > > When this step was repeated:
> > > > > * Top board came up OK, No messages.
> > > > > * Bottom board came up and showed continuous SiByte messages
> > > > > (0.9 sec).
> > > > >
> > > > > When bottom board was ejected:
> > > > > * Top board came up OK, No messages.
> > > > >
> > > > > When top board was ejected:
> > > > > * Bottom board came up OK, No messages.
> > > > >
> > > > > Please don't conclude that the failure only follows the board
> > > > > with IP 10.0.20.101, because the other board also reported
> > > > > continuous SiByte messages during another set of experiment.
> > > > >
> > > > > Regards,
> > > > > Abdallah
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Harb, Abdallah
> > > > > Sent: Tuesday, January 19, 2010 7:23 PM
> > > > > To: Stark, Brian
> > > > > Subject: RE: SiByte Watchdog messages
> > > > >
> > > > > Brian,
> > > > >
> > > > > Unfortunately, I didn't get a chance to work on it back in
> > > > > October, but I worked on it today. I tried all of the
> > > > > following debug tests listed in your email, and here are my
> > > > > findings:
> > > > >
> > > > > Q: Do the messages go away if sysdvt is halted?
> > > > > A: No.
> > > > >
> > > > > Q: If the motherboards are swapped, do the SiByte messages
> > > > > follow the board, stay on the same slot, or go away? A: The
> > > > > SiByte messages follow the board.
> > > > >
> > > > > Q: If the mezzanine board is swapped out, do the SiByte
> > > > > messages go away? A: Yes, the SiByte messages go away.
> > > > >
> > > > > Q: If the motherboard showing the problem is moved to another
> > > > > chassis, do the SiByte messages go away? A: Yes, the SiByte
> > > > > messages go away.
> > > > >
> > > > > The mezzanine board seems to be the source of the failure, but
> > > > > the question is why would the SiByte messages show only on one
> > > > > motherboard not on any other board? And if it shows on one
> > > > > motherboard it always follow that specific motherboard
> > > > > regardless of its slot number in the chassis, as long as the
> > > > > suspect mezzanine board is used.
> > > > >
> > > > > Tomorrow morning, I will be at Venture, and then at ONStor in
> > > > > the afternoon. Please let me know if there's anything else
> > > > > that I should try?
> > > > >
> > > > > Regards,
> > > > > Abdallah
> > > > >
> > > > >
