AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20081217003526.3916fcde@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:exch1.onstor.net
NSV:
SSH:
R:<brian.stark@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@exch1.onstor.net/INBOX	0	2779531E7C760D4491C96305019FEEB51762E4F2E9@exch1.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Wed, 17 Dec 2008 00:36:22 -0800
From: Andrew Sharp <andy.sharp@onstor.com>
To: Brian Stark <brian.stark@onstor.com>
Subject: Re: Defect  TED00026033 (Fujitsu Cougar Eval) Cougar 3510 rebooted
 with ECC/Bus Error exception on FP
Message-ID: <20081217003622.36d33ddb@ripper.onstor.net>
In-Reply-To: <2779531E7C760D4491C96305019FEEB51762E4F2E9@exch1.onstor.net>
References: <ONSTOR-EXCH01zMcC64000061ab@onstor-exch01.onstor.net>
	<2779531E7C760D4491C96305019FEEB51762E4F2E9@exch1.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

The SSC definitely didn't crash first, but the same bug might have
caused both crashes, ie., they both may have barfed over the bogus
virtual address of 120 or 180 or whatever it was.  The FP does not
directly access SSC memory, but it does formulate messages that are
encapsulated by TXRX code into management bus messages so it could have
been the same memory.  We allocate memory for the FP and the SSC to
communicate over the mgmt-bus, but those rings are never used.  Doh.

On Tue, 16 Dec 2008 22:53:52 -0800 Brian Stark <brian.stark@onstor.com>
wrote:

> Do you think the kernel oops happened after the FP crashed?  It's
> important to understand the sequence of events.
> 
> If the kernel oops happened first, that may explain the bus error on
> the FP.  If the FP tried to access the SSC over PCI, the transaction
> probably would not have completed if the kernel had crashed, which
> would then cause the bus error on the FP.  Then again, I'm not even
> sure if the FP directly accesses the SSC -- that may be done by only
> the TXRX.
> 
> I've asked Irie san some follow-up questions in another email, but
> it's curious that he mentioned the oops happened after the removal of
> a fan module.  
> 
> 
> 
> -----Original Message-----
> From: Andy Sharp 
> Sent: Tuesday, December 16, 2008 4:30 PM
> To: Andy Sharp
> Cc: Shin Irie; Brian Stark; Maxim Kozlovsky
> Subject: Defect TED00026033 (Fujitsu Cougar Eval) Cougar 3510
> rebooted with ECC/Bus Error exception on FP
> 
> Headline: (Fujitsu Cougar Eval) Cougar 3510 rebooted with ECC/Bus
> Error exception on FP id: TED00026033
> Note_Entry: 
> This might be a hardware failure.  The kernel oops is a down-stream
> effect of getting a null pointer handed across the management bus
> device, which might be from getting bogus data from a memory read on
> the TXRX/FP side, I don't know.  Adding Brian to CC list to be sure
> that my assessment of the hardware problem is correct.
> 
> State: Opened
> history: 33781396	Dec 16 2008  4:13PM	shini
> Submit	no_value	Opened 33781397	Dec 16 2008
> 4:15PM	shini	Modify	Opened	Opened
> 33781400	12/16/2008 16:29:31 PM	andrews
> Modify	Opened	Opened company_name: 
> 
