AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20080813164412.5a940bef@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<deepak.veliath@onstor.com>,<warren.gale@onstor.com>,<brian.stark@onstor.com>,<jonathan.goldick@onstor.com>,<maxim.kozlovsky@onstor.com>,<amit.bothra@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@onstor-exch02.onstor.net/INBOX	0	BB375AF679D4A34E9CA8DFA650E2B04E0B473E27@onstor-exch02.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Wed, 13 Aug 2008 16:44:45 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: "Deepak Veliath" <deepak.veliath@onstor.com>
Cc: "Warren Gale" <warren.gale@onstor.com>, "Brian Stark"
 <brian.stark@onstor.com>, Jonathan Goldick <jonathan.goldick@onstor.com>,
 Maxim Kozlovsky <maxim.kozlovsky@onstor.com>, Amit Bothra
 <amit.bothra@onstor.com>
Subject: Re: Regd the ECC exception handler
Message-ID: <20080813164445.5b3b6d7e@ripper.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E0B473E27@onstor-exch02.onstor.net>
References: <BB375AF679D4A34E9CA8DFA650E2B04E0B473E27@onstor-exch02.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Forwarding to a few more folks.

On Wed, 13 Aug 2008 16:35:15 -0700 "Deepak Veliath"
<deepak.veliath@onstor.com> wrote:

> Hello Warren,
> The ECC exception handler installed on the TxRx seems to be the
> routine ecc_error_exc().  In it the condition on which we panic reads:
> 	berr_mie_cnts = *((volatile uint64
> *)PHYS_TO_K1(A_BUS_MEM_IO_ERRORS));
> 	if( (berr_mie_cnts & 0x00FF0000) == 0) {
> 		signextend((address_t *)&istsrc_p);
>         	ecc_bus_error_process(sr, cause, epc, istsrc_p,
> *istsrc_p);
>         	panic("%s error\n", (cause & CAUSE_IP3) ?
> "Uncorrectable ECC" : "BUS");
> 	}
> 
> From the SiByte manuals for both the 1250 and the 1480 it would seem
> the highlighted check should read (berr_mie_cnts & 0x00FFFF00) != 0.
> Am I missing something?
> 
> The context for this query is a series of defects (TED00024773) where
> there are TxRx panics on one of the Cougar-SuperSoak systems which end
> up with the TxRx core file not being dumped because the non-panicing
> TxRx CPU has not responded to the cross-processor interrupt
> signalling a panic condition.  Eventually this non-responding CPU
> panics with an NMI from a watch-dog timer expiry.  A few seconds
> later the FP0 panics from a watchdog-timer expiry NMI as well waiting
> for the non-responding CPU to acknowledge.  After that the panicing
> TxRx CPU panics again when its watchdog timer expires.
> 
> The three CPUs panicing from NMIs goes on until the node is rebooted.
> 
> In some instances where the non-responding CPU was able to write a
> crashdump before the one that won the cross-processor panic broadcast
> race, I see that the ECC interrupt (IM3) is set in the CR0 Cause
> register.  My theory is that one (or perhaps both) of the CPUs is
> receiving the ECC interrupt and since the indication is not disabled
> and the handler returns without panic'ing, as soon as interrupts are
> re-enabled, the interrupt is received again.  This goes on until the
> watchdog timer expires.
> 
> You seem to have put in the highlighted check in the code for
> ecc_error_exc() as part of defect 17390 (change 22593).
> 
> Do let me know.
> 
> Thank you,
> veliath
> 
