X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C7665D.634115F8@onstor-exch02.onstor.net>; Wed, 14 Mar 2007 10:22:48 -0700
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C7665D.634115F8"
Content-class: urn:content-classes:message
Subject: RE: False ECC errors
Date: Wed, 14 Mar 2007 10:22:48 -0700
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E02D8F399@onstor-exch02.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E02D8F37D@onstor-exch02.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: False ECC errors
Thread-Index: AcdmWBVGnOpHDdxeREy2hMOjobI4YwAAW/HAAACJicAAAA0y8A==
References: <BB375AF679D4A34E9CA8DFA650E2B04E02D8F343@onstor-exch02.onstor.net> <BB375AF679D4A34E9CA8DFA650E2B04E02D8F379@onstor-exch02.onstor.net> <BB375AF679D4A34E9CA8DFA650E2B04E02D8F37D@onstor-exch02.onstor.net>
From: "Brian Stark" <brian.stark@onstor.com>
To: "Jonathan Goldick" <jonathan.goldick@onstor.com>,
	"Andy Sharp" <andy.sharp@onstor.com>

This is a multi-part message in MIME format.

------_=_NextPart_001_01C7665D.634115F8
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

OK, after having more morning coffee and waking up a bit more, this was
the issue prior to 1.3.3.10 where Warren put in changes to fix some
bogus ECC error reporting in the crashdump.  Basically, a DBE on a core,
usually caused by a bad pointer access, would then lead to a bogus ECC
error on another core.  Because of some other previous changes to
crashdumps, the bogus ECC error then overwrote the DBE in /var/crash,
leading everyone to believe that the hardware had real ECC errors.

With 1.3.3.10 and I think 2.2.2, the overwriting doesn't happen and the
SiByte counters are stored within the crashdump.  If uncorrectable ECC
errors are shown in the crashdump, then this is thought to be real since
it's pulled straight out of the SiByte.

The concern would be that the ECC counter is not correct or can be
incremented when other crashes occur.  We don't have any evidence this
is happening, but it's possible.  I'll feel much better if we find ECC
errors on the Facebook system that is still under test.


Brian


> _____________________________________________=20
> From: 	Jonathan Goldick =20
> Sent:	Wednesday, March 14, 2007 10:11 AM
> To:	Brian Stark; Andy Sharp
> Subject:	RE: False ECC errors
>=20
> It wasn't Andy, but I don't remember who else it could have been.
> Will try to track down where the idea started :-)
>=20
>=20
> _____________________________________________
> From: Brian Stark=20
> Sent: Wednesday, March 14, 2007 10:10 AM
> To: Jonathan Goldick; Andy Sharp
> Subject: RE: False ECC errors
>=20
> Wow, I haven't heard anything about this.  For ECC errors on the
> SiByte, we are looking at the uncorrectable error counter on the
> SiByte itself.  Does this have anything to do with an invalid pointer
> access?  Can this counter be incremented for a reason other than a
> real ECC error?
>=20
> This is definitely something we need to get to the bottom of.  We got
> the system back from Facebook that reported several ECC errors that
> were thought to be real because of the SiByte counter, but we have yet
> to find anything wrong with it in the hardware lab.  The tests we are
> running are designed to specifically tickle ECC errors, and we've yet
> to see a system that experienced ECC errors in normal op and then
> didn't have them with this test. =20
>=20
> I'm starting to worry that this counter is either wrong or that
> environmental influences at some customer sites are causing real ECC
> errors.  Obviously, neither of these is good.
>=20
>=20
> Brian
>=20
>=20
> 	_____________________________________________=20
> 	From: 	Jonathan Goldick =20
> 	Sent:	Wednesday, March 14, 2007 9:45 AM
> 	To:	Andy Sharp
> 	Cc:	Brian Stark
> 	Subject:	False ECC errors
>=20
> 	Andy,
>=20
> 	I seem to remember you mentioning that we report an ECC error
> when in reality this is an invalid pointer access.  Please confirm
> since we are still RMA'ing boxes for ECC errors that may not be real.
>=20
> 	Thanks,
>=20
> 	Jonathan

------_=_NextPart_001_01C7665D.634115F8
Content-Type: text/html;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Dus-ascii">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7652.24">
<TITLE>RE: False ECC errors</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/rtf format -->

<P><FONT COLOR=3D"#0000FF" SIZE=3D2 FACE=3D"Arial">OK, after having more =
morning coffee and waking up a bit more, this was the issue prior to =
1.3.3.10 where Warren put in changes to fix some bogus ECC error =
reporting in the crashdump.&nbsp; Basically, a DBE on a core, usually =
caused by a bad pointer access, would then lead to a bogus ECC error on =
another core.&nbsp; Because of some other previous changes to =
crashdumps, the bogus ECC error then overwrote the DBE in /var/crash, =
leading everyone to believe that the hardware had real ECC =
errors.</FONT></P>

<P><FONT COLOR=3D"#0000FF" SIZE=3D2 FACE=3D"Arial">With 1.3.3.10 and I =
think 2.2.2, the overwriting doesn't happen and the SiByte counters are =
stored within the crashdump.&nbsp; If uncorrectable ECC errors are shown =
in the crashdump, then this is thought to be real since it's pulled =
straight out of the SiByte.</FONT></P>

<P><FONT COLOR=3D"#0000FF" SIZE=3D2 FACE=3D"Arial">The concern would be =
that the ECC counter is not correct or can be incremented when other =
crashes occur.&nbsp; We don't have any evidence this is happening, but =
it's possible.&nbsp; I'll feel much better if we find ECC errors on the =
Facebook system that is still under test.</FONT></P>
<BR>

<P><FONT COLOR=3D"#0000FF" SIZE=3D2 FACE=3D"Arial">Brian</FONT>
</P>
<BR>
<UL>
<P><FONT SIZE=3D1 =
FACE=3D"Tahoma">_____________________________________________ </FONT>

<BR><B><FONT SIZE=3D1 FACE=3D"Tahoma">From: &nbsp;</FONT></B> <FONT =
SIZE=3D1 FACE=3D"Tahoma">Jonathan Goldick&nbsp; </FONT>

<BR><B><FONT SIZE=3D1 FACE=3D"Tahoma">Sent:&nbsp;&nbsp;</FONT></B> <FONT =
SIZE=3D1 FACE=3D"Tahoma">Wednesday, March 14, 2007 10:11 AM</FONT>

<BR><B><FONT SIZE=3D1 =
FACE=3D"Tahoma">To:&nbsp;&nbsp;&nbsp;&nbsp;</FONT></B> <FONT SIZE=3D1 =
FACE=3D"Tahoma">Brian Stark; Andy Sharp</FONT>

<BR><B><FONT SIZE=3D1 =
FACE=3D"Tahoma">Subject:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</FONT>=
</B> <FONT SIZE=3D1 FACE=3D"Tahoma">RE: False ECC errors</FONT>
</P>

<P><FONT COLOR=3D"#000080" SIZE=3D2 FACE=3D"Arial">It wasn&#8217;t Andy, =
but I don&#8217;t remember who else it could have been.&nbsp; Will try =
to track down where the idea started</FONT> <FONT COLOR=3D"#000080" =
SIZE=3D2 FACE=3D"Wingdings">J</FONT>
</P>
<BR>

<P><FONT SIZE=3D2 =
FACE=3D"Tahoma">_____________________________________________<BR>
</FONT><B><FONT SIZE=3D2 FACE=3D"Tahoma">From:</FONT></B><FONT SIZE=3D2 =
FACE=3D"Tahoma"> Brian Stark<BR>
</FONT><B><FONT SIZE=3D2 FACE=3D"Tahoma">Sent:</FONT></B><FONT SIZE=3D2 =
FACE=3D"Tahoma"> Wednesday, March 14, 2007 10:10 AM<BR>
</FONT><B><FONT SIZE=3D2 FACE=3D"Tahoma">To:</FONT></B><FONT SIZE=3D2 =
FACE=3D"Tahoma"> Jonathan Goldick; Andy Sharp<BR>
</FONT><B><FONT SIZE=3D2 FACE=3D"Tahoma">Subject:</FONT></B><FONT =
SIZE=3D2 FACE=3D"Tahoma"> RE: False ECC errors</FONT>
</P>

<P><FONT COLOR=3D"#0000FF" SIZE=3D2 FACE=3D"Arial">Wow, I haven't heard =
anything about this.&nbsp; For ECC errors on the SiByte, we are looking =
at the uncorrectable error counter on the SiByte itself.&nbsp; Does this =
have anything to do with an invalid pointer access?&nbsp; Can this =
counter be incremented for a reason other than a real ECC =
error?</FONT></P>

<P><FONT COLOR=3D"#0000FF" SIZE=3D2 FACE=3D"Arial">This is definitely =
something we need to get to the bottom of.&nbsp; We got the system back =
from Facebook that reported several ECC errors that were thought to be =
real because of the SiByte counter, but we have yet to find anything =
wrong with it in the hardware lab.&nbsp; The tests we are running are =
designed to specifically tickle ECC errors, and we've yet to see a =
system that experienced ECC errors in normal op and then didn't have =
them with this test.&nbsp; </FONT></P>

<P><FONT COLOR=3D"#0000FF" SIZE=3D2 FACE=3D"Arial">I'm starting to worry =
that this counter is either wrong or that environmental influences at =
some customer sites are causing real ECC errors.&nbsp; Obviously, =
neither of these is good.</FONT></P>
<BR>

<P><FONT COLOR=3D"#0000FF" SIZE=3D2 FACE=3D"Arial">Brian</FONT>
</P>
<BR>
<UL>
<P><FONT SIZE=3D1 =
FACE=3D"Tahoma">_____________________________________________ </FONT>

<BR><B><FONT SIZE=3D1 FACE=3D"Tahoma">From: &nbsp;</FONT></B> <FONT =
SIZE=3D1 FACE=3D"Tahoma">Jonathan Goldick&nbsp; </FONT>

<BR><B><FONT SIZE=3D1 FACE=3D"Tahoma">Sent:&nbsp;&nbsp;</FONT></B> <FONT =
SIZE=3D1 FACE=3D"Tahoma">Wednesday, March 14, 2007 9:45 AM</FONT>

<BR><B><FONT SIZE=3D1 =
FACE=3D"Tahoma">To:&nbsp;&nbsp;&nbsp;&nbsp;</FONT></B> <FONT SIZE=3D1 =
FACE=3D"Tahoma">Andy Sharp</FONT>

<BR><B><FONT SIZE=3D1 =
FACE=3D"Tahoma">Cc:&nbsp;&nbsp;&nbsp;&nbsp;</FONT></B> <FONT SIZE=3D1 =
FACE=3D"Tahoma">Brian Stark</FONT>

<BR><B><FONT SIZE=3D1 =
FACE=3D"Tahoma">Subject:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</FONT>=
</B> <FONT SIZE=3D1 FACE=3D"Tahoma">False ECC errors</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Andy,</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">I seem to remember you mentioning that =
we report an ECC error when in reality this is an invalid pointer =
access.&nbsp; Please confirm since we are still RMA&#8217;ing boxes for =
ECC errors that may not be real.</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Thanks,</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Jonathan</FONT>
</P>
</UL></UL>
</BODY>
</HTML>
------_=_NextPart_001_01C7665D.634115F8--
