X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C85868.F9FD6411@onstor-exch02.onstor.net>; Wed, 16 Jan 2008 10:55:27 -0700
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C85868.F9FD6411"
Content-class: urn:content-classes:message
Subject: RE: TXRX optimized build
Date: Wed, 16 Jan 2008 10:55:27 -0700
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E07AE21C2@onstor-exch02.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E07AE21A6@onstor-exch02.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: TXRX optimized build
Thread-Index: AchYZbKYKkj9SJsmTz2INn+nOUJNagAADpvgAAByNxA=
References: <BB375AF679D4A34E9CA8DFA650E2B04E07AE2184@onstor-exch02.onstor.net> <BB375AF679D4A34E9CA8DFA650E2B04E07AE21A6@onstor-exch02.onstor.net>
From: "Brian Stark" <brian.stark@onstor.com>
To: "Maxim Kozlovsky" <maxim.kozlovsky@onstor.com>,
	"Rick Lund" <rick.lund@onstor.com>
Cc: "Andy Sharp" <andy.sharp@onstor.com>

This is a multi-part message in MIME format.

------_=_NextPart_001_01C85868.F9FD6411
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Max,

The bus error on the TXRX side appears to be a master abort, which
shouldn't happen because the SSC is busy with something else.  If the
SSC is busy and can't complete the PCI access, it will retry the
transaction on the PCI side.  A master abort happens because nothing
responds to the address put out by the master (TXRX driving
0x6000.0000).  This could happen if the PCI bus on the 1125 was going
through some time of reset of reinit sequence.  We could also be dealing
with a chip errata, which we're looking into now.

By the way, we've ruled out the PCI parity error that I mentioned
yesterday at lunch.  We're not seeing parity errors, so that's not
causing the bus error.


Brian


> _____________________________________________=20
> From: 	Maxim Kozlovsky =20
> Sent:	Wednesday, January 16, 2008 9:45 AM
> To:	Rick Lund
> Cc:	Brian Stark; Andy Sharp
> Subject:	RE: TXRX optimized build
>=20
> Rcon_init_done and rconInit() are not connected, in fact I should take
> out the rconInit() from cougar, it is for the older rcon interface
> which we don't need anymore.
>=20
> By looking on the consoles it can be seen that the crash on txrx
> happens when the linux is already starting to run the user processes,
> so the Linux pci setup must be compeleted. I had the same crash in the
> debug build when the system was doing some work and it happened really
> late, which excludes the rcon_init and linux pci configuration. It is
> probably something else, but I'll take the workaround for now. My
> guess will be that in the optimized build during the initialization we
> are creating dirty data faster than it can be flushed to the memory,
> so the memory controller and bus are busy when the messaging
> initialization happens and this somehow affects the pci accesses. By
> adding the delay you allow the dirty data to be flushed and this
> allows us to get through the initialization process. This of course is
> pure theory.
>=20
>=20
> _____________________________________________
> From: Rick Lund=20
> Sent: Wednesday, January 16, 2008 9:32 AM
> To: Maxim Kozlovsky
> Cc: Brian Stark; Andy Sharp
> Subject: TXRX optimized build
>=20
> Max,
>     I've worked around the TXRX optimized problem with a couple
> different changes.  One is adding a 5 second delay before the ring
> init in the mgmtBus_init code which is getting the bus error.  The
> second is moving the "rcon_init_done =3D 1" line in test.c to after =
the
> rconInit() call later in the startup sequence.  That one sounds like a
> more logical fix; not setting the flag which says we've initialized
> rcon until after the rconInit() function.  But I don't think that's
> the real problem, since it appears to work in the debug build and
> presumable on Bobcat.
>     My best guess is a conflict with the TXRX using the PCI bus to
> access SSC memory during the SSC's startup.  Linux is disabling the
> memory accesses from the PCI bus during startup?  Something like that?
> And by slightly changing the timing of our use of SSC memory by the
> TXRX, we delay long enough for the Linux PCI init to have been
> completed.
>=20
> -Rick

------_=_NextPart_001_01C85868.F9FD6411
Content-Type: text/html;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Dus-ascii">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7653.38">
<TITLE>RE: TXRX optimized build</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/rtf format -->

<P><FONT COLOR=3D"#0000FF" SIZE=3D2 FACE=3D"Arial">Max,</FONT>
</P>

<P><FONT COLOR=3D"#0000FF" SIZE=3D2 FACE=3D"Arial">The bus error on the =
TXRX side appears to be a master abort, which shouldn't happen because =
the SSC is busy with something else.&nbsp; If the SSC is busy and can't =
complete the PCI access, it will retry the transaction on the PCI =
side.&nbsp; A master abort happens because nothing responds to the =
address put out by the master (TXRX driving 0x6000.0000).&nbsp; This =
could happen if the PCI bus on the 1125 was going through some time of =
reset of reinit sequence.&nbsp; We could also be dealing with a chip =
errata, which we're looking into now.</FONT></P>

<P><FONT COLOR=3D"#0000FF" SIZE=3D2 FACE=3D"Arial">By the way, we've =
ruled out the PCI parity error that I mentioned yesterday at =
lunch.&nbsp; We're not seeing parity errors, so that's not causing the =
bus error.</FONT></P>
<BR>

<P><FONT COLOR=3D"#0000FF" SIZE=3D2 FACE=3D"Arial">Brian</FONT>
</P>
<BR>
<UL>
<P><FONT SIZE=3D1 =
FACE=3D"Tahoma">_____________________________________________ </FONT>

<BR><B><FONT SIZE=3D1 FACE=3D"Tahoma">From: &nbsp;</FONT></B> <FONT =
SIZE=3D1 FACE=3D"Tahoma">Maxim Kozlovsky&nbsp; </FONT>

<BR><B><FONT SIZE=3D1 FACE=3D"Tahoma">Sent:&nbsp;&nbsp;</FONT></B> <FONT =
SIZE=3D1 FACE=3D"Tahoma">Wednesday, January 16, 2008 9:45 AM</FONT>

<BR><B><FONT SIZE=3D1 =
FACE=3D"Tahoma">To:&nbsp;&nbsp;&nbsp;&nbsp;</FONT></B> <FONT SIZE=3D1 =
FACE=3D"Tahoma">Rick Lund</FONT>

<BR><B><FONT SIZE=3D1 =
FACE=3D"Tahoma">Cc:&nbsp;&nbsp;&nbsp;&nbsp;</FONT></B> <FONT SIZE=3D1 =
FACE=3D"Tahoma">Brian Stark; Andy Sharp</FONT>

<BR><B><FONT SIZE=3D1 =
FACE=3D"Tahoma">Subject:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</FONT>=
</B> <FONT SIZE=3D1 FACE=3D"Tahoma">RE: TXRX optimized build</FONT>
</P>

<P><FONT COLOR=3D"#000080" SIZE=3D2 FACE=3D"Arial">Rcon_init_done and =
rconInit() are not connected, in fact I should take out the rconInit() =
from cougar, it is for the older rcon interface which we don&#8217;t =
need anymore.</FONT></P>

<P><FONT COLOR=3D"#000080" SIZE=3D2 FACE=3D"Arial">By looking on the =
consoles it can be seen that the crash on txrx happens when the linux is =
already starting to run the user processes, so the Linux pci setup must =
be compeleted. I had the same crash in the debug build when the system =
was doing some work and it happened really late, which excludes the =
rcon_init and linux pci configuration. It is probably something else, =
but I&#8217;ll take the workaround for now. My guess will be that in the =
optimized build during the initialization we are creating dirty data =
faster than it can be flushed to the memory, so the memory controller =
and bus are busy when the messaging initialization happens and this =
somehow affects the pci accesses. By adding the delay you allow the =
dirty data to be flushed and this allows us to get through the =
initialization process. This of course is pure theory.</FONT></P>
<BR>

<P><FONT SIZE=3D2 =
FACE=3D"Tahoma">_____________________________________________<BR>
</FONT><B><FONT SIZE=3D2 FACE=3D"Tahoma">From:</FONT></B><FONT SIZE=3D2 =
FACE=3D"Tahoma"> Rick Lund<BR>
</FONT><B><FONT SIZE=3D2 FACE=3D"Tahoma">Sent:</FONT></B><FONT SIZE=3D2 =
FACE=3D"Tahoma"> Wednesday, January 16, 2008 9:32 AM<BR>
</FONT><B><FONT SIZE=3D2 FACE=3D"Tahoma">To:</FONT></B><FONT SIZE=3D2 =
FACE=3D"Tahoma"> Maxim Kozlovsky<BR>
</FONT><B><FONT SIZE=3D2 FACE=3D"Tahoma">Cc:</FONT></B><FONT SIZE=3D2 =
FACE=3D"Tahoma"> Brian Stark; Andy Sharp<BR>
</FONT><B><FONT SIZE=3D2 FACE=3D"Tahoma">Subject:</FONT></B><FONT =
SIZE=3D2 FACE=3D"Tahoma"> TXRX optimized build</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Max,</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">&nbsp;&nbsp;&nbsp; I&#8217;ve worked =
around the TXRX optimized problem with a couple different changes.&nbsp; =
One is adding a 5 second delay before the ring init in the mgmtBus_init =
code which is getting the bus error.&nbsp; The second is moving the =
&#8220;rcon_init_done =3D 1&#8221; line in test.c to after the =
rconInit() call later in the startup sequence.&nbsp; That one sounds =
like a more logical fix; not setting the flag which says we&#8217;ve =
initialized rcon until after the rconInit() function.&nbsp; But I =
don&#8217;t think that&#8217;s the real problem, since it appears to =
work in the debug build and presumable on Bobcat.</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">&nbsp;&nbsp;&nbsp; My best guess is a =
conflict with the TXRX using the PCI bus to access SSC memory during the =
SSC&#8217;s startup.&nbsp; Linux is disabling the memory accesses from =
the PCI bus during startup?&nbsp; Something like that?&nbsp; And by =
slightly changing the timing of our use of SSC memory by the TXRX, we =
delay long enough for the Linux PCI init to have been =
completed.</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">-Rick</FONT>
</P>
</UL>
</BODY>
</HTML>
------_=_NextPart_001_01C85868.F9FD6411--
