X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C89066.297C55B0@onstor-exch02.onstor.net>; Thu, 27 Mar 2008 16:56:23 -0700
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C89066.297C55B0"
Content-class: urn:content-classes:message
Subject: Watchdog NMI
Date: Thu, 27 Mar 2008 16:56:23 -0700
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E091AACCA@onstor-exch02.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Watchdog NMI
Thread-Index: AciQZikzyQjtUr3mSjSM8LxLysQbDA==
From: "Brian Stark" <brian.stark@onstor.com>
To: "Andy Sharp" <andy.sharp@onstor.com>,
	"Warren Gale" <warren.gale@onstor.com>

This is a multi-part message in MIME format.

------_=_NextPart_001_01C89066.297C55B0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Guys,

Caeli asked me to look into an escalation where the box just up and
reboots for no apparent reason.  This is further complicated by the fact
there are no crashdumps after the reboot.

While looking at the settings for the Dallas RTC that also contains the
watchdog function, I discovered that the RTC is set to generate a board
reset rather than an interrupt when it expires.  This will essentially
have the same effect as power cycling.  I think the RTC should instead
be configured to generate the interrupt, which will then assert the NMI
to the RM9000 and give us a nice watchdog crashdump.  Here's what
happens when I set the RTC 0x78 register to 0xb2 and then force a
watchdog timeout from the FC console:

RCON-FC0:1 > mm -b bf000078 b2; mm -b bf000060 1; mm -b bf000068 0
RCON-FC0:2 >

Exception Cause =3D Watchdog Timeout/NMI
ERREPC:   0x830006a8
RA:       0x83000690
SR:       0x502f01
Autoreboot set "on", system reset
Rebooting...

With the 0xb3 setting that BSD is currently using, the box just reboots
with no indication.

Questions:

- Is there any reason why we aren't using 0xb2 as the setting for
register 0x78?  It seems like we've been through this before, so maybe
there's a previous changelist that describes why it was set to 0xb3.  I
swear it used to be the same as the PROM, which is 0xb2. =20

- Is there any minimum PROM version required to then store the NMI
crashdump, which will then get exported to /var/crash/0.0 when the
system boots back up?

As a test at the customer site, I'm going to recommend that we modify by
hand the value in 0x78 to generate the interrupt.  That way, we'll see
if the watchdog is timing out and if so, maybe we'll be able to figure
out why from the crashdump.


Thanks,
Brian




------_=_NextPart_001_01C89066.297C55B0
Content-Type: text/html;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Dus-ascii">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7653.38">
<TITLE>Watchdog NMI</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/rtf format -->

<P><FONT SIZE=3D2 FACE=3D"Arial">Guys,</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Caeli asked me to look into an =
escalation where the box just up and reboots for no apparent =
reason.&nbsp; This is further complicated by the fact there are no =
crashdumps after the reboot.</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">While looking at the settings for the =
Dallas RTC that also contains the watchdog function, I discovered that =
the RTC is set to generate a board reset rather than an interrupt when =
it expires.&nbsp; This will essentially have the same effect as power =
cycling.&nbsp; I think the RTC should instead be configured to generate =
the interrupt, which will then assert the NMI to the RM9000 and give us =
a nice watchdog crashdump.&nbsp; Here's what happens when I set the RTC =
0x78 register to 0xb2 and then force a watchdog timeout from the FC =
console:</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">RCON-FC0:1 &gt; mm -b bf000078 b2; mm =
-b bf000060 1; mm -b bf000068 0</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">RCON-FC0:2 &gt;</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Exception Cause =3D Watchdog =
Timeout/NMI</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">ERREPC:&nbsp;&nbsp; 0x830006a8</FONT>

<BR><FONT SIZE=3D2 =
FACE=3D"Arial">RA:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0x83000690</FONT>

<BR><FONT SIZE=3D2 =
FACE=3D"Arial">SR:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0x502f01</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Autoreboot set &quot;on&quot;, system =
reset</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Rebooting...</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">With the 0xb3 setting that BSD is =
currently using, the box just reboots with no indication.</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Questions:</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">- Is there any reason why we aren't =
using 0xb2 as the setting for register 0x78?&nbsp; It seems like we've =
been through this before, so maybe there's a previous changelist that =
describes why it was set to 0xb3.&nbsp; I swear it used to be the same =
as the PROM, which is 0xb2.&nbsp; </FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">- Is there any minimum PROM version =
required to then store the NMI crashdump, which will then get exported =
to /var/crash/0.0 when the system boots back up?</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">As a test at the customer site, I'm =
going to recommend that we modify by hand the value in 0x78 to generate =
the interrupt.&nbsp; That way, we'll see if the watchdog is timing out =
and if so, maybe we'll be able to figure out why from the =
crashdump.</FONT></P>
<BR>

<P><FONT SIZE=3D2 FACE=3D"Arial">Thanks,</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Brian</FONT>
</P>
<BR>
<BR>

</BODY>
</HTML>
------_=_NextPart_001_01C89066.297C55B0--
