AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20080327171958.40525910@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<brian.stark@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@onstor-exch02.onstor.net/INBOX	0	BB375AF679D4A34E9CA8DFA650E2B04E091AACCA@onstor-exch02.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Thu, 27 Mar 2008 17:20:02 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: "Brian Stark" <brian.stark@onstor.com>
Subject: Re: Watchdog NMI
Message-ID: <20080327172002.14c9e1e4@ripper.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E091AACCA@onstor-exch02.onstor.net>
References: <BB375AF679D4A34E9CA8DFA650E2B04E091AACCA@onstor-exch02.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Sounds like a good plan to me.

Cheers,

a

On Thu, 27 Mar 2008 16:56:23 -0700 "Brian Stark"
<brian.stark@onstor.com> wrote:

> Guys,
> 
> Caeli asked me to look into an escalation where the box just up and
> reboots for no apparent reason.  This is further complicated by the
> fact there are no crashdumps after the reboot.
> 
> While looking at the settings for the Dallas RTC that also contains
> the watchdog function, I discovered that the RTC is set to generate a
> board reset rather than an interrupt when it expires.  This will
> essentially have the same effect as power cycling.  I think the RTC
> should instead be configured to generate the interrupt, which will
> then assert the NMI to the RM9000 and give us a nice watchdog
> crashdump.  Here's what happens when I set the RTC 0x78 register to
> 0xb2 and then force a watchdog timeout from the FC console:
> 
> RCON-FC0:1 > mm -b bf000078 b2; mm -b bf000060 1; mm -b bf000068 0
> RCON-FC0:2 >
> 
> Exception Cause = Watchdog Timeout/NMI
> ERREPC:   0x830006a8
> RA:       0x83000690
> SR:       0x502f01
> Autoreboot set "on", system reset
> Rebooting...
> 
> With the 0xb3 setting that BSD is currently using, the box just
> reboots with no indication.
> 
> Questions:
> 
> - Is there any reason why we aren't using 0xb2 as the setting for
> register 0x78?  It seems like we've been through this before, so maybe
> there's a previous changelist that describes why it was set to 0xb3.
> I swear it used to be the same as the PROM, which is 0xb2.  
> 
> - Is there any minimum PROM version required to then store the NMI
> crashdump, which will then get exported to /var/crash/0.0 when the
> system boots back up?
> 
> As a test at the customer site, I'm going to recommend that we modify
> by hand the value in 0x78 to generate the interrupt.  That way, we'll
> see if the watchdog is timing out and if so, maybe we'll be able to
> figure out why from the crashdump.
> 
> 
> Thanks,
> Brian
> 
> 
> 
