AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20080214183521.2564ff04@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<raj.kumar@onstor.com>,<mike.lee@onstor.com>,<sandrine.boulanger@onstor.com>,<dl-Cougar@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@onstor-exch02.onstor.net/INBOX	0	BB375AF679D4A34E9CA8DFA650E2B04E0856E83B@onstor-exch02.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Thu, 14 Feb 2008 18:37:35 -0800
From: Andrew Sharp <andy.sharp@onstor.com>
To: "Raj Kumar" <raj.kumar@onstor.com>
Cc: "Mike Lee" <mike.lee@onstor.com>, "Sandrine Boulanger"
 <sandrine.boulanger@onstor.com>, "dl-Cougar" <dl-Cougar@onstor.com>
Subject: Re: g4r6
Message-ID: <20080214183735.23b60bec@ripper.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E0856E83B@onstor-exch02.onstor.net>
References: <BB375AF679D4A34E9CA8DFA650E2B04E07A8D9AB@onstor-exch02.onstor.net>
	<BB375AF679D4A34E9CA8DFA650E2B04E0856E83B@onstor-exch02.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

On Thu, 14 Feb 2008 17:57:13 -0800 "Raj Kumar" <raj.kumar@onstor.com>
wrote:

> Was running EEK and looks like Linux rebooted during EEK. Just before
> the reboot I see " SiByte Watchdog in danger of initiating system
> reset in 3.6 seconds" message. Is this same as what Mike seeing? 

Keep in mind that the warning message(s) from the watchdog device
driver(s) is the messenger, not the message.  Don't kill the messenger,
in other words.  Somehow the filer got locked up or otherwise hosed,
and the watchdog rebooted it.  The question is, what caused it to get
into that state?

In Mike's case he had a bug where some process was looping rather
tightly and nothing else could get a chance to run, including the
process that has to tickle the watchdog device.  Or at least made it so
slow that the watchdog device was issuing that warning message.

> Feb 13 19:51:44 g12r10 : 1:4:efs:NOTICE: 16428: FS: g12r10-vs2-vol1
> 0x749000000bd - eek - req - g12r10-vs2-vol1: inode 17930387 quota tree
> id mismatched, EXPECTED 0x0 GOT 0x1
> Feb 13 19:51:44 g12r10 : 1:3:efs:NOTICE: 16429: FS: g12r10-vs2-vol1
> 0x749000000bd - eek - req - g12r10-vs2-vol1: inode 17930388 quota tree
> id mismatched, EXPECTED 0x0 GOT 0x1
> Feb 13 19:51:44 g12r10 : 1:3:efs:NOTICE: 16430: FS: g12r10-vs2-vol1
> 0x7490Feb 13 19:51:44 g12r10 : 1:Feb 13 19:51:44 Feb 13 19:51:4Feb 13
> 19:51:44 Feb 13 19:51:44 g12r10 : 1Feb 13 19:51:44Feb 13 19:51:44
> g12r1Feb 13 19:51:44 g12r10 : 1:4Feb 13 19:51:44Feb 13 19:51:4Feb 13
> 19:51:44 g1Feb 13 19:51:44 g12r10 : Feb 13 19:51:4Feb 13 19:51:44Feb
> 13 19:51:4FeINIT: Sending processes the TERM signal16446: FS:
> g12r10-vs2-vol1 0x749000000bd - eek - req -
> SiByte Watchdog in danger of initiating system reset in 3.6 seconds
> Stopping deferred execution scheduler: atd.
> Stopping periodic command scheduler: crond.
> Stopping automounter: done.
> Stopping MTA: exim4_listener.
> * ALERT: exim paniclog /var/log/exim4/paniclog has non-zero size, mail
> system possibly broken
> Stopping internet superserver: inetd.
> Stopping OpenBSD Secure Shell server: sshd.
> Stopping NTP server: ntpd.
> Saving the system clock..
> Stopping NFS common utilities: statd.
> Stopping kernel log daemon: klogdSiByte Watchdog in danger of
> initiating system reset in 3.6 seconds
> .
> Stopping system log daemon: syslogd.
> Stopping ONStor services:/onstor/bin/emrscron -r
> /onstor/bin/emrscron: line 432: 15480 Killed                  ( ps
> axww | awk '/support.sh/ || /socat/ || /emrscron/ {if
> ($1 !~ /^'$$'$/) {print $1}}' | xargs kill -9 2>&1 ) >/dev/null
> .
> Asking all remaining processes to terminate...done.
> Killing all remaining processes...done.
> Deconfiguring network interfaces...done.
> Cleaning up ifupdown....
> Unmounting temporary filesystems...done.
> Deactivating swap...done.
> Unmounting local filesystems...done.
> Will now restart.
> Restarting system.
> 
> 
> 
> PowerOn Self Test........OK
> 
> Initializing System......please wait
> 
> 
> 
> 
> 
> PMON [SSC,EL,FP,64]
> ONStor Inc. PROM_SIBYTE_CG : Cougar-prom-1.0.3 : Fri Jan 11 12:30:31
> 2008
> CPU type SB1125.  Rev 35  600 MHz
> module: SSC, Slot 0, CPU 0
> Memory size 512 MB.
> Icache size  32 KB, 32/line (4 way)
> Dcache size  32 KB, 32/line (4 way)
> Scache size 256 KB, 32/line (4 way)
> debug IP addr = 10.2.10.12
> debug IP mask = 255.255.0.0
> 
> 
> Initializing Autoloader, hit control-E to bypass
> ........................................................................
> ........
> 
> Type ctrl-e to stop autoload.
> Waiting for SSC to enter autoload init state...done.
>  ext2_load_file /dev/sda1/boot/vmlinux.bin at location
> ffffffff82000000 disk model: CF 1GB
> disk geometry: cylinders=2044 heads=16 sectors=63
> Type ctrl-e to stop autoload.
> Waiting for TXRX to enter autoload init state...done.
>  ext2_load_file /dev/sda1/boot/txrx_cg.bin at location 42000000
> disk model: CF 1GB
> disk geometry: cylinders=2044 heads=16 sectors=63
> Type ctrl-e to stop autoload.
> Waiting for FP to enter autoload init state...done.
>  ext2_load_file /dev/sda1/boot/fp_cg.bin at location 44000000
> disk model: CF 1GB
> disk geometry: cylinders=2044 heads=16 sectors=63
>  do_bsd_launch argc = 3 argv[3] = ip=none
> 
> env[0] = 0xffffffff80b7bed0:.cpuclock=4894967296.
> env[1] = 0xffffffff80b7bf20:.memsize=512.
> env[2] = 0xffffffff80b7bf70:.osloadoptions=mAt.
> env[3] = 0xffffffff80b7bfc0:.boot=cold.
> env[4] = 0xffffffff80b7c010:.busclock=600.
> env[5] = 0xffffffff80b7c060:.ipaddr=10.2.10.12.
> env[6] = 0xffffffff80b7c0b0:.netmask=255.255.0.0.
> env[7] = 0xffffffff80b7c100:.macaddr0=.00:07:34:07:49:00.
> env[8] = 0xffffffff80b7c150:.macaddr1=.00:07:34:07:49:01.
> env[9] = 0xffffffff80b7c1a0:.bootdev=/dev/sda1.
>  Load options and params for [g]
>   Address 0xffffffff82000000 argc = 3
>    argv [0] = g
>    argv [1] = root=/dev/sda1
>    argv [2] = ip=none
>  pointer to Prom Util routines = 0x0
>  Command should be  (addr)(argc, argv, env_strings,
> ptr_prom_util_routines)
> 
> 
> Linux version 2.6.22-cg (build@k3.onstor.lab) (gcc version 4.1.2
> 20061115 (prerelease) (Debian 4.1.1-21)) #1 Wed Feb 6 16:08:22 PST
> 2008 Booting Linux kernel...Mips64 Cougar
> cougar_pmon_init: argc=3, arg=ffffffff80bf4230, env=ffffffff80b7be50
> prom_init: env[0] = 'cpuclock=4894967296'
> prom_init: env[1] = 'memsize=512'
> prom_init: env[2] = 'osloadoptions=mAt'
> prom_init: env[3] = 'boot=cold'
> prom_init: env[4] = 'busclock=600'
> prom_init: env[5] = 'ipaddr=10.2.10.12'
> prom_init: env[6] = 'netmask=255.255.0.0'
> prom_init: env[7] = 'macaddr0=00:07:34:07:49:00'
> prom_init: env[8] = 'macaddr1=00:07:34:07:49:01'
> prom_init: env[9] = 'bootdev=/dev/sda1'
> CPU revision is: 00040103
> FPU revision is: 000f0103
> Broadcom SiByte BCM1125H A4 @ 600 MHz (SB1 rev 3)
> Board type: ONStor Cougar
> This kernel optimized for ONStor Cougar board without CFE
> Determined physical RAM map:
>  memory: 0000000002000000 @ 0000000000000000 (ROM data)
>  memory: 000000000e000000 @ 0000000002000000 (usable)
>  memory: 000000000f000000 @ 0000000080000000 (usable)
>  memory: 0000000001000000 @ 000000008f000000 (reserved)
> Wasting 458752 bytes for tracking 8192 unused pages
> Built 1 zonelists.  Total pages: 577720
> 
> -----Original Message-----
> From: Mike Lee 
> Sent: Tuesday, February 12, 2008 7:10 PM
> To: Sandrine Boulanger; Andy Sharp
> Cc: dl-Cougar
> Subject: RE: g4r6
> 
> Thanks to everyone to replied.
> Actually, Larry helped me figure out the problem.
> I had added an extra trace statement in the management bus driver, in
> function mgmtBus_rxPacket().
> The problem goes away when I remove that trace statement.  
> -Mike
> -----Original Message-----
> From: Sandrine Boulanger 
> Sent: Tuesday, February 12, 2008 6:58 PM
> To: Andy Sharp; Mike Lee
> Cc: dl-Cougar
> Subject: RE: g4r6
> 
> 
> We've seen that on other systems too, a defect is already filed.
> 
> -----Original Message-----
> From: Andy Sharp 
> Sent: Tuesday, February 12, 2008 5:32 PM
> To: Mike Lee
> Cc: dl-Cougar
> Subject: Re: g4r6
> 
> On Tue, 12 Feb 2008 16:53:50 -0800 "Mike Lee" <mike.lee@onstor.com>
> wrote:
> 
> > Guys:
> > 
> > I'm seeing g4r6 constantly displaying the following messages, and I
> > cannot log into the filer (via the console or ssh).  Would anyone
> > know what I can do to revive it?
> > 
> > Thanks.
> > -Mike
> > 
> > 
> > 
> > g4r6 login: SiByte Watchdog in danger of initiating system reset in
> > 8.3 seconds SiByte Watchdog in danger of initiating system reset in
> > 8.3 seconds SiByte Watchdog in danger of initiating system reset in
> > 8.3 seconds SiByte Watchdog in danger of initiating system reset in
> > 8.3 seconds
> 
> Does it have a reset switch?
> 
> That is just a message from the watchdog driver which indicates that
> something is hosing something bad enough that chassisd isn't able to
> get enough execution time to reset the watchdog before this message
> goes off.  Under normal circumstances, it wouldn't even be close.
