X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C87023.E6740FA4@onstor-exch02.onstor.net>; Fri, 15 Feb 2008 15:41:27 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Content-class: urn:content-classes:message
Subject: RE: g4r6
Date: Fri, 15 Feb 2008 15:41:26 -0700
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E086273CB@onstor-exch02.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E0856E863@onstor-exch02.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: g4r6
Thread-Index: Achve7mNQS7G6cwVSQq7i0T+5tePTAAAEQRAACn4VvA=
From: "Raj Kumar" <raj.kumar@onstor.com>
To: "Maxim Kozlovsky" <maxim.kozlovsky@onstor.com>,
	"Andy Sharp" <andy.sharp@onstor.com>
Cc: "Tim Gardner" <tim.gardner@onstor.com>

id: TED00022384
Headline: Cougar Soak: Linux rebooted during EEK: SiByte Watchdog in
danger of initiating system reset in 3.6 seconds

-----Original Message-----
From: Maxim Kozlovsky=20
Sent: Thursday, February 14, 2008 6:43 PM
To: Andy Sharp; Raj Kumar
Cc: Mike Lee; Sandrine Boulanger; dl-Cougar
Subject: RE: g4r6

This does not look like a watchdog reboot, something actually run the
complete shutdown. (The message copied again for easier reference):

>> SiByte Watchdog in danger of initiating system reset in 3.6 seconds
>> Stopping deferred execution scheduler: atd.
>> Stopping periodic command scheduler: crond.
>> Stopping automounter: done.
>> Stopping MTA: exim4_listener.
>> * ALERT: exim paniclog /var/log/exim4/paniclog has non-zero size,
mail
>> system possibly broken
>> Stopping internet superserver: inetd.
>> Stopping OpenBSD Secure Shell server: sshd.
>> Stopping NTP server: ntpd.
>> Saving the system clock..
>> Stopping NFS common utilities: statd.
>> Stopping kernel log daemon: klogdSiByte Watchdog in danger of
>> initiating system reset in 3.6 seconds
>> .
>> Stopping system log daemon: syslogd.
>> Stopping ONStor services:/onstor/bin/emrscron -r
>> /onstor/bin/emrscron: line 432: 15480 Killed                  ( ps
>> axww | awk '/support.sh/ || /socat/ || /emrscron/ {if
>> ($1 !~ /^'$$'$/) {print $1}}' | xargs kill -9 2>&1 ) >/dev/null
>> .
>> Asking all remaining processes to terminate...done.
>> Killing all remaining processes...done.
>> Deconfiguring network interfaces...done.
>> Cleaning up ifupdown....
>> Unmounting temporary filesystems...done.
>> Deactivating swap...done.
>> Unmounting local filesystems...done.
>> Will now restart.
>> Restarting system.

>-----Original Message-----
>From: Andy Sharp
>Sent: Thursday, February 14, 2008 6:38 PM
>To: Raj Kumar
>Cc: Mike Lee; Sandrine Boulanger; dl-Cougar
>Subject: Re: g4r6
>
>On Thu, 14 Feb 2008 17:57:13 -0800 "Raj Kumar" <raj.kumar@onstor.com>
>wrote:
>
>> Was running EEK and looks like Linux rebooted during EEK. Just before
>> the reboot I see " SiByte Watchdog in danger of initiating system
>> reset in 3.6 seconds" message. Is this same as what Mike seeing?
>
>Keep in mind that the warning message(s) from the watchdog device
>driver(s) is the messenger, not the message.  Don't kill the messenger,
>in other words.  Somehow the filer got locked up or otherwise hosed,
>and the watchdog rebooted it.  The question is, what caused it to get
>into that state?
>
>In Mike's case he had a bug where some process was looping rather
>tightly and nothing else could get a chance to run, including the
>process that has to tickle the watchdog device.  Or at least made it so
>slow that the watchdog device was issuing that warning message.
>
>> Feb 13 19:51:44 g12r10 : 1:4:efs:NOTICE: 16428: FS: g12r10-vs2-vol1
>> 0x749000000bd - eek - req - g12r10-vs2-vol1: inode 17930387 quota
tree
>> id mismatched, EXPECTED 0x0 GOT 0x1
>> Feb 13 19:51:44 g12r10 : 1:3:efs:NOTICE: 16429: FS: g12r10-vs2-vol1
>> 0x749000000bd - eek - req - g12r10-vs2-vol1: inode 17930388 quota
tree
>> id mismatched, EXPECTED 0x0 GOT 0x1
>> Feb 13 19:51:44 g12r10 : 1:3:efs:NOTICE: 16430: FS: g12r10-vs2-vol1
>> 0x7490Feb 13 19:51:44 g12r10 : 1:Feb 13 19:51:44 Feb 13 19:51:4Feb 13
>> 19:51:44 Feb 13 19:51:44 g12r10 : 1Feb 13 19:51:44Feb 13 19:51:44
>> g12r1Feb 13 19:51:44 g12r10 : 1:4Feb 13 19:51:44Feb 13 19:51:4Feb 13
>> 19:51:44 g1Feb 13 19:51:44 g12r10 : Feb 13 19:51:4Feb 13 19:51:44Feb
>> 13 19:51:4FeINIT: Sending processes the TERM signal16446: FS:
>> g12r10-vs2-vol1 0x749000000bd - eek - req -
>> SiByte Watchdog in danger of initiating system reset in 3.6 seconds
>> Stopping deferred execution scheduler: atd.
>> Stopping periodic command scheduler: crond.
>> Stopping automounter: done.
>> Stopping MTA: exim4_listener.
>> * ALERT: exim paniclog /var/log/exim4/paniclog has non-zero size,
mail
>> system possibly broken
>> Stopping internet superserver: inetd.
>> Stopping OpenBSD Secure Shell server: sshd.
>> Stopping NTP server: ntpd.
>> Saving the system clock..
>> Stopping NFS common utilities: statd.
>> Stopping kernel log daemon: klogdSiByte Watchdog in danger of
>> initiating system reset in 3.6 seconds
>> .
>> Stopping system log daemon: syslogd.
>> Stopping ONStor services:/onstor/bin/emrscron -r
>> /onstor/bin/emrscron: line 432: 15480 Killed                  ( ps
>> axww | awk '/support.sh/ || /socat/ || /emrscron/ {if
>> ($1 !~ /^'$$'$/) {print $1}}' | xargs kill -9 2>&1 ) >/dev/null
>> .
>> Asking all remaining processes to terminate...done.
>> Killing all remaining processes...done.
>> Deconfiguring network interfaces...done.
>> Cleaning up ifupdown....
>> Unmounting temporary filesystems...done.
>> Deactivating swap...done.
>> Unmounting local filesystems...done.
>> Will now restart.
>> Restarting system.
>>
>>
>>
>> PowerOn Self Test........OK
>>
>> Initializing System......please wait
>>
>>
>>
>>
>>
>> PMON [SSC,EL,FP,64]
>> ONStor Inc. PROM_SIBYTE_CG : Cougar-prom-1.0.3 : Fri Jan 11 12:30:31
>> 2008
>> CPU type SB1125.  Rev 35  600 MHz
>> module: SSC, Slot 0, CPU 0
>> Memory size 512 MB.
>> Icache size  32 KB, 32/line (4 way)
>> Dcache size  32 KB, 32/line (4 way)
>> Scache size 256 KB, 32/line (4 way)
>> debug IP addr =3D 10.2.10.12
>> debug IP mask =3D 255.255.0.0
>>
>>
>> Initializing Autoloader, hit control-E to bypass
>>
........................................................................
>> ........
>>
>> Type ctrl-e to stop autoload.
>> Waiting for SSC to enter autoload init state...done.
>>  ext2_load_file /dev/sda1/boot/vmlinux.bin at location
>> ffffffff82000000 disk model: CF 1GB
>> disk geometry: cylinders=3D2044 heads=3D16 sectors=3D63
>> Type ctrl-e to stop autoload.
>> Waiting for TXRX to enter autoload init state...done.
>>  ext2_load_file /dev/sda1/boot/txrx_cg.bin at location 42000000
>> disk model: CF 1GB
>> disk geometry: cylinders=3D2044 heads=3D16 sectors=3D63
>> Type ctrl-e to stop autoload.
>> Waiting for FP to enter autoload init state...done.
>>  ext2_load_file /dev/sda1/boot/fp_cg.bin at location 44000000
>> disk model: CF 1GB
>> disk geometry: cylinders=3D2044 heads=3D16 sectors=3D63
>>  do_bsd_launch argc =3D 3 argv[3] =3D ip=3Dnone
>>
>> env[0] =3D 0xffffffff80b7bed0:.cpuclock=3D4894967296.
>> env[1] =3D 0xffffffff80b7bf20:.memsize=3D512.
>> env[2] =3D 0xffffffff80b7bf70:.osloadoptions=3DmAt.
>> env[3] =3D 0xffffffff80b7bfc0:.boot=3Dcold.
>> env[4] =3D 0xffffffff80b7c010:.busclock=3D600.
>> env[5] =3D 0xffffffff80b7c060:.ipaddr=3D10.2.10.12.
>> env[6] =3D 0xffffffff80b7c0b0:.netmask=3D255.255.0.0.
>> env[7] =3D 0xffffffff80b7c100:.macaddr0=3D.00:07:34:07:49:00.
>> env[8] =3D 0xffffffff80b7c150:.macaddr1=3D.00:07:34:07:49:01.
>> env[9] =3D 0xffffffff80b7c1a0:.bootdev=3D/dev/sda1.
>>  Load options and params for [g]
>>   Address 0xffffffff82000000 argc =3D 3
>>    argv [0] =3D g
>>    argv [1] =3D root=3D/dev/sda1
>>    argv [2] =3D ip=3Dnone
>>  pointer to Prom Util routines =3D 0x0
>>  Command should be  (addr)(argc, argv, env_strings,
>> ptr_prom_util_routines)
>>
>>
>> Linux version 2.6.22-cg (build@k3.onstor.lab) (gcc version 4.1.2
>> 20061115 (prerelease) (Debian 4.1.1-21)) #1 Wed Feb 6 16:08:22 PST
>> 2008 Booting Linux kernel...Mips64 Cougar
>> cougar_pmon_init: argc=3D3, arg=3Dffffffff80bf4230, =
env=3Dffffffff80b7be50
>> prom_init: env[0] =3D 'cpuclock=3D4894967296'
>> prom_init: env[1] =3D 'memsize=3D512'
>> prom_init: env[2] =3D 'osloadoptions=3DmAt'
>> prom_init: env[3] =3D 'boot=3Dcold'
>> prom_init: env[4] =3D 'busclock=3D600'
>> prom_init: env[5] =3D 'ipaddr=3D10.2.10.12'
>> prom_init: env[6] =3D 'netmask=3D255.255.0.0'
>> prom_init: env[7] =3D 'macaddr0=3D00:07:34:07:49:00'
>> prom_init: env[8] =3D 'macaddr1=3D00:07:34:07:49:01'
>> prom_init: env[9] =3D 'bootdev=3D/dev/sda1'
>> CPU revision is: 00040103
>> FPU revision is: 000f0103
>> Broadcom SiByte BCM1125H A4 @ 600 MHz (SB1 rev 3)
>> Board type: ONStor Cougar
>> This kernel optimized for ONStor Cougar board without CFE
>> Determined physical RAM map:
>>  memory: 0000000002000000 @ 0000000000000000 (ROM data)
>>  memory: 000000000e000000 @ 0000000002000000 (usable)
>>  memory: 000000000f000000 @ 0000000080000000 (usable)
>>  memory: 0000000001000000 @ 000000008f000000 (reserved)
>> Wasting 458752 bytes for tracking 8192 unused pages
>> Built 1 zonelists.  Total pages: 577720
>>
>> -----Original Message-----
>> From: Mike Lee
>> Sent: Tuesday, February 12, 2008 7:10 PM
>> To: Sandrine Boulanger; Andy Sharp
>> Cc: dl-Cougar
>> Subject: RE: g4r6
>>
>> Thanks to everyone to replied.
>> Actually, Larry helped me figure out the problem.
>> I had added an extra trace statement in the management bus driver, in
>> function mgmtBus_rxPacket().
>> The problem goes away when I remove that trace statement.
>> -Mike
>> -----Original Message-----
>> From: Sandrine Boulanger
>> Sent: Tuesday, February 12, 2008 6:58 PM
>> To: Andy Sharp; Mike Lee
>> Cc: dl-Cougar
>> Subject: RE: g4r6
>>
>>
>> We've seen that on other systems too, a defect is already filed.
>>
>> -----Original Message-----
>> From: Andy Sharp
>> Sent: Tuesday, February 12, 2008 5:32 PM
>> To: Mike Lee
>> Cc: dl-Cougar
>> Subject: Re: g4r6
>>
>> On Tue, 12 Feb 2008 16:53:50 -0800 "Mike Lee" <mike.lee@onstor.com>
>> wrote:
>>
>> > Guys:
>> >
>> > I'm seeing g4r6 constantly displaying the following messages, and I
>> > cannot log into the filer (via the console or ssh).  Would anyone
>> > know what I can do to revive it?
>> >
>> > Thanks.
>> > -Mike
>> >
>> >
>> >
>> > g4r6 login: SiByte Watchdog in danger of initiating system reset in
>> > 8.3 seconds SiByte Watchdog in danger of initiating system reset in
>> > 8.3 seconds SiByte Watchdog in danger of initiating system reset in
>> > 8.3 seconds SiByte Watchdog in danger of initiating system reset in
>> > 8.3 seconds
>>
>> Does it have a reset switch?
>>
>> That is just a message from the watchdog driver which indicates that
>> something is hosing something bad enough that chassisd isn't able to
>> get enough execution time to reset the watchdog before this message
>> goes off.  Under normal circumstances, it wouldn't even be close.
