X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C86F76.E376C9CE@onstor-exch02.onstor.net>; Thu, 14 Feb 2008 19:02:59 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Content-class: urn:content-classes:message
Subject: RE: g4r6
Date: Thu, 14 Feb 2008 19:02:59 -0700
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E07A8D9C5@onstor-exch02.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E0856E83B@onstor-exch02.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: g4r6
Thread-Index: Acht4D8CjPhMB5NfRNOxR6hpT3n+PQAC+OHwAAA7m0AAYjezUAAABA+w
From: "Mike Lee" <mike.lee@onstor.com>
To: "Raj Kumar" <raj.kumar@onstor.com>,
	"Sandrine Boulanger" <sandrine.boulanger@onstor.com>,
	"Andy Sharp" <andy.sharp@onstor.com>
Cc: "dl-Cougar" <dl-Cougar@onstor.com>

Raj:
Actually, it is a bit different...
The error I had seen was caused by a trace statement I added in the
management bus driver code.
I am guessing that it changed the timing in the system, which led to the
watchdog timer warning.
The problem went away after I removed the trace statement.
My filer did not reboot by itself, but it was lock in a tight loop
somewhere such that I cannot log in (either via ssh or via the console).
-Mike

-----Original Message-----
From: Raj Kumar=20
Sent: Thursday, February 14, 2008 5:57 PM
To: Mike Lee; Sandrine Boulanger; Andy Sharp
Cc: dl-Cougar
Subject: RE: g4r6


Was running EEK and looks like Linux rebooted during EEK. Just before
the reboot I see " SiByte Watchdog in danger of initiating system reset
in 3.6 seconds" message. Is this same as what Mike seeing?=20

Feb 13 19:51:44 g12r10 : 1:4:efs:NOTICE: 16428: FS: g12r10-vs2-vol1
0x749000000bd - eek - req - g12r10-vs2-vol1: inode 17930387 quota tree
id mismatched, EXPECTED 0x0 GOT 0x1
Feb 13 19:51:44 g12r10 : 1:3:efs:NOTICE: 16429: FS: g12r10-vs2-vol1
0x749000000bd - eek - req - g12r10-vs2-vol1: inode 17930388 quota tree
id mismatched, EXPECTED 0x0 GOT 0x1
Feb 13 19:51:44 g12r10 : 1:3:efs:NOTICE: 16430: FS: g12r10-vs2-vol1
0x7490Feb 13 19:51:44 g12r10 : 1:Feb 13 19:51:44 Feb 13 19:51:4Feb 13
19:51:44 Feb 13 19:51:44 g12r10 : 1Feb 13 19:51:44Feb 13 19:51:44
g12r1Feb 13 19:51:44 g12r10 : 1:4Feb 13 19:51:44Feb 13 19:51:4Feb 13
19:51:44 g1Feb 13 19:51:44 g12r10 : Feb 13 19:51:4Feb 13 19:51:44Feb 13
19:51:4FeINIT: Sending processes the TERM signal16446: FS:
g12r10-vs2-vol1 0x749000000bd - eek - req -
SiByte Watchdog in danger of initiating system reset in 3.6 seconds
Stopping deferred execution scheduler: atd.
Stopping periodic command scheduler: crond.
Stopping automounter: done.
Stopping MTA: exim4_listener.
* ALERT: exim paniclog /var/log/exim4/paniclog has non-zero size, mail
system possibly broken
Stopping internet superserver: inetd.
Stopping OpenBSD Secure Shell server: sshd.
Stopping NTP server: ntpd.
Saving the system clock..
Stopping NFS common utilities: statd.
Stopping kernel log daemon: klogdSiByte Watchdog in danger of initiating
system reset in 3.6 seconds
.
Stopping system log daemon: syslogd.
Stopping ONStor services:/onstor/bin/emrscron -r
/onstor/bin/emrscron: line 432: 15480 Killed                  ( ps axww
| awk '/support.sh/ || /socat/ || /emrscron/ {if ($1 !~ /^'$$'$/) {print
$1}}' | xargs kill -9 2>&1 ) >/dev/null
.
Asking all remaining processes to terminate...done.
Killing all remaining processes...done.
Deconfiguring network interfaces...done.
Cleaning up ifupdown....
Unmounting temporary filesystems...done.
Deactivating swap...done.
Unmounting local filesystems...done.
Will now restart.
Restarting system.



PowerOn Self Test........OK

Initializing System......please wait





PMON [SSC,EL,FP,64]
ONStor Inc. PROM_SIBYTE_CG : Cougar-prom-1.0.3 : Fri Jan 11 12:30:31
2008
CPU type SB1125.  Rev 35  600 MHz
module: SSC, Slot 0, CPU 0
Memory size 512 MB.
Icache size  32 KB, 32/line (4 way)
Dcache size  32 KB, 32/line (4 way)
Scache size 256 KB, 32/line (4 way)
debug IP addr =3D 10.2.10.12
debug IP mask =3D 255.255.0.0


Initializing Autoloader, hit control-E to bypass
........................................................................
........

Type ctrl-e to stop autoload.
Waiting for SSC to enter autoload init state...done.
 ext2_load_file /dev/sda1/boot/vmlinux.bin at location ffffffff82000000
disk model: CF 1GB
disk geometry: cylinders=3D2044 heads=3D16 sectors=3D63
Type ctrl-e to stop autoload.
Waiting for TXRX to enter autoload init state...done.
 ext2_load_file /dev/sda1/boot/txrx_cg.bin at location 42000000
disk model: CF 1GB
disk geometry: cylinders=3D2044 heads=3D16 sectors=3D63
Type ctrl-e to stop autoload.
Waiting for FP to enter autoload init state...done.
 ext2_load_file /dev/sda1/boot/fp_cg.bin at location 44000000
disk model: CF 1GB
disk geometry: cylinders=3D2044 heads=3D16 sectors=3D63
 do_bsd_launch argc =3D 3 argv[3] =3D ip=3Dnone

env[0] =3D 0xffffffff80b7bed0:.cpuclock=3D4894967296.
env[1] =3D 0xffffffff80b7bf20:.memsize=3D512.
env[2] =3D 0xffffffff80b7bf70:.osloadoptions=3DmAt.
env[3] =3D 0xffffffff80b7bfc0:.boot=3Dcold.
env[4] =3D 0xffffffff80b7c010:.busclock=3D600.
env[5] =3D 0xffffffff80b7c060:.ipaddr=3D10.2.10.12.
env[6] =3D 0xffffffff80b7c0b0:.netmask=3D255.255.0.0.
env[7] =3D 0xffffffff80b7c100:.macaddr0=3D.00:07:34:07:49:00.
env[8] =3D 0xffffffff80b7c150:.macaddr1=3D.00:07:34:07:49:01.
env[9] =3D 0xffffffff80b7c1a0:.bootdev=3D/dev/sda1.
 Load options and params for [g]
  Address 0xffffffff82000000 argc =3D 3
   argv [0] =3D g
   argv [1] =3D root=3D/dev/sda1
   argv [2] =3D ip=3Dnone
 pointer to Prom Util routines =3D 0x0
 Command should be  (addr)(argc, argv, env_strings,
ptr_prom_util_routines)


Linux version 2.6.22-cg (build@k3.onstor.lab) (gcc version 4.1.2
20061115 (prerelease) (Debian 4.1.1-21)) #1 Wed Feb 6 16:08:22 PST 2008
Booting Linux kernel...Mips64 Cougar
cougar_pmon_init: argc=3D3, arg=3Dffffffff80bf4230, =
env=3Dffffffff80b7be50
prom_init: env[0] =3D 'cpuclock=3D4894967296'
prom_init: env[1] =3D 'memsize=3D512'
prom_init: env[2] =3D 'osloadoptions=3DmAt'
prom_init: env[3] =3D 'boot=3Dcold'
prom_init: env[4] =3D 'busclock=3D600'
prom_init: env[5] =3D 'ipaddr=3D10.2.10.12'
prom_init: env[6] =3D 'netmask=3D255.255.0.0'
prom_init: env[7] =3D 'macaddr0=3D00:07:34:07:49:00'
prom_init: env[8] =3D 'macaddr1=3D00:07:34:07:49:01'
prom_init: env[9] =3D 'bootdev=3D/dev/sda1'
CPU revision is: 00040103
FPU revision is: 000f0103
Broadcom SiByte BCM1125H A4 @ 600 MHz (SB1 rev 3)
Board type: ONStor Cougar
This kernel optimized for ONStor Cougar board without CFE
Determined physical RAM map:
 memory: 0000000002000000 @ 0000000000000000 (ROM data)
 memory: 000000000e000000 @ 0000000002000000 (usable)
 memory: 000000000f000000 @ 0000000080000000 (usable)
 memory: 0000000001000000 @ 000000008f000000 (reserved)
Wasting 458752 bytes for tracking 8192 unused pages
Built 1 zonelists.  Total pages: 577720

-----Original Message-----
From: Mike Lee=20
Sent: Tuesday, February 12, 2008 7:10 PM
To: Sandrine Boulanger; Andy Sharp
Cc: dl-Cougar
Subject: RE: g4r6

Thanks to everyone to replied.
Actually, Larry helped me figure out the problem.
I had added an extra trace statement in the management bus driver, in
function mgmtBus_rxPacket().
The problem goes away when I remove that trace statement. =20
-Mike
-----Original Message-----
From: Sandrine Boulanger=20
Sent: Tuesday, February 12, 2008 6:58 PM
To: Andy Sharp; Mike Lee
Cc: dl-Cougar
Subject: RE: g4r6


We've seen that on other systems too, a defect is already filed.

-----Original Message-----
From: Andy Sharp=20
Sent: Tuesday, February 12, 2008 5:32 PM
To: Mike Lee
Cc: dl-Cougar
Subject: Re: g4r6

On Tue, 12 Feb 2008 16:53:50 -0800 "Mike Lee" <mike.lee@onstor.com>
wrote:

> Guys:
>=20
> I'm seeing g4r6 constantly displaying the following messages, and I
> cannot log into the filer (via the console or ssh).  Would anyone
> know what I can do to revive it?
>=20
> Thanks.
> -Mike
>=20
>=20
>=20
> g4r6 login: SiByte Watchdog in danger of initiating system reset in
> 8.3 seconds SiByte Watchdog in danger of initiating system reset in
> 8.3 seconds SiByte Watchdog in danger of initiating system reset in
> 8.3 seconds SiByte Watchdog in danger of initiating system reset in
> 8.3 seconds

Does it have a reset switch?

That is just a message from the watchdog driver which indicates that
something is hosing something bad enough that chassisd isn't able to
get enough execution time to reset the watchdog before this message
goes off.  Under normal circumstances, it wouldn't even be close.
