AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20081216100345.010fc246@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:exch1.onstor.net
NSV:
SSH:
R:<sandrine.boulanger@onstor.com>,<escal@onstor.com>,<timothy.swenson@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@exch1.onstor.net/INBOX	0	2779531E7C760D4491C96305019FEEB51762F49A40@exch1.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Tue, 16 Dec 2008 10:04:03 -0800
From: Andrew Sharp <andy.sharp@onstor.com>
To: Sandrine Boulanger <sandrine.boulanger@onstor.com>
Cc: dl-Escalation <escal@onstor.com>, Timothy Swenson
 <timothy.swenson@onstor.com>
Subject: Re: Defect  Yes TED00025943 [11084 - Onstor] Unexplained reboot of
 mktg3 Onstor
Message-ID: <20081216100403.7308a0eb@ripper.onstor.net>
In-Reply-To: <2779531E7C760D4491C96305019FEEB51762F49A40@exch1.onstor.net>
References: <ONSTOR-EXCH01NJDZbd0000617e@onstor-exch01.onstor.net>
	<2779531E7C760D4491C96305019FEEB51762F49A40@exch1.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Don't tell me, I don't care ~:^)  Put it in the bug report!

Cheers,

a

On Tue, 16 Dec 2008 09:41:14 -0800 Sandrine Boulanger
<sandrine.boulanger@onstor.com> wrote:

> The spm_pollAddDevice happens continuously on PCC. The devices are
> tapes that are not being used.
> 
> -----Original Message-----
> From: Andy Sharp 
> Sent: Tuesday, December 16, 2008 9:39 AM
> To: dl-Escalation
> Cc: Andy Sharp; Timothy Swenson
> Subject: Defect Yes TED00025943 [11084 - Onstor] Unexplained reboot
> of mktg3 Onstor
> 
> company_name: Onstor
> id: TED00025943
> Headline: [11084 - Onstor] Unexplained reboot of mktg3
> State: Opened
> Note_Entry: 
> The system rebooted because the TXRX/FP was down.
> Here are some of the interesting entries before it rebooted:
> 
> Dec 16 07:39:48 Dogfood : 0:0:spm:INFO: spm_pollAddDevice: Found
> dev[IBM_1110082899_0] wwn[0x0] lun[0] type[1] sid[fd6865ef] size[0]
> Dec 16 07:39:48 Dogfood : 0:0:spm:INFO: spm_pollAddDevice: Found
> dev[IBM_0000013042911000_1] wwn[0x0] lun[0] type[2] sid[184c3a20f]
> size[0] Dec 16 07:39:48 Dogfood : 0:0:spm:INFO: spm_pollAddDevice:
> Found dev[IBM_1110082921_0] wwn[0x0] lun[0] type[1] sid[191a9ae6]
> size[0] Dec 16 07:40:04 Dogfood : 0:0:snmpd:NOTICE: getEnvInfo:
> Failed to get PS/Fan info - rc=0 Dec 16 07:40:32 Dogfood last message
> repeated 2 times Dec 16 07:40:48 Dogfood : 0:0:spm:INFO:
> spm_pollAddDevice: Found dev[IBM_1110082899_0] wwn[0x0] lun[0]
> type[1] sid[fd6865ef] size[0] Dec 16 07:40:48 Dogfood : 0:0:spm:INFO:
> spm_pollAddDevice: Found dev[IBM_0000013042911000_1] wwn[0x0] lun[0]
> type[2] sid[184c3a20f] size[0] Dec 16 07:40:48 Dogfood :
> 0:0:spm:INFO: spm_pollAddDevice: Found dev[IBM_1110082921_0] wwn[0x0]
> lun[0] type[1] sid[191a9ae6] size[0] Dec 16 07:41:38 Dogfood :
> 0:0:snmpd:NOTICE: getEnvInfo: Failed to get PS/Fan info - rc=0 Dec 16
> 07:41:39 Dogfood : 0:0:cluster2:INFO: cluster_clientSendRmcRpc: Error
> sending rpc to clusterrpc, flags 820a, name , rc -19, retrying... Dec
> 16 07:41:39 Dogfood : 0:0:cluster2:INFO: cluster_clientSendRmcRpc:
> Retry worked to clusterrpc, flags 8e02, name Dec 16 07:41:39
> Dogfood : 0:0:cluster2:INFO: cluster_clientSendRmcRpc: Error sending
> rpc to clusterrpc, flags 820a, name , rc -19, retrying... Dec 16
> 07:41:39 Dogfood : 0:0:cluster2:INFO: cluster_clientSendRmcRpc: Retry
> worked to clusterrpc, flags 8e02, name  
> 
> It looks like it found some LUNs and then there were some RMC errors
> involved in cluster communications.
> 
> Then some CPU down events:
> 
> Dec 16 07:44:59 Dogfood : 0:0:snmpd:NOTICE: getEnvInfo: Failed to get
> PS/Fan info - rc=0 Dec 16 07:45:01 Dogfood : 0:0:snmpd:NOTICE:
> getEnvInfo: Failed to get PS/Fan info - rc=0 Dec 16 07:45:48
> Dogfood : 0:0:spm:INFO: spm_pollAddDevice: Found
> dev[IBM_1110082899_0] wwn[0x0] lun[0] type[1] sid[fd6865ef] size[0]
> Dec 16 07:45:48 Dogfood : 0:0:spm:INFO: spm_pollAddDevice: Found
> dev[IBM_0000013042911000_1] wwn[0x0] lun[0] type[2] sid[184c3a20f]
> size[0] Dec 16 07:45:48 Dogfood : 0:0:spm:INFO: spm_pollAddDevice:
> Found dev[IBM_1110082921_0] wwn[0x0] lun[0] type[1] sid[191a9ae6]
> size[0] Dec 16 07:45:55 Dogfood : 0:0:eventd:WARNING: Process-EVENT
> CPU: Slot 1, CPU 0, State Down Dec 16 07:45:55 Dogfood :
> 0:0:eventd:WARNING: Process-EVENT CPU: Slot 1, CPU 1, State Down Dec
> 16 07:45:55 Dogfood : 0:0:eventd:WARNING: Process-EVENT CPU: Slot 1,
> CPU 2, State Down Dec 16 07:45:55 Dogfood : 0:0:eventd:WARNING:
> Process-EVENT CPU: Slot 1, CPU 3, State Down Dec 16 07:45:55
> Dogfood : 0:0:eventd:WARNING: Process-EVENT CPU: Slot 1, CPU 4, State
> Down Dec 16 07:45:55 Dogfood : 0:0:eventd:WARNING: Process-EVENT CPU:
> Slot 1, CPU 5, State Down Dec 16 07:45:55 Dogfood : 0:0:sanm:ERROR:
> SANM: FP NIM down. Aborting all mirror sessions. Dec 16 07:45:55
> Dogfood : 0:0:sanm:ERROR: SANM: FP NIM down. Aborting all mirror
> sessions. Dec 16 07:45:55 Dogfood : 0:0:vtm:NOTICE: Vtm_ProcEventMsg:
> NFX_EVENT_CPU, state 2, slot 1, cpu 0 Dec 16 07:45:56 Dogfood :
> 0:0:vtm:INFO: vtm_sendCardStateMsg: Sending card DOWN to Dogfood Dec
> 16 07:45:56 Dogfood : 0:0:vtm:NOTICE: Vtm_ProcEventMsg:
> NFX_EVENT_CPU, state 2, slot 1, cpu 1 Dec 16 07:45:56 Dogfood :
> 0:0:vtm:NOTICE: Vtm_ProcEventMsg: NFX_EVENT_CPU, state 2, slot 1, cpu
> 2 Dec 16 07:45:56 Dogfood : 0:0:vtm:NOTICE: Vtm_ProcEventMsg:
> NFX_EVENT_CPU, state 2, slot 1, cpu 3 Dec 16 07:45:56 Dogfood :
> 0:0:auth_agent:INFO: uninitCifsd: Uninitializing CIFS daemon Dec 16
> 07:45:56 Dogfood : 0:0:vtm:NOTICE: Vtm_ProcEventMsg: NFX_EVENT_CPU,
> state 2, slot 1, cpu 4 Dec 16 07:45:56 Dogfood : 0:0:vtm:NOTICE:
> Vtm_ProcEventMsg: NFX_EVENT_CPU, state 2, slot 1, cpu 5 Dec 16
> 07:45:56 Dogfood : 0:0:vtm:INFO: Vtm_ProcCardStateMsg: card state
> DOWN message from Dogfood 
> 
> Then all the vsvr's failed over to mktg3, and the system thought
> about things for a while, and then eventually decided to reboot:
> 
> Dec 16 08:09:46 Dogfood : 0:0:ea:WARNING: ea_getFsysProc[2957]:
> Failed to get run-time info of volId[0x1020000016e]. Just returning
> whatever data collected, rc[8] Dec 16 08:09:46 Dogfood last message
> repeated 8 times Dec 16 08:09:47 Dogfood : 0:0:snmpd:INFO:
> getVolumeSummary: got rsp status error (0) Dec 16 08:09:48 Dogfood :
> 0:0:spm:INFO: spm_pollAddDevice: Found dev[IBM_1110082899_0] wwn[0x0]
> lun[0] type[1] sid[fd6865ef] size[0] Dec 16 08:09:48 Dogfood :
> 0:0:spm:INFO: spm_pollAddDevice: Found dev[IBM_0000013042911000_1]
> wwn[0x0] lun[0] type[2] sid[184c3a20f] size[0] Dec 16 08:09:48
> Dogfood : 0:0:spm:INFO: spm_pollAddDevice: Found
> dev[IBM_1110082921_0] wwn[0x0] lun[0] type[1] sid[191a9ae6] size[0]
> Dec 16 08:09:49 Dogfood : 0:0:eventd:CRITICAL: Process-EVENT Node:
> Name 'local', State Down, Msg 'Node going down for reboot! (chassisd:
> because FP/TXRX module is down).' Dec 16 08:09:49 Dogfood :
> 0:0:cm:WARNING: TXRX/FP module is down, rebooting system. Dec 16
> 08:09:49 Dogfood : 0:0:spm:NOTICE: spm_ncmNodeEvent: Lost connect for
> Dec 16 08:09:49 Dogfood : 0:0:spm:NOTICE: spm_ncmNodeEvent:
> disconnected Dec 16 08:09:49 Dogfood : 0:0:snmpd:INFO:
> getVolumeSummary: got rsp status error (0) Dec 16 08:09:50 Dogfood :
> 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]: Failed to get info for
> volume[nx-d_buildup_old], rc[8] Dec 16 08:09:50 Dogfood :
> 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]: Failed to get info for
> volume[nx-d_buildup_old], rc[8] Dec 16 08:09:50 Dogfood :
> 0:0:ea:WARNING: ea_getFsysProc[2957]: Failed to get run-time info of
> volId[0x1020000016e]. Just returning whatever data collected, rc[8]
> Dec 16 08:09:50 Dogfood : 0:0:ea:WARNING: ea_getFsysProc[2957]:
> Failed to get run-time info of volId[0x1020000016e]. Just returning
> whatever data collected, rc[8] Dec 16 08:09:51 Dogfood :
> 0:0:snmpd:INFO: getVolumeSummary: got rsp status error (0) Dec 16
> 08:09:52 Dogfood : 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]: Failed
> to get info for volume[nx-d_buildup_old], rc[8] Dec 16 08:09:52
> Dogfood : 0:0:ea:WARNING: ea_getFsysProc[2957]: Failed to get
> run-time info of volId[0x1020000016e]. Just returning whatever data
> collected, rc[8] Dec 16 08:09:52 Dogfood : 0:0:eventd:CRITICAL:
> Process-EVENT Node: Name 'local', State Down, Msg 'Node going down
> for reboot! (chassisd: because FP/TXRX module is down).' Dec 16
> 08:09:52 Dogfood : 0:0:spm:NOTICE: spm_ncmNodeEvent: Lost connect for
> Dec 16 08:09:52 Dogfood : 0:0:spm:NOTICE: spm_ncmNodeEvent:
> disconnected Dec 16 08:09:52 Dogfood : 0:0:cm:WARNING: TXRX/FP module
> is down, rebooting system.
> 
> Release_Project: 4.0.1.0
> List_Incident_ID: 11084
> 
