AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20081114185416.057ecf70@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:exch1.onstor.net
NSV:
SSH:
R:<sandrine.boulanger@onstor.com>,<john.rogers@onstor.com>,<dl-CougarCore@onstor.com>,<dl-mightydog-alert@onstor.com>,<ed.kwan@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@exch1.onstor.net/INBOX	0	2779531E7C760D4491C96305019FEEB5175D5BE275@exch1.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Fri, 14 Nov 2008 18:54:33 -0800
From: Andrew Sharp <andy.sharp@onstor.com>
To: Sandrine Boulanger <sandrine.boulanger@onstor.com>
Cc: John Rogers <john.rogers@onstor.com>, dl-Cougar Core Team
 <dl-CougarCore@onstor.com>, dl-mightydog-alert
 <dl-mightydog-alert@onstor.com>, Ed Kwan <ed.kwan@onstor.com>
Subject: Re: Status of  R4.0.1.0 Submittal 17 on Cougar soak
Message-ID: <20081114185433.7ad93831@ripper.onstor.net>
In-Reply-To: <2779531E7C760D4491C96305019FEEB5175D5BE275@exch1.onstor.net>
References: <2779531E7C760D4491C96305019FEEB5175D5BE217@exch1.onstor.net>
	<2779531E7C760D4491C96305019FEEB5175D5BE275@exch1.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

The frozen messages on g[12]r8 were because I didn't install the config
file changes to those.  Not sure why, but they're there now.  Frozen's
gone.

On Fri, 14 Nov 2008 18:02:17 -0800 Sandrine Boulanger
<sandrine.boulanger@onstor.com> wrote:

> More changes later...
> Andy installed some new changes and all 4 nodes were rebooted at 2am
> this morning. Since then:
> 
>  *   No hung exim processes
>  *   No cluster2 errors (cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30)
> 
> But
> 
>  *   Since 9am this morning, the PCC g1r8 is showing cluster2:ERROR:
> main: rcv ncm msg, ea:ERROR: ea_getRunTimeVolInfo, vol show not
> responding (defect 25689)
> 
>  *   G11r10 had a txrx crash this morning, but no core was written
> Can't get TXRX cpu private data... core file not saved (defect 25870)
>  *   A few frozen messages on g2r8
> g2r8:~# ps ax | grep exim
> 28439 ?        Ss     0:00 /usr/sbin/exim4 -bd -q30m
> 30332 pts/0    S+     0:00 grep exim
> g2r8:~# exiqgrep -z -c
> 5 matches out of 5 messages
> g2r8:~# exim -bp
>  5h  1.3K 1L15oo-0005VU-38 <> *** frozen ***
>           root@g2r8
> 
>  4h  1.3K 1L16lc-00012J-PZ <> *** frozen ***
>           root@g2r8
> 
>  3h  1.3K 1L17gL-0004nr-1B <> *** frozen ***
>           root@g2r8
> 
>  2h  1.3K 1L18cM-0000LH-4a <> *** frozen ***
>           root@g2r8
> 
> 53m  1.3K 1L19YJ-0004Gf-AJ <> *** frozen ***
>           root@g2r8
> 
>  *   One frozen message on g1r8
> g1r8:/var/log/onstor# ps ax | grep exim
> 13301 ?        Ss     0:00 /usr/sbin/exim4 -bd -q30m
> 15475 pts/0    S+     0:00 grep exim
> g1r8:/var/log/onstor# exiqgrep -z -c
> 1 matches out of 1 messages
> g1r8:/var/log/onstor# exim -bp
>  8h  1.5K 1L131g-00060L-RT <> *** frozen ***
>           root@g1r8
> 
> ________________________________
> From: Sandrine Boulanger
> Sent: Monday, November 10, 2008 5:57 PM
> To: Sandrine Boulanger; John Rogers; dl-Cougar Core Team;
> dl-mightydog-alert Cc: Ed Kwan
> Subject: RE: Status of R4.0.1.0 Submittal 17 on Cougar soak
> 
> A few changes later...
> A new change has been installed on all 4 nodes of the Cougar soak
> today. After rebooting them all (about an hour ago), so far they are
> behaving. No exim process hung, autosupport messages are sent, and no
> cluster2 errors so far (crossing fingers). We'll let this run
> overnight and I'll send an update tomorrow morning.
> 
> ________________________________
> From: Sandrine Boulanger
> Sent: Saturday, November 08, 2008 11:38 AM
> To: Sandrine Boulanger; John Rogers; dl-Cougar Core Team;
> dl-mightydog-alert Cc: Ed Kwan
> Subject: RE: Status of R4.0.1.0 Submittal 17 on Cougar soak
> 
> This morning, 3 out of 4 nodes have many exim4 processes hung, and
> one of them is getting "mta queue full" and is no longer sending
> autosupport emails. I just updated the /etc/hosts file of each node
> to 127.0.0.1 localhost <sc0 ip> nodename nodename.sc0
> as Andy recommended. I'm waiting for instructions to proceed further.
> 
> 
> ________________________________
> From: Sandrine Boulanger
> Sent: Friday, November 07, 2008 5:06 PM
> To: Sandrine Boulanger; John Rogers; dl-Cougar Core Team;
> dl-mightydog-alert Cc: Ed Kwan
> Subject: RE: Status of R4.0.1.0 Submittal 17 on Cougar soak
> 
> 2 out of 4 nodes have been rebooted since we tested crashdump panic
> on those nodes. Looking at elogs, the cluster errors were gone. The
> latest exim package was installed after sub#17, and installing a
> package does not require a reboot. However, Andy suspects there could
> have been something leftover so I also rebooted the other 2 nodes.
> I'll keep monitoring the 4 nodes.
> 
> PS: Raj, since g12r10 does not see any luns, it kept complaining
> about the core and mgmt volumes. I force deleted those to clear the
> elog and be able to monitor things more easily. When we figure out
> why sp2.0 is down on this node, we'll need to re-create them.
> 
> ________________________________
> From: Sandrine Boulanger
> Sent: Friday, November 07, 2008 1:39 PM
> To: John Rogers; dl-Cougar Core Team; dl-mightydog-alert
> Subject: RE: Status of R4.0.1.0 Submittal 17 on Cougar soak
> 
> FP crash can be ignored, Raj had run a "crashdump panic" to test core
> generation since on MD it took too long on one node. Core generation
> works fine on Cougar soak, and it worked on MD too on mktg3.
> 
> ________________________________
> From: John Rogers
> Sent: Friday, November 07, 2008 1:05 PM
> To: Sandrine Boulanger; dl-Cougar Core Team; dl-mightydog-alert
> Subject: Re: Status of R4.0.1.0 Submittal 17 on Cougar soak
> 
> 
> Fantastic news!
> 
> ________________________________
> From: Sandrine Boulanger
> To: Sandrine Boulanger; dl-Cougar Core Team
> Sent: Fri Nov 07 12:26:19 2008
> Subject: RE: Status of R4.0.1.0 Submittal 17 on Cougar soak
> 
> It looks like we reproduced similar behavior than MD on Cougar soak,
> which is running sub#17 and latest exim4 package.
> 
> On g2r8 - There was  FP crash this morning. One of the CPU had
> autoreboot off so it did not restart by itself, so I rebooted it.
> I'll see what I can get from the core.
> 
> Nov  7 10:06:16 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30
> 
> Nov  7 10:12:15 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1
> 
> Nov  7 10:12:15 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30
> 
> Nov  7 10:18:15 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1
> 
> Nov  7 10:18:16 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30
> 
> Nov  7 10:24:15 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1
> 
> Nov  7 10:24:15 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30
> 
> Nov  7 10:30:19 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1
> 
> Nov  7 10:30:19 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30
> 
> Nov  7 10:30:31 g2r8 : 0:0:cluster2:INFO: cluster_clientSendRmcRpc:
> Error sending rpc to clusterrpc, flags 820a, name nfxsh-19988, rc
> -19, retrying...
> 
> Nov  7 10:42:15 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1
> 
> Nov  7 10:42:15 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30
> 
> Nov  7 10:48:15 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1
> 
> Nov  7 10:48:15 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30
> 
> Nov  7 11:00:16 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1
> 
> Nov  7 11:00:16 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30
> 
> Nov  7 11:06:16 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1
> 
> Nov  7 11:06:16 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30
> 
> Nov  7 11:12:16 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1
> 
> Nov  7 11:12:16 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30
> 
> Nov  7 11:18:16 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1
> 
> Nov  7 11:18:16 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30
> 
> Nov  7 11:30:13 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1
> 
> Nov  7 11:30:13 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30
> 
> Nov  7 11:30:16 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1
> 
> Nov  7 11:30:16 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30
> 
> Nov  7 11:41:25 g2r8 : 0:0:sanm:ERROR: SANM: FP NIM down. Aborting
> all mirror sessions.
> 
> Nov  7 11:41:25 g2r8 : 0:0:sanm:ERROR: SANM: FP NIM down. Aborting
> all mirror sessions.
> 
> On g1r8
> 
> Nov  6 16:30:19 g1r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1
> 
> Nov  6 16:30:19 g1r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30
> 
> Nov  6 16:30:31 g1r8 : 0:0:cluster2:INFO: cluster_clientSendRmcRpc:
> Error sending rpc to clusterrpc, flags 820a, name nfxsh-12633, rc
> -19, retrying...
> 
> Nov  6 16:31:11 g1r8 : 0:0:snmpd:INFO: getVolumeDetail: got bad rsp
> error (type=8315 volId=0)
> 
> Nov  6 16:31:11 g1r8 : 0:0:snmpd:INFO: getVolumeDetail: got bad rsp
> error (type=8315 volId=0)
> 
> Nov  6 16:31:11 g1r8 : 0:0:snmpd:INFO: getVolumeDetail: got bad rsp
> error (type=8315 volId=0)
> 
> ...
> 
> Nov  7 12:06:16 g1r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1
> 
> Nov  7 12:06:16 g1r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30
> 
> Nov  7 12:18:16 g1r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1
> 
> Nov  7 12:18:16 g1r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> cannot get cluster rec, code 30
> 
> G1r1 volume show is failing, likely because of those ea errors:
> 
> Nov  7 12:20:57 g11r10 : 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]:
> Failed to get info for volume[g1r8-vs1-vol1], rc[8]
> 
> Nov  7 12:21:07 g11r10 : 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]:
> Failed to get info for volume[g1r8-vs1-vol1], rc[8]
> 
> Nov  7 12:21:07 g11r10 : 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]:
> Failed to get info for volume[g1r8-vs1-vol1], rc[8]
> 
> Nov  7 12:21:07 g11r10 : 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]:
> Failed to get info for volume[g1r8-vs1-vol1], rc[8]
> 
> Nov  7 12:21:17 g11r10 : 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]:
> Failed to get info for volume[g1r8-vs1-vol1], rc[8]
> 
> Nov  7 12:21:17 g11r10 : 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]:
> Failed to get info for volume[g1r8-vs1-vol1], rc[8]
> 
> Nov  7 12:21:17 g11r10 : 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]:
> Failed to get info for volume[g1r8-vs1-vol1], rc[8]
> 
> _____________________________________________
> From: Sandrine Boulanger
> Sent: Thursday, November 06, 2008 5:53 PM
> To: Sandrine Boulanger; dl-Cougar Core Team
> Subject: RE: Status of R4.0.1.0 Submittal 17 on Cougar soak
> 
> I got a new exim4 package from Andy which is now installed on all
> nodes in Cougar soak. We'll monitor the status of the queue and # of
> processes running. I'll send an update tomorrow.
> 
> _____________________________________________
> From: Sandrine Boulanger
> Sent: Thursday, November 06, 2008 3:35 PM
> To: dl-Cougar Core Team
> Subject: Status of R4.0.1.0 Submittal 17 on Cougar soak
> 
> Cougar soak has been upgraded to sub#17. We have been increasing the
> schedule for autosupport reports to every 2 minutes. G12r10 had a lot
> of frozen messages in the queue yesterday night, but by this morning
> everything was cleared.
> 
> However, autosupport is no longer working on g11r10:
> 
> Nov  6 15:30:03 g11r10 : 0:0:asd:INFO: Rcvd Generate report request
> APP: (null)
> 
> Nov  6 15:30:03 g11r10 : 0:0:asd:ERROR: mta mail queue full
> 
> g11r10 diag> autosupport generate report
> 
> Report not generated, error 0xffffffff.
> 
> % Command failure.
> 
> g11r10 diag> system show chassis
> 
>  module     cpu         state
> 
> ----------------------------------------------
> 
>  SSC        SSC         UP
> 
>  NFPNIM     TXRX0       UP
> 
>             TXRX1       UP
> 
>             FP0         UP
> 
>             FP1         UP
> 
>             FP2         UP
> 
>             FP3         UP
> 
> ----------------------------------------------
> 
> g11r10 diag> exit
> 
> g11r10:~# exiqgrep -z -c
> 
> 121 matches out of 121 messages
> 
> g11r10:~# exim4 -bpc
> 
> 121
> 
> g11r10:~# ps ax | grep exim
> 
>   953 ?        S      0:00 /usr/sbin/exim4 -q
> 
>   966 ?        S      0:02 /usr/sbin/exim4 -q
> 
>  1261 ?        Ss     0:00 /usr/sbin/exim4 -bd -q30m
> 
> 10474 pts/0    R+     0:00 grep exim
> 
> g11r10:~#
> 
> _____________________________________________
> From: Larry Scheer
> Sent: Wednesday, November 05, 2008 3:47 PM
> To: dl-QA; dl-hcl-qa; dl-Cougar
> Subject: Build of R4.0.1.0 Submittal 17 is available for acceptance
> tests
> 
> Changes since last submittal
> 
> Branch r401rel
> 
> Change 31060 on 2008/11/05 by andys@ripper 'Integrate changelist
> 31059 from'
> 
> Change 31053 on 2008/11/04 by billn@billn-dev ' Change 31051 by
> billn@billn-de'
> 
> Defects fixed since last submittal
> 
> TED 25710 - [10206 - Onstor] Over 200 Exim processes running
> 
> TED 25761 - HP EVA4400, does not report paths as Primary/Failover
> even though TPGS is active
> 
> Location of images for submittal 17
> 
> R401rel build:
> 
> Source tree is here:
> 
> /n/Build-Trees/R4.0.1.0/EverON-4.0.1.0-110508-sub17
> 
> Images are here:
> 
> Cougar optimized:
> 
> http://10.2.0.21/upgrade/EverON-4.0.1.0CG.tar.gz
> 
> Cougar debug:
> 
> http://10.2.0.21/upgrade/EverON-4.0.1.0CGDBG.tar.gz
> 
