AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20081116182109.1e6ef49c@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:exch1.onstor.net
NSV:
SSH:
R:<sandrine.boulanger@onstor.com>,<john.rogers@onstor.com>,<dl-CougarCore@onstor.com>,<dl-mightydog-alert@onstor.com>,<ed.kwan@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@exch1.onstor.net/INBOX	0	2779531E7C760D4491C96305019FEEB5175D5BE27A@exch1.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Sun, 16 Nov 2008 18:21:16 -0800
From: Andrew Sharp <andy.sharp@onstor.com>
To: Sandrine Boulanger <sandrine.boulanger@onstor.com>
Cc: John Rogers <john.rogers@onstor.com>, dl-Cougar Core Team
 <dl-CougarCore@onstor.com>, dl-mightydog-alert
 <dl-mightydog-alert@onstor.com>, Ed Kwan <ed.kwan@onstor.com>
Subject: Re: Status of  R4.0.1.0 Submittal 17 on Cougar soak
Message-ID: <20081116182116.50c03ae8@ripper.onstor.net>
In-Reply-To: <2779531E7C760D4491C96305019FEEB5175D5BE27A@exch1.onstor.net>
References: <20081114185433.7ad93831@ripper.onstor.net>
	<2779531E7C760D4491C96305019FEEB5175D5BE27A@exch1.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Yes.  As soon as I'm sure they don't suck.

On Fri, 14 Nov 2008 18:55:37 -0800 Sandrine Boulanger
<sandrine.boulanger@onstor.com> wrote:

> So are you going to check in your changes?
> 
> -----Original Message-----
> From: Andy Sharp
> Sent: Friday, November 14, 2008 6:55 PM
> To: Sandrine Boulanger
> Cc: John Rogers; dl-Cougar Core Team; dl-mightydog-alert; Ed Kwan
> Subject: Re: Status of R4.0.1.0 Submittal 17 on Cougar soak
> 
> The frozen messages on g[12]r8 were because I didn't install the
> config file changes to those.  Not sure why, but they're there now.
> Frozen's gone.
> 
> On Fri, 14 Nov 2008 18:02:17 -0800 Sandrine Boulanger
> <sandrine.boulanger@onstor.com> wrote:
> 
> > More changes later...
> > Andy installed some new changes and all 4 nodes were rebooted at 2am
> > this morning. Since then:
> >
> >  *   No hung exim processes
> >  *   No cluster2 errors (cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30)
> >
> > But
> >
> >  *   Since 9am this morning, the PCC g1r8 is showing cluster2:ERROR:
> > main: rcv ncm msg, ea:ERROR: ea_getRunTimeVolInfo, vol show not
> > responding (defect 25689)
> >
> >  *   G11r10 had a txrx crash this morning, but no core was written
> > Can't get TXRX cpu private data... core file not saved (defect
> > 25870)
> >  *   A few frozen messages on g2r8
> > g2r8:~# ps ax | grep exim
> > 28439 ?        Ss     0:00 /usr/sbin/exim4 -bd -q30m
> > 30332 pts/0    S+     0:00 grep exim
> > g2r8:~# exiqgrep -z -c
> > 5 matches out of 5 messages
> > g2r8:~# exim -bp
> >  5h  1.3K 1L15oo-0005VU-38 <> *** frozen ***
> >           root@g2r8
> >
> >  4h  1.3K 1L16lc-00012J-PZ <> *** frozen ***
> >           root@g2r8
> >
> >  3h  1.3K 1L17gL-0004nr-1B <> *** frozen ***
> >           root@g2r8
> >
> >  2h  1.3K 1L18cM-0000LH-4a <> *** frozen ***
> >           root@g2r8
> >
> > 53m  1.3K 1L19YJ-0004Gf-AJ <> *** frozen ***
> >           root@g2r8
> >
> >  *   One frozen message on g1r8
> > g1r8:/var/log/onstor# ps ax | grep exim
> > 13301 ?        Ss     0:00 /usr/sbin/exim4 -bd -q30m
> > 15475 pts/0    S+     0:00 grep exim
> > g1r8:/var/log/onstor# exiqgrep -z -c
> > 1 matches out of 1 messages
> > g1r8:/var/log/onstor# exim -bp
> >  8h  1.5K 1L131g-00060L-RT <> *** frozen ***
> >           root@g1r8
> >
> > ________________________________
> > From: Sandrine Boulanger
> > Sent: Monday, November 10, 2008 5:57 PM
> > To: Sandrine Boulanger; John Rogers; dl-Cougar Core Team;
> > dl-mightydog-alert Cc: Ed Kwan
> > Subject: RE: Status of R4.0.1.0 Submittal 17 on Cougar soak
> >
> > A few changes later...
> > A new change has been installed on all 4 nodes of the Cougar soak
> > today. After rebooting them all (about an hour ago), so far they are
> > behaving. No exim process hung, autosupport messages are sent, and
> > no cluster2 errors so far (crossing fingers). We'll let this run
> > overnight and I'll send an update tomorrow morning.
> >
> > ________________________________
> > From: Sandrine Boulanger
> > Sent: Saturday, November 08, 2008 11:38 AM
> > To: Sandrine Boulanger; John Rogers; dl-Cougar Core Team;
> > dl-mightydog-alert Cc: Ed Kwan
> > Subject: RE: Status of R4.0.1.0 Submittal 17 on Cougar soak
> >
> > This morning, 3 out of 4 nodes have many exim4 processes hung, and
> > one of them is getting "mta queue full" and is no longer sending
> > autosupport emails. I just updated the /etc/hosts file of each node
> > to 127.0.0.1 localhost <sc0 ip> nodename nodename.sc0
> > as Andy recommended. I'm waiting for instructions to proceed
> > further.
> >
> >
> > ________________________________
> > From: Sandrine Boulanger
> > Sent: Friday, November 07, 2008 5:06 PM
> > To: Sandrine Boulanger; John Rogers; dl-Cougar Core Team;
> > dl-mightydog-alert Cc: Ed Kwan
> > Subject: RE: Status of R4.0.1.0 Submittal 17 on Cougar soak
> >
> > 2 out of 4 nodes have been rebooted since we tested crashdump panic
> > on those nodes. Looking at elogs, the cluster errors were gone. The
> > latest exim package was installed after sub#17, and installing a
> > package does not require a reboot. However, Andy suspects there
> > could have been something leftover so I also rebooted the other 2
> > nodes. I'll keep monitoring the 4 nodes.
> >
> > PS: Raj, since g12r10 does not see any luns, it kept complaining
> > about the core and mgmt volumes. I force deleted those to clear the
> > elog and be able to monitor things more easily. When we figure out
> > why sp2.0 is down on this node, we'll need to re-create them.
> >
> > ________________________________
> > From: Sandrine Boulanger
> > Sent: Friday, November 07, 2008 1:39 PM
> > To: John Rogers; dl-Cougar Core Team; dl-mightydog-alert
> > Subject: RE: Status of R4.0.1.0 Submittal 17 on Cougar soak
> >
> > FP crash can be ignored, Raj had run a "crashdump panic" to test
> > core generation since on MD it took too long on one node. Core
> > generation works fine on Cougar soak, and it worked on MD too on
> > mktg3.
> >
> > ________________________________
> > From: John Rogers
> > Sent: Friday, November 07, 2008 1:05 PM
> > To: Sandrine Boulanger; dl-Cougar Core Team; dl-mightydog-alert
> > Subject: Re: Status of R4.0.1.0 Submittal 17 on Cougar soak
> >
> >
> > Fantastic news!
> >
> > ________________________________
> > From: Sandrine Boulanger
> > To: Sandrine Boulanger; dl-Cougar Core Team
> > Sent: Fri Nov 07 12:26:19 2008
> > Subject: RE: Status of R4.0.1.0 Submittal 17 on Cougar soak
> >
> > It looks like we reproduced similar behavior than MD on Cougar soak,
> > which is running sub#17 and latest exim4 package.
> >
> > On g2r8 - There was  FP crash this morning. One of the CPU had
> > autoreboot off so it did not restart by itself, so I rebooted it.
> > I'll see what I can get from the core.
> >
> > Nov  7 10:06:16 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30
> >
> > Nov  7 10:12:15 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> > no reply bck -1
> >
> > Nov  7 10:12:15 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30
> >
> > Nov  7 10:18:15 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> > no reply bck -1
> >
> > Nov  7 10:18:16 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30
> >
> > Nov  7 10:24:15 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> > no reply bck -1
> >
> > Nov  7 10:24:15 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30
> >
> > Nov  7 10:30:19 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> > no reply bck -1
> >
> > Nov  7 10:30:19 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30
> >
> > Nov  7 10:30:31 g2r8 : 0:0:cluster2:INFO: cluster_clientSendRmcRpc:
> > Error sending rpc to clusterrpc, flags 820a, name nfxsh-19988, rc
> > -19, retrying...
> >
> > Nov  7 10:42:15 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> > no reply bck -1
> >
> > Nov  7 10:42:15 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30
> >
> > Nov  7 10:48:15 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> > no reply bck -1
> >
> > Nov  7 10:48:15 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30
> >
> > Nov  7 11:00:16 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> > no reply bck -1
> >
> > Nov  7 11:00:16 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30
> >
> > Nov  7 11:06:16 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> > no reply bck -1
> >
> > Nov  7 11:06:16 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30
> >
> > Nov  7 11:12:16 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> > no reply bck -1
> >
> > Nov  7 11:12:16 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30
> >
> > Nov  7 11:18:16 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> > no reply bck -1
> >
> > Nov  7 11:18:16 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30
> >
> > Nov  7 11:30:13 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> > no reply bck -1
> >
> > Nov  7 11:30:13 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30
> >
> > Nov  7 11:30:16 g2r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> > no reply bck -1
> >
> > Nov  7 11:30:16 g2r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30
> >
> > Nov  7 11:41:25 g2r8 : 0:0:sanm:ERROR: SANM: FP NIM down. Aborting
> > all mirror sessions.
> >
> > Nov  7 11:41:25 g2r8 : 0:0:sanm:ERROR: SANM: FP NIM down. Aborting
> > all mirror sessions.
> >
> > On g1r8
> >
> > Nov  6 16:30:19 g1r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> > no reply bck -1
> >
> > Nov  6 16:30:19 g1r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30
> >
> > Nov  6 16:30:31 g1r8 : 0:0:cluster2:INFO: cluster_clientSendRmcRpc:
> > Error sending rpc to clusterrpc, flags 820a, name nfxsh-12633, rc
> > -19, retrying...
> >
> > Nov  6 16:31:11 g1r8 : 0:0:snmpd:INFO: getVolumeDetail: got bad rsp
> > error (type=8315 volId=0)
> >
> > Nov  6 16:31:11 g1r8 : 0:0:snmpd:INFO: getVolumeDetail: got bad rsp
> > error (type=8315 volId=0)
> >
> > Nov  6 16:31:11 g1r8 : 0:0:snmpd:INFO: getVolumeDetail: got bad rsp
> > error (type=8315 volId=0)
> >
> > ...
> >
> > Nov  7 12:06:16 g1r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> > no reply bck -1
> >
> > Nov  7 12:06:16 g1r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30
> >
> > Nov  7 12:18:16 g1r8 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> > no reply bck -1
> >
> > Nov  7 12:18:16 g1r8 : 0:0:cluster2:ERROR: cluster_getFilerNameList:
> > cannot get cluster rec, code 30
> >
> > G1r1 volume show is failing, likely because of those ea errors:
> >
> > Nov  7 12:20:57 g11r10 : 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]:
> > Failed to get info for volume[g1r8-vs1-vol1], rc[8]
> >
> > Nov  7 12:21:07 g11r10 : 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]:
> > Failed to get info for volume[g1r8-vs1-vol1], rc[8]
> >
> > Nov  7 12:21:07 g11r10 : 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]:
> > Failed to get info for volume[g1r8-vs1-vol1], rc[8]
> >
> > Nov  7 12:21:07 g11r10 : 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]:
> > Failed to get info for volume[g1r8-vs1-vol1], rc[8]
> >
> > Nov  7 12:21:17 g11r10 : 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]:
> > Failed to get info for volume[g1r8-vs1-vol1], rc[8]
> >
> > Nov  7 12:21:17 g11r10 : 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]:
> > Failed to get info for volume[g1r8-vs1-vol1], rc[8]
> >
> > Nov  7 12:21:17 g11r10 : 0:0:ea:ERROR: ea_getRunTimeVolInfo[1881]:
> > Failed to get info for volume[g1r8-vs1-vol1], rc[8]
> >
> > _____________________________________________
> > From: Sandrine Boulanger
> > Sent: Thursday, November 06, 2008 5:53 PM
> > To: Sandrine Boulanger; dl-Cougar Core Team
> > Subject: RE: Status of R4.0.1.0 Submittal 17 on Cougar soak
> >
> > I got a new exim4 package from Andy which is now installed on all
> > nodes in Cougar soak. We'll monitor the status of the queue and # of
> > processes running. I'll send an update tomorrow.
> >
> > _____________________________________________
> > From: Sandrine Boulanger
> > Sent: Thursday, November 06, 2008 3:35 PM
> > To: dl-Cougar Core Team
> > Subject: Status of R4.0.1.0 Submittal 17 on Cougar soak
> >
> > Cougar soak has been upgraded to sub#17. We have been increasing the
> > schedule for autosupport reports to every 2 minutes. G12r10 had a
> > lot of frozen messages in the queue yesterday night, but by this
> > morning everything was cleared.
> >
> > However, autosupport is no longer working on g11r10:
> >
> > Nov  6 15:30:03 g11r10 : 0:0:asd:INFO: Rcvd Generate report request
> > APP: (null)
> >
> > Nov  6 15:30:03 g11r10 : 0:0:asd:ERROR: mta mail queue full
> >
> > g11r10 diag> autosupport generate report
> >
> > Report not generated, error 0xffffffff.
> >
> > % Command failure.
> >
> > g11r10 diag> system show chassis
> >
> >  module     cpu         state
> >
> > ----------------------------------------------
> >
> >  SSC        SSC         UP
> >
> >  NFPNIM     TXRX0       UP
> >
> >             TXRX1       UP
> >
> >             FP0         UP
> >
> >             FP1         UP
> >
> >             FP2         UP
> >
> >             FP3         UP
> >
> > ----------------------------------------------
> >
> > g11r10 diag> exit
> >
> > g11r10:~# exiqgrep -z -c
> >
> > 121 matches out of 121 messages
> >
> > g11r10:~# exim4 -bpc
> >
> > 121
> >
> > g11r10:~# ps ax | grep exim
> >
> >   953 ?        S      0:00 /usr/sbin/exim4 -q
> >
> >   966 ?        S      0:02 /usr/sbin/exim4 -q
> >
> >  1261 ?        Ss     0:00 /usr/sbin/exim4 -bd -q30m
> >
> > 10474 pts/0    R+     0:00 grep exim
> >
> > g11r10:~#
> >
> > _____________________________________________
> > From: Larry Scheer
> > Sent: Wednesday, November 05, 2008 3:47 PM
> > To: dl-QA; dl-hcl-qa; dl-Cougar
> > Subject: Build of R4.0.1.0 Submittal 17 is available for acceptance
> > tests
> >
> > Changes since last submittal
> >
> > Branch r401rel
> >
> > Change 31060 on 2008/11/05 by andys@ripper 'Integrate changelist
> > 31059 from'
> >
> > Change 31053 on 2008/11/04 by billn@billn-dev ' Change 31051 by
> > billn@billn-de'
> >
> > Defects fixed since last submittal
> >
> > TED 25710 - [10206 - Onstor] Over 200 Exim processes running
> >
> > TED 25761 - HP EVA4400, does not report paths as Primary/Failover
> > even though TPGS is active
> >
> > Location of images for submittal 17
> >
> > R401rel build:
> >
> > Source tree is here:
> >
> > /n/Build-Trees/R4.0.1.0/EverON-4.0.1.0-110508-sub17
> >
> > Images are here:
> >
> > Cougar optimized:
> >
> > http://10.2.0.21/upgrade/EverON-4.0.1.0CG.tar.gz
> >
> > Cougar debug:
> >
> > http://10.2.0.21/upgrade/EverON-4.0.1.0CGDBG.tar.gz
> >
