AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20080805160007.4d8dfe62@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<chris.vandever@onstor.com>,<raj.kumar@onstor.com>,<rendell.fong@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@onstor-exch02.onstor.net/INBOX	0	BB375AF679D4A34E9CA8DFA650E2B04E0AE22A2C@onstor-exch02.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Tue, 5 Aug 2008 16:01:11 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: "Chris Vandever" <chris.vandever@onstor.com>
Cc: "Raj Kumar" <raj.kumar@onstor.com>, "Rendell Fong"
 <rendell.fong@onstor.com>
Subject: Re: #25027 CSS (G12R10) - mgmt vsvr is missing (nfxsh>vsvr show)
 but visible in txrx (rcon 1 1>vsvr show all)
Message-ID: <20080805160111.683f46a0@ripper.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E0AE22A2C@onstor-exch02.onstor.net>
References: <BB375AF679D4A34E9CA8DFA650E2B04E0B319F4B@onstor-exch02.onstor.net>
	<BB375AF679D4A34E9CA8DFA650E2B04E0AE22A2C@onstor-exch02.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Something is quite hose-head on this system.  I suspect the eek is
causing the snap_admin commands to hang, so there are a bunch of them
around since about noon.

I killed the extraneous exim processes, some dating back to aug. 2nd,
some to yesterday, but none from today.  So some part of the
foolishness was cleared up yesterday, I'm thinking.

But the snapadmin hanging isn't doing the system any good.  And there's
these constant messages in the elogs:

Aug  5 15:57:15 g12r10 : 0:0:cluster2:ERROR: ClusterCtrl_RecvPingMsg: ping from g11r10(192.168.111.4) for ip 192.168.111.3 did not find local match

Which would indicate some kind of configuration error obviously.  I
don't see matching messages on that other blade.

eth1 seems to be at 100BT, so it shouldn't be related to the networking
hardware problem.


On Tue, 5 Aug 2008 15:44:16 -0700 "Chris Vandever"
<chris.vandever@onstor.com> wrote:

> Thanks, Raj.  I'll do my best not to break the eek, i.e., not to
> cause a reboot.  I may end up killing clustering, however, which
> SHOULD make the problem disappear.  I will try not to do that,
> either, however.
> 
> ChrisV
> 
> -----Original Message-----
> From: Raj Kumar 
> Sent: Tuesday, August 05, 2008 3:41 PM
> To: Chris Vandever; Rendell Fong
> Cc: Andy Sharp
> Subject: RE: #25027 CSS (G12R10) - mgmt vsvr is missing (nfxsh>vsvr
> show) but visible in txrx (rcon 1 1>vsvr show all)
> 
> Chris, currently the system is open to all. 
> 
> 10.2.10.12
> 
> I have an EEK going. If possible let it run if not no big deal, I can
> restart
> 
> -----Original Message-----
> From: Chris Vandever 
> Sent: Tuesday, August 05, 2008 3:38 PM
> To: Rendell Fong; Raj Kumar
> Cc: Andy Sharp
> Subject: RE: #25027 CSS (G12R10) - mgmt vsvr is missing (nfxsh>vsvr
> show) but visible in txrx (rcon 1 1>vsvr show all)
> 
> Max suggested a network trace of lo0, which revealed ncmd sending a
> message to vsd and not getting a response.  However, vsd then sent a
> message to clustering, retrying it, and didn't get a response.
> Sooooo, it looks like it's back to clustering.
> 
> Raj, I'll need access to the system so I can attach debuggers to at
> least one of the cluster_contrl processes.  As I recall the systems
> are only accessible via the 10.3 subnet, so how do I get there?
> 
> ChrisV
> 
> -----Original Message-----
> From: Rendell Fong 
> Sent: Tuesday, August 05, 2008 10:37 AM
> To: Raj Kumar; Chris Vandever; Andy Sharp
> Subject: RE: #25027 CSS (G12R10) - mgmt vsvr is missing (nfxsh>vsvr
> show) but visible in txrx (rcon 1 1>vsvr show all)
> 
> Ok. Thanks Chris.
> 
> Rendell
>  
> 
> > -----Original Message-----
> > From: Raj Kumar
> > Sent: Tuesday, August 05, 2008 10:34 AM
> > To: Chris Vandever; Rendell Fong; Andy Sharp
> > Subject: RE: #25027 CSS (G12R10) - mgmt vsvr is missing (nfxsh>vsvr
> show)
> > but visible in txrx (rcon 1 1>vsvr show all)
> > 
> > 
> > 
> > -----Original Message-----
> > From: Chris Vandever
> > Sent: Tuesday, August 05, 2008 10:32 AM
> > To: Rendell Fong; Andy Sharp
> > Cc: Raj Kumar
> > Subject: RE: #25027 CSS (G12R10) - mgmt vsvr is missing (nfxsh>vsvr
> show)
> > but visible in txrx (rcon 1 1>vsvr show all)
> > 
> > I've already looked at it.
> > 
> > "vsvr show" sends a sendAgile message to vsd via ncmd.  No
> > clustering
> is
> > involved.
> > 
> > There are clustering errors because g11r10 was misconfigured when it
> was
> > added to the cluster, so it has no IP address for sc1 in
> > cluster.conf, when in fact it has an IP addr of 192.168.111.4 for
> > sc1 and is using
> it.
> > [Raj Kumar] Chris, How do I fix this?
> > 
> > There are also clustering errors related to volume vol_mgmt_1936,
> volId
> > 0x79000000142 because the corresponding mgmt vsvr has no volumes
> > configured in the clusDb, but the volume appears to exist and have a
> lun
> > label.
> > [Raj Kumar] This is due to a SCSI discovery issue where LUNs not
> > being discovered.
> > 
> > I have not seen elogs much beyond when the luns came back yesterday.
> > 
> > ChrisV
> > 
> > -----Original Message-----
> > From: Rendell Fong
> > Sent: Tuesday, August 05, 2008 10:21 AM
> > To: Chris Vandever; Andy Sharp
> > Cc: Raj Kumar
> > Subject: RE: #25027 CSS (G12R10) - mgmt vsvr is missing (nfxsh>vsvr
> show)
> > but visible in txrx (rcon 1 1>vsvr show all)
> > 
> > SGA tries to identify the mgmt vsvr and mgmt vol using the "vsvr
> > show
> all"
> > command.  I think it should be just "vsvr show".  The first vsvr
> > with VS_MGMT in its name is it.
> > 
> > Right now on g12r10, it only seems to work occasionally.
> > 
> > 
> > g12r10 diag> vsvr show
> > Virtual servers on nas gateway g12r10
> > 
> >  ID  State                             Name
> > ====================================================
> > 5    Enabled                           G12R10-VS1
> > 6    Enabled                           G12R10-VS2
> > g12r10 diag> vsvr show
> > Virtual servers on nas gateway g12r10
> > 
> >  ID  State                             Name
> > ====================================================
> > 1    Enabled                           VS_MGMT_1865
> > 5    Enabled                           G12R10-VS1
> > 6    Enabled                           G12R10-VS2
> > g12r10 diag>
> > 
> > 
> > > -----Original Message-----
> > > From: Chris Vandever
> > > Sent: Monday, August 04, 2008 4:40 PM
> > > To: Andy Sharp
> > > Cc: Raj Kumar
> > > Subject: RE: #25027 CSS (G12R10) - mgmt vsvr is missing
> > > (nfxsh>vsvr
> > show)
> > > but visible in txrx (rcon 1 1>vsvr show all)
> > >
> > > Andy, could you look into what support.sh is doing for "system get
> all"
> > > that causes it to fail?  "vsvr show" is now working, so I don't
> > > know
> > where
> > > the script is trying to get its info from, but it seems to be
> > > having trouble seeing the mgmt volume.
> > >
> > > ChrisV
> > >
> > > -----Original Message-----
> > > From: chris.vandever@onstor.com [mailto:chris.vandever@onstor.com]
> > > Sent: Monday, August 04, 2008 4:38 PM
> > > To: Andy Sharp; Raj Kumar
> > > Cc: Raj Kumar; Chris Vandever
> > > Subject: Defect TED00025027 CSS (G12R10) - mgmt vsvr is missing
> > > (nfxsh>vsvr show) but visible in txrx (rcon 1 1>vsvr show all)
> > >
> > > Headline: CSS (G12R10) - mgmt vsvr is missing (nfxsh>vsvr show)
> > > but visible in txrx (rcon 1 1>vsvr show all)
> > > id: TED00025027
> > > Note_Entry: When g10r12 first booted vsd was unable to see the
> volumes
> > for
> > > any of its 3 vsvrs.  It was not until more than 2 days later that
> > > it started seeing the volumes.  This was at 12:50:36 on Aug 4.
> > > This is
> > after
> > > the defect was entered, so I suspect that the reason "vsvr show"
> fails
> > is
> > > because vsd is busy trying to find volumes for ALL of its vsvrs:
> > >
> > > Aug  2 11:56:02 10.2.10.12 : 0:0:vsd:INFO:
> > > vsd_ensureNisFileCoherence[1230] : vs=1, file=71, #masterRec=0
> > > Aug  2 11:56:02 10.2.10.12 : 0:0:vsd:INFO:
> > > vsd_ensureNisFileCoherence[1230] : vs=1, file=72, #masterRec=0
> > > Aug  2 11:56:02 10.2.10.12 : 0:0:vsd:INFO:
> > > vsd_ensureNisFileCoherence[1230] : vs=1, file=74, #masterRec=0
> > > Aug  2 11:56:02 10.2.10.12 : 0:0:vsd:INFO:
> > > vsd_ensureNisFileCoherence[1230] : vs=5, file=71, #masterRec=0
> > > Aug  2 11:56:02 10.2.10.12 : 0:0:vsd:INFO:
> > > vsd_ensureNisFileCoherence[1230] : vs=5, file=72, #masterRec=0
> > > Aug  2 11:56:03 10.2.10.12 : 0:0:vsd:INFO:
> > > vsd_ensureNisFileCoherence[1230] : vs=5, file=74, #masterRec=0
> > > Aug  2 11:56:03 10.2.10.12 : 0:0:vsd:INFO:
> > > vsd_ensureNisFileCoherence[1230] : vs=6, file=71, #masterRec=0
> > > Aug  2 11:56:03 10.2.10.12 : 0:0:vsd:INFO:
> > > vsd_ensureNisFileCoherence[1230] : vs=6, file=72, #masterRec=0
> > > Aug  2 11:56:03 10.2.10.12 : 0:0:vsd:INFO:
> > > vsd_ensureNisFileCoherence[1230] : vs=6, file=74, #masterRec=0
> > > Aug  2 11:56:03 10.2.10.12 : 0:0:cluster2:ERROR:
> > ClusterCtrl_RecvPingMsg:
> > > ping from g11r10(192.168.111.4) for ip 192.168.111.3 did not find
> local
> > > match
> > > Aug  2 11:56:05 10.2.8.1 : 0:0:cluster2:NOTICE:
> > > urecovery_Interact:
> send
> > > new file end, code 0
> > > Aug  2 11:56:05 10.2.8.1 : 0:0:cluster2:NOTICE:
> > > urecovery_Interact:
> send
> > > new file (version 0x4894af2f:3) in progress, sending to
> > > 10.2.10.12,
> > remote
> > > version 0x4894ade4:1e
> > > Aug  2 11:56:05 10.2.10.12 : 0:0:cluster2:NOTICE: Ubik:
> SDISK_SendFile:
> > > Synchronize database with server 10.2.8.1, version 0x4894af2f:3
> > > Aug  2 11:56:06 10.2.10.12 : 0:0:cluster2:ERROR:
> > ClusterCtrl_RecvPingMsg:
> > > ping from g11r10(192.168.111.4) for ip 192.168.111.3 did not find
> local
> > > match
> > > Aug  2 11:56:08 10.2.10.12 : 0:0:cluster2:NOTICE: Ubik:
> SDISK_SendFile:
> > > Synchronize database with server 10.2.8.1 completed, version
> > 0x4894af2f:3
> > > Aug  2 11:56:08 10.2.10.12 : 0:0:cluster2:INFO:
> ClusterServ_UpdateState:
> > > database synchronized with the new cluster
> > > Aug  2 11:56:08 10.2.10.12 : 0:0:cluster2:INFO:
> > ClusterCtrl_iUpdateState:
> > > Sending state CLUSTER_STATE_SYNC_DONE to vtm, pcc 0x0
> > > Aug  2 11:56:08 10.2.8.1 : 0:0:cluster2:NOTICE:
> > > urecovery_Interact:
> send
> > > new file end, code 0
> > > Aug  2 11:56:09 10.2.10.12 : 0:0:cluster2:ERROR:
> > ClusterCtrl_RecvPingMsg:
> > > ping from g11r10(192.168.111.4) for ip 192.168.111.3 did not find
> local
> > > match
> > > Aug  2 11:56:11 10.2.10.12 : 0:0:cluster2:INFO:
> > > ClusterCtrl_GetClusterFilerInfo: pcc already eletected, post up
> pccname
> > > g1r8
> > > Aug  2 11:56:11 10.2.10.12 : 0:0:eventd:DEBUG: > ems_logEvent()
> > > Aug  2 11:56:11 10.2.10.12 : 0:0:eventd:WARNING: Process-EVENT
> 0.0.0.0:
> > > Mgmt Port 0.0.0.0 PCC, State Up
> > > Aug  2 11:56:16 10.2.10.12 : 0:0:eventd:DEBUG: < ems_logEvent()
> > > Aug  2 11:56:16 10.2.10.12 : 0:0:cluster2:ERROR:
> > ClusterCtrl_RecvPingMsg:
> > > ping from g11r10(192.168.111.4) for ip 192.168.111.3 did not find
> local
> > > match
> > > Aug  2 11:56:18 10.2.10.12 last message repeated 2 times
> > > Aug  2 11:56:18 10.2.10.12 : 0:0:pm:INFO: /onstor/bin/vsd:
> > > finished initialization.
> > > Aug  2 11:56:19 10.2.10.12 : 0:0:cluster2:INFO:
> > ClusterCtrl_ReleaseFiler:
> > > called by vtm
> > > Aug  2 11:56:19 10.2.10.12 : 0:0:vtm:DEBUG:
> > > vtm_get_filer_config_and_start_vsvr_trans: start collecting
> > > failover
> > vsvr,
> > > post event count 0
> > > Aug  2 11:56:19 10.2.10.12 : 0:0:vtm:DEBUG:
> > > vtm_get_filer_config_and_start_vsvr_trans: end collecting failover
> vsvr,
> > > clusterState 2 (not PCC)
> > > Aug  2 11:56:19 10.2.10.12 : 0:0:vtm:INFO: vtm_sendCardStateMsg:
> Sending
> > > card UP to g1r8
> > > Aug  2 11:56:19 10.2.10.12 : 0:0:pm:INFO: /onstor/bin/vtmd:
> > > finished initialization.
> > >
> > > Aug  2 11:56:22 10.2.10.12 : 0:0:vsd:ERROR: vsd_mountVolProc :
> Aborting
> > > mount operation for VS 5; 4 volume(s) owned but only 0 found
> > > Aug  2 11:56:23 10.2.10.12 : 0:0:vsd:ERROR: vsd_mountVolProc :
> Aborting
> > > mount operation for VS 6; 4 volume(s) owned but only 0 found
> > >
> > > Aug  2 11:56:29 10.2.10.12 : 0:0:vsd:INFO: vsd_createVsRunTime:
> sending
> > 1
> > > shares for VS 1 (1/1 cifs, 0/1 nfs), more to follow yes
> > > Aug  2 11:56:29 10.2.10.12 : 0:0:vsd:INFO: vsd_createVsRunTime:
> sending
> > 1
> > > shares for VS 1 (1/1 cifs, 1/1 nfs), more to follow no
> > > Aug  2 11:56:29 10.2.10.12 : 0:0:vsd:ERROR: vsd_mountVolProc :
> Aborting
> > > mount operation for VS 1; 1 volume(s) owned but only 0 found
> > >
> > > Aug  4 12:50:36 10.2.10.12 : 0:0:cluster2:INFO:
> > cluster_clientSendRmcRpc:
> > > Error sending rpc to clusterrpc, flags 820a, name vsd, rc -19,
> > retrying...
> > > Aug  4 12:50:36 10.2.10.12 : 0:0:cluster2:INFO:
> > cluster_clientSendRmcRpc:
> > > Retry worked to clusterrpc, flags 8e02, name vsd
> > > Aug  4 12:50:36 10.2.10.12 : 0:0:vsd:INFO:
> > > vsd_ensureNisFileCoherence[1230] : vs=1, file=71, #masterRec=0
> > > Aug  4 12:50:36 10.2.10.12 : 0:0:vsd:INFO:
> > > vsd_ensureNisFileCoherence[1230] : vs=1, file=72, #masterRec=0
> > > Aug  4 12:50:36 10.2.10.12 : 0:0:vsd:INFO:
> > > vsd_ensureNisFileCoherence[1230] : vs=1, file=74, #masterRec=0
> > >
> > > The symptoms changed once we were able to see the volumes and now
> the
> > > problem is with "system get all", which executes "support.sh".
> > >
> > > Not a clustering problem, and I'm sure Andy can dig through
> support.sh a
> > > log faster than I could.
> > >
> > > State: Opened
> > > history: 33766361	Aug  4 2008 12:04PM	rajk
> > > Submit	no_value Opened
> > > 33766363	Aug  4 2008 12:10PM	rajk
> > > Modify	Opened	Opened 33766364	Aug  4 2008
> > > 12:16PM	rajk	Modify	Opened	Opened
> > > 33766390	Aug  4 2008  2:22PM	vikas
> > > Modify	Opened	Opened 33766408	08/04/2008
> > > 16:38:03 PM	chrisv	Modify	Opened Opened
> 
