X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C85E21.DB0046B4@onstor-exch02.onstor.net>; Wed, 23 Jan 2008 17:41:28 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Content-class: urn:content-classes:message
Subject: RE: mightydog coffee breaks and sightings
Date: Wed, 23 Jan 2008 17:41:28 -0700
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E05C740E6@onstor-exch02.onstor.net>
In-Reply-To: <20080123154323.6d99aa2d@ripper.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: mightydog coffee breaks and sightings
Thread-Index: AcheGcDN5iRh5W9QQaKHMB4mM3iKXQAB+QAQ
From: "Sandrine Boulanger" <sandrine.boulanger@onstor.com>
To: "Andy Sharp" <andy.sharp@onstor.com>

Can you characterize the "going to sleep" part?
Do you see that from nfs or cifs? when accessing which share/folders? We
should take a trace from your client next time it happens.

-----Original Message-----
From: Andy Sharp=20
Sent: Wednesday, January 23, 2008 3:43 PM
To: Chris Vandever
Cc: Maxim Kozlovsky; Jonathan Goldick; Jobi Ariyamannil; Amit Bothra;
Sandrine Boulanger; Brian Baker; Brian DeForest
Subject: mightydog coffee breaks and sightings

Sightings.

Yesterday during one of mightydog's coffee breaks, a file appeared in
the directory I was working in (home directory on mightydog) that I did
nothing to create. It had the name of "1493" and a completely strange
mode, causing rsync, which I use constantly when in the compile/edit
cycle, to choke.  I know it just "appeared" in relation to the mightydog
sleepage because I used rsync just before and just after.  It didn't
choke before, and did after.  I just deleted the file and went on about
my business but the name seems too strangely similar to this 1491 and
well, maybe someone knows something or had a similar strange experience.

I think I have a screen log of when this happend, if anyone is
interested.  It wasn't Columbus day yesterday, was it?

And what's up with mightydog "going to sleep" for minutes at a time
lately?

Cheers,

a


On Wed, 23 Jan 2008 15:11:05 -0800 "Chris Vandever"
<chris.vandever@onstor.com> wrote:

> There's a shareName record for an NFS share named "vol_mgmt_1491/"
> which doesn't have the shareNfs and shareInfo records that should be
> associated with it.  I'll send instructions on how to delete it.
>=20
> ChrisV
>=20
> _____________________________________________
> From: Chris Vandever=20
> Sent: Wednesday, January 23, 2008 2:12 PM
> To: Shin Irie; dl-cstech
> Subject: RE: cluster DB corruption?
>=20
> I will check the clusDb and elogs in the zipped file, but in the
> meantime these messages:
>=20
> 	Jan 23 12:05:53 bobcat1 : 0:0:cluster2:ERROR: sig_timer:
> contrl rpc timeout, restarting controller=20
> 	Jan 23 12:05:53 bobcat1 : 0:0:pm:ERROR: pm_sig_handler:
> /usr/local/agile/bin/cluster_contrl (pid 30290) exited with status 0=20
>=20
> Indicate a known rmc problem, resulting in cluster_contrl exiting.
> The clustering errors after that are because clustering is restarting.
>=20
> ChrisV
>=20
> _____________________________________________
> From: Shin Irie=20
> Sent: Wednesday, January 23, 2008 2:07 PM
> To: dl-cstech
> Subject: cluster DB corruption?
>=20
> Hi,
>=20
> I have a customer whose Bobcat takes long time to complete nfx
> commands. Also they cannot create a share for the management volume
> with the message "the share already exist, so system get all cannot
> be copied.  I only have /var/agile/messages (elog) now.  The Bobcat
> is a single node system. and running R3.1.0.7.
>  << File: elog_clusdb.zip >>=20
> Following message are being logged a lot of times.  See attached zip
> file for elog and Cluster DB.
>=20
> 	Jan 23 12:04:25 bobcat1 : 0:0:auth_agent:WARNING: nisd for vs
> 5 exited. Restarting it=20
>=20
> This messages started around Jan 23 12:04 (see below).  Several
> cluster error messages are also logged.  The system admins were
> configuring the Bobcat from CLI and Web UI at the same time.
> Is this cluster DB corruption?  How can I recover this?
>=20
> 	Jan 23 12:04:23 bobcat1 : 1: cmd[0]: vsvr set SNIPER :
> status[0] Jan 23 12:04:24 bobcat1 : 0:0:eventd:NOTICE: Process-EVENT
> Volume: name 'snipe-vol01', Id 0x000005d30000006a, Event 'Online', was
> offline for roughly 799 sec.
> 	Jan 23 12:04:24 bobcat1 : 0:0:eventd:NOTICE: Process-EVENT IP
> i/f: IP 192.167.5.1, Port bp0, State Up
> 	Jan 23 12:04:24 bobcat1 : 0:0:eventd:NOTICE: Process-EVENT IP
> i/f: IP 192.167.5.2, Port bp0, State Up
> 	Jan 23 12:04:25 bobcat1 : 0:0:auth_agent:WARNING: nisd for vs
> 5 exited. Restarting it=20
> 	Jan 23 12:04:57 bobcat1 last message repeated 18 times
> 	Jan 23 12:05:52 bobcat1 last message repeated 32 times
> 	Jan 23 12:05:53 bobcat1 : 0:0:cluster2:ERROR: sig_timer:
> contrl rpc timeout, restarting controller=20
> 	Jan 23 12:05:53 bobcat1 : 0:0:pm:ERROR: pm_sig_handler:
> /usr/local/agile/bin/cluster_contrl (pid 30290) exited with status 0=20
> 	Jan 23 12:05:54 bobcat1 : 0:0:auth_agent:WARNING: nisd for vs
> 5 exited. Restarting it=20
> 	Jan 23 12:06:03 bobcat1 last message repeated 5 times
> 	Jan 23 12:06:03 bobcat1 : 0:0:cluster2:ERROR:
> cluster_getRecordIdByKey: no reply bck -1=20
> 	Jan 23 12:06:03 bobcat1 : 0:0:cluster2:ERROR:
> cluster_getFilerNameList: cannot get cluster rec, rcode 30=20
> 	Jan 23 12:06:03 bobcat1 : 0:0:nfxsh:NOTICE: cmd[9]: vsvr show
> all : status[11]
> 	Jan 23 12:06:04 bobcat1 : 0:0:cluster2:ERROR:
> cluster_atomicUpdateRecord: no reply bck -1=20
> 	Jan 23 12:06:04 bobcat1 : 0:0:cluster2:ERROR:
> cluster_releaseLock[3956]: Unable to update lock recId 12800, code 30=20
> 	Jan 23 12:06:04 bobcat1 : 0:0:cluster2:ERROR:
> cluster_releaseGnsLock[2081]: Can't release GNS read lock, recId
> 12800, code 30=20
>=20
>=20
> --
> Irie
>=20
