X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C85E24.E137B832@onstor-exch02.onstor.net>; Wed, 23 Jan 2008 18:03:07 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Content-class: urn:content-classes:message
Subject: RE: mightydog coffee breaks and sightings
Date: Wed, 23 Jan 2008 18:03:07 -0700
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E05C740E8@onstor-exch02.onstor.net>
In-Reply-To: <20080123164649.43f9d70b@ripper.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: mightydog coffee breaks and sightings
Thread-Index: AcheIpo+TJOpi9t7TVmgG+Fx6ciBQwAAjcLw
From: "Sandrine Boulanger" <sandrine.boulanger@onstor.com>
To: "Andy Sharp" <andy.sharp@onstor.com>

Let me know when you run into it, we can check a few things from TXRX on
MD.

-----Original Message-----
From: Andy Sharp=20
Sent: Wednesday, January 23, 2008 4:47 PM
To: Sandrine Boulanger
Subject: Re: mightydog coffee breaks and sightings

It acts like it has crashed, but recovers after several minutes.  It's
not special just for me, it is happening to other people at the same
time.  There were several yesterday.

It is NFS, on my home directory.  I'm accessing my home directory from
multiple machines at once (my workstation to read/write, and a compile
server to read) so it's not just relegated to ripper.

I can fire off a trace next time it happens, but I think it's just
going to show no packets at all coming from mightydog.  But we will see.


On Wed, 23 Jan 2008 16:41:28 -0800 "Sandrine Boulanger"
<sandrine.boulanger@onstor.com> wrote:

> Can you characterize the "going to sleep" part?
> Do you see that from nfs or cifs? when accessing which share/folders?
> We should take a trace from your client next time it happens.
>=20
> -----Original Message-----
> From: Andy Sharp=20
> Sent: Wednesday, January 23, 2008 3:43 PM
> To: Chris Vandever
> Cc: Maxim Kozlovsky; Jonathan Goldick; Jobi Ariyamannil; Amit Bothra;
> Sandrine Boulanger; Brian Baker; Brian DeForest
> Subject: mightydog coffee breaks and sightings
>=20
> Sightings.
>=20
> Yesterday during one of mightydog's coffee breaks, a file appeared in
> the directory I was working in (home directory on mightydog) that I
> did nothing to create. It had the name of "1493" and a completely
> strange mode, causing rsync, which I use constantly when in the
> compile/edit cycle, to choke.  I know it just "appeared" in relation
> to the mightydog sleepage because I used rsync just before and just
> after.  It didn't choke before, and did after.  I just deleted the
> file and went on about my business but the name seems too strangely
> similar to this 1491 and well, maybe someone knows something or had a
> similar strange experience.
>=20
> I think I have a screen log of when this happend, if anyone is
> interested.  It wasn't Columbus day yesterday, was it?
>=20
> And what's up with mightydog "going to sleep" for minutes at a time
> lately?
>=20
> Cheers,
>=20
> a
>=20
>=20
> On Wed, 23 Jan 2008 15:11:05 -0800 "Chris Vandever"
> <chris.vandever@onstor.com> wrote:
>=20
> > There's a shareName record for an NFS share named "vol_mgmt_1491/"
> > which doesn't have the shareNfs and shareInfo records that should be
> > associated with it.  I'll send instructions on how to delete it.
> >=20
> > ChrisV
> >=20
> > _____________________________________________
> > From: Chris Vandever=20
> > Sent: Wednesday, January 23, 2008 2:12 PM
> > To: Shin Irie; dl-cstech
> > Subject: RE: cluster DB corruption?
> >=20
> > I will check the clusDb and elogs in the zipped file, but in the
> > meantime these messages:
> >=20
> > 	Jan 23 12:05:53 bobcat1 : 0:0:cluster2:ERROR: sig_timer:
> > contrl rpc timeout, restarting controller=20
> > 	Jan 23 12:05:53 bobcat1 : 0:0:pm:ERROR: pm_sig_handler:
> > /usr/local/agile/bin/cluster_contrl (pid 30290) exited with status
> > 0=20
> >=20
> > Indicate a known rmc problem, resulting in cluster_contrl exiting.
> > The clustering errors after that are because clustering is
> > restarting.
> >=20
> > ChrisV
> >=20
> > _____________________________________________
> > From: Shin Irie=20
> > Sent: Wednesday, January 23, 2008 2:07 PM
> > To: dl-cstech
> > Subject: cluster DB corruption?
> >=20
> > Hi,
> >=20
> > I have a customer whose Bobcat takes long time to complete nfx
> > commands. Also they cannot create a share for the management volume
> > with the message "the share already exist, so system get all cannot
> > be copied.  I only have /var/agile/messages (elog) now.  The Bobcat
> > is a single node system. and running R3.1.0.7.
> >  << File: elog_clusdb.zip >>=20
> > Following message are being logged a lot of times.  See attached zip
> > file for elog and Cluster DB.
> >=20
> > 	Jan 23 12:04:25 bobcat1 : 0:0:auth_agent:WARNING: nisd for
> > vs 5 exited. Restarting it=20
> >=20
> > This messages started around Jan 23 12:04 (see below).  Several
> > cluster error messages are also logged.  The system admins were
> > configuring the Bobcat from CLI and Web UI at the same time.
> > Is this cluster DB corruption?  How can I recover this?
> >=20
> > 	Jan 23 12:04:23 bobcat1 : 1: cmd[0]: vsvr set SNIPER :
> > status[0] Jan 23 12:04:24 bobcat1 : 0:0:eventd:NOTICE: Process-EVENT
> > Volume: name 'snipe-vol01', Id 0x000005d30000006a, Event 'Online',
> > was offline for roughly 799 sec.
> > 	Jan 23 12:04:24 bobcat1 : 0:0:eventd:NOTICE: Process-EVENT
> > IP i/f: IP 192.167.5.1, Port bp0, State Up
> > 	Jan 23 12:04:24 bobcat1 : 0:0:eventd:NOTICE: Process-EVENT
> > IP i/f: IP 192.167.5.2, Port bp0, State Up
> > 	Jan 23 12:04:25 bobcat1 : 0:0:auth_agent:WARNING: nisd for
> > vs 5 exited. Restarting it=20
> > 	Jan 23 12:04:57 bobcat1 last message repeated 18 times
> > 	Jan 23 12:05:52 bobcat1 last message repeated 32 times
> > 	Jan 23 12:05:53 bobcat1 : 0:0:cluster2:ERROR: sig_timer:
> > contrl rpc timeout, restarting controller=20
> > 	Jan 23 12:05:53 bobcat1 : 0:0:pm:ERROR: pm_sig_handler:
> > /usr/local/agile/bin/cluster_contrl (pid 30290) exited with status
> > 0 Jan 23 12:05:54 bobcat1 : 0:0:auth_agent:WARNING: nisd for vs
> > 5 exited. Restarting it=20
> > 	Jan 23 12:06:03 bobcat1 last message repeated 5 times
> > 	Jan 23 12:06:03 bobcat1 : 0:0:cluster2:ERROR:
> > cluster_getRecordIdByKey: no reply bck -1=20
> > 	Jan 23 12:06:03 bobcat1 : 0:0:cluster2:ERROR:
> > cluster_getFilerNameList: cannot get cluster rec, rcode 30=20
> > 	Jan 23 12:06:03 bobcat1 : 0:0:nfxsh:NOTICE: cmd[9]: vsvr
> > show all : status[11]
> > 	Jan 23 12:06:04 bobcat1 : 0:0:cluster2:ERROR:
> > cluster_atomicUpdateRecord: no reply bck -1=20
> > 	Jan 23 12:06:04 bobcat1 : 0:0:cluster2:ERROR:
> > cluster_releaseLock[3956]: Unable to update lock recId 12800, code
> > 30 Jan 23 12:06:04 bobcat1 : 0:0:cluster2:ERROR:
> > cluster_releaseGnsLock[2081]: Can't release GNS read lock, recId
> > 12800, code 30=20
> >=20
> >=20
> > --
> > Irie
> >=20
