X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C84032.FC80A9A0@onstor-exch02.onstor.net>; Sun, 16 Dec 2007 14:28:30 -0800
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C84032.FC80A9A0"
Content-class: urn:content-classes:message
Subject: RE: cluster testing on Cougar
Date: Sun, 16 Dec 2007 14:28:30 -0800
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E0353B4C3@onstor-exch02.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: cluster testing on Cougar
Thread-Index: Acg/uscj8PT6znZwRJSt4cXl5vB7mAAdz4xv
References: <BB375AF679D4A34E9CA8DFA650E2B04E028FB688@onstor-exch02.onstor.net>
From: "Chris Vandever" <chris.vandever@onstor.com>
To: "Mike Lee" <mike.lee@onstor.com>,
	"dl-Cougar" <dl-Cougar@onstor.com>

This is a multi-part message in MIME format.

------_=_NextPart_001_01C84032.FC80A9A0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Did cluster_contrl die on either filer?  It looks very similar to a =
known problem against SW-RMC where cluster_contrl dies on one filer the =
first time it tries to send an RMC message.  It dies in sig_timer() with =
an rmc timeout.  PM restarts it, but it and the other node aren't able =
to communicate at all after that.

There's a workaround in the defect:  Confirm that both nodes are listed =
in /onstor/cluster.conf on both nodes, and if so, then reboot both =
nodes.  Things should run fine after that.

If this isn't the problem, then I'll need more context from the elog =
(starting from the "cluster add" from nfxsh) and the elog from the other =
node as well (just before the other node says it got a restart =
(ClusterCtrl_iRestart?) and reboots.)

ChrisV


-----Original Message-----
From: Mike Lee
Sent: Sun 12/16/2007 12:08 AM
To: dl-Cougar
Subject: cluster testing on Cougar
=20
All:
Unfortunately, using the official framework, basic clustering between =
two cougar nodes did not work.
After doing "cluster commit", the cluster daemon on the PCC was stuck in =
a loop trying to synchronize the cluster DB (exerpted below).  I will =
try to investigate, but without Chris' help (she's on vacation), I'm not =
certain how far I will get.
-Mike

Dec 16 00:00:44 g7r10 : 0:0:cluster2:ERROR: =
ClusterContrl_GetRecordWhole: ubik_call failed, recType 2(clusterCfg), =
code 5376, rc 30
Dec 16 00:00:44 g7r10 : 0:0:cluster2:ERROR: ClusterServ_UpdateState: =
database is synchronizing, not ready
Dec 16 00:00:45 g7r10 : 0:0:cluster2:ERROR: =
ClusterContrl_GetRecordWhole: ubik_call failed, recType 2(clusterCfg), =
code 5376, rc 30
Dec 16 00:00:45 g7r10 : 0:0:cluster2:ERROR: =
ClusterContrl_GetRecordWhole: ubik_call failed, recType 77(EMRS), code =
5376, rc 30


------_=_NextPart_001_01C84032.FC80A9A0
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7652.24">
<TITLE>RE: cluster testing on Cougar</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>Did cluster_contrl die on either filer?&nbsp; It looks =
very similar to a known problem against SW-RMC where cluster_contrl dies =
on one filer the first time it tries to send an RMC message.&nbsp; It =
dies in sig_timer() with an rmc timeout.&nbsp; PM restarts it, but it =
and the other node aren't able to communicate at all after that.<BR>
<BR>
There's a workaround in the defect:&nbsp; Confirm that both nodes are =
listed in /onstor/cluster.conf on both nodes, and if so, then reboot =
both nodes.&nbsp; Things should run fine after that.<BR>
<BR>
If this isn't the problem, then I'll need more context from the elog =
(starting from the &quot;cluster add&quot; from nfxsh) and the elog from =
the other node as well (just before the other node says it got a restart =
(ClusterCtrl_iRestart?) and reboots.)<BR>
<BR>
ChrisV<BR>
<BR>
<BR>
-----Original Message-----<BR>
From: Mike Lee<BR>
Sent: Sun 12/16/2007 12:08 AM<BR>
To: dl-Cougar<BR>
Subject: cluster testing on Cougar<BR>
<BR>
All:<BR>
Unfortunately, using the official framework, basic clustering between =
two cougar nodes did not work.<BR>
After doing &quot;cluster commit&quot;, the cluster daemon on the PCC =
was stuck in a loop trying to synchronize the cluster DB (exerpted =
below).&nbsp; I will try to investigate, but without Chris' help (she's =
on vacation), I'm not certain how far I will get.<BR>
-Mike<BR>
<BR>
Dec 16 00:00:44 g7r10 : 0:0:cluster2:ERROR: =
ClusterContrl_GetRecordWhole: ubik_call failed, recType 2(clusterCfg), =
code 5376, rc 30<BR>
Dec 16 00:00:44 g7r10 : 0:0:cluster2:ERROR: ClusterServ_UpdateState: =
database is synchronizing, not ready<BR>
Dec 16 00:00:45 g7r10 : 0:0:cluster2:ERROR: =
ClusterContrl_GetRecordWhole: ubik_call failed, recType 2(clusterCfg), =
code 5376, rc 30<BR>
Dec 16 00:00:45 g7r10 : 0:0:cluster2:ERROR: =
ClusterContrl_GetRecordWhole: ubik_call failed, recType 77(EMRS), code =
5376, rc 30<BR>
<BR>
</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C84032.FC80A9A0--
