X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C841C9.99EDC998@onstor-exch02.onstor.net>; Tue, 18 Dec 2007 14:59:10 -0800
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C841C9.99EDC998"
Content-class: urn:content-classes:message
Subject: RE: cluster testing on Cougar
Date: Tue, 18 Dec 2007 14:59:10 -0800
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E0353B4D2@onstor-exch02.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: cluster testing on Cougar
Thread-Index: Acg/uscj8PT6znZwRJSt4cXl5vB7mAAdz4xvAEmJ9UcAHCqNvQ==
References: <BB375AF679D4A34E9CA8DFA650E2B04E028FB688@onstor-exch02.onstor.net> <BB375AF679D4A34E9CA8DFA650E2B04E0353B4C3@onstor-exch02.onstor.net> <BB375AF679D4A34E9CA8DFA650E2B04E028FB68F@onstor-exch02.onstor.net>
From: "Chris Vandever" <chris.vandever@onstor.com>
To: "Mike Lee" <mike.lee@onstor.com>,
	"dl-Cougar" <dl-Cougar@onstor.com>

This is a multi-part message in MIME format.

------_=_NextPart_001_01C841C9.99EDC998
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Glad to hear the clustering problem was a known one rather than a new =
one.

On "system config reset", the "initial config menu" eventually needs to =
get converted to the "first time install" code, which was only done for =
BSD in R98.  I know Larry and Charissa (my apologies if I left out =
anyone else) spent time making sure that the FTI code in R98 would not =
break the initial config code in cougar, although it's always possible =
that subsequent changes broke it.

ChrisV


-----Original Message-----
From: Mike Lee
Sent: Tue 12/18/2007 1:39 AM
To: Chris Vandever; dl-Cougar
Subject: RE: cluster testing on Cougar
=20
Chris:

You sure know your cluster...

Indeed, on the node where "cluster add / cluster commit" were issued, =
cluster_contrl had crashed with the following stack:

Core was generated by `/usr/local/agile/bin/cluster_contrl -r'.
Program terminated with signal 6, Aborted.
#0  0x2b70cb04 in kill () from /lib/libc.so.6
(gdb) where
#0  0x2b70cb04 in kill () from /lib/libc.so.6
#1  0x2b70e200 in abort () from /lib/libc.so.6
#2  0x00403b9c in sig_timer (num=3D14) at cluster-contrl-cfg.c:250
#3  0x2b05165c in rmc_timeout_scan () at rmc_api.c:2761
#4  0x2b051390 in rmc_timer_intr (signo=3D14) at rmc_api.c:2700
#5  <signal handler called>
#6  0x2b7b30bc in select () from /lib/libc.so.6
#7  0x2af3abb0 in IOMGR (dummy=3D0x0) at iomgr.c:601
#8  0x2af3ec3c in Create_Process_Part2 () at lwp.c:768
#9  0x2af3da54 in LWP_CreateProcess (ep=3D0x2b6e43ec =
<dl_iterate_phdr+300284>, sta
cksize=3D715947072, priority=3D0,
    parm=3D0xef940000 <Address 0xef940000 out of bounds>, =
name=3D0x2b0001d0 <Address
 0x2b0001d0 out of bounds>, pid=3D0x0)
    at lwp.c:395
#10 0x004bc790 in ?? ()
warning: GDB can't find the start of the function at 0x4bc790.

Rebooting does cure the problem now, and the two filers are now =
clustered.
I will do some vsvr failover scenarios tomorrow.

Also, as I pointed out to Tim and Larry, "system config reset" does not =
work on one of the two filers, in that the initial config menus do not =
get get displayed after the automatic reboot.  We will probalby need to =
review the initial-config script for the cause.

Thanks again.

-Mike

All: In other news, there was a frequent FP crash using today's build =
from top of the dev tree, which hindered my progress a bit.  Jeff helped =
me out by giving me an updated FP image that worked around the problem.

-----Original Message-----
From: Chris Vandever
Sent: Sun 12/16/2007 2:28 PM
To: Mike Lee; dl-Cougar
Subject: RE: cluster testing on Cougar
=20
Did cluster_contrl die on either filer?  It looks very similar to a =
known problem against SW-RMC where cluster_contrl dies on one filer the =
first time it tries to send an RMC message.  It dies in sig_timer() with =
an rmc timeout.  PM restarts it, but it and the other node aren't able =
to communicate at all after that.

There's a workaround in the defect:  Confirm that both nodes are listed =
in /onstor/cluster.conf on both nodes, and if so, then reboot both =
nodes.  Things should run fine after that.

If this isn't the problem, then I'll need more context from the elog =
(starting from the "cluster add" from nfxsh) and the elog from the other =
node as well (just before the other node says it got a restart =
(ClusterCtrl_iRestart?) and reboots.)

ChrisV


-----Original Message-----
From: Mike Lee
Sent: Sun 12/16/2007 12:08 AM
To: dl-Cougar
Subject: cluster testing on Cougar
=20
All:
Unfortunately, using the official framework, basic clustering between =
two cougar nodes did not work.
After doing "cluster commit", the cluster daemon on the PCC was stuck in =
a loop trying to synchronize the cluster DB (exerpted below).  I will =
try to investigate, but without Chris' help (she's on vacation), I'm not =
certain how far I will get.
-Mike

Dec 16 00:00:44 g7r10 : 0:0:cluster2:ERROR: =
ClusterContrl_GetRecordWhole: ubik_call failed, recType 2(clusterCfg), =
code 5376, rc 30
Dec 16 00:00:44 g7r10 : 0:0:cluster2:ERROR: ClusterServ_UpdateState: =
database is synchronizing, not ready
Dec 16 00:00:45 g7r10 : 0:0:cluster2:ERROR: =
ClusterContrl_GetRecordWhole: ubik_call failed, recType 2(clusterCfg), =
code 5376, rc 30
Dec 16 00:00:45 g7r10 : 0:0:cluster2:ERROR: =
ClusterContrl_GetRecordWhole: ubik_call failed, recType 77(EMRS), code =
5376, rc 30




------_=_NextPart_001_01C841C9.99EDC998
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7652.24">
<TITLE>RE: cluster testing on Cougar</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=3D2>Glad to hear the clustering problem was a known one =
rather than a new one.<BR>
<BR>
On &quot;system config reset&quot;, the &quot;initial config menu&quot; =
eventually needs to get converted to the &quot;first time install&quot; =
code, which was only done for BSD in R98.&nbsp; I know Larry and =
Charissa (my apologies if I left out anyone else) spent time making sure =
that the FTI code in R98 would not break the initial config code in =
cougar, although it's always possible that subsequent changes broke =
it.<BR>
<BR>
ChrisV<BR>
<BR>
<BR>
-----Original Message-----<BR>
From: Mike Lee<BR>
Sent: Tue 12/18/2007 1:39 AM<BR>
To: Chris Vandever; dl-Cougar<BR>
Subject: RE: cluster testing on Cougar<BR>
<BR>
Chris:<BR>
<BR>
You sure know your cluster...<BR>
<BR>
Indeed, on the node where &quot;cluster add / cluster commit&quot; were =
issued, cluster_contrl had crashed with the following stack:<BR>
<BR>
Core was generated by `/usr/local/agile/bin/cluster_contrl -r'.<BR>
Program terminated with signal 6, Aborted.<BR>
#0&nbsp; 0x2b70cb04 in kill () from /lib/libc.so.6<BR>
(gdb) where<BR>
#0&nbsp; 0x2b70cb04 in kill () from /lib/libc.so.6<BR>
#1&nbsp; 0x2b70e200 in abort () from /lib/libc.so.6<BR>
#2&nbsp; 0x00403b9c in sig_timer (num=3D14) at =
cluster-contrl-cfg.c:250<BR>
#3&nbsp; 0x2b05165c in rmc_timeout_scan () at rmc_api.c:2761<BR>
#4&nbsp; 0x2b051390 in rmc_timer_intr (signo=3D14) at rmc_api.c:2700<BR>
#5&nbsp; &lt;signal handler called&gt;<BR>
#6&nbsp; 0x2b7b30bc in select () from /lib/libc.so.6<BR>
#7&nbsp; 0x2af3abb0 in IOMGR (dummy=3D0x0) at iomgr.c:601<BR>
#8&nbsp; 0x2af3ec3c in Create_Process_Part2 () at lwp.c:768<BR>
#9&nbsp; 0x2af3da54 in LWP_CreateProcess (ep=3D0x2b6e43ec =
&lt;dl_iterate_phdr+300284&gt;, sta<BR>
cksize=3D715947072, priority=3D0,<BR>
&nbsp;&nbsp;&nbsp; parm=3D0xef940000 &lt;Address 0xef940000 out of =
bounds&gt;, name=3D0x2b0001d0 &lt;Address<BR>
&nbsp;0x2b0001d0 out of bounds&gt;, pid=3D0x0)<BR>
&nbsp;&nbsp;&nbsp; at lwp.c:395<BR>
#10 0x004bc790 in ?? ()<BR>
warning: GDB can't find the start of the function at 0x4bc790.<BR>
<BR>
Rebooting does cure the problem now, and the two filers are now =
clustered.<BR>
I will do some vsvr failover scenarios tomorrow.<BR>
<BR>
Also, as I pointed out to Tim and Larry, &quot;system config reset&quot; =
does not work on one of the two filers, in that the initial config menus =
do not get get displayed after the automatic reboot.&nbsp; We will =
probalby need to review the initial-config script for the cause.<BR>
<BR>
Thanks again.<BR>
<BR>
-Mike<BR>
<BR>
All: In other news, there was a frequent FP crash using today's build =
from top of the dev tree, which hindered my progress a bit.&nbsp; Jeff =
helped me out by giving me an updated FP image that worked around the =
problem.<BR>
<BR>
-----Original Message-----<BR>
From: Chris Vandever<BR>
Sent: Sun 12/16/2007 2:28 PM<BR>
To: Mike Lee; dl-Cougar<BR>
Subject: RE: cluster testing on Cougar<BR>
<BR>
Did cluster_contrl die on either filer?&nbsp; It looks very similar to a =
known problem against SW-RMC where cluster_contrl dies on one filer the =
first time it tries to send an RMC message.&nbsp; It dies in sig_timer() =
with an rmc timeout.&nbsp; PM restarts it, but it and the other node =
aren't able to communicate at all after that.<BR>
<BR>
There's a workaround in the defect:&nbsp; Confirm that both nodes are =
listed in /onstor/cluster.conf on both nodes, and if so, then reboot =
both nodes.&nbsp; Things should run fine after that.<BR>
<BR>
If this isn't the problem, then I'll need more context from the elog =
(starting from the &quot;cluster add&quot; from nfxsh) and the elog from =
the other node as well (just before the other node says it got a restart =
(ClusterCtrl_iRestart?) and reboots.)<BR>
<BR>
ChrisV<BR>
<BR>
<BR>
-----Original Message-----<BR>
From: Mike Lee<BR>
Sent: Sun 12/16/2007 12:08 AM<BR>
To: dl-Cougar<BR>
Subject: cluster testing on Cougar<BR>
<BR>
All:<BR>
Unfortunately, using the official framework, basic clustering between =
two cougar nodes did not work.<BR>
After doing &quot;cluster commit&quot;, the cluster daemon on the PCC =
was stuck in a loop trying to synchronize the cluster DB (exerpted =
below).&nbsp; I will try to investigate, but without Chris' help (she's =
on vacation), I'm not certain how far I will get.<BR>
-Mike<BR>
<BR>
Dec 16 00:00:44 g7r10 : 0:0:cluster2:ERROR: =
ClusterContrl_GetRecordWhole: ubik_call failed, recType 2(clusterCfg), =
code 5376, rc 30<BR>
Dec 16 00:00:44 g7r10 : 0:0:cluster2:ERROR: ClusterServ_UpdateState: =
database is synchronizing, not ready<BR>
Dec 16 00:00:45 g7r10 : 0:0:cluster2:ERROR: =
ClusterContrl_GetRecordWhole: ubik_call failed, recType 2(clusterCfg), =
code 5376, rc 30<BR>
Dec 16 00:00:45 g7r10 : 0:0:cluster2:ERROR: =
ClusterContrl_GetRecordWhole: ubik_call failed, recType 77(EMRS), code =
5376, rc 30<BR>
<BR>
<BR>
<BR>
</FONT>
</P>

</BODY>
</HTML>
------_=_NextPart_001_01C841C9.99EDC998--
