X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C88860.F8EA2164@onstor-exch02.onstor.net>; Mon, 17 Mar 2008 11:59:05 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Content-class: urn:content-classes:message
Subject: RE: strerror inconsistency?
Date: Mon, 17 Mar 2008 11:59:05 -0700
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E03E9A6D7@onstor-exch02.onstor.net>
In-Reply-To: <20080317090734.0bda09fb@ripper.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: strerror inconsistency?
Thread-Index: AciISQMzxkeOGmPQSxeHGSf2dS36twAF+5kA
From: "Chris Vandever" <chris.vandever@onstor.com>
To: "Andy Sharp" <andy.sharp@onstor.com>

Yup, did that.  Thanks.

-----Original Message-----
From: Andy Sharp=20
Sent: Monday, March 17, 2008 9:08 AM
To: Chris Vandever
Subject: Re: strerror inconsistency?

Ahaha, that's a good one.  I don't know what that macro is doing, but
generally I save errno to some other variable in cases like this, just
to be sure.  Try

if (stat < 0) {
	int terr =3D errno;

	elog(whatever, terr, strerror(terr));


and see if that makes any difference.  Remember that potentially any
system call can change the value of errno, so if there is any system
calls anywhere in elog() after the value of errno is copied on the
stack, then this kind of behavior can result.  That includes if a
signal handler was executed at just the right moment.

Cheers,

a

On Fri, 14 Mar 2008 13:54:14 -0700 "Chris Vandever"
<chris.vandever@onstor.com> wrote:

> John K. got the following on 2 different bobcat linux nodes supposedly
> running the same release:
>=20
> =20
>=20
> Mar 14 13:23:58 g1r5 : 0:0:cluster2:INFO: Cluster_SendMsgSock: sendto
> failed, code 9 (Network is down)
>=20
> =20
>=20
> Mar 14 13:24:12 g5r3 : 0:0:cluster2:INFO: Cluster_SendMsgSock: sendto
> failed, code 9 (Host is down)
>=20
> =20
>=20
> The code that generates this is:
>=20
> =20
>=20
>         errno =3D 0;
>=20
>         stat =3D sendto(sock, msg, msgLen, 0, (struct sockaddr*)&to,
> sizeof(to));
>=20
>         if (stat !=3D msgLen) {
>=20
>             /* The send failed, so retry it.
>=20
>              */
>=20
>             if (stat < 0) {
>=20
>                 CLUSTER_INFO(("%s: sendto failed, code %d (%s)\n",
>=20
>                               __FUNCTION__,
>=20
>                               errno,
>=20
>                               strerror(errno)));
>=20
> =20
>=20
> So, (1) Why do the two nodes translate an errno of 9 differently?  (2)
> Why do they translate 9 to ENETDOWN or EHOSTDOWN when 9 is EBADF?  I
> can believe that perhaps I need to stash a copy of errno into a local
> variable in case somewhere in the bowels of CLUSTER_INFO() (which maps
> to an elog) we inadvertently step on errno, but I'd kind of expect the
> printed int and string to at least be consistent.  Sometimes I can be
> pretty dense, however...  :-(
>=20
> =20
>=20
> ChrisV
>=20
> ________________________________
>=20
> From: Chris Vandever=20
> Sent: Friday, March 14, 2008 1:40 PM
> To: John Keiffer
> Cc: dl-QA
> Subject: RE: Bobcat cluster help?
>=20
> =20
>=20
> There's a problem with the networking on the ssc on one or both nodes:
>=20
> =20
>=20
> =20
>=20
> Can each of them ping the other?
>=20
> =20
>=20
> (I find it very interesting that one node translates an errno of 9 to
> "Network is down" while the other is "Host is down", both of which are
> wrong.)
>=20
> =20
>=20
> ChrisV
>=20
> ________________________________
>=20
> From: John Keiffer=20
> Sent: Friday, March 14, 2008 1:27 PM
> To: Chris Vandever
> Cc: dl-QA
> Subject: Bobcat cluster help?
>=20
> =20
>=20
> Hello Chris,
>=20
> =20
>=20
> My bobcat cluster is really having problems today. This morning when I
> came in (Pleasanton office), both filer had crashed and were down to
> the point I had to have them power cycled.
>=20
> =20
>=20
> I have since upgraded both Bobcats with the latest build from Larry
> (I'll call it Sub12.v2). Currently, neither filer can do a vsvr
> show...
>=20
> =20
>=20
> G5r3 is up (though once the FC was in prom_init)
>=20
>            =20
>=20
> 13:18:15 g5r3 diag> cluster show cluster
>=20
> Cluster Name: g1r5       Cluster State:   On
>=20
> NAS Gateways        IP              State   PCC
>=20
> ------------------------------------------------------
>=20
> g1r5                10.2.1.21       UP      NO
>=20
> g5r3                10.2.1.18       UP      YES
>=20
> =20
>=20
> G1r5 is up, but it only starts about 11 or 12 onstor processes.
>=20
> =20
>=20
> 03/14/08 13:17:51 g1r5 diag> cluster show cluster
>=20
> Cluster Name:        Cluster State:   Off
>=20
> NAS Gateways        IP              State   PCC
>=20
> ------------------------------------------------------
>=20
> g1r5                10.2.1.21       N/A     N/A
>=20
> g5r3                10.2.1.18       N/A     N/A
>=20
> =20
>=20
> # onstor
>=20
>       12
>=20
> root     12246  0.0  0.2  2020   476 ??  Ss     1:14PM    0:00.06
> /onstor/bin/sshd
>=20
> root     16200  0.0  0.3   500   776 ??  Ss     1:15PM    0:00.40
> /onstor/bin/pm
>=20
> root     24192  0.0  0.6   436  1416 ??  S      1:15PM    0:01.18
> /onstor/bin/elog
>=20
> root      2737  0.0  0.6   796  1588 ??  S      1:15PM    0:00.88
> /onstor/bin/ncmd
>=20
> root     25055  0.0  0.6   496  1564 ??  S      1:15PM    0:00.56
> /onstor/bin/eventd
>=20
> root      7116  0.0  0.5   412  1308 ??  S      1:15PM    0:00.17
> /onstor/bin/timekeeper
>=20
> root      3309  0.0  0.2   156   468 ??  S      1:15PM    0:00.48
> /onstor/bin/chassisd
>=20
> root      3289  0.0  1.2  2888  2968 ??  S      1:15PM    0:03.17
> /onstor/bin/sdm_cfgd
>=20
> root     13035  0.0  0.9  1684  2196 ??  S      1:15PM    0:00.36
> /onstor/bin/evm_cfgd
>=20
> root     15469  0.0  0.4   516  1152 ??  S<     1:15PM    0:00.11
> /onstor/bin/cluster_server
>=20
> root     14965  0.1  0.8  1204  1936 ??  S<     1:15PM    0:00.45
> /onstor/bin/cluster_contrl
>=20
> root      8075  0.0  0.6  1520  1496 ??  S<     1:15PM    0:00.26
> /onstor/bin/cluster_contrl
>=20
> #
>=20
> =20
>=20
> Here are the repeating elogs from g1r5:
>=20
> =20
>=20
> Mar 14 13:23:58 g1r5 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1
>=20
> Mar 14 13:23:58 g1r5 : 0:0:nfxsh:NOTICE: cmd[0]: -> EMRS: tried (5
> times) without success to get the EMRS config from nfxsh : status[2]
>=20
> Mar 14 13:23:58 g1r5 : 0:0:cluster2:INFO: Cluster_SendMsgSock: sendto
> failed, code 9 (Network is down)
>=20
> Mar 14 13:23:58 g1r5 : 0:0:cluster2:ERROR:
> ClusterContrl_GetRecordWhole: ubik_call failed, recType 4(elogCfg),
> code -1, rc 30
>=20
> Mar 14 13:23:58 g1r5 : 0:0:cluster2:INFO: cluster_lookup_sess: fail to
> open sess for app cluster2, rc -20
>=20
> =20
>=20
> Here are the repeating elogs from g5r3:
>=20
> =20
>=20
> Mar 14 13:24:12 g5r3 : 0:0:cluster2:INFO: cluster_clientSendRmcRpc:
> Error sending rpc to cluster2, flags 820a, name tape-driver, rc -20,
> retrying...
>=20
> Mar 14 13:24:12 g5r3 : 0:0:cluster2:INFO: Cluster_SendMsgSock: sendto
> failed, code 9 (Host is down)
>=20
> Mar 14 13:24:12 g5r3 : 0:0:cluster2:INFO: cluster_clientSendRmcRpc:
> Error sending rpc to cluster2, flags 820a, name vsd, rc -20,
> retrying...
>=20
> Mar 14 13:24:13 g5r3 : 0:0:cluster2:INFO: cluster_clientSendRmcRpc:
> Error sending rpc to cluster2, flags 820a, name vtm, rc -20,
> retrying...
>=20
> Mar 14 13:24:13 g5r3 : 0:0:cluster2:INFO: Cluster_SendMsgSock: sendto
> failed, code 9 (Host is down)
>=20
> =20
>=20
> Thank you,
>=20
> John Keiffer
>=20
> =20
>=20
> =20
>=20
