X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C8BA01.F1921A18@onstor-exch02.onstor.net>; Mon, 19 May 2008 15:44:49 -0700
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C8BA01.F1921A18"
Content-class: urn:content-classes:message
Subject: RE: Is this an NTP problem?
Date: Mon, 19 May 2008 15:44:49 -0700
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E03AEAFF7@onstor-exch02.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E03E9A888@onstor-exch02.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Is this an NTP problem?
Thread-Index: Aci5/xI3GTwMB2ylTayyWxDLuQyBEAAASFhQAAASYnAAAEOtoA==
From: "Danqing Jin" <danqing.jin@onstor.com>
To: "Chris Vandever" <chris.vandever@onstor.com>,
	"Rich LaReau" <rich.lareau@onstor.com>,
	"dl-cstech" <dl-cstech@onstor.com>

This is a multi-part message in MIME format.

------_=_NextPart_001_01C8BA01.F1921A18
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

In this particular excerpt, it looks like cifsd died (rather than being
restarted), which typically means it would dump a core file to the flash
card (/var/run).

> _____________________________________________=20
> From: 	Chris Vandever =20
> Sent:	Monday, May 19, 2008 3:40 PM
> To:	Rich LaReau; dl-cstech
> Subject:	RE: Is this an NTP problem?
>=20
> Yes, it's a problem with the time getting out of sync.  The difference
> is actually 14 sec, which is enough to completely mess up clustering
> (votes expire, we start PCC election, and may end up in a
> split-brain-like scenario):
>=20
> bash-3.00$ utc.pl 1211232980
> Mon May 19 21:36:20 2008 GMT
> Mon May 19 14:36:20 2008 Local
> bash-3.00$ utc.pl 1211232994
> Mon May 19 21:36:34 2008 GMT
> Mon May 19 14:36:34 2008 Local
>=20
> We have a number of defects related to this.  Most of them boil down
> to us spending too much time in the BSD kernel to process packets in a
> timely fashion.  We're definitely not processing incoming clustering
> packets; I don't know what's happening with incoming NTP packets.  We
> see this type of problem when BSD is "busy" -- busy writing a core
> file, busy writing a large file to flash, etc.
>=20
> Sorry I don't remember the concensus on kicking it to get it in sync
> again.  If the problem is BSD being "busy" I believe it resyncs itself
> when BSD is done whatever it was so busy doing.
>=20
> ChrisV
>=20
> _____________________________________________
> From: Rich LaReau=20
> Sent: Monday, May 19, 2008 3:34 PM
> To: Rich LaReau; dl-cstech
> Subject: RE: Is this an NTP problem?
>=20
>=20
> (Resent without the unreadable wrap-around text feature.)
>=20
>=20
> I'm getting these chunks of logs every so often.  Is the whole set
> related to NTP being off by seven seconds?   If so, I should post them
> to our NTP troubleshooting Wiki.
>=20
> And also, what was the consensus on "kicking" the times so that they
> match again?  Restart ntpd, or delete/add the ntp servers?
>=20
> Thanks,
> Rich
>=20
>=20
> May 19 15:36:34 fss-pnasgw2 : 0:0:cluster2:INFO:
> ClusterCtrl_iUpdateState: post pcc up pccname fss-pnasgw2=20
> May 19 15:36:34 fss-pnasgw2 : 0:0:cluster2:WARNING: remote host
> 10.210.17.72 (time tic =3D 1211232980) and local (time tic =3D
> 1211232994)times are not synchronized. Please verify NTP server setup.
>=20
> May 19 15:36:34 fss-pnasgw2 : 0:0:auth_agent:WARNING:
> cleanCurrentRequest: cifsd exited while processing request with type
> 10321 1 time, retrying=20
> May 19 15:36:34 fss-pnasgw2 : 0:0:auth_agent:WARNING: cifsd for vs 3
> exited. Restarting it=20
> May 19 15:36:34 fss-pnasgw2 : 0:0:auth_agent:INFO:
> authen_restartCifsDaemon: Restarting CIFS daemon=20
> May 19 15:36:34 fss-pnasgw2 : 0:0:eventd:WARNING: Process-EVENT
> 0.0.0.0: Mgmt Port 0.0.0.0 PCC, State Up
> May 19 15:36:36 fss-pnasgw2 : 0:0:cluster2:NOTICE: urecovery_Interact:
> send new file in progress, remote ver 0(0), sending to 10.210.17.72=20
> May 19 15:36:36 fss-pnasgw2 : 0:0:cluster2:NOTICE: urecovery_Interact:
> Rx-write bulk error=3D1=20
> May 19 15:36:36 fss-pnasgw2 : 0:0:cluster2:NOTICE: urecovery_Interact:
> rx_EndCall error =3D 5381=20
> May 19 15:36:36 fss-pnasgw2 : 0:0:cluster2:NOTICE: urecovery_Interact:
> send new file end, code 1=20
> May 19 15:36:37 fss-pnasgw2 : 0:0:cluster2:WARNING:
> ClusterCtrl_iUpdateState: post pcc down pccname fss-pnasgw2=20
> May 19 15:36:37 fss-pnasgw2 : 0:0:eventd:CRITICAL: Process-EVENT Node:
> Name 'fss-pnasgw2', State Down, Msg ''
> May 19 15:36:37 fss-pnasgw2 : 0:0:vtm:INFO: vtm_get_filer_config: fail
> to get cluster info, try again later=20
> May 19 15:36:46 fss-pnasgw2 last message repeated 2 times
> May 19 15:36:47 fss-pnasgw2 : 0:0:sscccc:NOTICE: New client 0: addr
> 143.199.12.120, port 33693; Assigning fd(12)
> May 19 15:36:47 fss-pnasgw2 : 0:0:sscccc:INFO: bad read, bytes=3D-1
> (client 0 socket 12)
> May 19 15:36:47 fss-pnasgw2 : 0:0:sscccc:NOTICE: closing session for
> client 0
> May 19 15:36:49 fss-pnasgw2 : 0:0:cluster2:INFO:
> ClusterCtrl_iUpdateState: post pcc up pccname fss-pnasgw1=20
> May 19 15:36:49 fss-pnasgw2 : 0:0:eventd:WARNING: Process-EVENT
> 0.0.0.0: Mgmt Port 0.0.0.0 PCC, State Up=20
>=20
>=20

------_=_NextPart_001_01C8BA01.F1921A18
Content-Type: text/html;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Dus-ascii">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7653.38">
<TITLE>RE: Is this an NTP problem?</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/rtf format -->

<P><FONT COLOR=3D"#0000FF" SIZE=3D2 FACE=3D"Arial">In this particular =
excerpt, it looks like cifsd died (rather than being restarted), which =
typically means it would dump a core file to the flash card =
(/var/run).</FONT></P>

<P><FONT SIZE=3D1 =
FACE=3D"Tahoma">_____________________________________________ </FONT>

<BR><B><FONT SIZE=3D1 FACE=3D"Tahoma">From: &nbsp;</FONT></B> <FONT =
SIZE=3D1 FACE=3D"Tahoma">Chris Vandever&nbsp; </FONT>

<BR><B><FONT SIZE=3D1 FACE=3D"Tahoma">Sent:&nbsp;&nbsp;</FONT></B> <FONT =
SIZE=3D1 FACE=3D"Tahoma">Monday, May 19, 2008 3:40 PM</FONT>

<BR><B><FONT SIZE=3D1 =
FACE=3D"Tahoma">To:&nbsp;&nbsp;&nbsp;&nbsp;</FONT></B> <FONT SIZE=3D1 =
FACE=3D"Tahoma">Rich LaReau; dl-cstech</FONT>

<BR><B><FONT SIZE=3D1 =
FACE=3D"Tahoma">Subject:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</FONT>=
</B> <FONT SIZE=3D1 FACE=3D"Tahoma">RE: Is this an NTP problem?</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Yes, it's a problem with the time =
getting out of sync.&nbsp; The difference is actually 14 sec, which is =
enough to completely mess up clustering (votes expire, we start PCC =
election, and may end up in a split-brain-like scenario):</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">bash-3.00$ utc.pl 1211232980</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Mon May 19 21:36:20 2008 GMT</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Mon May 19 14:36:20 2008 Local</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">bash-3.00$ utc.pl 1211232994</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Mon May 19 21:36:34 2008 GMT</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Mon May 19 14:36:34 2008 Local</FONT>
</P>

<P><FONT SIZE=3D2 FACE=3D"Arial">We have a number of defects related to =
this.&nbsp; Most of them boil down to us spending too much time in the =
BSD kernel to process packets in a timely fashion.&nbsp; We're =
definitely not processing incoming clustering packets; I don't know =
what's happening with incoming NTP packets.&nbsp; We see this type of =
problem when BSD is &quot;busy&quot; -- busy writing a core file, busy =
writing a large file to flash, etc.</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Sorry I don&#8217;t remember the =
concensus on kicking it to get it in sync again.&nbsp; If the problem is =
BSD being &#8220;busy&#8221; I believe it resyncs itself when BSD is =
done whatever it was so busy doing.</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">ChrisV</FONT>
</P>

<P><FONT SIZE=3D2 =
FACE=3D"Tahoma">_____________________________________________<BR>
</FONT><B><FONT SIZE=3D2 FACE=3D"Tahoma">From:</FONT></B><FONT SIZE=3D2 =
FACE=3D"Tahoma"> Rich LaReau<BR>
</FONT><B><FONT SIZE=3D2 FACE=3D"Tahoma">Sent:</FONT></B><FONT SIZE=3D2 =
FACE=3D"Tahoma"> Monday, May 19, 2008 3:34 PM<BR>
</FONT><B><FONT SIZE=3D2 FACE=3D"Tahoma">To:</FONT></B><FONT SIZE=3D2 =
FACE=3D"Tahoma"> Rich LaReau; dl-cstech<BR>
</FONT><B><FONT SIZE=3D2 FACE=3D"Tahoma">Subject:</FONT></B><FONT =
SIZE=3D2 FACE=3D"Tahoma"> RE: Is this an NTP problem?</FONT>
</P>
<BR>

<P><FONT SIZE=3D2 FACE=3D"Arial">(Resent without the unreadable =
wrap-around text feature.)</FONT>
</P>
<BR>

<P><FONT SIZE=3D2 FACE=3D"Arial">I'm getting these chunks of logs every =
so often.&nbsp; Is the whole set related to NTP being off by seven =
seconds?&nbsp;&nbsp; If so, I should post them to our NTP =
troubleshooting Wiki.</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">And also, what was the consensus on =
&quot;kicking&quot; the times so that they match again?&nbsp; Restart =
ntpd, or delete/add the ntp servers?</FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Arial">Thanks,</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Arial">Rich</FONT>
</P>
<BR>

<P><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:34 fss-pnasgw2 : =
0:0:cluster2:INFO: ClusterCtrl_iUpdateState: post pcc up pccname =
fss-pnasgw2 </FONT>

<BR><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:34 fss-pnasgw2 : =
0:0:cluster2:WARNING: remote host 10.210.17.72 (time tic =3D 1211232980) =
and local (time tic =3D 1211232994)times are not synchronized. Please =
verify NTP server setup. </FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:34 fss-pnasgw2 : =
0:0:auth_agent:WARNING: cleanCurrentRequest: cifsd exited while =
processing request with type 10321 1 time, retrying </FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:34 fss-pnasgw2 : =
0:0:auth_agent:WARNING: cifsd for vs 3 exited. Restarting it </FONT>

<BR><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:34 fss-pnasgw2 : =
0:0:auth_agent:INFO: authen_restartCifsDaemon: Restarting CIFS daemon =
</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:34 fss-pnasgw2 : =
0:0:eventd:WARNING: Process-EVENT 0.0.0.0: Mgmt Port 0.0.0.0 PCC, State =
Up</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:36 fss-pnasgw2 : =
0:0:cluster2:NOTICE: urecovery_Interact: send new file in progress, =
remote ver 0(0), sending to 10.210.17.72 </FONT></P>

<P><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:36 fss-pnasgw2 : =
0:0:cluster2:NOTICE: urecovery_Interact: Rx-write bulk error=3D1 </FONT>

<BR><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:36 fss-pnasgw2 : =
0:0:cluster2:NOTICE: urecovery_Interact: rx_EndCall error =3D 5381 =
</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:36 fss-pnasgw2 : =
0:0:cluster2:NOTICE: urecovery_Interact: send new file end, code 1 =
</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:37 fss-pnasgw2 : =
0:0:cluster2:WARNING: ClusterCtrl_iUpdateState: post pcc down pccname =
fss-pnasgw2 </FONT>

<BR><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:37 fss-pnasgw2 : =
0:0:eventd:CRITICAL: Process-EVENT Node: Name 'fss-pnasgw2', State Down, =
Msg ''</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:37 fss-pnasgw2 : =
0:0:vtm:INFO: vtm_get_filer_config: fail to get cluster info, try again =
later </FONT>

<BR><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:46 fss-pnasgw2 last =
message repeated 2 times</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:47 fss-pnasgw2 : =
0:0:sscccc:NOTICE: New client 0: addr 143.199.12.120, port 33693; =
Assigning fd(12)</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:47 fss-pnasgw2 : =
0:0:sscccc:INFO: bad read, bytes=3D-1 (client 0 socket 12)</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:47 fss-pnasgw2 : =
0:0:sscccc:NOTICE: closing session for client 0</FONT>

<BR><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:49 fss-pnasgw2 : =
0:0:cluster2:INFO: ClusterCtrl_iUpdateState: post pcc up pccname =
fss-pnasgw1 </FONT>

<BR><FONT SIZE=3D2 FACE=3D"Courier New">May 19 15:36:49 fss-pnasgw2 : =
0:0:eventd:WARNING: Process-EVENT 0.0.0.0: Mgmt Port 0.0.0.0 PCC, State =
Up</FONT>=20
</P>
<BR>

</BODY>
</HTML>
------_=_NextPart_001_01C8BA01.F1921A18--
