X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C8BA01.49B878B0@onstor-exch02.onstor.net>; Mon, 19 May 2008 15:40:07 -0700
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C8BA01.49B878B0"
Content-class: urn:content-classes:message
Subject: RE: Is this an NTP problem?
Date: Mon, 19 May 2008 15:40:07 -0700
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E03E9A888@onstor-exch02.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E0A0F8472@onstor-exch02.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Is this an NTP problem?
Thread-Index: Aci5/xI3GTwMB2ylTayyWxDLuQyBEAAASFhQAAASYnA=
From: "Chris Vandever" <chris.vandever@onstor.com>
To: "Rich LaReau" <rich.lareau@onstor.com>,
	"dl-cstech" <dl-cstech@onstor.com>

This is a multi-part message in MIME format.

------_=_NextPart_001_01C8BA01.49B878B0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Yes, it's a problem with the time getting out of sync.  The difference
is actually 14 sec, which is enough to completely mess up clustering
(votes expire, we start PCC election, and may end up in a
split-brain-like scenario):

bash-3.00$ utc.pl 1211232980
Mon May 19 21:36:20 2008 GMT
Mon May 19 14:36:20 2008 Local
bash-3.00$ utc.pl 1211232994
Mon May 19 21:36:34 2008 GMT
Mon May 19 14:36:34 2008 Local

We have a number of defects related to this.  Most of them boil down to
us spending too much time in the BSD kernel to process packets in a
timely fashion.  We're definitely not processing incoming clustering
packets; I don't know what's happening with incoming NTP packets.  We
see this type of problem when BSD is "busy" -- busy writing a core file,
busy writing a large file to flash, etc.

Sorry I don't remember the concensus on kicking it to get it in sync
again.  If the problem is BSD being "busy" I believe it resyncs itself
when BSD is done whatever it was so busy doing.

ChrisV

_____________________________________________
From: Rich LaReau=20
Sent: Monday, May 19, 2008 3:34 PM
To: Rich LaReau; dl-cstech
Subject: RE: Is this an NTP problem?


(Resent without the unreadable wrap-around text feature.)


I'm getting these chunks of logs every so often.  Is the whole set
related to NTP being off by seven seconds?   If so, I should post them
to our NTP troubleshooting Wiki.

And also, what was the consensus on "kicking" the times so that they
match again?  Restart ntpd, or delete/add the ntp servers?

Thanks,
Rich


May 19 15:36:34 fss-pnasgw2 : 0:0:cluster2:INFO:
ClusterCtrl_iUpdateState: post pcc up pccname fss-pnasgw2=20
May 19 15:36:34 fss-pnasgw2 : 0:0:cluster2:WARNING: remote host
10.210.17.72 (time tic =3D 1211232980) and local (time tic =3D
1211232994)times are not synchronized. Please verify NTP server setup.=20
May 19 15:36:34 fss-pnasgw2 : 0:0:auth_agent:WARNING:
cleanCurrentRequest: cifsd exited while processing request with type
10321 1 time, retrying=20
May 19 15:36:34 fss-pnasgw2 : 0:0:auth_agent:WARNING: cifsd for vs 3
exited. Restarting it=20
May 19 15:36:34 fss-pnasgw2 : 0:0:auth_agent:INFO:
authen_restartCifsDaemon: Restarting CIFS daemon=20
May 19 15:36:34 fss-pnasgw2 : 0:0:eventd:WARNING: Process-EVENT 0.0.0.0:
Mgmt Port 0.0.0.0 PCC, State Up
May 19 15:36:36 fss-pnasgw2 : 0:0:cluster2:NOTICE: urecovery_Interact:
send new file in progress, remote ver 0(0), sending to 10.210.17.72=20
May 19 15:36:36 fss-pnasgw2 : 0:0:cluster2:NOTICE: urecovery_Interact:
Rx-write bulk error=3D1=20
May 19 15:36:36 fss-pnasgw2 : 0:0:cluster2:NOTICE: urecovery_Interact:
rx_EndCall error =3D 5381=20
May 19 15:36:36 fss-pnasgw2 : 0:0:cluster2:NOTICE: urecovery_Interact:
send new file end, code 1=20
May 19 15:36:37 fss-pnasgw2 : 0:0:cluster2:WARNING:
ClusterCtrl_iUpdateState: post pcc down pccname fss-pnasgw2=20
May 19 15:36:37 fss-pnasgw2 : 0:0:eventd:CRITICAL: Process-EVENT Node:
Name 'fss-pnasgw2', State Down, Msg ''
May 19 15:36:37 fss-pnasgw2 : 0:0:vtm:INFO: vtm_get_filer_config: fail
to get cluster info, try again later=20
May 19 15:36:46 fss-pnasgw2 last message repeated 2 times
May 19 15:36:47 fss-pnasgw2 : 0:0:sscccc:NOTICE: New client 0: addr
143.199.12.120, port 33693; Assigning fd(12)
May 19 15:36:47 fss-pnasgw2 : 0:0:sscccc:INFO: bad read, bytes=3D-1
(client 0 socket 12)
May 19 15:36:47 fss-pnasgw2 : 0:0:sscccc:NOTICE: closing session for
client 0
May 19 15:36:49 fss-pnasgw2 : 0:0:cluster2:INFO:
ClusterCtrl_iUpdateState: post pcc up pccname fss-pnasgw1=20
May 19 15:36:49 fss-pnasgw2 : 0:0:eventd:WARNING: Process-EVENT 0.0.0.0:
Mgmt Port 0.0.0.0 PCC, State Up=20



------_=_NextPart_001_01C8BA01.49B878B0
Content-Type: text/html;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Dus-ascii">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7653.38">
<TITLE>RE: Is this an NTP problem?</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/rtf format -->

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">Yes, =
it's a problem with the time getting out of sync.&nbsp; The difference =
is actually 14 sec</FONT></SPAN><SPAN LANG=3D"en-us"></SPAN><SPAN =
LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">, which =
is</FONT></SPAN><SPAN LANG=3D"en-us"></SPAN><SPAN LANG=3D"en-us"> <FONT =
SIZE=3D2 FACE=3D"Arial">enough</FONT></SPAN><SPAN =
LANG=3D"en-us"></SPAN><SPAN LANG=3D"en-us"><FONT SIZE=3D2 =
FACE=3D"Arial"> to completely mess up clustering (votes expire, we start =
PCC election, and may</FONT></SPAN><SPAN LANG=3D"en-us"></SPAN><SPAN =
LANG=3D"en-us"> <FONT SIZE=3D2 FACE=3D"Arial">end up in a =
split-brain-like scenario</FONT></SPAN><SPAN LANG=3D"en-us"></SPAN><SPAN =
LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">)</FONT></SPAN><SPAN =
LANG=3D"en-us"></SPAN><SPAN LANG=3D"en-us"><FONT SIZE=3D2 =
FACE=3D"Arial">:</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 =
FACE=3D"Arial">bash-3.00$ utc.pl 1211232980</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">Mon =
May 19 21:36:20 2008 GMT</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">Mon =
May 19 14:36:20 2008 Local</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 =
FACE=3D"Arial">bash-3.00$ utc.pl 1211232994</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">Mon =
May 19 21:36:34 2008 GMT</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">Mon =
May 19 14:36:34 2008 Local</FONT></SPAN><SPAN =
LANG=3D"en-us"></SPAN><SPAN LANG=3D"en-us"></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">We =
have a number of defects</FONT></SPAN><SPAN LANG=3D"en-us"></SPAN><SPAN =
LANG=3D"en-us"> <FONT SIZE=3D2 FACE=3D"Arial">relat</FONT></SPAN><SPAN =
LANG=3D"en-us"></SPAN><SPAN LANG=3D"en-us"><FONT SIZE=3D2 =
FACE=3D"Arial">ed to this.&nbsp; Most of them boil down to us spending =
too much time in the BSD kernel to process packets in a timely =
fashion.</FONT></SPAN><SPAN LANG=3D"en-us"></SPAN><SPAN =
LANG=3D"en-us">&nbsp;<FONT SIZE=3D2 FACE=3D"Arial"> We're definitely not =
processing incoming clustering packets; I</FONT></SPAN><SPAN =
LANG=3D"en-us"></SPAN><SPAN LANG=3D"en-us"> <FONT SIZE=3D2 =
FACE=3D"Arial">don't know</FONT></SPAN><SPAN LANG=3D"en-us"></SPAN><SPAN =
LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial"> what's happening with =
incoming NTP packets.</FONT></SPAN><SPAN LANG=3D"en-us"></SPAN><SPAN =
LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">&nbsp; We see this type of =
problem when BSD is &quot;busy&quot; -- busy writing a core file, busy =
writing a large file to flash, etc.</FONT></SPAN><SPAN =
LANG=3D"en-us"></SPAN><SPAN LANG=3D"en-us"></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">Sorry =
I don</FONT></SPAN><SPAN LANG=3D"en-us"></SPAN><SPAN =
LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">&#8217;</FONT></SPAN><SPAN =
LANG=3D"en-us"></SPAN><SPAN LANG=3D"en-us"><FONT SIZE=3D2 =
FACE=3D"Arial">t remember the concensus on kicking it to get it in sync =
again.&nbsp; If the problem is BSD being</FONT></SPAN><SPAN =
LANG=3D"en-us"></SPAN><SPAN LANG=3D"en-us"> <FONT SIZE=3D2 =
FACE=3D"Arial">&#8220;</FONT></SPAN><SPAN LANG=3D"en-us"></SPAN><SPAN =
LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">busy</FONT></SPAN><SPAN =
LANG=3D"en-us"></SPAN><SPAN LANG=3D"en-us"><FONT SIZE=3D2 =
FACE=3D"Arial">&#8221;</FONT></SPAN><SPAN LANG=3D"en-us"></SPAN><SPAN =
LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial"> I believe it resyncs =
itself when BSD is done whatever it was so busy =
doing.</FONT></SPAN><SPAN LANG=3D"en-us"></SPAN><SPAN =
LANG=3D"en-us"></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 =
FACE=3D"Arial">ChrisV</FONT></SPAN><SPAN LANG=3D"en-us"></SPAN><SPAN =
LANG=3D"en-us"></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 =
FACE=3D"Tahoma">_____________________________________________<BR>
</FONT></SPAN><SPAN LANG=3D"en-us"><B></B></SPAN><SPAN =
LANG=3D"en-us"><B><FONT SIZE=3D2 =
FACE=3D"Tahoma">From:</FONT></B></SPAN><SPAN LANG=3D"en-us"></SPAN><SPAN =
LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Tahoma"> Rich LaReau<BR>
</FONT></SPAN><SPAN LANG=3D"en-us"><B></B></SPAN><SPAN =
LANG=3D"en-us"><B><FONT SIZE=3D2 =
FACE=3D"Tahoma">Sent:</FONT></B></SPAN><SPAN LANG=3D"en-us"></SPAN><SPAN =
LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Tahoma"> Monday, May 19, 2008 3:34 =
PM<BR>
</FONT></SPAN><SPAN LANG=3D"en-us"><B></B></SPAN><SPAN =
LANG=3D"en-us"><B><FONT SIZE=3D2 =
FACE=3D"Tahoma">To:</FONT></B></SPAN><SPAN LANG=3D"en-us"></SPAN><SPAN =
LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Tahoma"> Rich LaReau; =
dl-cstech<BR>
</FONT></SPAN><SPAN LANG=3D"en-us"><B></B></SPAN><SPAN =
LANG=3D"en-us"><B><FONT SIZE=3D2 =
FACE=3D"Tahoma">Subject:</FONT></B></SPAN><SPAN =
LANG=3D"en-us"></SPAN><SPAN LANG=3D"en-us"><FONT SIZE=3D2 =
FACE=3D"Tahoma"> RE: Is this an NTP problem?</FONT></SPAN><SPAN =
LANG=3D"en-us"></SPAN></P>
<BR>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 =
FACE=3D"Arial">(Resent without the unreadable wrap-around text =
feature.)</FONT></SPAN></P>
<BR>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">I'm =
getting these chunks of logs every so often.&nbsp; Is the whole set =
related to NTP being off by seven seconds?&nbsp;&nbsp; If so, I should =
post them to our NTP troubleshooting Wiki.</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">And =
also, what was the consensus on &quot;kicking&quot; the times so that =
they match again?&nbsp; Restart ntpd, or delete/add the ntp =
servers?</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 =
FACE=3D"Arial">Thanks,</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 =
FACE=3D"Arial">Rich</FONT></SPAN></P>
<BR>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:34 fss-pnasgw2 : 0:0:cluster2:INFO: =
ClusterCtrl_iUpdateState: post pcc up pccname fss-pnasgw2 =
</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:34 fss-pnasgw2 : 0:0:cluster2:WARNING: remote host =
10.210.17.72 (time tic =3D 1211232980) and local (time tic =3D =
1211232994)times are not synchronized. Please verify NTP server setup. =
</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:34 fss-pnasgw2 : 0:0:auth_agent:WARNING: =
cleanCurrentRequest: cifsd exited while processing request with type =
10321 1 time, retrying </FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:34 fss-pnasgw2 : 0:0:auth_agent:WARNING: cifsd for vs =
3 exited. Restarting it </FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:34 fss-pnasgw2 : 0:0:auth_agent:INFO: =
authen_restartCifsDaemon: Restarting CIFS daemon </FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:34 fss-pnasgw2 : 0:0:eventd:WARNING: Process-EVENT =
0.0.0.0: Mgmt Port 0.0.0.0 PCC, State Up</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:36 fss-pnasgw2 : 0:0:cluster2:NOTICE: =
urecovery_Interact: send new file in progress, remote ver 0(0), sending =
to 10.210.17.72 </FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:36 fss-pnasgw2 : 0:0:cluster2:NOTICE: =
urecovery_Interact: Rx-write bulk error=3D1 </FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:36 fss-pnasgw2 : 0:0:cluster2:NOTICE: =
urecovery_Interact: rx_EndCall error =3D 5381 </FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:36 fss-pnasgw2 : 0:0:cluster2:NOTICE: =
urecovery_Interact: send new file end, code 1 </FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:37 fss-pnasgw2 : 0:0:cluster2:WARNING: =
ClusterCtrl_iUpdateState: post pcc down pccname fss-pnasgw2 =
</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:37 fss-pnasgw2 : 0:0:eventd:CRITICAL: Process-EVENT =
Node: Name 'fss-pnasgw2', State Down, Msg ''</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:37 fss-pnasgw2 : 0:0:vtm:INFO: vtm_get_filer_config: =
fail to get cluster info, try again later </FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:46 fss-pnasgw2 last message repeated 2 =
times</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:47 fss-pnasgw2 : 0:0:sscccc:NOTICE: New client 0: addr =
143.199.12.120, port 33693; Assigning fd(12)</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:47 fss-pnasgw2 : 0:0:sscccc:INFO: bad read, bytes=3D-1 =
(client 0 socket 12)</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:47 fss-pnasgw2 : 0:0:sscccc:NOTICE: closing session =
for client 0</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:49 fss-pnasgw2 : 0:0:cluster2:INFO: =
ClusterCtrl_iUpdateState: post pcc up pccname fss-pnasgw1 =
</FONT></SPAN></P>

<P ALIGN=3DLEFT><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Courier =
New">May 19 15:36:49 fss-pnasgw2 : 0:0:eventd:WARNING: Process-EVENT =
0.0.0.0: Mgmt Port 0.0.0.0 PCC, State Up</FONT></SPAN><SPAN =
LANG=3D"en-us"></SPAN><SPAN LANG=3D"en-us"> </SPAN></P>
<BR>

</BODY>
</HTML>
------_=_NextPart_001_01C8BA01.49B878B0--
