X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C8B54F.C45BB7E8@onstor-exch02.onstor.net>; Tue, 13 May 2008 16:19:18 -0700
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C8B54F.C45BB7E8"
Content-class: urn:content-classes:message
Subject: Qlogic Pause bug
Date: Tue, 13 May 2008 16:19:18 -0700
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E08FC5A80@onstor-exch02.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Qlogic Pause bug
Thread-Index: Aci1T8CA8pS8Qa+dRgue8lf/vwXz9g==
From: "Bill Nadzam" <bill.nadzam@onstor.com>
To: "dl-Cougar" <dl-Cougar@onstor.com>

This is a multi-part message in MIME format.

------_=_NextPart_001_01C8B54F.C45BB7E8
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

As we talked about in the cougar meeting, the SRAM on the QLogic will at
times cause the QLA2432 to enter a PAUSE state. Sometimes there is an
error posted which indicates that a SRAM Parity error has occurred.
Other times there isn't any error condition posted at all. The later is
of great concern, and there maybe yet another problem behind the
problem. While working on the issue several bugs surfaced in the device
level handler. The problem revolved around changes between the older
QLA23xx part used in the legacy machines, and the QLA24xx parts used in
the Cougar hardware. The good news here is that it is now possible to
detect the Unexplained PAUSE condition, reset the QLogic part, and
restart I/O. This takes about 4 seconds of clock time, but should allow
QA to use the machines they have for testing, at least until the pause
problem is resolved, IN HARDWARE that is.
=20
Amit is testing the change and if all goes well I will work on getting
this checked into the dev branch.

May 13 15:34:40 g14r10 kernel: fp0: 147: ispfc:sp1.0: QLA2432 Unknown
Pause detected. r_to_h[0x53668113] hccr[0x00000000] Resetting Adapter.
May 13 15:34:41 g14r10 kernel: fp0: 148:
ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port [10800]
May 13 15:34:43 g14r10 kernel: fp0: 149:
ispfc:ISPFC_CS_NSDB_CHANGE,[8014] LoopID_LoginState [ff0006]
May 13 15:34:43 g14r10 kernel: fp0: ispfc:sp1.0: Public point-to-point
connection established: host_port_id 0x10800
May 13 15:34:43 g14r10 kernel: fp0: 150: ispfc: ispfc:sp1.0 Fibrechannel
link now online
May 13 15:34:43 g14r10 kernel: fp0: 151:
ispfc:ISPFC_CS_NSDB_CHANGE,[8014] LoopID_LoginState [20004]
May 13 15:34:44 g14r10 kernel: fp0: 152:
ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port [10800]
May 13 15:34:47 g14r10 kernel: fp0: 153:
ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port [10800]
May 13 15:34:51 g14r10 kernel: fp0: 154: ispfc: ispfc:sp1.1 Fibrechannel
link now online
May 13 15:34:51 g14r10 kernel: fp0: 155:
ispfc:ISPFC_CS_NSDB_CHANGE,[8014] LoopID_LoginState [20004]
May 13 15:36:01 g14r10 /USR/SBIN/CRON[3421]: (root) CMD
(/onstor/bin/emrscron -g h_res_stats)
May 13 15:38:07 g14r10 kernel: fp0: 156: ispfc:sp1.0: QLA2432 Unknown
Pause detected. r_to_h[0x3dfb8113] hccr[0x00000000] Resetting Adapter.
May 13 15:38:07 g14r10 kernel: fp0:
May 13 15:38:08 g14r10 kernel: fp0: 157:
ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port [10800]
May 13 15:38:10 g14r10 kernel: fp0: 158:
ispfc:ISPFC_CS_NSDB_CHANGE,[8014] LoopID_LoginState [ff0006]
May 13 15:38:10 g14r10 kernel: fp0: ispfc:sp1.0: Public point-to-point
connection established: host_port_id 0x10800
May 13 15:38:10 g14r10 kernel: fp0: 159: ispfc: ispfc:sp1.0 Fibrechannel
link now online
May 13 15:38:10 g14r10 kernel: fp0: 160:
ispfc:ISPFC_CS_NSDB_CHANGE,[8014] LoopID_LoginState [20004]
May 13 15:38:11 g14r10 kernel: fp0: 161:
ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port [10800]
May 13 15:38:12 g14r10 kernel: fp0: 162:
ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port [10800]
May 13 15:38:16 g14r10 kernel: fp0: 163: ispfc: ispfc:sp1.1 Fibrechannel
link now online
May 13 15:38:16 g14r10 kernel: fp0: 164:
ispfc:ISPFC_CS_NSDB_CHANGE,[8014] LoopID_LoginState [20004]
May 13 15:39:02 g14r10 /USR/SBIN/CRON[3434]: (root) CMD
(/onstor/bin/emrscron -g h_res_stats)
May 13 15:42:01 g14r10 /USR/SBIN/CRON[3445]: (root) CMD
(/onstor/bin/emrscron -g h_res_stats)
May 13 15:45:01 g14r10 /USR/SBIN/CRON[3456]: (root) CMD
(/onstor/bin/emrscron -g h_res_stats)
May 13 15:48:01 g14r10 /USR/SBIN/CRON[3512]: (root) CMD
(/onstor/bin/emrscron -g h_res_stats)
May 13 15:51:01 g14r10 /USR/SBIN/CRON[3523]: (root) CMD
(/onstor/bin/emrscron -g h_res_stats)
May 13 15:53:05 g14r10 kernel: tx1: TXRX1:4 > Data corruption, volume
nfsvol4, file0039, conn_idx=3D39 sequential=3Dno
May 13 15:53:05 g14r10 kernel: tx1:     offset=3D34950016(0X2154B80), =
page
4266, got 0XC02700002154BC0 expected 0XC02700002154B80
May 13 15:53:05 g14r10 kernel: tx1:     Data came from same page in same
file, volume nfsvol4, file0039, offset=3D34950080(0X2154BC0), page 4266
May 13 15:54:01 g14r10 /USR/SBIN/CRON[3534]: (root) CMD
(/onstor/bin/emrscron -g h_res_stats)
May 13 15:54:05 g14r10 kernel: fp0: 165: ispfc:sp1.0: QLA2432 Unknown
Pause detected. r_to_h[0x56128113] hccr[0x00000000] Resetting Adapter.
May 13 15:54:05 g14r10 kernel: fp0:
May 13 15:54:06 g14r10 kernel: fp0: 166:
ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port [10800]
May 13 15:54:07 g14r10 kernel: fp0: 167:
ispfc:ISPFC_CS_NSDB_CHANGE,[8014] LoopID_LoginState [ff0006]
May 13 15:54:08 g14r10 kernel: fp0: ispfc:sp1.0: Public point-to-point
connection established: host_port_id 0x10800
May 13 15:54:08 g14r10 kernel: fp0: 168: ispfc: ispfc:sp1.0 Fibrechannel
link now online
May 13 15:54:08 g14r10 kernel: fp0: 169:
ispfc:ISPFC_CS_NSDB_CHANGE,[8014] LoopID_LoginState [20004]
May 13 15:54:08 g14r10 kernel: fp0: 170:
ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port [10800]
May 13 15:54:12 g14r10 kernel: fp0: 171: ispfc: ispfc:sp1.1 Fibrechannel
link now online
May 13 15:54:13 g14r10 kernel: fp0: 172:
ispfc:ISPFC_CS_NSDB_CHANGE,[8014] LoopID_LoginState [20004]
May 13 15:54:13 g14r10 kernel: fp0: 173:
ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port [10800]
May 13 15:54:17 g14r10 kernel: fp0: 174: ispfc: ispfc:sp1.1 Fibrechannel
link now online
May 13 15:54:17 g14r10 kernel: fp0: 175:
ispfc:ISPFC_CS_NSDB_CHANGE,[8014] LoopID_LoginState [20004]
May 13 15:57:01 g14r10 /USR/SBIN/CRON[3545]: (root) CMD
(/onstor/bin/emrscron -g h_res_stats)
May 13 16:00:01 g14r10 /USR/SBIN/CRON[3556]: (root) CMD
(/onstor/bin/emrscron -g h_res_stats)
May 13 16:03:01 g14r10 /USR/SBIN/CRON[3567]: (root) CMD
(/onstor/bin/emrscron -g h_res_stats)
May 13 16:06:01 g14r10 /USR/SBIN/CRON[3578]: (root) CMD
(/onstor/bin/emrscron -g h_res_stats)
May 13 16:07:45 g14r10 kernel: tx1: NFS Perf: 1 workflows of 60 failed
in 3376.523477 seconds, 0 resource failures, 0 rpc failures, 0 nfs
failures, 1 read corruption failures, avg/workflow 4279659900 usecs
May 13 16:07:45 g14r10 kernel: tx1: ^I0 waits for memory count
May 13 16:07:45 g14r10 kernel: tx1: ^I0 waits for allocatable request
count
May 13 16:09:02 g14r10 /USR/SBIN/CRON[3589]: (root) CMD
(/onstor/bin/emrscron -g h_res_stats)
May 13 16:12:01 g14r10 /USR/SBIN/CRON[3600]: (root) CMD
(/onstor/bin/emrscron -g h_res_stats)
May 13 16:15:01 g14r10 /USR/SBIN/CRON[3611]: (root) CMD
(/onstor/bin/emrscron -g h_res_stats)
May 13 16:17:01 g14r10 /USR/SBIN/CRON[3666]: (root) CMD (   cd / &&
run-parts --report /etc/cron.hourly)
=20
=20

------_=_NextPart_001_01C8B54F.C45BB7E8
Content-Type: text/html;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dus-ascii">
<META content=3D"MSHTML 6.00.2800.1555" name=3DGENERATOR></HEAD>
<BODY>
<DIV><FONT face=3DArial size=3D2><SPAN class=3D407390823-13052008>As we =
talked about=20
in the cougar meeting, the SRAM on the QLogic will at times cause the =
QLA2432 to=20
enter a PAUSE state. </SPAN></FONT><FONT face=3DArial size=3D2><SPAN=20
class=3D407390823-13052008>Sometimes there is an error posted which =
indicates that=20
a SRAM Parity error has occurred. Other times there isn't any error =
condition=20
posted at all. The later is of great concern, and there maybe yet =
another=20
problem behind the problem. While working on the issue several bugs =
surfaced in=20
the device level handler. The problem revolved around changes between =
the older=20
QLA23xx part used in the legacy machines, and the QLA24xx parts used in =
the=20
Cougar hardware. The good news here is that it is now possible to detect =
the=20
Unexplained PAUSE condition, reset the QLogic part, and restart I/O. =
This takes=20
about&nbsp;4 seconds of clock time, but should allow QA to use the =
machines they=20
have for testing, at least until the pause problem is resolved, IN =
HARDWARE that=20
is.</SPAN></FONT></DIV>
<DIV><FONT face=3DArial size=3D2><SPAN=20
class=3D407390823-13052008></SPAN></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2><SPAN class=3D407390823-13052008>Amit =
is testing the=20
change and if all goes well I will work on getting this checked into the =
dev=20
branch.</SPAN></FONT></DIV>
<DIV><FONT face=3DArial size=3D2><SPAN =
class=3D407390823-13052008><BR><STRONG><FONT=20
color=3D#ff0000>May 13 15:34:40 g14r10 kernel: fp0: 147: ispfc:sp1.0: =
QLA2432=20
Unknown Pause detected. r_to_h[0x53668113] hccr[0x00000000] Resetting=20
Adapter.</FONT></STRONG></SPAN></FONT></DIV>
<DIV><FONT face=3DArial size=3D2><SPAN class=3D407390823-13052008>May 13 =
15:34:41=20
g14r10 kernel: fp0: 148: ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port=20
[10800]<BR>May 13 15:34:43 g14r10 kernel: fp0: 149:=20
ispfc:ISPFC_CS_NSDB_CHANGE,[8014] LoopID_LoginState [ff0006]<BR>May 13 =
15:34:43=20
g14r10 kernel: fp0: ispfc:sp1.0: Public point-to-point connection =
established:=20
host_port_id 0x10800<BR>May 13 15:34:43 g14r10 kernel: fp0: 150: ispfc:=20
ispfc:sp1.0 Fibrechannel link now online<BR>May 13 15:34:43 g14r10 =
kernel: fp0:=20
151: ispfc:ISPFC_CS_NSDB_CHANGE,[8014] LoopID_LoginState [20004]<BR>May =
13=20
15:34:44 g14r10 kernel: fp0: 152: ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on =
port=20
[10800]<BR>May 13 15:34:47 g14r10 kernel: fp0: 153:=20
ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port [10800]<BR>May 13 15:34:51 =
g14r10=20
kernel: fp0: 154: ispfc: ispfc:sp1.1 Fibrechannel link now online<BR>May =
13=20
15:34:51 g14r10 kernel: fp0: 155: ispfc:ISPFC_CS_NSDB_CHANGE,[8014]=20
LoopID_LoginState [20004]<BR>May 13 15:36:01 g14r10 =
/USR/SBIN/CRON[3421]: (root)=20
CMD (/onstor/bin/emrscron -g h_res_stats)<BR><STRONG><FONT =
color=3D#ff0000>May 13=20
15:38:07 g14r10 kernel: fp0: 156: ispfc:sp1.0: QLA2432 Unknown Pause =
detected.=20
r_to_h[0x3dfb8113] hccr[0x00000000] Resetting =
Adapter.</FONT></STRONG><BR>May 13=20
15:38:07 g14r10 kernel: fp0:<BR>May 13 15:38:08 g14r10 kernel: fp0: 157: =

ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port [10800]<BR>May 13 15:38:10 =
g14r10=20
kernel: fp0: 158: ispfc:ISPFC_CS_NSDB_CHANGE,[8014] LoopID_LoginState=20
[ff0006]<BR>May 13 15:38:10 g14r10 kernel: fp0: ispfc:sp1.0: Public=20
point-to-point connection established: host_port_id 0x10800<BR>May 13 =
15:38:10=20
g14r10 kernel: fp0: 159: ispfc: ispfc:sp1.0 Fibrechannel link now =
online<BR>May=20
13 15:38:10 g14r10 kernel: fp0: 160: ispfc:ISPFC_CS_NSDB_CHANGE,[8014]=20
LoopID_LoginState [20004]<BR>May 13 15:38:11 g14r10 kernel: fp0: 161:=20
ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port [10800]<BR>May 13 15:38:12 =
g14r10=20
kernel: fp0: 162: ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port =
[10800]<BR>May 13=20
15:38:16 g14r10 kernel: fp0: 163: ispfc: ispfc:sp1.1 Fibrechannel link =
now=20
online<BR>May 13 15:38:16 g14r10 kernel: fp0: 164:=20
ispfc:ISPFC_CS_NSDB_CHANGE,[8014] LoopID_LoginState [20004]<BR>May 13 =
15:39:02=20
g14r10 /USR/SBIN/CRON[3434]: (root) CMD (/onstor/bin/emrscron -g=20
h_res_stats)<BR>May 13 15:42:01 g14r10 /USR/SBIN/CRON[3445]: (root) CMD=20
(/onstor/bin/emrscron -g h_res_stats)<BR>May 13 15:45:01 g14r10=20
/USR/SBIN/CRON[3456]: (root) CMD (/onstor/bin/emrscron -g =
h_res_stats)<BR>May 13=20
15:48:01 g14r10 /USR/SBIN/CRON[3512]: (root) CMD (/onstor/bin/emrscron =
-g=20
h_res_stats)<BR>May 13 15:51:01 g14r10 /USR/SBIN/CRON[3523]: (root) CMD=20
(/onstor/bin/emrscron -g h_res_stats)<BR>May 13 15:53:05 g14r10 kernel: =
tx1:=20
TXRX1:4 &gt; Data corruption, volume nfsvol4, file0039, conn_idx=3D39=20
sequential=3Dno<BR>May 13 15:53:05 g14r10 kernel: =
tx1:&nbsp;&nbsp;&nbsp;&nbsp;=20
offset=3D34950016(0X2154B80), page 4266, got 0XC02700002154BC0 expected=20
0XC02700002154B80<BR>May 13 15:53:05 g14r10 kernel: =
tx1:&nbsp;&nbsp;&nbsp;&nbsp;=20
Data came from same page in same file, volume nfsvol4, file0039,=20
offset=3D34950080(0X2154BC0), page 4266<BR>May 13 15:54:01 g14r10=20
/USR/SBIN/CRON[3534]: (root) CMD (/onstor/bin/emrscron -g=20
h_res_stats)<BR><STRONG><FONT color=3D#ff0000>May 13 15:54:05 g14r10 =
kernel: fp0:=20
165: ispfc:sp1.0: QLA2432 Unknown Pause detected. r_to_h[0x56128113]=20
hccr[0x00000000] Resetting Adapter.</FONT></STRONG><BR>May 13 15:54:05 =
g14r10=20
kernel: fp0:<BR>May 13 15:54:06 g14r10 kernel: fp0: 166:=20
ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port [10800]<BR>May 13 15:54:07 =
g14r10=20
kernel: fp0: 167: ispfc:ISPFC_CS_NSDB_CHANGE,[8014] LoopID_LoginState=20
[ff0006]<BR>May 13 15:54:08 g14r10 kernel: fp0: ispfc:sp1.0: Public=20
point-to-point connection established: host_port_id 0x10800<BR>May 13 =
15:54:08=20
g14r10 kernel: fp0: 168: ispfc: ispfc:sp1.0 Fibrechannel link now =
online<BR>May=20
13 15:54:08 g14r10 kernel: fp0: 169: ispfc:ISPFC_CS_NSDB_CHANGE,[8014]=20
LoopID_LoginState [20004]<BR>May 13 15:54:08 g14r10 kernel: fp0: 170:=20
ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port [10800]<BR>May 13 15:54:12 =
g14r10=20
kernel: fp0: 171: ispfc: ispfc:sp1.1 Fibrechannel link now online<BR>May =
13=20
15:54:13 g14r10 kernel: fp0: 172: ispfc:ISPFC_CS_NSDB_CHANGE,[8014]=20
LoopID_LoginState [20004]<BR>May 13 15:54:13 g14r10 kernel: fp0: 173:=20
ispfc:ISPFC_CS_NSDB_CHANGE,[8015] on port [10800]<BR>May 13 15:54:17 =
g14r10=20
kernel: fp0: 174: ispfc: ispfc:sp1.1 Fibrechannel link now online<BR>May =
13=20
15:54:17 g14r10 kernel: fp0: 175: ispfc:ISPFC_CS_NSDB_CHANGE,[8014]=20
LoopID_LoginState [20004]<BR>May 13 15:57:01 g14r10 =
/USR/SBIN/CRON[3545]: (root)=20
CMD (/onstor/bin/emrscron -g h_res_stats)<BR>May 13 16:00:01 g14r10=20
/USR/SBIN/CRON[3556]: (root) CMD (/onstor/bin/emrscron -g =
h_res_stats)<BR>May 13=20
16:03:01 g14r10 /USR/SBIN/CRON[3567]: (root) CMD (/onstor/bin/emrscron =
-g=20
h_res_stats)<BR>May 13 16:06:01 g14r10 /USR/SBIN/CRON[3578]: (root) CMD=20
(/onstor/bin/emrscron -g h_res_stats)<BR>May 13 16:07:45 g14r10 kernel: =
tx1: NFS=20
Perf: 1 workflows of 60 failed in 3376.523477 seconds, 0 resource =
failures, 0=20
rpc failures, 0 nfs failures, 1 read corruption failures, avg/workflow=20
4279659900 usecs<BR>May 13 16:07:45 g14r10 kernel: tx1: ^I0 waits for =
memory=20
count<BR>May 13 16:07:45 g14r10 kernel: tx1: ^I0 waits for allocatable =
request=20
count<BR>May 13 16:09:02 g14r10 /USR/SBIN/CRON[3589]: (root) CMD=20
(/onstor/bin/emrscron -g h_res_stats)<BR>May 13 16:12:01 g14r10=20
/USR/SBIN/CRON[3600]: (root) CMD (/onstor/bin/emrscron -g =
h_res_stats)<BR>May 13=20
16:15:01 g14r10 /USR/SBIN/CRON[3611]: (root) CMD (/onstor/bin/emrscron =
-g=20
h_res_stats)<BR>May 13 16:17:01 g14r10 /USR/SBIN/CRON[3666]: (root) CMD=20
(&nbsp;&nbsp; cd / &amp;&amp; run-parts --report=20
/etc/cron.hourly)</SPAN></FONT></DIV>
<DIV>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2><SPAN=20
class=3D407390823-13052008></SPAN></FONT>&nbsp;</DIV></BODY></HTML>

------_=_NextPart_001_01C8B54F.C45BB7E8--
