X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C7B434.159EA20E@onstor-exch02.onstor.net>; Thu, 21 Jun 2007 10:43:39 -0800
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C7B434.159EA20E"
Content-class: urn:content-classes:message
Subject: RE: An Opportunity for Improving Sequential Write Performance
Date: Thu, 21 Jun 2007 10:43:39 -0800
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E0443A292@onstor-exch02.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E02FB258A@onstor-exch02.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: An Opportunity for Improving Sequential Write Performance
Thread-Index: Ace0JZTNmUy91HXbRCGfEJsglkOemAABgSHcAAHuWaA=
From: "Jay Michlin" <jay.michlin@onstor.com>
To: "Bill Nadzam" <bill.nadzam@onstor.com>,
	"Fay Chong" <fay.chong@onstor.com>
Cc: "Shawn Currin" <shawn.currin@onstor.com>,
	"John Klokkenga" <john.klokkenga@onstor.com>,
	"Paul Hammer" <paul.hammer@onstor.com>,
	"Jerry Lopatin" <jerry.lopatin@onstor.com>,
	"Brian DeForest" <brian.deforest@onstor.com>,
	"Tim Gardner" <tim.gardner@onstor.com>,
	"Narayan Venkat" <narayan.venkat@onstor.com>,
	"Jobi Ariyamannil" <jobi.ariyamannil@onstor.com>,
	"Andy Sharp" <andy.sharp@onstor.com>,
	"Maxim Kozlovsky" <maxim.kozlovsky@onstor.com>,
	"Brian Stark" <brian.stark@onstor.com>

This is a multi-part message in MIME format.

------_=_NextPart_001_01C7B434.159EA20E
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hello all,
=20
Fay and I discussed this as he was working on the Xyratex array, and I
committed to carry this information forward and explore applying it to
our product development. By copy of this to key people in SW Development
I'll invite comments on whether this might present an opportunity for
any of the projects we're working on (subject to existing schedule
commitments). If not, I invite comments on scope of the work and
priority so we can think about inclusion in a future project.
=20
Thanks,
jay

________________________________

From: Bill Nadzam=20
Sent: Thursday, June 21, 2007 10:43 AM
To: Fay Chong; John Klokkenga; Paul Hammer; Jerry Lopatin; Jay Michlin
Cc: Shawn Currin; dl-qualification
Subject: RE: An Opportunity for Improving Sequential Write Performance


Fantasic information.
Good work on the problem and cause.
Sheds insight into the slow write performance we have seen on some other
arrays as well.

________________________________

From: Fay Chong
Sent: Thu 6/21/2007 9:59 AM
To: John Klokkenga; Paul Hammer; Jerry Lopatin; Jay Michlin
Cc: Bill Nadzam; Shawn Currin; Fay Chong; dl-qualification
Subject: An Opportunity for Improving Sequential Write Performance



Executive Summary:

Problem Statement: Improving sequential write performance for Xyratex
and arrays in general.

Findings: The ONStor filer issues sequential SCSI write commands when
presented with a sequential write workload by the filer client. The SCSI
write commands are sometimes out of order which causes performance
problems in Xyratex and probably other arrays. If the file system issued
write commands in order then the performance of Xyratex and other
similar arrays would improve. In addition, performance should improve
for most if not all other arrays.

Details:

In the past week there has been an intensive effort by Xyratex and
ONStor to improve the performance of the ONStor filer and Xyratex
storage combination for sequential writes. Sometimes the filer issues
SCSI write commands out of order. The Xyratex controller is not able to
reorder the commands for efficient write-back operation. Here's a
listing of write-back operations in their order of efficiency for
raid-5:

1.      Full stripe write - Data for a full stripe is received in one or
more commands. Parity is calculated and data and parity is written to
the disks.

2.      Full segment, partial stripe write (update write) - Data for a
stripe segment (also called stripe size, chunk size) is received in one
or more commands. New parity is calculated and the new segment data and
parity is written to the two disks.

3.      Partial segment, partial stripe write (update write) - Data for
part of a segment is received. New parity is calculated and the new data
and parity written to the two disks.

Example:

Suppose the filer is connected to a raid-5 array with 4+1 disks with a
segment size of 64K. Keep in mind that the ONStor file system used a
block size of 8K for data. There is some coalescing in the filer at the
SCSI layer. In the case of a full stripe write, 256K bytes are received
(4 times 64K), the parity is calculated from the data, and 5 disk
operations performed to write the data and parity. In general, the
overhead of executing a command is much greater than the data transfer
time. For full segment write, 64K bytes are received, the new parity
calculated from the old parity and old data, and the new data and new
parity written to disk. Obtaining the old parity and old data requires 2
disk operation, writing new data and new parity 2 disk operations for a
total of 4 disk I/O's. The partial segment write works like the full
segment write only less data is read/written. To summarize the cost for
writing 256K bytes of data:

*       Full stripe write - 5 disk I/O's=20
*       Full segment write - 40 disk I/O's=20
*       Partial segment write (8K) - 128 disk I/O's=20

Xyratex specific:

Xyratex controller code can reorder 1 command that is out of order i.e.
coalesce two commands. They use a full stripe lock which means that only
one update operation on a stripe can be done at a time. The combination
of only coalescing two commands and a full stripe lock means that a high
number of disk operations are required and result in low throughput.
They have made some changes to their code and can get a consistent 17-20
MB/sec. For comparison the Saturn array achieves about 80 MB/sec in a
similar configuration.  Xyratex has direct attach data that shows 40
MB/sec for 8K sequential writes with a queue depth of 8. For 128K
transfers, the transfer rate is over 200 MB/sec for a queue depth of 8.

More details and other arrays:

Clearly having the filer process and issue sequential write commands in
order would improve the performance of the Xyratex controller. However a
larger benefit may occur by being able to coalesce more commands at the
filer, thus presenting larger writes to the storage. Coalescing at the
filer occurs opportunistically. If the SCSI layer can issue a command to
storage it will do so with no coalescing taking place. If the storage
has the maximum number of I/O's issued to it (maxdispatch), the FC code
will sort its queue by LBA and coalesce just before issuing another
command. The maximum I/O is 128K. If the FC code receives the writes out
of order then the opportunity for coalescing is dependent upon there
being a sufficient number of commands in the queue. If the FC receives
writes in order, then coalescing will occur more often i.e. whenever
maxdispatch and there are 2 or more commands in the FC queue. Even the
Xyratex storage could achieve 200 MB/sec.

Future work:

An estimate of the performance gain by issuing write commands in order
should be made. Command traces have been recorded. Two experiments come
to mind:

1.      Using direct attach storage, the command list could be issued
and performance noted. Then the command list sorted by LBA, issued again
and performance compared with the unsorted list. This will give an idea
of the benefit of just issuing write commands in order.=20

2.      Using some assumption about queue depth at the FC, the sorted
list of commands could be aggregated and issued to the storage. This
would show the benefit of in order commands and coalescing.

Summary:

Working on the Xyratex write throughput problem has been very
beneficial. Having the Filesystem issue sequential write commands in
order would increase performance in arrays such as Xyratex and also
increase performance due to aggregation into larger write commands.

Questions, comments, and suggestions welcome.

Thanks

Fay




Fay Chong

Sr. Performance Engineer

ONStor, Inc.

fay.chong@onstor.com

408.376.3130 (w)


------_=_NextPart_001_01C7B434.159EA20E
Content-Type: text/html;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML dir=3Dltr><HEAD><TITLE>An Opportunity for Improving Sequential =
Write Performance</TITLE>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dus-ascii">
<META content=3D"MSHTML 6.00.2900.3020" name=3DGENERATOR></HEAD>
<BODY>
<DIV dir=3Dltr align=3Dleft><FONT face=3DArial color=3D#0000ff =
size=3D2><SPAN=20
class=3D761123818-21062007>Hello all,</SPAN></FONT></DIV>
<DIV dir=3Dltr align=3Dleft><FONT face=3DArial color=3D#0000ff =
size=3D2><SPAN=20
class=3D761123818-21062007></SPAN></FONT>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><FONT face=3DArial color=3D#0000ff =
size=3D2><SPAN=20
class=3D761123818-21062007>Fay and I discussed this as he was working on =
the=20
Xyratex array, and I committed to carry this information forward and =
explore=20
applying it to our product development. By copy of this to key people in =
SW=20
Development I'll invite comments on whether this might present an =
opportunity=20
for any of the projects we're working on (subject to existing schedule=20
commitments). If not, I invite comments on scope of the work and =
priority so we=20
can think about inclusion in a future project.</SPAN></FONT></DIV>
<DIV dir=3Dltr align=3Dleft><FONT face=3DArial color=3D#0000ff =
size=3D2><SPAN=20
class=3D761123818-21062007></SPAN></FONT>&nbsp;</DIV>
<DIV dir=3Dltr align=3Dleft><FONT face=3DArial color=3D#0000ff =
size=3D2><SPAN=20
class=3D761123818-21062007>Thanks,</SPAN></FONT></DIV>
<DIV dir=3Dltr align=3Dleft><FONT face=3DArial color=3D#0000ff =
size=3D2><SPAN=20
class=3D761123818-21062007>jay</SPAN></FONT></DIV><BR>
<DIV class=3DOutlookMessageHeader lang=3Den-us dir=3Dltr align=3Dleft>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Bill Nadzam <BR><B>Sent:</B> =
Thursday,=20
June 21, 2007 10:43 AM<BR><B>To:</B> Fay Chong; John Klokkenga; Paul =
Hammer;=20
Jerry Lopatin; Jay Michlin<BR><B>Cc:</B> Shawn Currin;=20
dl-qualification<BR><B>Subject:</B> RE: An Opportunity for Improving =
Sequential=20
Write Performance<BR></FONT><BR></DIV>
<DIV></DIV>
<DIV id=3DidOWAReplyText79095 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>Fantasic=20
information.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>Good work on the problem and=20
cause.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>Sheds insight into the slow =
write=20
performance we have seen on some other arrays as =
well.</FONT></DIV></DIV>
<DIV dir=3Dltr><BR>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Fay Chong<BR><B>Sent:</B> Thu =
6/21/2007=20
9:59 AM<BR><B>To:</B> John Klokkenga; Paul Hammer; Jerry Lopatin; Jay=20
Michlin<BR><B>Cc:</B> Bill Nadzam; Shawn Currin; Fay Chong;=20
dl-qualification<BR><B>Subject:</B> An Opportunity for Improving =
Sequential=20
Write Performance<BR></FONT><BR></DIV>
<DIV>
<P align=3Dleft><SPAN lang=3Den-us>Executive Summary:</SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us>Problem Statement: Improving =
sequential write=20
performance for Xyratex and arrays in general.</SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us>Findings: The ONStor filer issues =
sequential SCSI=20
write commands when presented with a sequential write workload by the =
filer=20
client. The SCSI write commands are sometimes out of order which causes=20
performance problems in Xyratex and probably other arrays. If the file =
system=20
issued write commands in order then the performance of Xyratex and other =
similar=20
arrays would improve. In addition, performance should improve for most =
if not=20
all other arrays.</SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us>Details:</SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us>In the past week there has been an =
intensive=20
effort by Xyratex and ONStor to improve the performance of the ONStor =
filer and=20
Xyratex storage combination for sequential writes. Sometimes the filer =
issues=20
SCSI write commands out of order. The Xyratex controller is not able to =
reorder=20
the commands for efficient write-back operation. Here&#8217;s a listing =
of write-back=20
operations in their order of efficiency for raid-5:</SPAN></P>
<P><SPAN lang=3Den-us><FONT=20
face=3D"Times New =
Roman">1.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</FONT></SPAN><SPAN=20
lang=3Den-us> Full stripe write &#8211; Data for a full stripe is =
received in one or=20
more commands. Parity is calculated and data and parity is written to =
the=20
disks.</SPAN></P>
<P><SPAN lang=3Den-us><FONT=20
face=3D"Times New =
Roman">2.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</FONT></SPAN><SPAN=20
lang=3Den-us> Full segment, partial stripe write (update write) &#8211; =
Data for a=20
stripe segment (also called stripe size, chunk size) is received in one =
or more=20
commands. New parity is calculated and the new segment data and parity =
is=20
written to the two disks.</SPAN></P>
<P><SPAN lang=3Den-us><FONT=20
face=3D"Times New =
Roman">3.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</FONT></SPAN><SPAN=20
lang=3Den-us> Partial segment, partial stripe write (update write) =
&#8211; Data for part=20
of a segment is received. New parity is calculated and the new data and =
parity=20
written to the two disks.</SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us>Example:</SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us>Suppose the filer is connected to a =
raid-5 array=20
with 4+1 disks with a segment size of 64K. Keep in mind that the ONStor =
file=20
system used a block size of 8K for data. There is some coalescing in the =
filer=20
at the SCSI layer. In the case of a full stripe write, 256K bytes are =
received=20
(4 times 64K), the parity is calculated from the data, and 5 disk =
operations=20
performed to write the data and parity. In general, the overhead of =
executing a=20
command is much greater than the data transfer time. For full segment =
write, 64K=20
bytes are received, the new parity calculated from the old parity and =
old data,=20
and the new data and new parity written to disk. Obtaining the old =
parity and=20
old data requires 2 disk operation, writing new data and new parity 2 =
disk=20
operations for a total of 4 disk I/O&#8217;s. The partial segment write =
works like the=20
full segment write only less data is read/written. To summarize the cost =
for=20
writing 256K bytes of data:</SPAN></P>
<P><SPAN lang=3Den-us><FONT face=3DSymbol>&middot;<FONT=20
face=3D"Courier =
New">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</FONT></FONT></SPAN><SPAN=20
lang=3Den-us> Full stripe write &#8211; 5 disk I/O&#8217;s</SPAN> =
<BR><SPAN lang=3Den-us><FONT=20
face=3DSymbol>&middot;<FONT=20
face=3D"Courier =
New">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</FONT></FONT></SPAN><SPAN=20
lang=3Den-us> Full segment write &#8211; 40 disk I/O&#8217;s</SPAN> =
<BR><SPAN lang=3Den-us><FONT=20
face=3DSymbol>&middot;<FONT=20
face=3D"Courier =
New">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</FONT></FONT></SPAN><SPAN=20
lang=3Den-us> Partial segment write (8K) &#8211; 128 disk =
I/O&#8217;s</SPAN> </P>
<P align=3Dleft><SPAN lang=3Den-us>Xyratex specific:</SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us>Xyratex controller code can reorder 1 =
command=20
that is out of order i.e. coalesce two commands. They use a full stripe =
lock=20
which means that only one update operation on a stripe can be done at a =
time.=20
The combination of only coalescing two commands and a full stripe lock =
means=20
that a high number of disk operations are required and result in low =
throughput.=20
They have made some changes to their code and can get a consistent 17-20 =
MB/sec.=20
For comparison the Saturn array achieves about 80 MB/sec in a similar=20
configuration.&nbsp; Xyratex has direct attach data that shows 40 MB/sec =
for 8K=20
sequential writes with a queue depth of 8. For 128K transfers, the =
transfer rate=20
is over 200 MB/sec for a queue depth of 8.</SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us>More details and other =
arrays:</SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us>Clearly having the filer process and =
issue=20
sequential write commands in order would improve the performance of the =
Xyratex=20
controller. However a larger benefit may occur by being able to coalesce =
more=20
commands at the filer, thus presenting larger writes to the storage. =
Coalescing=20
at the filer occurs opportunistically. If the SCSI layer can issue a =
command to=20
storage it will do so with no coalescing taking place. If the storage =
has the=20
maximum number of I/O&#8217;s issued to it (maxdispatch), the FC code =
will sort its=20
queue by LBA and coalesce just before issuing another command. The =
maximum I/O=20
is 128K. If the FC code receives the writes out of order then the =
opportunity=20
for coalescing is dependent upon there being a sufficient number of =
commands in=20
the queue. If the FC receives writes in order, then coalescing will =
occur more=20
often i.e. whenever maxdispatch and there are 2 or more commands in the =
FC=20
queue. Even the Xyratex storage could achieve 200 MB/sec.</SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us>Future work:</SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us>An estimate of the performance gain =
by issuing=20
write commands in order should be made. Command traces have been =
recorded. Two=20
experiments come to mind:</SPAN></P>
<P><SPAN lang=3Den-us><FONT=20
face=3D"Times New =
Roman">1.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</FONT></SPAN><SPAN=20
lang=3Den-us> Using direct attach storage, the command list could be =
issued and=20
performance noted. Then the command list sorted by LBA, issued again and =

performance compared with the unsorted list. This will give an idea of =
the=20
benefit of just issuing write commands in order. </SPAN></P>
<P><SPAN lang=3Den-us><FONT=20
face=3D"Times New =
Roman">2.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</FONT></SPAN><SPAN=20
lang=3Den-us> Using some assumption about queue depth at the FC, the =
sorted list=20
of commands could be aggregated and issued to the storage. This would =
show the=20
benefit of in order commands and coalescing.</SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us>Summary:</SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us>Working on the Xyratex write =
throughput problem=20
has been very beneficial. Having the Filesystem issue sequential write =
commands=20
in order would increase performance in arrays such as Xyratex and also =
increase=20
performance due to aggregation into larger write commands.</SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us>Questions, comments, and suggestions=20
welcome.</SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us>Thanks</SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us>Fay</SPAN></P><BR><BR><BR>
<P align=3Dleft><SPAN lang=3Den-us></SPAN><A name=3D""><SPAN =
lang=3Den-us><FONT=20
face=3DArial size=3D2>Fay Chong</FONT></SPAN></A></P>
<P align=3Dleft><SPAN lang=3Den-us><FONT face=3DArial size=3D2>Sr. =
Performance=20
Engineer</FONT></SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us><FONT face=3DArial size=3D2>ONStor,=20
Inc.</FONT></SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us><FONT face=3DArial=20
size=3D2>fay.chong@onstor.com</FONT></SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us><FONT face=3DArial =
size=3D2>408.376.3130=20
(w)</FONT></SPAN></P>
<P align=3Dleft><SPAN lang=3Den-us></SPAN></P></DIV></BODY></HTML>

------_=_NextPart_001_01C7B434.159EA20E--
