X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C71FC9.AA6BB7FB@onstor-exch02.onstor.net>; Thu, 14 Dec 2006 13:49:01 -0800
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C71FC9.AA6BB7FB"
Content-class: urn:content-classes:message
Subject: Discussion about 20% Effort on Internal/Improvement Projects, Updated
Date: Thu, 14 Dec 2006 13:49:00 -0800
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E01C0ADC1@onstor-exch02.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Discussion about 20% Effort on Internal/Improvement Projects, Updated
thread-index: AccfGcgVq/bkmbtIR8OF5A6mFWjFTw==
From: "Jay Michlin" <jay.michlin@onstor.com>
To: "dl-Software" <dl-software@onstor.com>
Cc: "Maxim Kozlovsky" <maxim.kozlovsky@onstor.com>,
	"Ian Brown" <ian.brown@onstor.com>,
	"Paul Hammer" <paul.hammer@onstor.com>,
	"Brian Stark" <brian.stark@onstor.com>,
	"Jerry Lopatin" <jerry.lopatin@onstor.com>,
	"Jonathan Goldick" <jonathan.goldick@onstor.com>

This is a multi-part message in MIME format.

------_=_NextPart_001_01C71FC9.AA6BB7FB
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Here is an update incorporating comments from Max, Andy, Wencheng and
Ian. In this version I've tried to group together ideas that are related
or close to each other, and also separate some broad categories (e.g.
tools). I've also cc'd Paul and Brian Stark since some of the
suggestions will be of special interest to them, and Jonathan and Jerry
for their information.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D
Original introduction: In our staff meeting today (Wednesday) we
brainstormed on projects we might do as part of the Delorean release
(probably for March or April) to  attend to areas of our code we think
need attention. These projects won't likely add immediate features that
are visible to customers, but rather will add strength, robustness or
solid foundation to our overall product. The payoff is long term, but
it's work that must go on constantly.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D

System Hardening and Robustness
* Eliminate sendAgile - should be at the top of the priority order.
There is no need to do the whole thing at once, just pick a piece of
code and make it use RMC.

* Support NFS mount of the EFS volumes from the SSC through the loopback
- should be easy to implement and this takes care about the number of
problems as it provides the SSC with lots of disk storage. The examples
are: 1) flash wear off problem - logs can be stored directly on the
management volumes so there is no need to write to the flash anymore
except during upgrades and boot up 2) upgrades - fixes the failures from
the lack of temporary storage 3) Space to write coredumps without
overflowing /var partition. (Ian's comment: What max mentioned about
mounting volumes via the loopback interface is KEY for going forward in
fixing the upgrade problems we have and for storing lots more data that
makes life easier in diagnosing problems.)

* Verify the file system log

* eek for clusDB

* Shrink the clusDB

* Rewrite tpl/fp

* Rewrite/clean up/refactor sanmd

Debug Features that Will Make Finding Problems Easier
* Global memory statistics, runtime type information for FP/TXRX/FC -
must have to be able to solve the memory leaks problem happening in the
field in timely manner instead of working for months on a single problem
like Motorola's buffer leak.

* On the same topic, Memory leak detector - Find the slow memory leaks
proactively instead of waiting for the customers to come up with the
workload which makes slow leaks fast.
+ Andy's comment: What about just linking in a memory allocation
debugging library, like mpatrol? With a little bit of work we were able
to link with this at my last embedded project and it found all kinds of
crazy errors that probably never would have been found. Obviously we
would only use something like this at a "problem" site and never in
normal production code, but it's probably the easiest/closest we will
get to a useful tool given our context.
+ Max's additional comment:  I would like to have memory statistics +
type information always on in the production code in the embedded cpus.
Sometimes you get only one chance to reproduce the problem, and
sometimes customers will be reluctant to install special builds. The
library solution will be great for the SSC. If the performance hit is
not very high, it can be even permanently on, for all or a subset of
executables.
+ Ian's additional comment: http://manju.cs.berkeley.edu/ccured/ I've
played with that before, it's pretty cool, and well written.

* Error traceability - in the current product it is sometimes hard to
trace the error reported by various daemons to the origin of the
problem. The requests originated by the clients or created internally
should have a unique identifier carried through the RPCs executed on the
behalf of the request, the error logs and the CLI logs should include
that identifier so the failures can be related to the command that
failed. (Wencheng's comment: I second Max's comments regarding error
tracing mechanism in the code. Windows has a tracing mechanism called
WPP tracing which is a very useful debugging tool, it would be very
helpful for troubleshooting if we have similar tool.)

* Tracing of the internal messages - this will help in diagnosing the
problems, performance tuning and understanding how the system works. For
example, the recent discussion about what happens in the system during
the failover could be made much shorter if we would be able to look at
the message trace. Make Charissa do a nice GUI to display the traces.

* FC coredump - or may be wait for Cougar so the problem goes away.=20

* Allow separate reboot of FP/TxRx/FC on bobcat, make it the same as
cheetah - the compile/run cycle is way too long on bobcat.

Tools/processes
* Parallel make system - the builds take way too much time.

* Automated lab machine set up and initial config: so that one may
request a lab filer, ask for a version, number of vsvrs, and number of
volumes, and have a script go off and set up the lab machine for you
before you start using it.

Testing
* Automated test bots which give nightly to weekly test results on a web
interface of builds, and has inventories of our tests, and a single
interface to run them, and has a single interface with which to submit
new tests for the framework.  The test suite should be able to run tests
on any lab filer via a web interface, and email results to a developer.

* Static Build-time code tests: We should invest in some static build-
time code checkers which can look for bugs in our code, beyond simple
gcc compiler warnings and errors.

Features
* Encryption and/or compression for DM-IP

* Vol Create /8 LUNs

* Mirror repair or resynchronize

* Authentication for DM-IP

* Improve upgrade time and upgrade reliability




------_=_NextPart_001_01C71FC9.AA6BB7FB
Content-Type: text/html;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Dus-ascii">
<META NAME=3D"Generator" CONTENT=3D"MS Exchange Server version =
6.5.7650.28">
<TITLE>Discussion about 20% Effort on Internal/Improvement Projects, =
Updated</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/rtf format -->

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">Here is an update =
incorporating comments from Max, Andy</FONT><FONT SIZE=3D2 =
FACE=3D"Arial">, Wencheng</FONT> <FONT SIZE=3D2 FACE=3D"Arial">and Ian. =
In this version I've tried to group together ideas that are related or =
close to each other, and also separate some broad categories (e.g. =
tools). I've also cc'd Paul and Brian Stark since some of the =
suggestions will be of special interest to them</FONT><FONT SIZE=3D2 =
FACE=3D"Arial">, and Jonathan and Jerry for their =
information</FONT><FONT SIZE=3D2 FACE=3D"Arial">.</FONT></SPAN></P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 =
FACE=3D"Arial">=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D</FONT></SPAN>

<BR><SPAN LANG=3D"en-us"><B><FONT SIZE=3D2 FACE=3D"Arial">Original =
introduction</FONT></B><FONT SIZE=3D2 FACE=3D"Arial">: In our staff =
meeting today (Wednesday) we brainstormed on projects we might do as =
part of the Delorean release (probably for March or April) to&nbsp; =
attend to areas of our code we think need attention. These projects =
won't likely add immediate features that are visible to customers, but =
rather will add strength, robustness or solid foundation to our overall =
product. The payoff is long term, but it's work that must go on =
constantly.</FONT></SPAN></P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 =
FACE=3D"Arial">=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D</FONT></SPAN>
</P>

<P><SPAN LANG=3D"en-us"><B><FONT SIZE=3D2 FACE=3D"Arial">System =
Hardening and Robustness</FONT></B></SPAN>

<BR><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Eliminate =
sendAgile - should be at the top of the priority order. There is no need =
to do the whole thing at once, just pick a piece of code and make it use =
RMC.</FONT></SPAN></P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Support NFS =
mount of the EFS volumes from the SSC through the loopback - should be =
easy to implement and this takes care about the number of problems as it =
provides the SSC with lots of disk storage. The examples are: 1) flash =
wear off problem - logs can be stored directly on the management volumes =
so there is no need to write to the flash anymore except during upgrades =
and boot up 2) upgrades - fixes the failures from the lack of temporary =
storage 3) Space to write coredumps without overflowing /var =
partition</FONT><FONT SIZE=3D2 FACE=3D"Arial">. (<I>Ian's comment</I>: =
What max mentioned about mounting volumes via the loopback interface is =
KEY for going forward in fixing the upgrade problems we have and for =
storing lots more data that makes life easier in diagnosing =
problems.)</FONT></SPAN></P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Verify the file =
system log</FONT></SPAN>
</P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* eek for =
clusDB</FONT></SPAN>
</P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Shrink the =
clusDB</FONT></SPAN>
</P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Rewrite =
tpl/fp</FONT></SPAN>
</P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Rewrite/clean =
up/refactor sanmd</FONT></SPAN>
</P>

<P><SPAN LANG=3D"en-us"><B><FONT SIZE=3D2 FACE=3D"Arial">Debug Features =
that Will Make Finding Problems Easier</FONT></B></SPAN>

<BR><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Global memory =
statistics, runtime type information for FP/TXRX/FC - must have to be =
able to solve the memory leaks problem happening in the field in timely =
manner instead of working for months on a single problem like Motorola's =
buffer leak.</FONT></SPAN></P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* On the same =
topic, Memory leak detector - Find the slow memory leaks proactively =
instead of waiting for the customers to come up with the workload which =
makes slow leaks fast</FONT><FONT SIZE=3D2 =
FACE=3D"Arial">.</FONT></SPAN></P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">+</FONT><I> <FONT =
SIZE=3D2 FACE=3D"Arial">Andy's comment</FONT></I><FONT SIZE=3D2 =
FACE=3D"Arial">: What about just linking in a memory allocation =
debugging library, like mpatrol? With a little bit of work we were able =
to link with this at my last embedded project and it found all kinds of =
crazy errors that probably never would have been found. Obviously we =
would only use something like this at a &quot;problem&quot; site and =
never in normal production code, but it's probably the easiest/closest =
we will get to a useful tool given our context.</FONT></SPAN></P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">+</FONT><I> <FONT =
SIZE=3D2 FACE=3D"Arial">Max's additional comment</FONT></I><FONT =
SIZE=3D2 FACE=3D"Arial">:&nbsp; I would like to have memory statistics + =
type information always on in the production code in the embedded cpus. =
Sometimes you get only one chance to reproduce the problem, and =
sometimes customers will be reluctant to install special builds. The =
library solution will be great for the SSC. If the performance hit is =
not very high, it can be even permanently on, for all or a subset of =
executables.</FONT></SPAN></P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">+</FONT><I> <FONT =
SIZE=3D2 FACE=3D"Arial">Ian's additional comment</FONT></I><FONT =
SIZE=3D2 FACE=3D"Arial">: </FONT></SPAN><A =
HREF=3D"http://manju.cs.berkeley.edu/ccured/"><SPAN =
LANG=3D"en-us"><U><FONT COLOR=3D"#0000FF" SIZE=3D2 =
FACE=3D"Arial">http://manju.cs.berkeley.edu/ccured/</FONT></U></SPAN></A>=
<SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial"> I've played with =
that before, it's pretty cool, and well written.</FONT></SPAN></P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Error =
traceability - in the current product it is sometimes hard to trace the =
error reported by various daemons to the origin of the problem. The =
requests originated by the clients or created internally should have a =
unique identifier carried through the RPCs executed on the behalf of the =
request, the error logs and the CLI logs should include that identifier =
so the failures can be related to the command that failed.</FONT> <FONT =
SIZE=3D2 FACE=3D"Arial">(</FONT><I><FONT SIZE=3D2 =
FACE=3D"Arial">Wencheng's comment</FONT></I><FONT SIZE=3D2 =
FACE=3D"Arial">: I second Max's comments regarding error tracing =
mechanism in the code. Windows has a tracing mechanism called WPP =
tracing which is a very useful debugging tool, it would be very helpful =
for troubleshooting if we have similar tool.)</FONT></SPAN></P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Tracing of the =
internal messages - this will help in diagnosing the problems, =
performance tuning and understanding how the system works. For example, =
the recent discussion about what happens in the system during the =
failover could be made much shorter if we would be able to look at the =
message trace. Make Charissa do a nice GUI to display the =
traces.</FONT></SPAN></P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* FC coredump - or =
may be wait for Cougar so the problem goes away. </FONT></SPAN>
</P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Allow separate =
reboot of FP/TxRx/FC on bobcat, make it the same as cheetah - the =
compile/run cycle is way too long on bobcat.</FONT></SPAN></P>

<P><SPAN LANG=3D"en-us"><B><FONT SIZE=3D2 =
FACE=3D"Arial">Tools/processes</FONT></B></SPAN>

<BR><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Parallel make =
system - the builds take way too much time.</FONT></SPAN>
</P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Automated lab =
machine set up and initial config: so that one may request a lab filer, =
ask for a version, number of vsvrs, and number of volumes, and have a =
script go off and set up the lab machine for you before you start using =
it.</FONT></SPAN></P>

<P><SPAN LANG=3D"en-us"><B><FONT SIZE=3D2 =
FACE=3D"Arial">Testing</FONT></B></SPAN>

<BR><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Automated test =
bots which give nightly to weekly test results on a web interface of =
builds, and has inventories of our tests, and a single interface to run =
them, and has a single interface with which to submit new tests for the =
framework.&nbsp; The test suite should be able to run tests on any lab =
filer via a web interface, and email results to a =
developer.</FONT></SPAN></P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Static =
Build-time code tests: We should invest in some static build- time code =
checkers which can look for bugs in our code, beyond simple gcc compiler =
warnings and errors.</FONT></SPAN></P>

<P><SPAN LANG=3D"en-us"><B><FONT SIZE=3D2 =
FACE=3D"Arial">Features</FONT></B></SPAN>

<BR><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Encryption =
and/or compression for DM-IP</FONT></SPAN>
</P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Vol Create /8 =
LUNs</FONT></SPAN>
</P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Mirror repair or =
resynchronize</FONT></SPAN>
</P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Authentication =
for DM-IP</FONT></SPAN>
</P>

<P><SPAN LANG=3D"en-us"><FONT SIZE=3D2 FACE=3D"Arial">* Improve upgrade =
time and upgrade reliability</FONT></SPAN>
</P>
<BR>
<BR>

</BODY>
</HTML>
------_=_NextPart_001_01C71FC9.AA6BB7FB--
