X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C77DED.A3A10E12@onstor-exch02.onstor.net>; Fri, 13 Apr 2007 10:03:20 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Content-class: urn:content-classes:message
Subject: RE: SW Development Analysis of Proposed Availability Initiative
Date: Fri, 13 Apr 2007 10:03:20 -0700
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E0343675C@onstor-exch02.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E03436748@onstor-exch02.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: SW Development Analysis of Proposed Availability Initiative
Thread-Index: Acd9ZJ6Ll53cVhYeRpWdW5VFuzbY+AAcFwkAAAYLD2A=
From: "Jay Michlin" <jay.michlin@onstor.com>
To: "Brian DeForest" <brian.deforest@onstor.com>,
	"Jerry Lopatin" <jerry.lopatin@onstor.com>,
	"Paul Hammer" <paul.hammer@onstor.com>
Cc: "Tim Gardner" <tim.gardner@onstor.com>,
	"Charissa Willard" <charissa.willard@onstor.com>,
	"Maxim Kozlovsky" <maxim.kozlovsky@onstor.com>,
	"Andy Sharp" <andy.sharp@onstor.com>

Brian,

Great minds think alike... you may recall that during our earlier Zonda
planning, rewriting the ssc daemon was one of our projects, and it's
included in the workforce balance. The only reason it's not included in
the "10 items" Paul and Jerry and I discussed" is that I forgot to put
it on that list.

I also think there is great benefit to be found in collaboration between
SW Development and EMRS. A shared availability objective in the context
of a disciplined, release-oriented project, could be a nice opportunity
to do that.

jay
=20

-----Original Message-----
From: Brian DeForest=20
Sent: Friday, April 13, 2007 9:57 AM
To: Jay Michlin; Jerry Lopatin; Paul Hammer
Cc: Tim Gardner; Charissa Willard; Maxim Kozlovsky; Andy Sharp
Subject: RE: SW Development Analysis of Proposed Availability Initiative

There are a few possible components that will cause downtime:

1.  hardware problems:  Bobcat, network/FC connectivity, LAN/SAN 2.  BSD
crashes 3.  TXRX, FP processor crashes 4.  SSC daemon crashes 5.  volume
offline 6.  other software problems (e.g. daemons crash when /var is
100% full)

Assuming #1 and #2 will not be addressed at all until Cougar, and #5 is
being addressed by file system hardening, then hardening to address #3,
#4, and #6 should be part of the Zonda planning. =20

#4 should partially be addressed by fixing Coverity-found defects,
however Coverity will almost certainly not help with #3 in the short
term since it will not understand our threading model.=20

For #4, I suggest we need to determine the top 3 (or so) daemons that
need the most hardening.  This can be determined from data in the data
warehouse (e.g. most frequent core dumps), and/or less scientifically by
polling CS for their feedback.  We are presumably already addressing the
one at the top of the list, CIFSD, with the Samba replacement project
and Samba hardening started in Delorean.  Once we have the list of the
top 3 (or so) to address, then we can prioritize all Coverity-found
defects in those daemons to be top priority.

For #3, the memory tagging may be a first step, however I think
additional consideration needs to be given to harden the embedded code.

For #6, we should give more thought to complete the list.  Addressing
the /var full problems is one that comes to mind and I'm already working
w/ Ian on ways EMRS could help address this, though there are other
related projects including limiting log file sizes or moving logs to the
mgmt volume as listed already.

-----Original Message-----
From: Jay Michlin
Sent: Thursday, April 12, 2007 5:43 PM
To: Jerry Lopatin; Paul Hammer
Cc: Tim Gardner; Charissa Willard; Maxim Kozlovsky; Andy Sharp; Brian
DeForest
Subject: SW Development Analysis of Proposed Availability Initiative

Attached Word document.
