X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C71FBA.DE700C17@onstor-exch02.onstor.net>; Thu, 14 Dec 2006 12:03:05 -0800
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Content-class: urn:content-classes:message
Subject: RE: Notes from Our Discussion about 20% Effort on Internal/Improvement Projects
Date: Thu, 14 Dec 2006 12:03:05 -0800
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E01C0AD1D@onstor-exch02.onstor.net>
In-Reply-To: <20061214110407.32da6557@ripper.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Notes from Our Discussion about 20% Effort on Internal/Improvement Projects
thread-index: AccfsqFCmv2w0NjtSwajfZ2g0MTldgABP9lQ
From: "Maxim Kozlovsky" <maxim.kozlovsky@onstor.com>
To: "Andy Sharp" <andy.sharp@onstor.com>,
	"Wencheng Chai" <wencheng.chai@onstor.com>
Cc: "Jay Michlin" <jay.michlin@onstor.com>,
	"dl-Software" <dl-software@onstor.com>,
	"Ian Brown" <ian.brown@onstor.com>

I would like to have memory statistics + type information always on in
the production code in the embedded cpus. Sometimes you get only one
chance to reproduce the problem, and sometimes customers will be
reluctant to install special builds. =20

The library solution will be great for the SSC. If the performance hit
is not very high, it can be even permanently on, for all or a subset of
executables.

> -----Original Message-----
> From: Andrew Sharp [mailto:andy.sharp@onstor.com]=20
> Sent: Thursday, December 14, 2006 11:04 AM
> To: Wencheng Chai
> Cc: Maxim Kozlovsky; Jay Michlin; dl-Software; Ian Brown
> Subject: Re: Notes from Our Discussion about 20% Effort on=20
> Internal/Improvement Projects
>=20
> What about just linking in a memory allocation debugging=20
> library, like mpatrol?  With a little bit of work we were=20
> able to link with this at my last embedded project and it=20
> found all kinds of crazy errors that probably never would=20
> have been found.
>=20
> Obviously we would only use something like this at a=20
> "problem" site and never in normal production code, but it's=20
> probably the easiest/closest we will get to a useful tool=20
> given our context.
>=20
> Cheers,
>=20
> a
>=20
>  On Thu, 14 Dec 2006 10:15:53 -0800 "Wencheng Chai"
> <wencheng.chai@onstor.com> wrote:
>=20
> >=20
> >     I second Max's comments regarding error tracing=20
> mechanism in the=20
> > code.
> >     Windows has a tracing mechanism called WPP tracing which is a=20
> >     very useful debugging tool, it would be very helpful for=20
> > troubleshooting
> >     if we have similar tool.
> >=20
> >     Wencheng
> >=20
> >=20
> > -----Original Message-----
> > From: Maxim Kozlovsky
> > Sent: Thursday, December 14, 2006 9:31 AM
> > To: Jay Michlin; dl-Software
> > Cc: Ian Brown
> > Subject: RE: Notes from Our Discussion about 20% Effort on=20
> > Internal/Improvement Projects
> >=20
> > Here are some items - if you need more, just ask.
> >=20
> > Eliminate sendAgile - should be at the top of the priority order.
> > There is no need to do the whole thing at once, just pick a=20
> piece of=20
> > code and make it use RMC.
> >=20
> > Global memory statistics, runtime type information for FP/TXRX/FC -=20
> > must have to be able to solve the memory leaks problem happening in=20
> > the field in timely manner instead of working for months on=20
> a single=20
> > problem like Motorola's buffer leak.
> >=20
> > Support NFS mount of the EFS volumes from the SSC through=20
> the loopback=20
> > - should be easy to implement and this takes care about the=20
> number of=20
> > problems as it provides the SSC with lots of disk storage.
> > The examples are: 1) flash wear off problem - logs can be stored=20
> > directly on the management volumes so there is no need to=20
> write to the=20
> > flash anymore except during upgrades and boot up 2)=20
> upgrades - fixes=20
> > the failures from the lack of temporary storage 3) Space to write=20
> > coredumps without overflowing /var partition
> >=20
> > Memory leak detector - Find the slow memory leaks=20
> proactively instead=20
> > of waiting for the customers to come up with the workload=20
> which makes=20
> > slow leaks fast
> >=20
> > Parallel make system - the builds take way too much time.
> >=20
> > Allow separate reboot of FP/TxRx/FC on bobcat, make it the same as=20
> > cheetah - the compile/run cycle is way too long on bobcat.
> >=20
> > Error traceability - in the current product it is sometimes hard to=20
> > trace the error reported by various daemons to the origin of the=20
> > problem. The requests originated by the clients or created=20
> internally=20
> > should have a unique identifier carried through the RPCs=20
> executed on=20
> > the behalf of the request, the error logs and the CLI logs should=20
> > include that identifier so the failures can be related to=20
> the command=20
> > that failed.
> >=20
> > Tracing of the internal messages - this will help in diagnosing the=20
> > problems, performance tuning and understanding how the system works.
> > For example, the recent discussion about what happens in the system=20
> > during the failover could be made much shorter if we would=20
> be able to=20
> > look at the message trace. Make Charissa do a nice GUI to=20
> display the=20
> > traces.
> > FC coredump - or may be wait for Cougar so the problem goes away.=20
> > =20
> >=20
> > > -----Original Message-----
> > > From: Jay Michlin
> > > Sent: Wednesday, December 13, 2006 4:50 PM
> > > To: dl-Software
> > > Cc: Maxim Kozlovsky; Ian Brown
> > > Subject: Notes from Our Discussion about 20% Effort on=20
> > > Internal/Improvement Projects
> > >=20
> > > Hello all,
> > >=20
> > > In our staff meeting today we brainstormed on projects we=20
> might do=20
> > > as part of the Delorean release (probably for March or April) to=20
> > > attend to areas of our code we think need attention.=20
> These projects=20
> > > won't likely add immediate features that are visible to=20
> customers,=20
> > > but rather will add strength, robustness or solid=20
> foundation to our=20
> > > overall product. The payoff is long term, but it's work=20
> that must go=20
> > > on constantly.
> > >=20
> > > Here (in no particular order) are the items we mentioned:
> > >=20
> > > * Encryption and/or compression for DM-IP
> > > * Improve upgrade time and upgrade reliability
> > > * Vol Create /8 LUNs
> > > * Eliminate SendAgile
> > > * Rewrite tpl/fp
> > > * Rewrite/clean up/refactor sanmd
> > > * eee buffer tagging/descriptor tagging
> > > * Mirror repair or resynchronize
> > > * Verify the file system log
> > > * Authentication for DM-IP
> > > * eek for clusDB
> > > * Shrink the clusDB
> > >=20
> > > Some of the best ideas come in a second round, after=20
> seeing a set of=20
> > > notes such as these. So please do review them and if you have=20
> > > thoughts, comments or new ideas, please send them to the entire=20
> > > list. As we develop the details of Delorean planning in=20
> the next 3=20
> > > weeks, we will try to include some set of these projects=20
> along with=20
> > > the feature development and committed file system hardening.
> > >=20
> > > jay
> > >=20
>=20
