AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20080613113009.6ac7a5f5@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<chris.vandever@onstor.com>,<ian.brown@onstor.com>,<dl-designreview@onstor.com>,<brian.stark@onstor.com>,<warren.gale@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#mh/Mailbox/design review	0	BB375AF679D4A34E9CA8DFA650E2B04E03E9A8FB@onstor-exch02.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Fri, 13 Jun 2008 11:30:37 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: "Chris Vandever" <chris.vandever@onstor.com>
Cc: "Ian Brown" <ian.brown@onstor.com>, "dl-Design Review"
 <dl-designreview@onstor.com>, "Brian Stark" <brian.stark@onstor.com>,
 "Warren Gale" <warren.gale@onstor.com>
Subject: Re: Proposed design for new(ish) boot procedure for Cougar
Message-ID: <20080613113037.3f79c040@ripper.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E03E9A8FB@onstor-exch02.onstor.net>
References: <BB375AF679D4A34E9CA8DFA650E2B04E0A6E8AE4@onstor-exch02.onstor.net>
	<BB375AF679D4A34E9CA8DFA650E2B04E03E9A8FB@onstor-exch02.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

On Fri, 13 Jun 2008 11:15:34 -0700 "Chris Vandever"
<chris.vandever@onstor.com> wrote:

> You'll have to try harder than that.  Jobi has to restart his SSC
> daemons because he's actually trying to use his cheetah as a filer.
> However, if you have no clients and only care about the ssc daemons,
> well, that's another story...

What if you have no clients and you don't care about SSC daemons either?

> -----Original Message-----
> From: Maxim Kozlovsky 
> Sent: Friday, June 13, 2008 11:12 AM
> To: Jobi Ariyamannil; Andy Sharp; Ian Brown
> Cc: dl-Design Review; Brian Stark; Warren Gale
> Subject: RE: Proposed design for new(ish) boot procedure for Cougar
> 
> Oh well. This must be a part of the conspiracy to make Chris give up
> her Cheetah. 
> 
> >-----Original Message-----
> >From: Jobi Ariyamannil
> >Sent: Friday, June 13, 2008 10:57 AM
> >To: Maxim Kozlovsky; Andy Sharp; Ian Brown
> >Cc: dl-Design Review; Brian Stark; Warren Gale
> >Subject: RE: Proposed design for new(ish) boot procedure for Cougar
> >
> >This does not work on cheetah anymore.
> >We need to manually restart a bunch of SSC daemons after resetting
> >the
> fp.
> >
> >-----Original Message-----
> >From: Maxim Kozlovsky
> >Sent: Friday, June 13, 2008 9:28 AM
> >To: Andy Sharp; Ian Brown
> >Cc: dl-Design Review; Brian Stark; Warren Gale
> >Subject: RE: Proposed design for new(ish) boot procedure for Cougar
> >
> >
> >
> >>-----Original Message-----
> >>From: Andy Sharp
> >>Sent: Thursday, June 12, 2008 8:29 PM
> >>To: Ian Brown
> >>Cc: dl-Design Review; Brian Stark; Warren Gale
> >>Subject: Re: Proposed design for new(ish) boot procedure for Cougar
> >>
> >>On Thu, 12 Jun 2008 18:34:00 -0700 Ian Brown <ian.brown@onstor.com>
> >>wrote:
> >>
> >>> In production, for the Cheetah, we have always rebooted the entire
> >>> box.  There were some daemons that relied on boot up order, thus
> >>> I'd guess that you would need to restart the daemons in phase 1 if
> >>> you're going to just bounce an embedded core.
> >>
> >>That's good to know.  What little I know about Cheetah operation
> >>would likely fall into the "Lore" category.
> >>
> >>Phase I is still rebooting the whole box.  Depending on the results
> >>of testing, Phase II may never see the light of day. ~:^)
> >[MK]
> >
> >There is no need to restart the daemons. During cheetah development
> >the daemons which did care about fp/txrx/fc restarts learned to
> >listen on a slot/cpu up/down events and do the right thing. This
> >used to work up to 3.2, after that I had to give up my cheetah and
> >can't testify on the account.
> >
> >>
> >>
> >>> Ian
> >>>
> >>> On Jun 12, 2008, at 6:24 PM, Andrew Sharp wrote:
> >>>
> >>>                        Cougar Boot Procedure Redesign
> >>>                        ______________________________
> >>>
> >>> Problem
> >>> =======
> >>>
> >>>     Booting takes far too long on Cougar, and in theory the
> >>> embedded nodes should be rebootable w/o rebooting Linux on the
> >>> Sibyte
> 1125.
> >>>
> >>> Reasons:
> >>>     1)    Image load from CF is intolerably slow
> >>>     2)    After image load, Linux boot takes the longest but is
> >>> the least likely to need rebooting, resulting in an
> unnecessary
> >>> 		  bottleneck.
> >>>
> >>> Solution
> >>> ========
> >>>
> >>>     Redesign the boot flow to allow the embedded cores to be
> >>>     independently booted if Linux is up.
> >>>
> >>> Proposal
> >>> ========
> >>>
> >>>     Take a phased approach to implementing a redesigned boot
> >>> procedure:
> >>>
> >>> 	Phase I
> >>> 	-------
> >>> 	1)  Change SSC PROM to load and boot only Linux.
> >>> 	2)  Change FP/TXRX PROM to write a magic cookie in a
> >>> 	    predefined memory location indicating its readiness
> >>> 	    for it's image to be loaded.
> >>> 	3)  Impement an early start Linux daemon that waits for
> >>> these boot magic cookies to be set by the embedded cores, loads
> >>> 	    their images to the correct memory locations, and
> >>> signals to the FP/TXRX when finished.  The FP and TXRX could boot
> >>>             while Linux completes its boot steps.
> >>>
> >>> 	Phase 2
> >>> 	-------
> >>> 	1)  Through testing, determine what needs to be done to
> >>> allow FP/TXRX to be rebooted independently without disturbing
> >>> the Linux kernel and each other.  Current daemons that
> >>>             communicate with FP/TXRX are not expected to be much
> >>> trouble since they had to handle this for Cheetah, although this
> >>> has not been extensively tested on Cheetah in the last few
> >>>             releases.
> >>>
> >>> Expected Results
> >>> ================
> >>>
> >>> Phase I
> >>> -------
> >>>
> >>> Current boot time           Predicted Boot time        Predicted
> >>> savings -----------------           -------------------
> >>> ----------------- 2 minutes, 57 secs          1 minute, 43.7
> >>> secs        1 minute, 13.7 secs
> >>>
> >>> 42% reduction in boot time: current boot time* is 2:57, resulting
> boot
> >>> time is estimated to be 1:43.7, or, a savings of 1:13.7, or, the
> >>> new method would boot 1.7 times faster (2 times faster, or twice
> >>> as
> fast,
> >>> would be a 50% reduction in boot time).
> >>>
> >>> These estimations based on a difference in image load time for the
> >>> FP/TXRX of 86 seconds for the PROM, and 12.7 seconds for Linux
> >>> (cold cache).
> >>>
> >>>
> >>> Phase II
> >>> --------
> >>> If just rebooting one or both of the FP/TXRX nodes, boot time
> >>> estimated to be in the sub 10 second range.  This would
> substantially
> >>> increase customer satisfaction and supportability, as well as
> >>> resulting in a substantial increase in developer efficiency.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> * Boot time measured from when PROM code starts loading the first
> boot
> >>> image to when nfxsh CLI is available.
> >>>
