AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20080612202748.5b26f6c2@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<ian.brown@onstor.com>,<dl-designreview@onstor.com>,<brian.stark@onstor.com>,<warren.gale@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#mh/Mailbox/design review	0	02F5342D-628B-4BA3-B305-B499C3F49469@onstor.com
X-Sylpheed-End-Special-Headers: 1
Date: Thu, 12 Jun 2008 20:28:32 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: Ian Brown <ian.brown@onstor.com>
Cc: dl-Design Review <dl-designreview@onstor.com>, Brian Stark
 <brian.stark@onstor.com>, Warren Gale <warren.gale@onstor.com>
Subject: Re: Proposed design for new(ish) boot procedure for Cougar
Message-ID: <20080612202832.5e5ff15b@ripper.onstor.net>
In-Reply-To: <02F5342D-628B-4BA3-B305-B499C3F49469@onstor.com>
References: <20080612182458.010d3d89@ripper.onstor.net>
	<02F5342D-628B-4BA3-B305-B499C3F49469@onstor.com>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

On Thu, 12 Jun 2008 18:34:00 -0700 Ian Brown <ian.brown@onstor.com>
wrote:

> In production, for the Cheetah, we have always rebooted the entire  
> box.  There were some daemons that relied on boot up order, thus I'd  
> guess that you would need to restart the daemons in phase 1 if
> you're going to just bounce an embedded core.

That's good to know.  What little I know about Cheetah operation would
likely fall into the "Lore" category.

Phase I is still rebooting the whole box.  Depending on the results of
testing, Phase II may never see the light of day. ~:^)


> Ian
> 
> On Jun 12, 2008, at 6:24 PM, Andrew Sharp wrote:
> 
>                        Cougar Boot Procedure Redesign
>                        ______________________________
> 
> Problem
> =======
> 
>     Booting takes far too long on Cougar, and in theory the embedded
>     nodes should be rebootable w/o rebooting Linux on the Sibyte 1125.
> 
> Reasons:
>     1)    Image load from CF is intolerably slow
>     2)    After image load, Linux boot takes the longest but is the
>           least likely to need rebooting, resulting in an unnecessary
> 		  bottleneck.
> 
> Solution
> ========
> 
>     Redesign the boot flow to allow the embedded cores to be
>     independently booted if Linux is up.
> 
> Proposal
> ========
> 
>     Take a phased approach to implementing a redesigned boot
> procedure:
> 
> 	Phase I
> 	-------
> 	1)  Change SSC PROM to load and boot only Linux.
> 	2)  Change FP/TXRX PROM to write a magic cookie in a
> 	    predefined memory location indicating its readiness
> 	    for it's image to be loaded.
> 	3)  Impement an early start Linux daemon that waits for these
> 	    boot magic cookies to be set by the embedded cores, loads
> 	    their images to the correct memory locations, and signals
> 	    to the FP/TXRX when finished.  The FP and TXRX could boot
>             while Linux completes its boot steps.
> 
> 	Phase 2
> 	-------
> 	1)  Through testing, determine what needs to be done to allow
> 	    FP/TXRX to be rebooted independently without disturbing
> the Linux kernel and each other.  Current daemons that
>             communicate with FP/TXRX are not expected to be much
> trouble since they had to handle this for Cheetah, although this has
>             not been extensively tested on Cheetah in the last few
>             releases.
> 
> Expected Results
> ================
> 
> Phase I
> -------
> 
> Current boot time           Predicted Boot time        Predicted
> savings -----------------           -------------------
> ----------------- 2 minutes, 57 secs          1 minute, 43.7
> secs        1 minute, 13.7 secs
> 
> 42% reduction in boot time: current boot time* is 2:57, resulting boot
> time is estimated to be 1:43.7, or, a savings of 1:13.7, or, the new
> method would boot 1.7 times faster (2 times faster, or twice as fast,
> would be a 50% reduction in boot time).
> 
> These estimations based on a difference in image load time for the
> FP/TXRX of 86 seconds for the PROM, and 12.7 seconds for Linux (cold
> cache).
> 
> 
> Phase II
> --------
> If just rebooting one or both of the FP/TXRX nodes, boot time
> estimated to be in the sub 10 second range.  This would substantially
> increase customer satisfaction and supportability, as well as
> resulting in a substantial increase in developer efficiency.
> 
> 
> 
> 
> 
> * Boot time measured from when PROM code starts loading the first boot
> image to when nfxsh CLI is available.
> 
