AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20080725113913.6cb0b145@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<chris.vandever@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@onstor-exch02.onstor.net/INBOX	45222	BB375AF679D4A34E9CA8DFA650E2B04E0AE229BA@onstor-exch02.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Fri, 25 Jul 2008 11:40:00 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: "Chris Vandever" <chris.vandever@onstor.com>
Subject: Re: please review 29639
Message-ID: <20080725114000.41a4adf5@ripper.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E0AE229BA@onstor-exch02.onstor.net>
References: <20080721181847.490681cb@ripper.onstor.net>
 <BB375AF679D4A34E9CA8DFA650E2B04E0AE229BA@onstor-exch02.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

On Wed, 23 Jul 2008 21:10:42 -0700 "Chris Vandever"
<chris.vandever@onstor.com> wrote:

> Inline...

Ditto.

> > -----Original Message-----
> > From: Andy Sharp
> > Sent: Monday, July 21, 2008 6:19 PM
> > To: Chris Vandever; Rendell Fong
> > Subject: Re: please review 29639
> > 
> > I've made some more changes since this review request first went
> > out, possibly only to fix review mentioned problems, but also to
> > include changes that were made to these files since I started this
> > ill-gotten venture.  See the two bugs mentioned near the end.
> > 
> > Don't know if Rendell needs to re-review anything, but his name was
> > mentioned in these comments to Chris, so I'm cc'ing you, Mr. R.
> > 
> > 
> > On Wed, 9 Jul 2008 19:04:21 -0700 "Chris Vandever"
> > <chris.vandever@onstor.com> wrote:
> > 
> > > ... //depot/dev/nfx-tree/code/sm-utils/sys-utils-api.h#4 edit
> > > *	@103-6 you ARE aware this isn't the coding standard,
> > > right?
> > 
> > Sure it is.  What's wrong with it?  The defines, right?
> 
> Yes, the defines.  The comments are supposed to be in a separate block
> preceding the #define, not on the same line.

While all sane programmers recognize that our so called coding standard
is completely broken, I don't recall anything like that mentioned in it.
And I don't have time to look at it.  Does anyone actually care?

> > > ... //depot/dev/nfx-tree/code/sm-utils/sys-utils.c#3 edit
> > > *	@101-3, 145-163 you're planning on removing these before
> > > checking it in, yes?
> > 
> > Der.  Fixed.
> 
> What about 142-160?  Dead code is dead code; no point checking it in.

I would like to keep this specifically because this logic wasn't the
easiest to figure out, and the code a tad hard to come by.  I'm hoping
that maybe after this release I'll be able to get fast Linux reboots
working, and it would nice to have this then.

> > > *	@138 my understanding was that
> > > autosupport_log_reboot_state() needs to be root, so it should be
> > > after the seteuid().
> > 
> > Fixed, just in case you're right.
> > 
> > > *	@165-189 I'm confused about where you're going here.  It
> > > looks like you're planning to always offline the volumes, and the
> only
> > > question is whether to wait for it to complete before continuing.
> My
> > > question is in the case where we don't wait, will the child
> > > process that offlines the volumes screw anything up if we reboot
> > > in the middle of that code?  (I had run into a problem in
> > > split-brain testing where I would have liked to cleanly offline
> > > the volumes, but found that that took too long, and I was better
> > > off rebooting
> without
> > > offlining the volumes -- they got released and available for the
> > > other node to take over sooner.)
> > 
> > This is unused code at the moment, as there is no method to offline
> all
> > volumes in a quick shot.  The old nfxsh function, commented out
> > here, was quite slow, and Jonathon said it's better to just dump
> > them by rebooting than to take forever to offline this way --
> > downtime for
> users
> > on systems with more than a couple volumes will be vastly shotened.
> > More code can be written (kegg?) to quickly flush all the logs of
> > all volumes, or whatever he said, but that hasn't been implemented
> > yet.
> > 
> > There are no callers with the REBOOT_DONT_WAIT_VOLOFFLINE flag as of
> > yet, at least, who don't also call vol_offline_all() already.
> 
> My concern is that is looks like the intent is to eventually ALWAYS
> call vol_offline_all() or a faster equivalent, but not all callers
> will want to do so.  My only option will be whether to wait for it or
> not.  I can't specify not to do it at all.

Yes, that is the intent.  Currently there are no callers that could
even dream of a reason not to try and offline the volumes.  Code paths
that are causing a reboot when things are really hosed go straight to
genlib-reboot().

> > > *	@167, 175 the coding convention is to flag code that
> > > needs more work with "@@@" to make it easier to find it again
> > > later.  (I was used to "XXX" at other companies, but I think the
> > > main thing is to have a simple string that's easy to search for
> > > in pretty much any case.)
> > 
> > Fixed.  Apparently.
> 
> @182 we say for "non-cheetahs", but @185 we say "#ifdef COUGAR".  The
> comment is inconsistent.

Fixed.

> > 
> > > *	@186-194 shouldn't we already have these defined in
> > > some .h somewhere?
> > 
> > They weren't before.  This was lifted from cmd_upgrade.c I believe.
> 
> Okay, they should be defined in a single .h somewhere, but that can
> wait for another day.
> 
> @199 we used to call this for bobcats, also.  My understanding is that
> we didn't need it to kill hostidd, as that doesn't run on bobcats
> anyway.  Do we need to kill inetd?  Based on the comment, I think not,
> but I don't know for sure.

If you're talking about the system prevent_embedded_boot, no, we
didn't.  Bobcat embedded can't reboot on their own like cheetahs.

The call might have been there, but the guts were probably ifdef'd out
or something.

> > > *	@207-8 we think this should be done only for cheetahs,
> > > but the ifdefs @197 & 199-201 will cause it to be done for
> > > bobcat, also. I have no idea which is correct, but we should make
> > > sure the code
> and
> > > comments are consistent.
> > 
> > Fixed the comments.
> > 
> > > *	@224-244 I have no idea if this is correct.  You and
> > > Rendell would know better than me.
> > > *	@236 what's 3?  I think it's the timeout value for the
> > > response, but it should use a #define.
> > 
> > You don't think I actually wrote any of this crap, do you?
> > BTW, it's the number of retries.  A completely arbitrary but
> reasonable
> > number.  We have retries??!? <duck>
> 
> Nope, I figured it was copy/paste.  Still needs a #define.  One of
> JonG's pet peeves -- hard coded values.
> 
> @258 Max will shoot you for adding a return in a void function.  :)

The bullet will have his name on it then, because even functions that
return nothing still have to return.  Of course the hope is the return
never gets executed.

> 
> > 
> > > ... //depot/dev/nfx-tree/code/ssc-cluster/cluster-contrl-cfg.c#30
> edit
> > > *	@256-7 these need to be indented 1 more to line up with
> > > the first parameter on the line above.
> > 
> > Did I edit this?  It looks OK to me.  I assume you're talking about
> > __FUNCTION__ and gClusterConfig->cc_childwhatever.  The first _
> > lines up with the opening double quote on the line above.  But
> > maybe I shouldn't assume.
> 
> Sorry about that, I was totally off on the line numbers.  It's at
> 2388-9 -- you changed the name of the called function to
> utils_reboot_system(), which is 1 character longer than before, so
> the parameters on the subsequent lines need to be shifted 1 char as
> well.

Probably not off until I started changing things all around.
Fixed.

> > 
> > > ... //depot/dev/nfx-tree/code/ssc-cluster/cluster-server-rpc.c#6
> edit
> > > *	@627-9, 1015-7 indentation.
> > 
> > Did I fix it?
> 
> Yes, indeed.  Thanks.  However, I think both these functions will need
> to set REBOOT_DONT_WAIT_VOLOFFLINE, as they can't tolerate the delay
> that will be introduced when we add the vol offline functionality.

Done.

> > 
> > > ... //depot/dev/nfx-tree/code/ssc-genlib/genlib.c#2 edit
> > > *	@89 to be honest I never understood what the "-c" did on
> > > reboot in bsd to begin with, so for my own edification, why is it
> > > we no longer need it?
> > 
> > This is quite the essoteric MIPS thing, cold or warm reboot, we
> > don't have it, and we don't support it, big surprise.  But sometime
> > way back we added the option to reboot program on openbsd.  But the
> > code in the BSD kernel is the same for reboot and cold reboot, so
> > it's quite
> silly.
> > 
> > It refers to certain CPU registers which can be left un-reset when
> > the rest of the CPU is reset, and therefore possibly contain data
> > that might give a clue as to the reboot/reset reason, and/or be
> > used to
> pass
> > [very] small amounts of data between reboots.
> 
> Thanks!  Sounds like yet another thing we "intended" to do something
> with, but never did.  Cleaning it up is good.  :)
> 
> > 
> > > *	@153 would you fix the doc on this to indicate the
> > > values for "secondary"?  I think that means a .h file change,
> > > too.  :-(
> Thanks.
> > 
> > Fixed the comments.  The .h was fine.
> 
> Your comment in the .c is clearer, but whatever.
> 
> > 
> > > ... //depot/dev/nfx-tree/code/ssc-genlib/linux.h#8 edit
> > > 		OK
> > >
> > > ... //depot/dev/nfx-tree/code/ssc-genlib/openbsd.h#8 edit
> > > 		OK
> > >
> > > ... //depot/dev/nfx-tree/code/ssc-nfxsh/cmd_flash.c#17 edit
> > > *	@787 I think we can delete this since it will be done
> > > @804. You should check with Ian before changing the actual text
> > > of the message -- I'm not sure how much pattern matching he may
> > > be doing on it, if any.
> > 
> > Fixed.
> > 
> > > *	@1223 we no longer will issue the reboot_boards() (or
> > > equivalent to reboot the other cores on cheetah).  I don't know
> > > for sure, but I'm guessing we should.
> > 
> > Anything that might send that message has been stopped.
> 
> Then, @1223 should we be calling utils_reboot_system() instead of
> genlib_system_reboot()?

No, because all our daemons have been killed, so things trying to use
those daemons will be illin', like utils_reboot_system();


> > > ... //depot/dev/nfx-tree/code/ssc-nfxsh/cmd_promupgrade.c#13 edit
> > > *	@1150, 1159 since you're changing these, would you also
> > > fix the spacing like you did @1148?  :-)
> > 
> > Sorry, line numbers must have changed, because all those lines are
> > the same.  Even use the dreaded spaces
> 
> No, you changed "if( !" to "if (!" @1148, so I was hoping to get you
> to do it @1150 and 1159 also since you were changing those lines
> anyway. No biggie.

Wow, completely didn't see it.
Fixed.
 
> > > *	@1234 I know that any sane OS will sync before it shuts
> > > down, but our bsd implementation has proven time and again not to
> > > be sane. So I have to ask, do we need the syncs for bsd?
> > 
> > Since both kernels do a sync before unmounting a filesystem, it
> > isn't just redundant, it's slower.  Slower is badder.  I'll add it
> > back for ya'.  In genlib.c
> 
> No need, I just wasn't sure our bsd did the right thing.  As long as
> we do the right thing, you're right -- slower is badder.  :)
> 
> > 
> > > *	@1237 we no longer run shutdown, kill inetd, kill emrs,
> > > kill pm. I have no idea if it matters.  :-(
> > 
> > cmd_system-openbsd.c, line 45 (a no-op on linux because reboot
> > command does it for us)
> 
> Okay, but how do we get there from here?  Maybe I'm missing
> something... :(

That function is called from utils_reboot_system

> > 
> > > ... //depot/dev/nfx-tree/code/ssc-nfxsh/cmd_system.c#52 edit
> > > *	@326 we need to initialize rebootflags before this.
> > 
> > fixed.
> > 
> > > *	@327-338 and @412 I'm confused again.  We don't want to
> > > do the offline if fast is set.  (Maybe we just don't want to wait
> > > for
> it
> > > to complete, that I don't know.)  But, I think you want to change
> the
> > > sense of 327 so we'll call vol_offline_all() if fast is not set.
> > 
> > Yes, you're confused.  Or I am.
> 
> If fast is 0 we want to call vol_offline_all(), but we're not.  We're
> calling it when fast is !0.

fixed.

> > 
> > > *	@334 use "@@@" for consistency with the rest of the code
> > > base.
> > 
> > fixed.
> > 
> > > *	@342 we no longer call ensure_fs_writable(), which we
> > > claimed we needed to do to update the superblock time.  I have no
> > > idea if this is required for /sbin/halt or whether it might take
> care
> > > of whatever it needs itself.  (Novel concept.)
> > 
> > Probably a large source of corruption.  Why update the superblock?
> > If the fs hasn't been mounted writeable, it seems utterly silly to
> > mount it writeable to modify the superblock ... when the only
> > reason to update the superblock is to know when it the fs was last
> > written to.
> 
> Well, when you put it that way it does sound pretty silly.  :)
> 
> > 
> > > *	@348 looks alive to me.  :-)  I think it will be a hard
> > > sell since non-root users get stuck in nfxsh and need a mechanism
> > > to
> issue
> > > a reboot.
> > 
> > dore.  moved the comments to the right place.
> > 
> > > *	@427 we're no longer killing pm here.
> > 
> > yup, system_run_shutdown
> > 
> > > *	@429 this makes more sense than what was there!
> > >
> > > ... //depot/dev/nfx-tree/code/ssc-nfxsh/cmd_upgrade.c#32 edit
> > > *	Shouldn't the new conditional compilation be LINUX rather
> > > than COUGAR?
> > 
> > Nack,  We don't do conditional compilation on Linux or BSD, with
> > just
> a
> > couple of exceptions.
> > Not sure where you're talking about though.
> 
> @1095 we no longer need the file_exists() function, but it's not due
> to COUGAR hardware, it's due to LINUX.  Same thing @1167 with
> kill_daemon(), and the call sites @1220, 1248.

Well, it's a good thing COUGAR==LINUX because ifdef LINUX isn't allowed

> > > *	@1252 I'll take your word for it.
> > 
> > onstor init script does all this for us
> > 
> > > *	@1461-8 shouldn't we only do this for cheetah?
> > 
> > nope, for ev-furry-body.  my dog prefers that term.
> 
> Cute.
> 
> @1444 Don't we need to reboot the FP on bobcat?  I could easily be
> confused here.  :(

No, because resetting the FC kills all scsi traffic on cheetahs and
bobcats.  General reset takes care of FP after that.

> > 
> > > *	@1471 I have no idea why we need this, but I can see that
> > > complete_upgrade() used to do it, and I'd rather have the common
> code,
> > > even if it introduces a delay (that may or may not be necessary)
> > > for the other callers.  (Besides, it's "only cheetah".)  So, okay.
> > 
> > only used for upgrade of primary flash, which, yes, is quite messy
> > 
> > > ... //depot/dev/nfx-tree/code/ssc-pm/pm.c#20 edit
> > > *	@1446 can we call utils_reboot_system() here?  Some of
> > > what it does may fail, but I think we might still get to
> > > genlib_reboot(). Maybe...
> > 
> > no, things are hosed in this case, we'd be pretty lucky if even this
> > worked.  almost certainly the pm_log won't work.
> 
> Oh, well.  Sigh...
> 
> ... //depot/dev/nfx-tree/code/ssc-vsd/vs-daemon.c#53 edit
> @13554-5 we'll need to set REBOOT_DONT_WAIT_VOLOFFLINE, as it can't
> tolerate the delay that will be introduced when we add the vol offline
> functionality.  Also, these need to get shifted right 1 to line up,
> since the function name is now 1 char longer.

fixed.

> ... //depot/dev/nfx-tree/code/sm-sct/taskmgr.c#20 edit
> @1199 Ian will probably want the message formatted like the others:
> "Node going down for reboot! (Initial configuration in NCM)."

fixed.

> > 
> > > My thoughts are this is too risky to put in for GA.  I'd rather
> > > see
> it
> > > go in early in a product cycle so we get a lot of in-house testing
> in
> > > a variety of scenarios.
> > 
> > I was starting to agree with Rendell who said this as well.  But
> > then we ended up with these dueling douche-bag bugs:
> > 
> > 24067 Cougar-Beta: system config restore causes partner node to
> > reboot 23034 - kernel crash: DBE in yenta_irq during shutdown
> 
> 23034 has already been closed, and this doesn't fix 24067 -- we still
> have the problem in do_restore_config() that we do shutdown_all(),
> then do_copy_files(), and finally the reboot.  We need to tell the FP
> and TXRX to reboot after the shutdown_all() and before the
> do_copy_files(), which evidently is what is taking so long and
> screwing us up.

Max checked in a change for 24067 that is the exact un-do of my change
that fixed 23034.  This changelist at least fixes 24067 a little nicer:
shuts down the FP ports in all cases, and gives a single place to add
in a correct fix for 24067 which Max should have done in the first
place but he's Mr. Regression/SlashAndBurn Max and I can't get him to
do the right thing, and I don't have time to do it myself at this
point.  The learning time would take longer than we have left.

> > We still need one more piece after this: one to quiesce the FP on
> > command.  I wonder who will write it.
> 
> Good question.  Unless there's a defect submitted or unless you want
> to do it, it won't get done.

24067 is the defect.

> > > Also, I'm hoping to get a change checked in in the next few days
> that
> > > adds a call to utils_reboot_system() in vsd.  Just a heads up.
> (Sorry
> > > about that.)
> > 
> > Absorbed that one already ~:^)
> > 
> > > ChrisV
> > >
> > > -----Original Message-----
> > > From: Andy Sharp
> > > Sent: Tuesday, July 08, 2008 6:44 PM
> > > To: Chris Vandever
> > > Cc: Rendell Fong
> > > Subject: please review 29639
> > >
> > > I've been testing this on my filers -- so far so good with every
> > > variation I could try.  I don't know how to purposefully cause a
> core
> > > to crash, which would trigger chassisd reboot.
> > >
> > > Also, I don't know the defect for this.  Chris, is there one
> > > still?
> > >
> > > Thanks,
> > >
> > > a
> > >
> > >
> > > Change 29639 by andys@ripper on 2008/06/11 17:17:43 *pending*
> > >
> > >         TED
> > >
> > >         Implement a higher level general purpose reboot routine
> > > that uses
> > >         genlib_reboot to do the last little bit.
> > >
> > >         reviewed by
> > >
> > > Affected files ...
> > >
> > >
