AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20080721181844.43a4f3d6@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<chris.vandever@onstor.com>,<rendell.fong@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@onstor-exch02.onstor.net/INBOX	43293	BB375AF679D4A34E9CA8DFA650E2B04E03E9A945@onstor-exch02.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Mon, 21 Jul 2008 18:18:47 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: "Chris Vandever" <chris.vandever@onstor.com>, Rendell Fong
 <rendell.fong@onstor.com>
Subject: Re: please review 29639
Message-ID: <20080721181847.490681cb@ripper.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E03E9A945@onstor-exch02.onstor.net>
References: <20080708184419.2ec4daaf@ripper.onstor.net>
 <BB375AF679D4A34E9CA8DFA650E2B04E03E9A945@onstor-exch02.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

I've made some more changes since this review request first went out,
possibly only to fix review mentioned problems, but also to include
changes that were made to these files since I started this ill-gotten
venture.  See the two bugs mentioned near the end.

Don't know if Rendell needs to re-review anything, but his name was
mentioned in these comments to Chris, so I'm cc'ing you, Mr. R.


On Wed, 9 Jul 2008 19:04:21 -0700 "Chris Vandever"
<chris.vandever@onstor.com> wrote:

> ... //depot/dev/nfx-tree/code/sm-utils/sys-utils-api.h#4 edit
> *	@103-6 you ARE aware this isn't the coding standard, right?

Sure it is.  What's wrong with it?  The defines, right?

> ... //depot/dev/nfx-tree/code/sm-utils/sys-utils.c#3 edit
> *	@101-3, 145-163 you're planning on removing these before
> checking it in, yes?

Der.  Fixed.

> *	@138 my understanding was that autosupport_log_reboot_state()
> needs to be root, so it should be after the seteuid().

Fixed, just in case you're right.

> *	@165-189 I'm confused about where you're going here.  It
> looks like you're planning to always offline the volumes, and the only
> question is whether to wait for it to complete before continuing.  My
> question is in the case where we don't wait, will the child process
> that offlines the volumes screw anything up if we reboot in the
> middle of that code?  (I had run into a problem in split-brain
> testing where I would have liked to cleanly offline the volumes, but
> found that that took too long, and I was better off rebooting without
> offlining the volumes -- they got released and available for the
> other node to take over sooner.)

This is unused code at the moment, as there is no method to offline all
volumes in a quick shot.  The old nfxsh function, commented out here,
was quite slow, and Jonathon said it's better to just dump them by
rebooting than to take forever to offline this way -- downtime for users
on systems with more than a couple volumes will be vastly shotened.
More code can be written (kegg?) to quickly flush all the logs of all
volumes, or whatever he said, but that hasn't been implemented yet.

There are no callers with the REBOOT_DONT_WAIT_VOLOFFLINE flag as of
yet, at least, who don't also call vol_offline_all() already.

> *	@167, 175 the coding convention is to flag code that needs
> more work with "@@@" to make it easier to find it again later.  (I
> was used to "XXX" at other companies, but I think the main thing is
> to have a simple string that's easy to search for in pretty much any
> case.)

Fixed.  Apparently.

> *	@186-194 shouldn't we already have these defined in some .h
> somewhere?

They weren't before.  This was lifted from cmd_upgrade.c I believe.

> *	@207-8 we think this should be done only for cheetahs, but
> the ifdefs @197 & 199-201 will cause it to be done for bobcat, also.
> I have no idea which is correct, but we should make sure the code and
> comments are consistent.

Fixed the comments.

> *	@224-244 I have no idea if this is correct.  You and Rendell
> would know better than me.
> *	@236 what's 3?  I think it's the timeout value for the
> response, but it should use a #define.

You don't think I actually wrote any of this crap, do you?
BTW, it's the number of retries.  A completely arbitrary but reasonable
number.  We have retries??!? <duck>

> ... //depot/dev/nfx-tree/code/ssc-cluster/cluster-contrl-cfg.c#30 edit
> *	@256-7 these need to be indented 1 more to line up with the
> first parameter on the line above.

Did I edit this?  It looks OK to me.  I assume you're talking about
__FUNCTION__ and gClusterConfig->cc_childwhatever.  The first _ lines
up with the opening double quote on the line above.  But maybe I
shouldn't assume.

> ... //depot/dev/nfx-tree/code/ssc-cluster/cluster-server-rpc.c#6 edit
> *	@627-9, 1015-7 indentation.

Did I fix it?

> ... //depot/dev/nfx-tree/code/ssc-genlib/genlib.c#2 edit
> *	@89 to be honest I never understood what the "-c" did on
> reboot in bsd to begin with, so for my own edification, why is it we
> no longer need it?

This is quite the essoteric MIPS thing, cold or warm reboot, we don't
have it, and we don't support it, big surprise.  But sometime way back
we added the option to reboot program on openbsd.  But the code in the
BSD kernel is the same for reboot and cold reboot, so it's quite silly.

It refers to certain CPU registers which can be left un-reset when
the rest of the CPU is reset, and therefore possibly contain data that
might give a clue as to the reboot/reset reason, and/or be used to pass
[very] small amounts of data between reboots.

> *	@153 would you fix the doc on this to indicate the values for
> "secondary"?  I think that means a .h file change, too.  :-(  Thanks.

Fixed the comments.  The .h was fine.

> ... //depot/dev/nfx-tree/code/ssc-genlib/linux.h#8 edit
> 		OK
> 
> ... //depot/dev/nfx-tree/code/ssc-genlib/openbsd.h#8 edit
> 		OK
> 
> ... //depot/dev/nfx-tree/code/ssc-nfxsh/cmd_flash.c#17 edit
> *	@787 I think we can delete this since it will be done @804.
> You should check with Ian before changing the actual text of the
> message -- I'm not sure how much pattern matching he may be doing on
> it, if any.

Fixed.

> *	@1223 we no longer will issue the reboot_boards() (or
> equivalent to reboot the other cores on cheetah).  I don't know for
> sure, but I'm guessing we should.

Anything that might send that message has been stopped.

> ... //depot/dev/nfx-tree/code/ssc-nfxsh/cmd_promupgrade.c#13 edit
> *	@1150, 1159 since you're changing these, would you also fix
> the spacing like you did @1148?  :-)

Sorry, line numbers must have changed, because all those lines are the
same.  Even use the dreaded spaces

> *	@1234 I know that any sane OS will sync before it shuts down,
> but our bsd implementation has proven time and again not to be sane.
> So I have to ask, do we need the syncs for bsd?

Since both kernels do a sync before unmounting a filesystem, it isn't
just redundant, it's slower.  Slower is badder.  I'll add it back for
ya'.  In genlib.c

> *	@1237 we no longer run shutdown, kill inetd, kill emrs, kill
> pm. I have no idea if it matters.  :-(

cmd_system-openbsd.c, line 45 (a no-op on linux because reboot command
does it for us)

> ... //depot/dev/nfx-tree/code/ssc-nfxsh/cmd_system.c#52 edit
> *	@326 we need to initialize rebootflags before this.

fixed.

> *	@327-338 and @412 I'm confused again.  We don't want to do
> the offline if fast is set.  (Maybe we just don't want to wait for it
> to complete, that I don't know.)  But, I think you want to change the
> sense of 327 so we'll call vol_offline_all() if fast is not set.

Yes, you're confused.  Or I am.

> *	@334 use "@@@" for consistency with the rest of the code
> base.

fixed.

> *	@342 we no longer call ensure_fs_writable(), which we
> claimed we needed to do to update the superblock time.  I have no
> idea if this is required for /sbin/halt or whether it might take care
> of whatever it needs itself.  (Novel concept.)

Probably a large source of corruption.  Why update the superblock?  If
the fs hasn't been mounted writeable, it seems utterly silly to mount
it writeable to modify the superblock ... when the only reason to
update the superblock is to know when it the fs was last written to.

> *	@348 looks alive to me.  :-)  I think it will be a hard sell
> since non-root users get stuck in nfxsh and need a mechanism to issue
> a reboot.

dore.  moved the comments to the right place.

> *	@427 we're no longer killing pm here.

yup, system_run_shutdown

> *	@429 this makes more sense than what was there!
> 
> ... //depot/dev/nfx-tree/code/ssc-nfxsh/cmd_upgrade.c#32 edit
> *	Shouldn't the new conditional compilation be LINUX rather
> than COUGAR?

Nack,  We don't do conditional compilation on Linux or BSD, with just a
couple of exceptions.
Not sure where you're talking about though.

> *	@1252 I'll take your word for it.

onstor init script does all this for us

> *	@1461-8 shouldn't we only do this for cheetah?

nope, for ev-furry-body.  my dog prefers that term.

> *	@1471 I have no idea why we need this, but I can see that
> complete_upgrade() used to do it, and I'd rather have the common code,
> even if it introduces a delay (that may or may not be necessary) for
> the other callers.  (Besides, it's "only cheetah".)  So, okay.

only used for upgrade of primary flash, which, yes, is quite messy

> ... //depot/dev/nfx-tree/code/ssc-pm/pm.c#20 edit
> *	@1446 can we call utils_reboot_system() here?  Some of what
> it does may fail, but I think we might still get to genlib_reboot().
> Maybe...

no, things are hosed in this case, we'd be pretty lucky if even this
worked.  almost certainly the pm_log won't work.

> My thoughts are this is too risky to put in for GA.  I'd rather see it
> go in early in a product cycle so we get a lot of in-house testing in
> a variety of scenarios.

I was starting to agree with Rendell who said this as well.  But then
we ended up with these dueling douche-bag bugs:

24067 Cougar-Beta: system config restore causes partner node to reboot
23034 - kernel crash: DBE in yenta_irq during shutdown

We still need one more piece after this: one to quiesce the FP on
command.  I wonder who will write it.

> Also, I'm hoping to get a change checked in in the next few days that
> adds a call to utils_reboot_system() in vsd.  Just a heads up.  (Sorry
> about that.)

Absorbed that one already ~:^)

> ChrisV
> 
> -----Original Message-----
> From: Andy Sharp 
> Sent: Tuesday, July 08, 2008 6:44 PM
> To: Chris Vandever
> Cc: Rendell Fong
> Subject: please review 29639
> 
> I've been testing this on my filers -- so far so good with every
> variation I could try.  I don't know how to purposefully cause a core
> to crash, which would trigger chassisd reboot.
> 
> Also, I don't know the defect for this.  Chris, is there one still?
> 
> Thanks,
> 
> a
> 
> 
> Change 29639 by andys@ripper on 2008/06/11 17:17:43 *pending*
> 
>         TED
>         
>         Implement a higher level general purpose reboot routine that
> uses
>         genlib_reboot to do the last little bit.
>         
>         reviewed by
> 
> Affected files ...
> 
> 
