AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20080617155547.1cb20ab8@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<kumarv@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@onstor-exch02.onstor.net/INBOX	0	BB375AF679D4A34E9CA8DFA650E2B04E0A82D316@onstor-exch02.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Tue, 17 Jun 2008 15:58:13 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: "Kumar Vakacharla (HCL)" <kumarv@onstor.com>
Subject: Re: Review Request : TED22005
Message-ID: <20080617155813.3e658d0e@ripper.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E0A82D316@onstor-exch02.onstor.net>
References: <BB375AF679D4A34E9CA8DFA650E2B04E0A60C96D@onstor-exch02.onstor.net>
	<20080612114827.609ea49c@ripper.onstor.net>
	<BB375AF679D4A34E9CA8DFA650E2B04E0A6E85FD@onstor-exch02.onstor.net>
	<20080612131534.7825d252@ripper.onstor.net>
	<BB375AF679D4A34E9CA8DFA650E2B04E0A6E8D58@onstor-exch02.onstor.net>
	<20080616091448.37dd22b3@ripper.onstor.net>
	<BB375AF679D4A34E9CA8DFA650E2B04E0A6E8FF9@onstor-exch02.onstor.net>
	<20080616104017.7a66b93b@ripper.onstor.net>
	<BB375AF679D4A34E9CA8DFA650E2B04E0A82D316@onstor-exch02.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Hi Kumar,

As per my last email, I don't think we should be changing the code that
only affects cheetahs/nfxsh.  There's just no reason.

I don't think we should do all the fork/exec stuff either.  All we care
about is getting the system to reboot.

And definitely should not be using asprintf.  You don't need any
additional memory if you don't do the exec stuff anyway.

If you want to get a second opinion on the first two things, let me
know.

Cheers,

a


On Tue, 17 Jun 2008 15:48:23 -0700 "Kumar Vakacharla (HCL)"
<kumarv@onstor.com> wrote:

> Andy, 
> 
>   Please let me know whether I can go ahead with this fix. Also please
> look at the r320rel changelist 29678. This fix is supposed to go to
> R3.2.0.6-patch this week. 
> 
> Thanks,
> Kumar.
> 
> -----Original Message-----
> From: Kumar Vakacharla (HCL) 
> Sent: Monday, June 16, 2008 12:34 PM
> To: Andy Sharp
> Subject: RE: Review Request : TED22005
> 
> Andy, 
> 
> Please find my responses inline. 
> 
> -----Original Message-----
> From: Andy Sharp 
> Sent: Monday, June 16, 2008 10:40 AM
> To: Kumar Vakacharla (HCL)
> Subject: Re: Review Request : TED22005
> 
> > Change 29620 by perforce@kumarv-DEV on 2008/06/10 18:23:56 *pending*
> > 
> > 	TED00022005 (LSI-PA 6989) Each system hung after a number of
> > crashes (minority pcc state)
> > 	
> > 	Fix Description: 
> > 	Make pgid of reboot process different from pm process
> > group. The processes in pm group may receive SIGTERM during reboot
> > operation. 
> 
> nfx-tree/code/sm-chassis/chassisd-bc.c
> 
>      looks good
> 
> nfx-tree/code/sm-chassis/chassisd-cg.c
> 
>      looks good
> 
> nfx-tree/code/sm-chassis/chassisd-msg.c
> 
>      looks good
> 
> nfx-tree/code/ssc-genlib/genlib.c
> -----------------------------
>      line 15: why is _GNU_SOURCE being defined?
> Kumar>> This is because "asprintf" failed to compile for linux without
> this macro. The man page also mentions this macro.   
> -----------------------------	
> 
> 	line 85, why do the fork/wait/etc?  why not just
>      setpgid(0,0);system(PLATFORM_REBOOT_CMD); ?
> 
> 	Kumar> If we invoke system command "system("/sbin/reboot")" it
> will in turn creates 2 processes "sh" and "/sbin/reboot" with the same
> pgid. As I mentioned earlier the problem here is when "sh" gets a
> SIGTERM signal from reboot (kill(-1, SIGTERM)it may in turn send the
> SIGTERM to its group and hence reboot will also be affected as both
> will share the same pgid. So I have avoided system command that
> creates one more unnecessary process "sh" which sometimes triggers
> this issue as I have seen. 
>          
> -----------------------------
>      line 141, this doesn't need to be modified, it is only called on
>      cheetahs and only from nfxsh.
> 	
> Kumar> Agree, but I think the problem could happen even in the normal
> reboot (though very less probable) also and better to make pgid of
> reboot process unique. 
> 
> 
> 
> Thanks,
> Kumar.
> -----------------------------
> On Mon, 16 Jun 2008 09:43:36 -0700 "Kumar Vakacharla (HCL)"
> <kumarv@onstor.com> wrote:
> 
> > Hi Andy,
> > 
> > Sorry for that.  I have reopened the change 29620 now. 
> > 
> > 
> > Thanks,
> > Kumar.
> > 
> > -----Original Message-----
> > From: Andy Sharp 
> > Sent: Monday, June 16, 2008 9:15 AM
> > To: Kumar Vakacharla (HCL)
> > Subject: Re: Review Request : TED22005
> > 
> > Hi Kumar,
> > 
> > The changelist 29620 has no files in it.
> > 
> > They're probably in your default change list?  You can move them to
> > 29620 with the reopen command:
> > 
> > p4 reopen -c 29620 <file>
> > 
> > 
> > 
> > On Fri, 13 Jun 2008 22:03:42 -0700 "Kumar Vakacharla (HCL)"
> > <kumarv@onstor.com> wrote:
> > 
> > > Andy, 
> > > 
> > >   I have modified the code according to your comments and ready
> > > for review.
> > > 
> > >   Please let me know if you see any problems. 
> > > 
> > > P4CLIENT=kumarv-DEV
> > > P4 Change Id: 29620
> > > PATH: /homes/kumarv/work/dev/
> > >
> >
> ========================================================================
> > > =
> > > kumarv@compile2>p4 describe 29620
> > > Change 29620 by perforce@kumarv-DEV on 2008/06/10 18:23:56
> > > *pending*
> > > 
> > >         TED00022005 (LSI-PA 6989) Each system hung after a number
> > > of crashes
> > >         (minority pcc state)
> > > 
> > >         Fix Description:
> > >         	Make pgid of reboot process different from pm
> > > process group.
> > >     		The processes in pm group may receive SIGTERM
> > > during the reboot 		operation.
> > > Affected files ...
> > > 
> > > ... //depot/dev/nfx-tree/code/sm-chassis/chassisd-bc.c#12 edit
> > > ... //depot/dev/nfx-tree/code/sm-chassis/chassisd-cg.c#10 edit
> > > ... //depot/dev/nfx-tree/code/sm-chassis/chassisd-msg.c#12 edit
> > > ... //depot/dev/nfx-tree/code/ssc-genlib/genlib.c#1 edit
> > >
> >
> ========================================================================
> > > =
> > > 
> > > Since I need to provide a patch for 3.2.0.5, I have made similar
> > > changes in r320 branch. Please review them too. 
> > > 
> > > P4CLIENT=kumarv-r320rel
> > > P4 Change Id: 29678
> > > PATH: /homes/kumarv/work/r320rel/
> > >
> >
> ========================================================================
> > > ====
> > > kumarv@linux-compile>p4 describe 29678
> > > Change 29678 by perforce@kumarv-r320rel on 2008/06/13 15:26:31
> > > *pending*
> > > 
> > >            TED00022005 (LSI-PA 6989) Each system hung after a
> > > number of crashes
> > >         (minority pcc state)
> > > 
> > >         Fix Description:
> > >         Make pgid of reboot process different from pm process
> > > group. The processes in pm group may receive SIGTERM during the
> > > reboot operation.
> > > 
> > > Affected files ...
> > > 
> > > ... //depot/r320rel/nfx-tree/code/sm-chassis/chassisd-bc.c#1 edit
> > > ... //depot/r320rel/nfx-tree/code/sm-chassis/chassisd-msg.c#1 edit
> > > ... //depot/r320rel/nfx-tree/code/sm-chassis/chassisd.c#1 edit
> > > ... //depot/r320rel/nfx-tree/code/ssc-genlib/cm-reboot-linux.c#1
> > > edit ...
> //depot/r320rel/nfx-tree/code/ssc-genlib/cm-reboot-openbsd.c#1
> > > edit ... //depot/r320rel/nfx-tree/code/ssc-genlib/genlib-linux.c#2
> > > edit ... //depot/r320rel/nfx-tree/code/ssc-genlib/genlib-openbsd.c#1
> > > edit
> > > 
> > >
> >
> ========================================================================
> > > ====
> > > 
> > > Thanks,
> > > Kumar.
> > > 
> > > -----Original Message-----
> > > From: Andy Sharp 
> > > Sent: Thursday, June 12, 2008 1:16 PM
> > > To: Kumar Vakacharla (HCL)
> > > Subject: Re: Review Request : TED22005
> > > 
> > > Feel free to come by and talk about it.  Right now, I'm doing some
> > > follow on work to some code that Chris Vandever is soon to check
> > > in that will be the start of an attempt to consolidate all
> > > attempts to reboot the system from our code, including daemons
> > > and nfxsh.  So perhaps if you concentrated on adding it to the
> > > genlib code, that might be enough for now, and the other places
> > > in our code that unwisely do something like system("reboot") on
> > > their own will be cleaned up later.
> > > 
> > > BTW, I don't think the reboot program should be immune to SIGTERM.
> > > Perhaps I might want to be able to kill the reboot program from
> > > some other program, who knows?
> > > 
> > > 
> > > On Thu, 12 Jun 2008 12:15:37 -0700 "Kumar Vakacharla (HCL)"
> > > <kumarv@onstor.com> wrote:
> > > 
> > > > Hi Andy, 
> > > > 
> > > > I understand it. In fact I have tried similar thing in our code
> > > > initially. Then I realized that there are many places we reboot
> > > > the system using system("reboot")". So I thought instead of
> > > > changing it in multiple places I can make it in reboot code of
> > > > BSD itself so that even future calls to system(reboot) won't
> > > > break it. I think reboot process is not supposed to be
> > > > terminated by SIGTERM from the other processes and that's why I
> > > > made the fix there.  
> > > > 
> > > > Anyways, I will try to do it as you suggested.
> > > > 
> > > > Thanks,
> > > > Kumar.
> > > > 
> > > > -----Original Message-----
> > > > From: Andy Sharp 
> > > > Sent: Thursday, June 12, 2008 11:48 AM
> > > > To: Kumar Vakacharla (HCL)
> > > > Subject: Re: Review Request : TED22005
> > > > 
> > > > Hi Kumar,
> > > > 
> > > > I've had a chance to take a look at this, and while you're
> > > > right, this is one viable approach, I would much prefer to
> > > > stick to a design philosophy of modifying our code first and
> > > > system/distro code only as a last resort.
> > > > 
> > > > Can we instead code up a method whereby the reboot command is
> > > > run in a process that is not part of the initial process
> > > > group?  Ie, do a fork;setpgrp;do_system(reboot) kind of thing?
> > > > 
> > > > Thanks,
> > > > 
> > > > a
> > > > 
> > > > On Tue, 10 Jun 2008 18:28:36 -0700 "Kumar Vakacharla (HCL)"
> > > > <kumarv@onstor.com> wrote:
> > > > 
> > > > > Andy, 
> > > > > 
> > > > >  
> > > > > 
> > > > > Can you please review the fix for this defect?
> > > > > 
> > > > >  
> > > > > 
> > > > >  
> > > > > 
> > > > > Defect : 
> > > > > 
> > > > >  
> > > > > 
> > > > > TED00022005 (LSI-PA 6989) Each system hung after a number of
> > > > > crashes (minority pcc state)
> > > > > 
> > > > >  
> > > > > 
> > > > >  
> > > > > 
> > > > > Root Cause:  
> > > > > 
> > > > > "reboot" process is getting killed in the middle of reboot
> > > > > operation hence the system hangs. 
> > > > > 
> > > > >  
> > > > > 
> > > > > Details: 
> > > > > 
> > > > >  
> > > > > 
> > > > > During the reboot process... ".  
> > > > > 
> > > > > -          reboot program (/sbin/reboot) issues "kill(-1,
> > > > > SIGTERM)" to kill all the processes in the system except
> > > > > "init" and himself.
> > > > > 
> > > > > -          When any of the forked shells (e.g. "sh support.sh"
> > > > > or shells created by system command) receives this signal they
> > > > > sometimes in turn send that signal to all the group using
> > > > > kill(0, SIGTERM). Since the reboot process also belongs to the
> > > > > same process group it gets killed hence the system hangs. 
> > > > > 
> > > > >  
> > > > > 
> > > > > Fix Description: 
> > > > > 
> > > > >  
> > > > > 
> > > > >             Initially tried by cleaning up the processes
> > > > > (support.sh. emrscron, pm, etc) before reboot issues "kill
> > > > > (-1, SIGTERM)". Also tried changing the order in which we
> > > > > terminate these processes during the cleanup.  But both of
> > > > > these approaches didn't work out. 
> > > > > 
> > > > >  
> > > > > 
> > > > >             Finally, the fix would be to ignore the SIGTERM
> > > > > signal during "reboot" operation. 
> > > > > 
> > > > >  
> > > > > 
> > > > >  
> > > > > 
> > > > > Affected Files:
> > > > > 
> > > > > *
> /homes/kumarv//work/dev/openbsd/src/sbin/reboot/reboot.c
> > > > > 
> > > > >  
> > > > > 
> > > > > P4CLIENT=kumarv-DEV
> > > > > 
> > > > > P4 Change Id: 29620
> > > > > 
> > > > >  
> > > > > 
> > > > > Please let me know if you need any clarifications. 
> > > > > 
> > > > >  
> > > > > 
> > > > >  
> > > > > 
> > > > > Thanks,
> > > > > Kumar.
> > > > > 
