AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20080617162404.2153dd6e@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<kumarv@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@onstor-exch02.onstor.net/INBOX	41246	BB375AF679D4A34E9CA8DFA650E2B04E0A6E915D@onstor-exch02.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Tue, 17 Jun 2008 16:24:50 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: "Kumar Vakacharla (HCL)" <kumarv@onstor.com>
Subject: Re: Review Request : TED22005
Message-ID: <20080617162450.61d0f2f7@ripper.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E0A6E915D@onstor-exch02.onstor.net>
References: <BB375AF679D4A34E9CA8DFA650E2B04E0A60C96D@onstor-exch02.onstor.net>
 <20080612114827.609ea49c@ripper.onstor.net>
 <BB375AF679D4A34E9CA8DFA650E2B04E0A6E85FD@onstor-exch02.onstor.net>
 <20080612131534.7825d252@ripper.onstor.net>
 <BB375AF679D4A34E9CA8DFA650E2B04E0A6E8D58@onstor-exch02.onstor.net>
 <20080616091448.37dd22b3@ripper.onstor.net>
 <BB375AF679D4A34E9CA8DFA650E2B04E0A6E8FF9@onstor-exch02.onstor.net>
 <20080616104017.7a66b93b@ripper.onstor.net>
 <BB375AF679D4A34E9CA8DFA650E2B04E0A6E915D@onstor-exch02.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Kumar,

Oops, it looks like I got interrupted while writing this and never
finished it or sent it to you.  My apologies.

a

On Mon, 16 Jun 2008 12:34:12 -0700 "Kumar Vakacharla (HCL)"
<kumarv@onstor.com> wrote:

> Andy, 
> 
> Please find my responses inline. 
> 
> -----Original Message-----
> From: Andy Sharp 
> Sent: Monday, June 16, 2008 10:40 AM
> To: Kumar Vakacharla (HCL)
> Subject: Re: Review Request : TED22005
> 
> > Change 29620 by perforce@kumarv-DEV on 2008/06/10 18:23:56 *pending*
> > 
> > 	TED00022005 (LSI-PA 6989) Each system hung after a number of
> > crashes (minority pcc state)
> > 	
> > 	Fix Description: 
> > 	Make pgid of reboot process different from pm process
> > group. The processes in pm group may receive SIGTERM during reboot
> > operation. 
> 
> nfx-tree/code/sm-chassis/chassisd-bc.c
> 
>      looks good
> 
> nfx-tree/code/sm-chassis/chassisd-cg.c
> 
>      looks good
> 
> nfx-tree/code/sm-chassis/chassisd-msg.c
> 
>      looks good
> 
> nfx-tree/code/ssc-genlib/genlib.c
> -----------------------------
>      line 15: why is _GNU_SOURCE being defined?
> Kumar>> This is because "asprintf" failed to compile for linux without
> this macro. The man page also mentions this macro.   
> -----------------------------	

Don't use asprintf.  ~:^)

> 	line 85, why do the fork/wait/etc?  why not just
>      setpgid(0,0);system(PLATFORM_REBOOT_CMD); ?
> 
> 	Kumar> If we invoke system command "system("/sbin/reboot")" it
> will in turn creates 2 processes "sh" and "/sbin/reboot" with the same
> pgid. As I mentioned earlier the problem here is when "sh" gets a
> SIGTERM signal from reboot (kill(-1, SIGTERM)it may in turn send the
> SIGTERM to its group and hence reboot will also be affected as both
> will share the same pgid. So I have avoided system command that
> creates one more unnecessary process "sh" which sometimes triggers
> this issue as I have seen. 

Well, we don't care about the number of processes being created -- we
are rebooting after all.  As for the other scenario, if what you are
saying were a problem, then it could affect anyone typing "reboot" on
the shell command line, and that just isn't the case.

> -----------------------------
>      line 141, this doesn't need to be modified, it is only called on
>      cheetahs and only from nfxsh.
> 	
> Kumar> Agree, but I think the problem could happen even in the normal
> reboot (though very less probable) also and better to make pgid of
> reboot process unique. 

I don't see how it can happen when "system reboot" command is run by
the user.  I'm unaware of any reported cases where anyone claims that
happened.

> 
> 
> Thanks,
> Kumar.
> -----------------------------
> On Mon, 16 Jun 2008 09:43:36 -0700 "Kumar Vakacharla (HCL)"
> <kumarv@onstor.com> wrote:
> 
> > Hi Andy,
> > 
> > Sorry for that.  I have reopened the change 29620 now. 
> > 
> > 
> > Thanks,
> > Kumar.
> > 
> > -----Original Message-----
> > From: Andy Sharp 
> > Sent: Monday, June 16, 2008 9:15 AM
> > To: Kumar Vakacharla (HCL)
> > Subject: Re: Review Request : TED22005
> > 
> > Hi Kumar,
> > 
> > The changelist 29620 has no files in it.
> > 
> > They're probably in your default change list?  You can move them to
> > 29620 with the reopen command:
> > 
> > p4 reopen -c 29620 <file>
> > 
> > 
> > 
> > On Fri, 13 Jun 2008 22:03:42 -0700 "Kumar Vakacharla (HCL)"
> > <kumarv@onstor.com> wrote:
> > 
> > > Andy, 
> > > 
> > >   I have modified the code according to your comments and ready
> > > for review.
> > > 
> > >   Please let me know if you see any problems. 
> > > 
> > > P4CLIENT=kumarv-DEV
> > > P4 Change Id: 29620
> > > PATH: /homes/kumarv/work/dev/
> > >
> >
> ========================================================================
> > > =
> > > kumarv@compile2>p4 describe 29620
> > > Change 29620 by perforce@kumarv-DEV on 2008/06/10 18:23:56
> > > *pending*
> > > 
> > >         TED00022005 (LSI-PA 6989) Each system hung after a number
> > > of crashes
> > >         (minority pcc state)
> > > 
> > >         Fix Description:
> > >         	Make pgid of reboot process different from pm
> > > process group.
> > >     		The processes in pm group may receive SIGTERM
> > > during the reboot 		operation.
> > > Affected files ...
> > > 
> > > ... //depot/dev/nfx-tree/code/sm-chassis/chassisd-bc.c#12 edit
> > > ... //depot/dev/nfx-tree/code/sm-chassis/chassisd-cg.c#10 edit
> > > ... //depot/dev/nfx-tree/code/sm-chassis/chassisd-msg.c#12 edit
> > > ... //depot/dev/nfx-tree/code/ssc-genlib/genlib.c#1 edit
> > >
> >
> ========================================================================
> > > =
> > > 
> > > Since I need to provide a patch for 3.2.0.5, I have made similar
> > > changes in r320 branch. Please review them too. 
> > > 
> > > P4CLIENT=kumarv-r320rel
> > > P4 Change Id: 29678
> > > PATH: /homes/kumarv/work/r320rel/
> > >
> >
> ========================================================================
> > > ====
> > > kumarv@linux-compile>p4 describe 29678
> > > Change 29678 by perforce@kumarv-r320rel on 2008/06/13 15:26:31
> > > *pending*
> > > 
> > >            TED00022005 (LSI-PA 6989) Each system hung after a
> > > number of crashes
> > >         (minority pcc state)
> > > 
> > >         Fix Description:
> > >         Make pgid of reboot process different from pm process
> > > group. The processes in pm group may receive SIGTERM during the
> > > reboot operation.
> > > 
> > > Affected files ...
> > > 
> > > ... //depot/r320rel/nfx-tree/code/sm-chassis/chassisd-bc.c#1 edit
> > > ... //depot/r320rel/nfx-tree/code/sm-chassis/chassisd-msg.c#1 edit
> > > ... //depot/r320rel/nfx-tree/code/sm-chassis/chassisd.c#1 edit
> > > ... //depot/r320rel/nfx-tree/code/ssc-genlib/cm-reboot-linux.c#1
> > > edit ...
> //depot/r320rel/nfx-tree/code/ssc-genlib/cm-reboot-openbsd.c#1
> > > edit ... //depot/r320rel/nfx-tree/code/ssc-genlib/genlib-linux.c#2
> > > edit ... //depot/r320rel/nfx-tree/code/ssc-genlib/genlib-openbsd.c#1
> > > edit
> > > 
> > >
> >
> ========================================================================
> > > ====
> > > 
> > > Thanks,
> > > Kumar.
> > > 
> > > -----Original Message-----
> > > From: Andy Sharp 
> > > Sent: Thursday, June 12, 2008 1:16 PM
> > > To: Kumar Vakacharla (HCL)
> > > Subject: Re: Review Request : TED22005
> > > 
> > > Feel free to come by and talk about it.  Right now, I'm doing some
> > > follow on work to some code that Chris Vandever is soon to check
> > > in that will be the start of an attempt to consolidate all
> > > attempts to reboot the system from our code, including daemons
> > > and nfxsh.  So perhaps if you concentrated on adding it to the
> > > genlib code, that might be enough for now, and the other places
> > > in our code that unwisely do something like system("reboot") on
> > > their own will be cleaned up later.
> > > 
> > > BTW, I don't think the reboot program should be immune to SIGTERM.
> > > Perhaps I might want to be able to kill the reboot program from
> > > some other program, who knows?
> > > 
> > > 
> > > On Thu, 12 Jun 2008 12:15:37 -0700 "Kumar Vakacharla (HCL)"
> > > <kumarv@onstor.com> wrote:
> > > 
> > > > Hi Andy, 
> > > > 
> > > > I understand it. In fact I have tried similar thing in our code
> > > > initially. Then I realized that there are many places we reboot
> > > > the system using system("reboot")". So I thought instead of
> > > > changing it in multiple places I can make it in reboot code of
> > > > BSD itself so that even future calls to system(reboot) won't
> > > > break it. I think reboot process is not supposed to be
> > > > terminated by SIGTERM from the other processes and that's why I
> > > > made the fix there.  
> > > > 
> > > > Anyways, I will try to do it as you suggested.
> > > > 
> > > > Thanks,
> > > > Kumar.
> > > > 
> > > > -----Original Message-----
> > > > From: Andy Sharp 
> > > > Sent: Thursday, June 12, 2008 11:48 AM
> > > > To: Kumar Vakacharla (HCL)
> > > > Subject: Re: Review Request : TED22005
> > > > 
> > > > Hi Kumar,
> > > > 
> > > > I've had a chance to take a look at this, and while you're
> > > > right, this is one viable approach, I would much prefer to
> > > > stick to a design philosophy of modifying our code first and
> > > > system/distro code only as a last resort.
> > > > 
> > > > Can we instead code up a method whereby the reboot command is
> > > > run in a process that is not part of the initial process
> > > > group?  Ie, do a fork;setpgrp;do_system(reboot) kind of thing?
> > > > 
> > > > Thanks,
> > > > 
> > > > a
> > > > 
> > > > On Tue, 10 Jun 2008 18:28:36 -0700 "Kumar Vakacharla (HCL)"
> > > > <kumarv@onstor.com> wrote:
> > > > 
> > > > > Andy, 
> > > > > 
> > > > >  
> > > > > 
> > > > > Can you please review the fix for this defect?
> > > > > 
> > > > >  
> > > > > 
> > > > >  
> > > > > 
> > > > > Defect : 
> > > > > 
> > > > >  
> > > > > 
> > > > > TED00022005 (LSI-PA 6989) Each system hung after a number of
> > > > > crashes (minority pcc state)
> > > > > 
> > > > >  
> > > > > 
> > > > >  
> > > > > 
> > > > > Root Cause:  
> > > > > 
> > > > > "reboot" process is getting killed in the middle of reboot
> > > > > operation hence the system hangs. 
> > > > > 
> > > > >  
> > > > > 
> > > > > Details: 
> > > > > 
> > > > >  
> > > > > 
> > > > > During the reboot process... ".  
> > > > > 
> > > > > -          reboot program (/sbin/reboot) issues "kill(-1,
> > > > > SIGTERM)" to kill all the processes in the system except
> > > > > "init" and himself.
> > > > > 
> > > > > -          When any of the forked shells (e.g. "sh support.sh"
> > > > > or shells created by system command) receives this signal they
> > > > > sometimes in turn send that signal to all the group using
> > > > > kill(0, SIGTERM). Since the reboot process also belongs to the
> > > > > same process group it gets killed hence the system hangs. 
> > > > > 
> > > > >  
> > > > > 
> > > > > Fix Description: 
> > > > > 
> > > > >  
> > > > > 
> > > > >             Initially tried by cleaning up the processes
> > > > > (support.sh. emrscron, pm, etc) before reboot issues "kill
> > > > > (-1, SIGTERM)". Also tried changing the order in which we
> > > > > terminate these processes during the cleanup.  But both of
> > > > > these approaches didn't work out. 
> > > > > 
> > > > >  
> > > > > 
> > > > >             Finally, the fix would be to ignore the SIGTERM
> > > > > signal during "reboot" operation. 
> > > > > 
> > > > >  
> > > > > 
> > > > >  
> > > > > 
> > > > > Affected Files:
> > > > > 
> > > > > *
> /homes/kumarv//work/dev/openbsd/src/sbin/reboot/reboot.c
> > > > > 
> > > > >  
> > > > > 
> > > > > P4CLIENT=kumarv-DEV
> > > > > 
> > > > > P4 Change Id: 29620
> > > > > 
> > > > >  
> > > > > 
> > > > > Please let me know if you need any clarifications. 
> > > > > 
> > > > >  
> > > > > 
> > > > >  
> > > > > 
> > > > > Thanks,
> > > > > Kumar.
> > > > > 
