AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<kumarv@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@onstor-exch02.onstor.net/INBOX	0	BB375AF679D4A34E9CA8DFA650E2B04E0A6E8FF9@onstor-exch02.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Mon, 16 Jun 2008 10:40:17 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: "Kumar Vakacharla (HCL)" <kumarv@onstor.com>
Subject: Re: Review Request : TED22005
Message-ID: <20080616104017.7a66b93b@ripper.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E0A6E8FF9@onstor-exch02.onstor.net>
References: <BB375AF679D4A34E9CA8DFA650E2B04E0A60C96D@onstor-exch02.onstor.net>
	<20080612114827.609ea49c@ripper.onstor.net>
	<BB375AF679D4A34E9CA8DFA650E2B04E0A6E85FD@onstor-exch02.onstor.net>
	<20080612131534.7825d252@ripper.onstor.net>
	<BB375AF679D4A34E9CA8DFA650E2B04E0A6E8D58@onstor-exch02.onstor.net>
	<20080616091448.37dd22b3@ripper.onstor.net>
	<BB375AF679D4A34E9CA8DFA650E2B04E0A6E8FF9@onstor-exch02.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

> Change 29620 by perforce@kumarv-DEV on 2008/06/10 18:23:56 *pending*
> 
> 	TED00022005 (LSI-PA 6989) Each system hung after a number of
> crashes (minority pcc state)
> 	
> 	Fix Description: 
> 	Make pgid of reboot process different from pm process group. 
> 	The processes in pm group may receive SIGTERM during reboot
> operation. 

nfx-tree/code/sm-chassis/chassisd-bc.c

     looks good

nfx-tree/code/sm-chassis/chassisd-cg.c

     looks good

nfx-tree/code/sm-chassis/chassisd-msg.c

     looks good

nfx-tree/code/ssc-genlib/genlib.c

     line 15: why is _GNU_SOURCE being defined?

     line 85, why do the fork/wait/etc?  why not just
     setpgid(0,0);system(PLATFORM_REBOOT_CMD); ?

     line 141, this doesn't need to be modified, it is only called on
     cheetahs and only from nfxsh.





On Mon, 16 Jun 2008 09:43:36 -0700 "Kumar Vakacharla (HCL)"
<kumarv@onstor.com> wrote:

> Hi Andy,
> 
> Sorry for that.  I have reopened the change 29620 now. 
> 
> 
> Thanks,
> Kumar.
> 
> -----Original Message-----
> From: Andy Sharp 
> Sent: Monday, June 16, 2008 9:15 AM
> To: Kumar Vakacharla (HCL)
> Subject: Re: Review Request : TED22005
> 
> Hi Kumar,
> 
> The changelist 29620 has no files in it.
> 
> They're probably in your default change list?  You can move them to
> 29620 with the reopen command:
> 
> p4 reopen -c 29620 <file>
> 
> 
> 
> On Fri, 13 Jun 2008 22:03:42 -0700 "Kumar Vakacharla (HCL)"
> <kumarv@onstor.com> wrote:
> 
> > Andy, 
> > 
> >   I have modified the code according to your comments and ready for
> > review.
> > 
> >   Please let me know if you see any problems. 
> > 
> > P4CLIENT=kumarv-DEV
> > P4 Change Id: 29620
> > PATH: /homes/kumarv/work/dev/
> >
> ========================================================================
> > =
> > kumarv@compile2>p4 describe 29620
> > Change 29620 by perforce@kumarv-DEV on 2008/06/10 18:23:56 *pending*
> > 
> >         TED00022005 (LSI-PA 6989) Each system hung after a number of
> > crashes
> >         (minority pcc state)
> > 
> >         Fix Description:
> >         	Make pgid of reboot process different from pm
> > process group.
> >     		The processes in pm group may receive SIGTERM
> > during the reboot 		operation.
> > Affected files ...
> > 
> > ... //depot/dev/nfx-tree/code/sm-chassis/chassisd-bc.c#12 edit
> > ... //depot/dev/nfx-tree/code/sm-chassis/chassisd-cg.c#10 edit
> > ... //depot/dev/nfx-tree/code/sm-chassis/chassisd-msg.c#12 edit
> > ... //depot/dev/nfx-tree/code/ssc-genlib/genlib.c#1 edit
> >
> ========================================================================
> > =
> > 
> > Since I need to provide a patch for 3.2.0.5, I have made similar
> > changes in r320 branch. Please review them too. 
> > 
> > P4CLIENT=kumarv-r320rel
> > P4 Change Id: 29678
> > PATH: /homes/kumarv/work/r320rel/
> >
> ========================================================================
> > ====
> > kumarv@linux-compile>p4 describe 29678
> > Change 29678 by perforce@kumarv-r320rel on 2008/06/13 15:26:31
> > *pending*
> > 
> >            TED00022005 (LSI-PA 6989) Each system hung after a number
> > of crashes
> >         (minority pcc state)
> > 
> >         Fix Description:
> >         Make pgid of reboot process different from pm process group.
> >     The processes in pm group may receive SIGTERM during the reboot
> > operation.
> > 
> > Affected files ...
> > 
> > ... //depot/r320rel/nfx-tree/code/sm-chassis/chassisd-bc.c#1 edit
> > ... //depot/r320rel/nfx-tree/code/sm-chassis/chassisd-msg.c#1 edit
> > ... //depot/r320rel/nfx-tree/code/sm-chassis/chassisd.c#1 edit
> > ... //depot/r320rel/nfx-tree/code/ssc-genlib/cm-reboot-linux.c#1
> > edit ... //depot/r320rel/nfx-tree/code/ssc-genlib/cm-reboot-openbsd.c#1
> > edit ... //depot/r320rel/nfx-tree/code/ssc-genlib/genlib-linux.c#2
> > edit ... //depot/r320rel/nfx-tree/code/ssc-genlib/genlib-openbsd.c#1
> > edit
> > 
> >
> ========================================================================
> > ====
> > 
> > Thanks,
> > Kumar.
> > 
> > -----Original Message-----
> > From: Andy Sharp 
> > Sent: Thursday, June 12, 2008 1:16 PM
> > To: Kumar Vakacharla (HCL)
> > Subject: Re: Review Request : TED22005
> > 
> > Feel free to come by and talk about it.  Right now, I'm doing some
> > follow on work to some code that Chris Vandever is soon to check in
> > that will be the start of an attempt to consolidate all attempts to
> > reboot the system from our code, including daemons and nfxsh.  So
> > perhaps if you concentrated on adding it to the genlib code, that
> > might be enough for now, and the other places in our code that
> > unwisely do something like system("reboot") on their own will be
> > cleaned up later.
> > 
> > BTW, I don't think the reboot program should be immune to SIGTERM.
> > Perhaps I might want to be able to kill the reboot program from some
> > other program, who knows?
> > 
> > 
> > On Thu, 12 Jun 2008 12:15:37 -0700 "Kumar Vakacharla (HCL)"
> > <kumarv@onstor.com> wrote:
> > 
> > > Hi Andy, 
> > > 
> > > I understand it. In fact I have tried similar thing in our code
> > > initially. Then I realized that there are many places we reboot
> > > the system using system("reboot")". So I thought instead of
> > > changing it in multiple places I can make it in reboot code of
> > > BSD itself so that even future calls to system(reboot) won't
> > > break it. I think reboot process is not supposed to be terminated
> > > by SIGTERM from the other processes and that's why I made the fix
> > > there.  
> > > 
> > > Anyways, I will try to do it as you suggested.
> > > 
> > > Thanks,
> > > Kumar.
> > > 
> > > -----Original Message-----
> > > From: Andy Sharp 
> > > Sent: Thursday, June 12, 2008 11:48 AM
> > > To: Kumar Vakacharla (HCL)
> > > Subject: Re: Review Request : TED22005
> > > 
> > > Hi Kumar,
> > > 
> > > I've had a chance to take a look at this, and while you're right,
> > > this is one viable approach, I would much prefer to stick to a
> > > design philosophy of modifying our code first and system/distro
> > > code only as a last resort.
> > > 
> > > Can we instead code up a method whereby the reboot command is run
> > > in a process that is not part of the initial process group?  Ie,
> > > do a fork;setpgrp;do_system(reboot) kind of thing?
> > > 
> > > Thanks,
> > > 
> > > a
> > > 
> > > On Tue, 10 Jun 2008 18:28:36 -0700 "Kumar Vakacharla (HCL)"
> > > <kumarv@onstor.com> wrote:
> > > 
> > > > Andy, 
> > > > 
> > > >  
> > > > 
> > > > Can you please review the fix for this defect?
> > > > 
> > > >  
> > > > 
> > > >  
> > > > 
> > > > Defect : 
> > > > 
> > > >  
> > > > 
> > > > TED00022005 (LSI-PA 6989) Each system hung after a number of
> > > > crashes (minority pcc state)
> > > > 
> > > >  
> > > > 
> > > >  
> > > > 
> > > > Root Cause:  
> > > > 
> > > > "reboot" process is getting killed in the middle of reboot
> > > > operation hence the system hangs. 
> > > > 
> > > >  
> > > > 
> > > > Details: 
> > > > 
> > > >  
> > > > 
> > > > During the reboot process... ".  
> > > > 
> > > > -          reboot program (/sbin/reboot) issues "kill(-1,
> > > > SIGTERM)" to kill all the processes in the system except "init"
> > > > and himself.
> > > > 
> > > > -          When any of the forked shells (e.g. "sh support.sh"
> > > > or shells created by system command) receives this signal they
> > > > sometimes in turn send that signal to all the group using
> > > > kill(0, SIGTERM). Since the reboot process also belongs to the
> > > > same process group it gets killed hence the system hangs. 
> > > > 
> > > >  
> > > > 
> > > > Fix Description: 
> > > > 
> > > >  
> > > > 
> > > >             Initially tried by cleaning up the processes
> > > > (support.sh. emrscron, pm, etc) before reboot issues "kill (-1,
> > > > SIGTERM)". Also tried changing the order in which we terminate
> > > > these processes during the cleanup.  But both of these
> > > > approaches didn't work out. 
> > > > 
> > > >  
> > > > 
> > > >             Finally, the fix would be to ignore the SIGTERM
> > > > signal during "reboot" operation. 
> > > > 
> > > >  
> > > > 
> > > >  
> > > > 
> > > > Affected Files:
> > > > 
> > > > *         /homes/kumarv//work/dev/openbsd/src/sbin/reboot/reboot.c
> > > > 
> > > >  
> > > > 
> > > > P4CLIENT=kumarv-DEV
> > > > 
> > > > P4 Change Id: 29620
> > > > 
> > > >  
> > > > 
> > > > Please let me know if you need any clarifications. 
> > > > 
> > > >  
> > > > 
> > > >  
> > > > 
> > > > Thanks,
> > > > Kumar.
> > > > 
