AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20080325185149.5a12e570@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<mike.lee@onstor.com>,<tim.gardner@onstor.com>,<schandra@onstor.com>,<rendell.fong@onstor.com>,<kumarv@onstor.com>,<larry.scheer@onstor.com>,<maxim.kozlovsky@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@onstor-exch02.onstor.net/INBOX	0	BB375AF679D4A34E9CA8DFA650E2B04E07A8DAF6@onstor-exch02.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Tue, 25 Mar 2008 18:53:39 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: "Mike Lee" <mike.lee@onstor.com>
Cc: "Tim Gardner" <tim.gardner@onstor.com>, "Svati Chandra"
 <schandra@onstor.com>, "Rendell Fong" <rendell.fong@onstor.com>, "Kumar
 Vakacharla (HCL)" <kumarv@onstor.com>, "Larry Scheer"
 <larry.scheer@onstor.com>, "Maxim Kozlovsky" <maxim.kozlovsky@onstor.com>
Subject: Re: vsd issue
Message-ID: <20080325185339.6779233c@ripper.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E07A8DAF6@onstor-exch02.onstor.net>
References: <BB375AF679D4A34E9CA8DFA650E2B04E07A8DAC3@onstor-exch02.onstor.net>
	<BB375AF679D4A34E9CA8DFA650E2B04E07A8DAF6@onstor-exch02.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

This tldp reference is to linux with libc5 and is at least 12 years old,
and is incorrect for our linux, which is libc6.  Do not use
#include <bsd/signal.h>.

sigaction(2) is always the correct thing to use if possible, especially
since signal(2) has no defined semantics in a multithreaded program and
almost certainly does the opposite of what is desired.

Yes, Linux has sigrelse(2).


On Tue, 25 Mar 2008 18:30:27 -0700 "Mike Lee" <mike.lee@onstor.com>
wrote:

> 
> Tim:
> Recall the following email that I forwarded you, which initiated the
> conversation effort that Kumar has completed and I'm reviewing.  
> I just consulted Larry to make sure that we're using bsd/signal.h for
> our bobcat/cheetach compile, and it does not seem like we are.  I will
> forward Larry's finding shortly. 
> However, as such, do you still think we need to go through with this
> conversion?
> Thanks.
> -Mike
> 
> 
> -----Original Message-----
> From: Svati Chandra 
> Sent: Tuesday, March 18, 2008 3:47 PM
> To: Rendell Fong
> Cc: Mike Lee
> Subject: RE: vsd issue
> 
> 
> 
> Hi Rendell,
> 
> Linux and BSD semantics for the signal() function differ 
> as described below. BSD implements reliable signals,
> to implement the same behavior on Linux we would need
> to compile with #include <bsd/signal.h> 
> 
> http://tldp.org/LDP/lpg/node136.html
> 
> When using bsd/signal.h, setjmp and longjmp will save and
> restore the signal mask, but signal.h will not.
> However sigrelse is not available in Linux, so maybe it's
> better to use the POSIX sigsetjmp and siglongjmp methods
> if we want to use reliable signals.
> 
> Let me know how things work out with this change.
> 
> In general, "sigaction" is more portable than "signal"..
> 
> Thanks,
> Svati.
> 
> 
> 
> 
> -----Original Message-----
> From: Svati Chandra 
> Sent: Tuesday, March 18, 2008 10:36 AM
> To: Mike Lee; Rendell Fong
> Cc: Tim Gardner
> Subject: RE: vsd issue
> 
> 
> Hi Mike,
> 
> The good thing is that you can reproduce it.
> Could you try with a change that saves the transaction entries on
> stack and doesn't clear them as they complete. This may help us
> localize the area of structure corruption.
> 
> In the function is vsd_popTxnState, we could bracket the area that
> clears
> the stack under #ifndef DEBUG.
> 
>     331 #ifndef DEBUG
>     332     txn->txnStack[txn->numStackElements].procFunc      = NULL;
>     333     txn->txnStack[txn->numStackElements].context       = NULL;
>     334     txn->txnStack[txn->numStackElements].returnState   = 0;
>     335     txn->txnStack[txn->numStackElements].evtDetectFlag = 0;
>     336 #endif
> 
> 
> Thanks Mike,
> Svati.
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Mike Lee 
> Sent: Tuesday, March 18, 2008 12:48 AM
> To: Rendell Fong; Svati Chandra
> Cc: Raj Kumar; Tim Gardner
> Subject: RE: vsd issue
> 
> 
> Rendell:
> I just tried it.  The change did not cure vsd, and vsd crashed after 3
> enables/disables.
> However, I think we still need to take care of the rmc changes you
> indicated.
> Thanks.
> -Mike
> 
> -----Original Message-----
> From: Rendell Fong
> Sent: Mon 3/17/2008 10:06 PM
> To: Mike Lee; Svati Chandra
> Cc: Raj Kumar; Tim Gardner
> Subject: RE: vsd issue
>  
> I haven't tried it.  If you start down that path, there's also several
> other places inside of rmc that might need the same adjustment as
> well.
> 
> So far I've linked in the dmalloc library and done some testing.  The
> problems seem to go away with dmalloc usage.
> This suggests that there's something that is timing sensitive because
> dmalloc will just slow things down.
> 
> Or perhaps there's conflicts between malloc/free system calls and the
> rmc signal handler?
> 
> 
> 
> -----Original Message-----
> From: Mike Lee
> Sent: Mon 3/17/2008 8:10 PM
> To: Mike Lee; Rendell Fong; Svati Chandra
> Cc: Raj Kumar
> Subject: RE: vsd issue
>  
> Rendell & Svati:
> 
> I noticed that vsd_receiveMessage() is using nfx_select(), which uses
> select(), which as Chris V discovered last week, requires timeout
> adjustment.
> Have we tried making that change yet?
> 
> I don't think that it is solution, since I observed a corrupt program
> counter between two of my breakpoints (that somehow recovered).
> 
> However, I think it is worthwhile to fix this one if it hasn't been
> done yet.
> 
> -Mike
> -----Original Message-----
> From: Mike Lee 
> Sent: Monday, March 17, 2008 5:03 PM
> To: Rendell Fong; Svati Chandra
> Subject: RE: vsd issue
> 
> 
> Rendell:
> If you/Svati are still on this one, is this problem reproducible?
> If so, can u please tell me how?
> Thanks.
> -Mike
> 
> -----Original Message-----
> From: Rendell Fong 
> Sent: Monday, March 17, 2008 10:23 AM
> To: Svati Chandra
> Cc: Mike Lee
> Subject: RE: vsd issue
> 
> 
> I'm open to suggestions.  Where the crash occurs does not seem
> indicative of the area in which the bug is.
> 
> After studying the vsd code, I've noticed the following so far:
> 
> 1) vs-daemon.c 1ine 12660: same flag tested 3 times
> 
> 2) vs-daemon.c 1ine 6354: when ctx is NULL can't goto
> VSD_CREATE_VS_RUN_TIME_FAILED since ctx is later referenced (must be
> non-NULL)
> 
> 3) vs-util.c: vsd_allocTxn() always allocates transaction entries
> starting from the beginning of the array.  Not necessarily an error
> but may cause messages from prior transaction to get processed in the
> context of a new transaction.
> 
> 
> -----Original Message-----
> From: Svati Chandra
> Sent: Monday, March 17, 2008 7:21 AM
> To: Rendell Fong; Mike Lee
> Cc: Tim Gardner
> Subject: RE: vsd issue
> 
> 
> Hi Rendell,
> 
> dmalloc has a couple of tunables, e.g how often the heap is checked,
> etc.
> Maybe we could verify/experiment with those.
> 
> Thanks,
> Svati.
> 
> 
> -----Original Message-----
> From: Rendell Fong 
> Sent: Saturday, March 15, 2008 8:00 AM
> To: Mike Lee
> Cc: Svati Chandra; Tim Gardner
> Subject: RE: vsd issue
> 
> I think that VSD_RETURN_TO_PARENT pops the current vsdTxn (ctx entry)
> off the stack.  When the same ctx entry is reused the context
> parameter is being initialized (set to NULL).  So I don't think its
> necessary. What I don't understand is why dmalloc didn't catch this
> problem.
> 
> 
> -----Original Message-----
> From: Mike Lee
> Sent: Fri 3/14/2008 7:18 PM
> To: Rendell Fong
> Cc: Svati Chandra; Tim Gardner
> Subject: RE: vsd issue
>  
> Rendell:
> 
> At the very least, do you think we should set txn->context to NULL
> after those three calls to VSD_CHECK_AND_FREE_PTR()?
> 
> -Mike
> 
> -----Original Message-----
> From: Rendell Fong
> Sent: Fri 3/14/2008 6:28 PM
> To: Mike Lee
> Cc: Svati Chandra; Tim Gardner
> Subject: RE: vsd issue
>  
> After further analysis that place wasn't it.  Somehow the ctx->req
> address is invalid such that a seg fault occurs when it is freed.
> Where it's getting messed up isn't obvious at the moment. 
> 
> 
> -----Original Message-----
> From: Rendell Fong 
> Sent: Friday, March 14, 2008 6:03 PM
> To: Mike Lee
> Cc: Svati Chandra; Tim Gardner
> Subject: FW: vsd issue
> 
> It looks like vs-daemon.c: line 12629 needs to be deleted.
> 
> 
> -----Original Message-----
> From: Rendell Fong 
> Sent: Friday, March 14, 2008 5:56 PM
> To: Mike Lee; Svati Chandra
> Cc: Tim Gardner
> Subject: RE: vsd issue
> 
> This is the vsd crash that we were able to reproduce by just enabling
> the vsvr CSOAK-VS3 on g11r10.
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x2b6df5f4 in free () from /lib/libc.so.6
> (gdb) bt
> #0  0x2b6df5f4 in free () from /lib/libc.so.6
> #1  0x0043e06c in vsd_removeVsReqProc (vsdTxn=0x2bbb3008) at
> vs-daemon.c:12733
> #2  0x00453530 in vsd_processTxnRunQueue (maxProcess=128) at
> vs-util.c:1455
> #3  0x0040ddf8 in main (argc=2, argv=0x7f914e24) at vs-daemon.c:2736
> (gdb)
> 
> -----Original Message-----
> From: Mike Lee 
> Sent: Friday, March 14, 2008 5:52 PM
> To: Svati Chandra
> Cc: Jobi Ariyamannil; Rendell Fong; Vikas Saini
> Subject: vsd issue
> 
> 
> Hi Svati:
> 
> Rendell and Vikas are chasing a VSD crash related to free().  
> 
> I recall that you had found a TAILQ related bug last night.
> If so, do you plan to check in a fix for it?
> 
> Thanks.
> -Mike
> 
> 
> 
> 
> 
> 
