AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20080730110852.078335de@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<sripal.surendiran@onstor.com>,<ed.kwan@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@onstor-exch02.onstor.net/INBOX	0	48905013.6090604@onstor.com
X-Sylpheed-End-Special-Headers: 1
Date: Wed, 30 Jul 2008 11:08:59 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: Sripal <sripal.surendiran@onstor.com>
Cc: Ed Kwan <ed.kwan@onstor.com>
Subject: Re: Need a clarification on the bug TED24831
Message-ID: <20080730110859.1371ac0a@ripper.onstor.net>
In-Reply-To: <48905013.6090604@onstor.com>
References: <48905013.6090604@onstor.com>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

The system was out of memory.  Page allocation failed and code
neglected to account for that all the way out, and probably switched to
an invalid stack when usleep was called or when it returned.  I don't
think there's anything to do at this point.

On Wed, 30 Jul 2008 16:57:15 +0530 Sripal
<sripal.surendiran@onstor.com> wrote:

> Andy,
> 
> I am working on bsd crash (TED24831). Just need your suggestions on
> this.
> 
> It looks like page fault has occurred and page daemon tried to
> allocate a new page for that process. After allocating the new page,
> daemon tried change the process to runnable state by calling unsleep
> (move from sleep queue). Segmentation fault occurred when unsleep
> function tried to search the wakeup process in the sleep queue data
> structure.
> 
> Here is the back trace
> 
> Program terminated with signal 11, Segmentation fault.
> #0  0x8011d594 in unsleep ()
> (gdb) bt
> #0  0x8011d594 in unsleep ()
> #1  0x8011da60 in setrunnable ()
> #2  0x80127678 in selwakeup ()
> #3  0x80121c60 in logwakeup ()
> #4  0x80121f74 in log ()
> #5  0x801e3a48 in vm_page_alloc ()
> #6  0x801db27c in vm_fault ()
> #7  0x801f599c in trap ()
> #8  0x80100c80 in MipsKernGenException ()
> #9  0x8011ec54 in timeout_add ()
> #10 0x8011d178 in tsleep ()
> #11 0x801e480c in vm_pageout_page ()
> #12 0x801e45f0 in vm_pageout_scan ()
> #13 0x801e4ae0 in vm_pageout ()
> #14 0x8011000c in start_pagedaemon ()
> 
> Here is the unsleep source code looks like.
> 
> src/sys/kern/kern_synch.c
> 
> void
> unsleep(p)
>     register struct proc *p;
> {
>     register struct slpque *qp;
>     register struct proc **hp;
>     int s;
> 
>     s = splhigh();
>     if (p->p_wchan) {
>         hp = &(qp = &slpque[LOOKUP(p->p_wchan)])->sq_head;
>         while (*hp != p)
>             hp = &(*hp)->p_forw;  
> ----------------------------------------------> Crash occurred here 
> (line corresponding to 0x8011d594 address)
>         *hp = p->p_forw;
>         if (qp->sq_tailp == &p->p_forw)
>             qp->sq_tailp = hp;
>         p->p_wchan = 0;
>     }
>     splx(s);
> }
> 
> Checked latest BSD code to see if this problem in fixed. But there is 
> lot of changes in the synchronization stuff. Should the fix be just
> to check for NULL before de-referencing? No clue for root cause since
> we have only crash dump.
> 
> Thanks,
> Sripal.
> 
> 
> 
