AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:
CFG:
PT:0
S:andy.sharp@lsi.com
RQ:
SSV:mhbs.lsil.com
NSV:
SSH:
R:<Ed.Kwan@lsi.com>
MAID:2
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/LSI/INBOX	0	2B044E14371DA244B71F8BF2514563F5040218F2@cosmail03.lsi.com
X-Sylpheed-End-Special-Headers: 1
Date: Tue, 3 Nov 2009 11:02:05 -0800
From: Andrew Sharp <andy.sharp@lsi.com>
To: "Kwan, Ed" <Ed.Kwan@lsi.com>
Subject: Re: Cougar Linux crashes
Message-ID: <20091103110205.6af5b513@ripper.onstor.net>
In-Reply-To: <2B044E14371DA244B71F8BF2514563F5040218F2@cosmail03.lsi.com>
References: <2B044E14371DA244B71F8BF2514563F5040218F2@cosmail03.lsi.com>
Organization: LSI
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Hi Ed,

The most I can tell you is that what's happening is that there is
a bug in the TXRX code whereby it is causing the kernel to take a page
fault during an interrupt.  The page fault is caused by the TXRX
putting a bogus value into the mgmtbus which the SSC needs to interpret
as an address, but which can't be.  Most likely NULL in most of these
cases, but any bogus address will do it.

As far as I can tell, there is a fix in 4.0.2.7 for the majority of
these but there is precious little that can be done on the SSC side due
to the design of the mgmtbus code.  It is a shared memory queueing
system with little or no error handling where packets are processed
like any normal networking device from the perspective of being
interrupt driven and so forth.

When this happens, the mgmtbus is dead, it is unrecoverable.  It's
possible that various hacks could be introduced to keep the kernel up a
little longer, but no error messages or data can be transmitted from
the TXRX to the SCC for logging or anything like that.  So there would
be little point in keeping the kernel up.  If the TXRX hasn't crashed
due to its bug, it soon will reset the system because it times out
trying to ping the SSC.


On Tue, 3 Nov 2009 11:25:55 -0700 "Kwan, Ed" <Ed.Kwan@lsi.com> wrote:

> Hi Andy,
> 
> Need you to take a closer look at these:
> 
> TED 27548 LSI
> Call Trace :
> [<ffffffff82024d18>] panic+0x88/0x1c0
> [<ffffffff8200782c>] die+0xdc/0xf8
> [<ffffffff820123d4>] do_page_fault+0x3b4/0x3e0
> [<ffffffff82001f20>] ret_from_exception+0x0/0x20
> [<ffffffff82007188>] show_regs+0x38/0x470
> [<ffffffff820076fc>] show_registers+0x14/0x68
> [<ffffffff820077cc>] die+0x7c/0xf8
> [<ffffffff82196878>] MGMTBUS_PHYS2VIRT+0xc8/0x128
> [<ffffffff8219699c>] mgmtbus_hard_start_xmit+0xc4/0x188
> [<ffffffff821b9d40>] __qdisc_run+0x70/0x1e8
> [<ffffffff821ac9fc>] dev_queue_xmit+0x1d4/0x458
> [<ffffffff82223618>] eee_dgram_sendmsg+0x2b8/0x440
> [<ffffffff8219c1d0>] sock_sendmsg+0x98/0xe8
> [<ffffffff8219c468>] sys_sendmsg+0x248/0x320
> [<ffffffff820107c8>] handle_sys+0x108/0x124
> 
> TED 27706 LSI Milpitas
> Call Trace :
> [<ffffffff82024d18>] panic+0x88/0x1c0
> [<ffffffff8200782c>] die+0xdc/0xf8
> [<ffffffff820123d4>] do_page_fault+0x3b4/0x3e0
> [<ffffffff82001f20>] ret_from_exception+0x0/0x20
> [<ffffffff82007188>] show_regs+0x38/0x470
> [<ffffffff820076fc>] show_registers+0x14/0x68
> [<ffffffff820077cc>] die+0x7c/0xf8
> [<ffffffff82196878>] MGMTBUS_PHYS2VIRT+0xc8/0x128
> [<ffffffff8219699c>] mgmtbus_hard_start_xmit+0xc4/0x188
> [<ffffffff821b9d40>] __qdisc_run+0x70/0x1e8
> [<ffffffff821ac9fc>] dev_queue_xmit+0x1d4/0x458
> [<ffffffff82223618>] eee_dgram_sendmsg+0x2b8/0x440
> [<ffffffff8219c1d0>] sock_sendmsg+0x98/0xe8
> [<ffffffff8219c468>] sys_sendmsg+0x248/0x320
> [<ffffffff820107c8>] handle_sys+0x108/0x124
> 
> TED 27711 LSI Shanghai
> Call Trace :
> [<ffffffff82024d18>] panic+0x88/0x1c0
> [<ffffffff8200782c>] die+0xdc/0xf8
> [<ffffffff820123d4>] do_page_fault+0x3b4/0x3e0
> [<ffffffff82001f20>] ret_from_exception+0x0/0x20
> [<ffffffff82007188>] show_regs+0x38/0x470
> [<ffffffff820076fc>] show_registers+0x14/0x68
> [<ffffffff820077cc>] die+0x7c/0xf8
> [<ffffffff82196878>] MGMTBUS_PHYS2VIRT+0xc8/0x128
> [<ffffffff8219699c>] mgmtbus_hard_start_xmit+0xc4/0x188
> [<ffffffff821b9d40>] __qdisc_run+0x70/0x1e8
> [<ffffffff821ac9fc>] dev_queue_xmit+0x1d4/0x458
> [<ffffffff82222e20>] relay_hard_start_xmit+0x1a0/0x230
> [<ffffffff821acb34>] dev_queue_xmit+0x30c/0x458
> [<ffffffff821c8be4>] ip_push_pending_frames+0x30c/0x4e8
> [<ffffffff821ec72c>] udp_push_pending_frames+0x16c/0x450
> [<ffffffff821edb04>] udp_sendmsg+0x324/0x6e0
> [<ffffffff8219c1d0>] sock_sendmsg+0x98/0xe8
> 
> Thanks,
> Ed
> 
