AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:
CFG:
PT:0
S:andy.sharp@lsi.com
RQ:
SSV:mhbs.lsil.com
NSV:
SSH:
R:<Bill.Fisher@lsi.com>,<David.Olien@lsi.com>,<Maxim.Kozlovsky@lsi.com>,<Larry.Scheer@lsi.com>,<Rendell.Fong@lsi.com>
MAID:2
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/LSI/INBOX	0	4BBA7BE2.1020705@lsi.com
X-Sylpheed-End-Special-Headers: 1
Date: Mon, 5 Apr 2010 18:21:36 -0700
From: Andrew Sharp <andy.sharp@lsi.com>
To: "Fisher, Bill" <Bill.Fisher@lsi.com>
Cc: "Olien, David" <David.Olien@lsi.com>, "Kozlovsky, Maxim"
 <Maxim.Kozlovsky@lsi.com>, "Scheer, Larry" <Larry.Scheer@lsi.com>, "Fong,
 Rendell" <Rendell.Fong@lsi.com>
Subject: Re: Do you have a fix for this tuxrx crash?
Message-ID: <20100405182136.1d3a31ce@ripper.onstor.net>
In-Reply-To: <4BBA7BE2.1020705@lsi.com>
References: <DEC609CD0E54B2448DAF023C89AE9755EB50C54A@cosmail02.lsi.com>
	<20100401172005.18845dfb@ripper.onstor.net>
	<DEC609CD0E54B2448DAF023C89AE9755EB50C54C@cosmail02.lsi.com>
	<6C678488C5CEE74F813A4D1948FD2DC7B7BF9276@cosmail02.lsi.com>
	<20100401185243.742bec06@ripper.onstor.net>
	<DEC609CD0E54B2448DAF023C89AE9755EB50C54E@cosmail02.lsi.com>
	<6C678488C5CEE74F813A4D1948FD2DC7B7BF9556@cosmail02.lsi.com>
	<861DA0537719934884B3D30A2666FECC010E4AC9C1@cosmail02.lsi.com>
	<6C678488C5CEE74F813A4D1948FD2DC7B7C5C2D7@cosmail02.lsi.com>
	<20100405154521.68b942a6@ripper.onstor.net>
	<4BBA7BE2.1020705@lsi.com>
Organization: LSI
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

On Mon, 5 Apr 2010 18:10:10 -0600 William Fisher <bill.fisher@lsi.com>
wrote:

> Andrew Sharp wrote:
> > I'll do it.  This buggy-ass code is originally mine, after all.
> > 
> 
> 	What kernel did it EVER work on?

It worked on ALL of our kernels, since this is the first time anyone
has had a problem with it.

> > On Mon, 5 Apr 2010 16:42:58 -0600 "Olien, David"
> > <David.Olien@lsi.com> wrote:
> > 
> > 
> >>OK, so this patch or one like it fixes the problem.
> >>
> >>I've always hated format strings.
> >>
> >>Sam is the patch's originator.
> >>
> >>But would one of you submit the change to the tree?
> >>
> >>I think going with a unsigned pointer type
> >>For magicmanagementbusringconfig would be more
> >>Consistent.
> >>
> >>dave
> >>
> >>
> >>-----Original Message-----
> >>From: Kozlovsky, Maxim 
> >>Sent: Monday, April 05, 2010 3:35 PM
> >>To: Olien, David; Scheer, Larry; Sharp, Andy
> >>Cc: Fong, Rendell; Fisher, Bill
> >>Subject: RE: Do you have a fix for this tuxrx crash?
> >>
> >>This is not a bug in sprintf() or compiler. As Andy rightfully
> >>pointed out it is a bug in our code (or, rather, his code :) ).
> >>
> >>%02 means the value is zero padded to at least 2 characters, but it
> >>does not mean output at most 2 characters.
> >>
> >>For example,
> >>
> >>int main()
> >>{
> >>        printf("%02x\n", 257);
> >>}
> >>
> >>Will output 101, not 01.
> >>
> >>The fix is ok, or alternatively you can change the type of
> >>magicmanagementbusringconfig to uint8_t to make it shorter.
> >>
> >>
> >>-----Original Message-----
> >>From: Olien, David 
> >>Sent: Monday, April 05, 2010 2:48 PM
> >>To: Olien, David; Scheer, Larry; Sharp, Andy
> >>Cc: Fong, Rendell; Kozlovsky, Maxim; Fisher, Bill
> >>Subject: RE: Do you have a fix for this tuxrx crash?
> >>
> >>I THINK this works around a bug in the sprintf
> >>
> >>Here's the PATCHED version of the code
> >>
> >>    for (idx = 0; idx < chip_max_units; idx++) {
> >>        char ons_mac_str[] = "00:07:34:00:00:00";
> >>
> >>        sprintf(&ons_mac_str[9], "%02X:%02X:%02X",
> >>                (uint8_t)*(magicmanagementbusringconfig + 230),
> >>                (uint8_t)*(magicmanagementbusringconfig + 229),
> >>                (uint8_t)*(magicmanagementbusringconfig + 228) |
> >>(0x10 | idx));
> >>
> >>        sbmac_setup_hwaddr(idx, ons_mac_str);
> >>    }
> >>
> >>If you look at the stack from the crash, you can pick out
> >>The character string "ons_mac_str" at the top of the
> >>Stack listing.  It's easy to see, because it's ascii characters.
> >>
> >>But WITHOUT this patch, you see lots of "f" charcters on the stack.
> >>It looks like MAYBE the dereferences of magicmanagementbusring
> >>Are being sign extended.  But I would have thought that the
> >>Format string form the sprint() would have still put only the
> >>Lower two characters from the format into ons_mac_str[].
> >>
> >>But instead it looks like it's putting the entire sign extended
> >>Value into the ons_mac_str[], and effectively over-flowing it.
> >>This probably over-writes some other stack varibable, like
> >>Chip_max_units, maybe.
> >>
> >>dave
> >>
> >>-----Original Message-----
> >>From: Olien, David 
> >>Sent: Monday, April 05, 2010 2:20 PM
> >>To: Scheer, Larry; Sharp, Andy
> >>Cc: Fong, Rendell; Kozlovsky, Maxim; Fisher, Bill
> >>Subject: RE: Do you have a fix for this tuxrx crash?
> >>
> >>Here's a patch Sam proposed.  I THINK this fixes something.
> >>
> >>s/dolien/workspaces/tuxrx-3/tuxrx/linux/kernel/linux-mips-2.6/drivers/net
> >>3012,3014c3012,3014
> >><
> >>(uint8_t)*(magicmanagementbusringconfig + 230), <
> >>				(uint8_t)*(magicmanagementbusringconfig
> >>+ 229), <
> >>(uint8_t)*(magicmanagementbusringconfig + 228) | (0x10 | idx)); ---
> >>
> >>>				*(magicmanagementbusringconfig +
> >>>230), *(magicmanagementbusringconfig + 229),
> >>>				*(magicmanagementbusringconfig +
> >>>228) | (0x10 | idx));
> >>
> >>dolien@dolien-debian:~/workspaces/tuxrx-1/tuxrx/linux/kernel/linux-mips-2.6/drivers/net$
> >>
> >>-----Original Message-----
> >>From: Scheer, Larry 
> >>Sent: Monday, April 05, 2010 10:58 AM
> >>To: Sharp, Andy; Olien, David
> >>Cc: Fong, Rendell; Kozlovsky, Maxim; Fisher, Bill
> >>Subject: RE: Do you have a fix for this tuxrx crash?
> >>
> >>Same symptoms with kernel. Still see the oops in sbmac_init_module
> >>DBE Phys Addr 0010068200 ________________________________________
> >>From: Andrew Sharp [andy.sharp@lsi.com]
> >>Sent: Thursday, April 01, 2010 6:52 PM
> >>To: Olien, David
> >>Cc: Scheer, Larry; Fong, Rendell; Kozlovsky, Maxim; Fisher, Bill
> >>Subject: Re: Do you have a fix for this tuxrx crash?
> >>
> >>None of this makes any sense.  Larry, try this kernel on the txrx
> >>instead of the one you built.
> >>
> >>On Thu, 1 Apr 2010 18:59:02 -0600 "Olien, David"
> >><David.Olien@lsi.com> wrote:
> >>
> >>
> >>>I don't understand MIPS assembly code well yet.
> >>>
> >>>But from what I can tell, the panic is occurring
> >>>In the sbmac_init_module() function
> >>>In the loop that contains the call to alloc_etherdev(),
> >>>At an instruction somewhat before that call.
> >>>
> >>>In that loop, there is a call to sbmac_setup_hwaddr()
> >>>That has been inlined, I think.  Sbmac_setup_hwaddr()
> >>>Calls sbmac_addr2reg().  In the disassembly just before
> >>>The panic, you can see the call to sbmac_addr2reg().
> >>>
> >>>But I don't know mips assembly and register usage
> >>>Well enough to tell exactly what memory reference
> >>>Instruction is faulting.
> >>>
> >>>dave
> >>>
> >>>-----Original Message-----
> >>>From: Scheer, Larry
> >>>Sent: Thursday, April 01, 2010 5:37 PM
> >>>To: Sharp, Andy
> >>>Cc: Olien, David; Fong, Rendell; Kozlovsky, Maxim; Fisher, Bill
> >>>Subject: RE: Do you have a fix for this tuxrx crash?
> >>>
> >>>I was trying to recall what is different about bottom blades. I
> >>>think they are supposed to get their mac address a different way
> >>>from the top blade. Something about their address being the top
> >>>blades + 01 or something like that.
> >>>
> >>>I am copying Brian because he knows what is the difference.
> >>>
> >>>I tried to remove the IP address from the seep by setting it to
> >>>0.0.0.0. That didn't change anything still getting the same panic.
> >>>
> >>>Larry
> >>>________________________________________
> >>>From: Andrew Sharp [andy.sharp@lsi.com]
> >>>Sent: Thursday, April 01, 2010 5:20 PM
> >>>To: Scheer, Larry
> >>>Cc: Olien, David; Fong, Rendell; Kozlovsky, Maxim; Fisher, Bill
> >>>Subject: Re: Do you have a fix for this tuxrx crash?
> >>>
> >>>On Thu, 1 Apr 2010 17:41:34 -0600 "Scheer, Larry"
> >>><Larry.Scheer@lsi.com> wrote:
> >>>
> >>>
> >>>>Hi David,
> >>>>   Andy said you might have a fix for a TXRX crash we are seeing
> >>>>on a QA system. We are booting the bottom blade and see a crash in
> >>>>sbmac_init.
> >>>>
> >>>>I suspect this problem may only be occurring on the bottom blades.
> >>>>What filer are you using? I could check to see if it, too is the
> >>>>bottom blade.
> >>>>
> >>>>If you do, could you send to me the diffs.
> >>>>
> >>>>Thanks,
> >>>>
> >>>>Larry
> >>>>
> >>>>DBE physical address: 0010068200
> >>>
> >>>                        ^^^^^^^^^^
> >>>This is a very puzzling address, I cannot see why sbmac_init_module
> >>>would be hitting that address.  Something is strange.  Possibly
> >>>there is something very recently broken in the branch.
> >>>
> >>>Dave, was this what you were seeing?
> >>>
> >>>It should be hitting 10064000 - 10067000, but should never get to
> >>>8000 as far as I can tell.  Not in that function, which is pretty
> >>>simple.
> >>>
> >>>
> >>>
> >>>>Data bus error, epc == ffffffff8332ab84, ra == ffffffff8332ab7c
> >>>>Oops[#1]:
> >>>>Cpu 0
> >>>>$ 0   : 0000000000000000 0000000014001fe0 0000000001200008
> >>>>00000000000000ff $ 4   : a80000000b05bec4 0000000000000007
> >>>>0000000000000034 0000000000000007 $ 8   : 0000000000000000
> >>>>0000000000000008 0000000000000041 0000000000000008 $12   :
> >>>>a80000000b05bee7 0000000000000010 0000000000000000
> >>>>ffffffff83248910 $16   : 000000000000000f a80000000b05beda
> >>>>a80000000b05beca 9000000010068208 $20   : 000000000000003a
> >>>>000000000000002d 0000000000000005 a80000000b05beca $24   :
> >>>>0000000000000000 0000000000000030 $28   : a80000000b058000
> >>>>a80000000b05beb0 a80000000b05bec4 ffffffff8332ab7c Hi    :
> >>>>000000000000000f Lo    : 0000000000000000 epc   : ffffffff8332ab84
> >>>>sbmac_init_module+0x26c/0x5d0     Not tainted ra    :
> >>>>ffffffff8332ab7c sbmac_init_module+0x264/0x5d0 Status: 14001fe3
> >>>>KX SX UX KERNEL EXL IE Cause : 0080801c
> >>>>PrId  : 01041100
> >>>>Modules linked in:
> >>>>Process swapper (pid: 1, threadinfo=a80000000b058000,
> >>>>task=a80000000b057870) Stack : 0000000014001fe1 ffffffff83304d60
> >>>>0734070083307810 3a37303a3030ffff 46463a37303a3433
> >>>>463a323846464646 0034394646464646 ffffffff832b0000
> >>>>0000000000000000 ffffffff833334f0 0000000000000000
> >>>>fffffffffffffffe 0000000000000000 ffffffff832b0000
> >>>>ffffffff83330000 ffffffff83330000 ffffffff83330000
> >>>>ffffffff833147d8 0000000000000000 0000000000000000
> >>>>0000000000000000 0000000000000000 0000000000000000
> >>>>0000000000000000 0000000000000000 0000000000000000
> >>>>0000000000000000 0000000000000000 0000000000000000
> >>>>0000000000000000 0000000000000000 0000000000000000
> >>>>0000000000000000 0000000000000000 0000000000000000
> >>>>ffffffff83004b90 0000000000000000 ffffffff83004b80
> >>>>5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ... Call Trace:
> >>>>[<ffffffff8332ab84>] sbmac_init_module+0x26c/0x5d0
> >>>>[<ffffffff833147d8>] kernel_init+0x1d0/0x3f8 [<ffffffff83004b90>]
> >>>>kernel_thread_helper+0x10/0x18
> >>>>
> >>>>
> >>>>Code: 26d60001  fe620000  de620000 <dfa20030> 16c2ff89  66731000
> >>>>3c028338  3c038338  0000a82d Can't open proc stat
> >>>>Crashdump not saved, prom device open error
> >>>>primary crash already saved... crash #2 (Attempted to kill init!)
> >>>>will be ignored Kernel panic - not syncing: Attempted to kill
> >>>>init! Rebooting in 5 seconds..
> 
