AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:
CFG:
PT:0
S:andy.sharp@lsi.com
RQ:
SSV:mhbs.lsil.com
NSV:
SSH:
R:<Maxim.Kozlovsky@lsi.com>,<David.Olien@lsi.com>,<Larry.Scheer@lsi.com>,<Rendell.Fong@lsi.com>,<Bill.Fisher@lsi.com>
MAID:2
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/LSI/INBOX	0	861DA0537719934884B3D30A2666FECC010E4AC9C1@cosmail02.lsi.com
X-Sylpheed-End-Special-Headers: 1
Date: Mon, 5 Apr 2010 15:44:34 -0700
From: Andrew Sharp <andy.sharp@lsi.com>
To: "Kozlovsky, Maxim" <Maxim.Kozlovsky@lsi.com>
Cc: "Olien, David" <David.Olien@lsi.com>, "Scheer, Larry"
 <Larry.Scheer@lsi.com>, "Fong, Rendell" <Rendell.Fong@lsi.com>, "Fisher,
 Bill" <Bill.Fisher@lsi.com>
Subject: Re: Do you have a fix for this tuxrx crash?
Message-ID: <20100405154434.25e24029@ripper.onstor.net>
In-Reply-To: <861DA0537719934884B3D30A2666FECC010E4AC9C1@cosmail02.lsi.com>
References: <DEC609CD0E54B2448DAF023C89AE9755EB50C54A@cosmail02.lsi.com>
	<20100401172005.18845dfb@ripper.onstor.net>
	<DEC609CD0E54B2448DAF023C89AE9755EB50C54C@cosmail02.lsi.com>
	<6C678488C5CEE74F813A4D1948FD2DC7B7BF9276@cosmail02.lsi.com>
	<20100401185243.742bec06@ripper.onstor.net>
	<DEC609CD0E54B2448DAF023C89AE9755EB50C54E@cosmail02.lsi.com>
	<6C678488C5CEE74F813A4D1948FD2DC7B7BF9556@cosmail02.lsi.com>
	<861DA0537719934884B3D30A2666FECC010E4AC9C1@cosmail02.lsi.com>
Organization: LSI
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Yes, Max, David and Sam are right.  You can change the types of
magic... and ons_mac_str to unsigned char.  Sheesh, why do we let
people have mac addresses with bytes greater than 0x7f?  Obviously
dangerous.

On Mon, 5 Apr 2010 16:34:39 -0600 "Kozlovsky, Maxim"
<Maxim.Kozlovsky@lsi.com> wrote:

> This is not a bug in sprintf() or compiler. As Andy rightfully
> pointed out it is a bug in our code (or, rather, his code :) ).
> 
> %02 means the value is zero padded to at least 2 characters, but it
> does not mean output at most 2 characters.
> 
> For example,
> 
> int main()
> {
>         printf("%02x\n", 257);
> }
> 
> Will output 101, not 01.
> 
> The fix is ok, or alternatively you can change the type of
> magicmanagementbusringconfig to uint8_t to make it shorter.
> 
> 
> -----Original Message-----
> From: Olien, David 
> Sent: Monday, April 05, 2010 2:48 PM
> To: Olien, David; Scheer, Larry; Sharp, Andy
> Cc: Fong, Rendell; Kozlovsky, Maxim; Fisher, Bill
> Subject: RE: Do you have a fix for this tuxrx crash?
> 
> I THINK this works around a bug in the sprintf
> 
> Here's the PATCHED version of the code
> 
>     for (idx = 0; idx < chip_max_units; idx++) {
>         char ons_mac_str[] = "00:07:34:00:00:00";
> 
>         sprintf(&ons_mac_str[9], "%02X:%02X:%02X",
>                 (uint8_t)*(magicmanagementbusringconfig + 230),
>                 (uint8_t)*(magicmanagementbusringconfig + 229),
>                 (uint8_t)*(magicmanagementbusringconfig + 228) |
> (0x10 | idx));
> 
>         sbmac_setup_hwaddr(idx, ons_mac_str);
>     }
> 
> If you look at the stack from the crash, you can pick out
> The character string "ons_mac_str" at the top of the
> Stack listing.  It's easy to see, because it's ascii characters.
> 
> But WITHOUT this patch, you see lots of "f" charcters on the stack.
> It looks like MAYBE the dereferences of magicmanagementbusring
> Are being sign extended.  But I would have thought that the
> Format string form the sprint() would have still put only the
> Lower two characters from the format into ons_mac_str[].
> 
> But instead it looks like it's putting the entire sign extended
> Value into the ons_mac_str[], and effectively over-flowing it.
> This probably over-writes some other stack varibable, like
> Chip_max_units, maybe.
> 
> dave
> 
> -----Original Message-----
> From: Olien, David 
> Sent: Monday, April 05, 2010 2:20 PM
> To: Scheer, Larry; Sharp, Andy
> Cc: Fong, Rendell; Kozlovsky, Maxim; Fisher, Bill
> Subject: RE: Do you have a fix for this tuxrx crash?
> 
> Here's a patch Sam proposed.  I THINK this fixes something.
> 
> s/dolien/workspaces/tuxrx-3/tuxrx/linux/kernel/linux-mips-2.6/drivers/net
> 3012,3014c3012,3014
> <
> (uint8_t)*(magicmanagementbusringconfig + 230), <
> 				(uint8_t)*(magicmanagementbusringconfig
> + 229), <
> (uint8_t)*(magicmanagementbusringconfig + 228) | (0x10 | idx)); ---
> > 				*(magicmanagementbusringconfig +
> > 230), *(magicmanagementbusringconfig + 229),
> > 				*(magicmanagementbusringconfig +
> > 228) | (0x10 | idx));
> dolien@dolien-debian:~/workspaces/tuxrx-1/tuxrx/linux/kernel/linux-mips-2.6/drivers/net$
> 
> -----Original Message-----
> From: Scheer, Larry 
> Sent: Monday, April 05, 2010 10:58 AM
> To: Sharp, Andy; Olien, David
> Cc: Fong, Rendell; Kozlovsky, Maxim; Fisher, Bill
> Subject: RE: Do you have a fix for this tuxrx crash?
> 
> Same symptoms with kernel. Still see the oops in sbmac_init_module
> DBE Phys Addr 0010068200 ________________________________________
> From: Andrew Sharp [andy.sharp@lsi.com]
> Sent: Thursday, April 01, 2010 6:52 PM
> To: Olien, David
> Cc: Scheer, Larry; Fong, Rendell; Kozlovsky, Maxim; Fisher, Bill
> Subject: Re: Do you have a fix for this tuxrx crash?
> 
> None of this makes any sense.  Larry, try this kernel on the txrx
> instead of the one you built.
> 
> On Thu, 1 Apr 2010 18:59:02 -0600 "Olien, David" <David.Olien@lsi.com>
> wrote:
> 
> > I don't understand MIPS assembly code well yet.
> >
> > But from what I can tell, the panic is occurring
> > In the sbmac_init_module() function
> > In the loop that contains the call to alloc_etherdev(),
> > At an instruction somewhat before that call.
> >
> > In that loop, there is a call to sbmac_setup_hwaddr()
> > That has been inlined, I think.  Sbmac_setup_hwaddr()
> > Calls sbmac_addr2reg().  In the disassembly just before
> > The panic, you can see the call to sbmac_addr2reg().
> >
> > But I don't know mips assembly and register usage
> > Well enough to tell exactly what memory reference
> > Instruction is faulting.
> >
> > dave
> >
> > -----Original Message-----
> > From: Scheer, Larry
> > Sent: Thursday, April 01, 2010 5:37 PM
> > To: Sharp, Andy
> > Cc: Olien, David; Fong, Rendell; Kozlovsky, Maxim; Fisher, Bill
> > Subject: RE: Do you have a fix for this tuxrx crash?
> >
> > I was trying to recall what is different about bottom blades. I
> > think they are supposed to get their mac address a different way
> > from the top blade. Something about their address being the top
> > blades + 01 or something like that.
> >
> > I am copying Brian because he knows what is the difference.
> >
> > I tried to remove the IP address from the seep by setting it to
> > 0.0.0.0. That didn't change anything still getting the same panic.
> >
> > Larry
> > ________________________________________
> > From: Andrew Sharp [andy.sharp@lsi.com]
> > Sent: Thursday, April 01, 2010 5:20 PM
> > To: Scheer, Larry
> > Cc: Olien, David; Fong, Rendell; Kozlovsky, Maxim; Fisher, Bill
> > Subject: Re: Do you have a fix for this tuxrx crash?
> >
> > On Thu, 1 Apr 2010 17:41:34 -0600 "Scheer, Larry"
> > <Larry.Scheer@lsi.com> wrote:
> >
> > > Hi David,
> > >    Andy said you might have a fix for a TXRX crash we are seeing
> > > on a QA system. We are booting the bottom blade and see a crash in
> > > sbmac_init.
> > >
> > > I suspect this problem may only be occurring on the bottom blades.
> > > What filer are you using? I could check to see if it, too is the
> > > bottom blade.
> > >
> > > If you do, could you send to me the diffs.
> > >
> > > Thanks,
> > >
> > > Larry
> > >
> > > DBE physical address: 0010068200
> >                         ^^^^^^^^^^
> > This is a very puzzling address, I cannot see why sbmac_init_module
> > would be hitting that address.  Something is strange.  Possibly
> > there is something very recently broken in the branch.
> >
> > Dave, was this what you were seeing?
> >
> > It should be hitting 10064000 - 10067000, but should never get to
> > 8000 as far as I can tell.  Not in that function, which is pretty
> > simple.
> >
> >
> > > Data bus error, epc == ffffffff8332ab84, ra == ffffffff8332ab7c
> > > Oops[#1]:
> > > Cpu 0
> > > $ 0   : 0000000000000000 0000000014001fe0 0000000001200008
> > > 00000000000000ff $ 4   : a80000000b05bec4 0000000000000007
> > > 0000000000000034 0000000000000007 $ 8   : 0000000000000000
> > > 0000000000000008 0000000000000041 0000000000000008 $12   :
> > > a80000000b05bee7 0000000000000010 0000000000000000
> > > ffffffff83248910 $16   : 000000000000000f a80000000b05beda
> > > a80000000b05beca 9000000010068208 $20   : 000000000000003a
> > > 000000000000002d 0000000000000005 a80000000b05beca $24   :
> > > 0000000000000000 0000000000000030 $28   : a80000000b058000
> > > a80000000b05beb0 a80000000b05bec4 ffffffff8332ab7c Hi    :
> > > 000000000000000f Lo    : 0000000000000000 epc   : ffffffff8332ab84
> > > sbmac_init_module+0x26c/0x5d0     Not tainted ra    :
> > > ffffffff8332ab7c sbmac_init_module+0x264/0x5d0 Status: 14001fe3
> > > KX SX UX KERNEL EXL IE Cause : 0080801c
> > > PrId  : 01041100
> > > Modules linked in:
> > > Process swapper (pid: 1, threadinfo=a80000000b058000,
> > > task=a80000000b057870) Stack : 0000000014001fe1 ffffffff83304d60
> > > 0734070083307810 3a37303a3030ffff 46463a37303a3433
> > > 463a323846464646 0034394646464646 ffffffff832b0000
> > > 0000000000000000 ffffffff833334f0 0000000000000000
> > > fffffffffffffffe 0000000000000000 ffffffff832b0000
> > > ffffffff83330000 ffffffff83330000 ffffffff83330000
> > > ffffffff833147d8 0000000000000000 0000000000000000
> > > 0000000000000000 0000000000000000 0000000000000000
> > > 0000000000000000 0000000000000000 0000000000000000
> > > 0000000000000000 0000000000000000 0000000000000000
> > > 0000000000000000 0000000000000000 0000000000000000
> > > 0000000000000000 0000000000000000 0000000000000000
> > > ffffffff83004b90 0000000000000000 ffffffff83004b80
> > > 5a5a5a5a5a5a5a5a 5a5a5a5a5a5a5a5a ... Call Trace:
> > > [<ffffffff8332ab84>] sbmac_init_module+0x26c/0x5d0
> > > [<ffffffff833147d8>] kernel_init+0x1d0/0x3f8 [<ffffffff83004b90>]
> > > kernel_thread_helper+0x10/0x18
> > >
> > >
> > > Code: 26d60001  fe620000  de620000 <dfa20030> 16c2ff89  66731000
> > > 3c028338  3c038338  0000a82d Can't open proc stat
> > > Crashdump not saved, prom device open error
> > > primary crash already saved... crash #2 (Attempted to kill init!)
> > > will be ignored Kernel panic - not syncing: Attempted to kill
> > > init! Rebooting in 5 seconds..
