AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20090522162306.2148cd8b@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:mail.onstor.net
NSV:
SSH:
R:<Bill.Fisher@onstor.com>,<brian.stark@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@exch1.onstor.net/INBOX	0	4A173250.5010009@onstor.com
X-Sylpheed-End-Special-Headers: 1
Date: Fri, 22 May 2009 16:24:01 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: Bill Fisher <Bill.Fisher@onstor.com>
Cc: Brian Stark <brian.stark@onstor.com>
Subject: Re: Perhaps you know this one
Message-ID: <20090522162401.40188cdf@ripper.onstor.net>
In-Reply-To: <4A173250.5010009@onstor.com>
References: <102AB4F33EBBDB4C91915B145C8E9FB312972FBABB@exch1.onstor.net>
	<4A173250.5010009@onstor.com>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Bill,

Just to reiterate so I can be sure we're clear, unless Brian or I say
so, your priority remains tuxrx.  Please don't spend any time on this
in lieu of working on tuxrx unless told to do so.

Thanks,

a

On Fri, 22 May 2009 16:16:32 -0700 Bill Fisher <Bill.Fisher@onstor.com>
wrote:

> Jonathan Goldick wrote:
> > Do you know of any way to determine if eee_poll is still being
> > called by a bobcat FP core? 
> 
> I would look at the various counters to see if they are being 
> incremented. I presume this will require the debugger, since I don't
> recall an explict procedure to do that. Since the polling buckets
> are compiled off, that eliminates that case.
>  >
> > I do not mean whether dont_poll is TRUE, as in "eee poll off" but
> > whether the base routine eee_poll is
>  > still properly set up to be called by the 1280 core.  We have a 
> reproducible bug at Shopzilla
>  > where the polling loops effectively stop on all cores, with no 
> watchdog crash.  Also, rcon still works.
>  >
> Since the watchdogs are being serviced, and rcon is working that 
> immediately imples that the outer-loop cases are still being
> called, the periodic calls and the watchdog checking.
> 
> 
> The registered "polling" functions are stored in a table, vai the 
> registerPollXX() procedure on the FP. What does the table dump show?
> I presume it shows that everything was registered properly.
> 
> The next case is under what conditions it never gets back into the
> major function calling cases, since it called the do_periodic_XXX()
> procedure or as you noted the checking/processing of the timers.
> 
> The reason rcon is still working is that it's get's called once every
> so many poll iterations, depending on the load on the machine.
> Hence it's hard-wired into the basement of the eee_poll() procedure.
> 
>  >
> > I believe I know the last routine to be called by the polling
> > routine, evm_io_clockTic via eee_processTimers,  but have yet to
> > determine any way it can get stuck without a watchdog.
>  >
> 	God question. From my reading of the code so far, I don't
> have an answer right now.
>  >
> > Another possibility is that it  somehow steps on memory such that
> > eee_poll is no longer called by the system.
>  >
> 	That would imply the poll function table is trashed. Isn't
> there an rcon shell command to dump the registered polling functions?
> > 
> > I know of no way to get a traceback of code run in the context of
> > eee_poll but if you have any ideas on that I would appreciate it.
>  > I have basically worked out evm_io_clockTic from indirect time
>  > stamps 
> rather than a thread stack.
>  >
> 
> I will have to look at the code to get a better handle on "how" this 
> might ever happen. What I described above was from "converting" the
> eee-poll() code to our Linux thread.
> 
> Later,
> 
> 
