AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20090708155522.566d29ec@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:mail.onstor.net
NSV:
SSH:
R:<sandrine.boulanger@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@exch1.onstor.net/INBOX	0	102AB4F33EBBDB4C91915B145C8E9FB31377A82E1A@exch1.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Wed, 8 Jul 2009 15:55:29 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: Sandrine Boulanger <sandrine.boulanger@onstor.com>
Subject: Re: what is kswapd0 on Cougar? It takes too much cpu and slows down
 ssc on latest Cougar dev build (07/06/09)
Message-ID: <20090708155529.7976d8b1@ripper.onstor.net>
In-Reply-To: <102AB4F33EBBDB4C91915B145C8E9FB31377A82E1A@exch1.onstor.net>
References: <20090708145842.3de06c1e@ripper.onstor.net>
	<102AB4F33EBBDB4C91915B145C8E9FB31377A82E1A@exch1.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Unfortunately, that run of top is on a healthy system.  Can I come take
a look at it?  I'll take the long trek to your office ~:^)

On Wed, 8 Jul 2009 15:11:30 -0700 Sandrine Boulanger
<sandrine.boulanger@onstor.com> wrote:

> It was in the path below. I copied it
> to /homes/sandrineb/traces/top-output.txt
> 
> -----Original Message-----
> From: Andy Sharp 
> Sent: Wednesday, July 08, 2009 2:59 PM
> To: Sandrine Boulanger
> Subject: Re: what is kswapd0 on Cougar? It takes too much cpu and
> slows down ssc on latest Cougar dev build (07/06/09)
> 
> Not yet.  He pointed me to a file I couldn't get to so I'm waiting for
> that.
> 
> On Wed, 8 Jul 2009 14:26:26 -0700 Sandrine Boulanger
> <sandrine.boulanger@onstor.com> wrote:
> 
> > Andy, did you get a chance to review this? This cluster is useless
> > right now.
> > 
> > -----Original Message-----
> > From: Yogesh Sawant 
> > Sent: Wednesday, July 08, 2009 6:14 AM
> > To: Yogesh Sawant; Andy Sharp; Sandrine Boulanger
> > Cc: Maxim Kozlovsky; Dilip Jha; Sandeep Chavan
> > Subject: RE: what is kswapd0 on Cougar? It takes too much cpu and
> > slows down ssc on latest Cougar dev build (07/06/09)
> > 
> > Hi Andy,
> > 
> > I couldn't send the text file for some unknown reason (I'm using
> > Konqueror on ubuntu and it does not like ms outlook).
> > 
> > The text file can be found here:  g5r204:/tmp/top_output.txt
> > I generated it by running
> > g5r204:/root/ysawant/capture_top_output.bash  while "vol show" was
> > in progress.
> > 
> > Thanks,
> > Yogesh Sawant
> > 
> > ________________________________________
> > From: Yogesh Sawant
> > Sent: Wednesday, July 08, 2009 6:31 PM
> > To: Andy Sharp; Sandrine Boulanger
> > Cc: Maxim Kozlovsky; Dilip Jha; Sandeep Chavan
> > Subject: RE: what is kswapd0 on Cougar? It takes too much cpu and
> > slows down ssc on latest Cougar dev build (07/06/09)
> > 
> > Hi Andy,
> > 
> > I ran "vol show" and captured output of "top -b -n 1" at intervals
> > of 5 seconds, please see attached text file.
> > 
> > g5r204 diag> vol show
> > Operation failed. Timeout.
> > g5r204 diag>
> > 
> > I see these in the log:
> > 
> > Jul  8 05:47:43 g5r204 : 0:0:nfxsh:NOTICE: cmd[1]: vol show :
> > status[0] Jul  8 05:47:43 g5r204 : 0:0:tape-driver:ERROR: tape_rpc:
> > tape_sess_lookup app sdm cpu 0 slot 0 failed Jul  8 05:47:44
> > g5r204 : 0:0:cluster2:ERROR: ClusterCtrl_GetDbVer: ubik_call
> > failed, code 5376, rc 30 Jul  8 05:47:53 g5r204 last message
> > repeated 11 times Jul  8 05:47:53 g5r204 : 0:0:tape-driver:ERROR:
> > tape_rpc: tape_sess_lookup app sdm cpu 0 slot 0 failed Jul  8
> > 05:47:54 g5r204 : 0:0:cluster2:ERROR: ClusterCtrl_GetDbVer:
> > ubik_call failed, code 5376, rc 30 Jul  8 05:48:03 g5r204 last
> > message repeated 11 times Jul  8 05:48:04 g5r204 :
> > 0:0:tape-driver:ERROR: tape_rpc: tape_sess_lookup app sdm cpu 0
> > slot 0 failed Jul  8 05:48:04 g5r204 : 0:0:cluster2:ERROR:
> > ClusterCtrl_GetDbVer: ubik_call failed, code 5376, rc 30 Jul  8
> > 05:48:07 g5r204 last message repeated 3 times
> > 
> > 
> > Another attempt:
> > 
> > g5r204 diag> lun show disk -t free
> > Failed to get response from SPM
> > PCC node unknown. Cannot process message
> > % Command failure.
> > g5r204 diag>
> > 
> > Jul  8 05:55:13 g5r204 : 0:0:nfxsh:NOTICE: cmd[2]: lun show disk -t
> > free : status[11] Jul  8 05:55:14 g5r204 : 0:0:cluster2:ERROR:
> > ClusterCtrl_GetDbVer: ubik_call failed, code 5376, rc 30 Jul  8
> > 05:55:20 g5r204 last message repeated 7 times Jul  8 05:55:20
> > g5r204 : 0:0:ea:ERROR: ea_evmGetVolList[646]: Failed to get volume
> > list, rc[8] Jul  8 05:55:20 g5r204 : 0:0:ea:INFO: Error
> > string[Operation failed. Timeout.] len[26] Jul  8 05:55:21 g5r204 :
> > 0:0:cluster2:ERROR: ClusterCtrl_GetDbVer: ubik_call failed, code
> > 5376, rc 30 Jul  8 05:55:22 g5r204 : 0:0:cluster2:ERROR:
> > ClusterCtrl_GetDbVer: ubik_call failed, code 5376, rc 30
> > 
> > Thanks,
> > Yogesh Sawant
> > 
> > ________________________________________
> > From: Andy Sharp
> > Sent: Wednesday, July 08, 2009 8:28 AM
> > To: Sandrine Boulanger
> > Cc: Maxim Kozlovsky; Yogesh Sawant; Dilip Jha; Sandeep Chavan
> > Subject: Re: what is kswapd0 on Cougar? It takes too much cpu and
> > slows down ssc on latest Cougar dev build (07/06/09)
> > 
> > Yup, it's out of memory.  Let me guess, you are testing a system
> > with 5-million lun/path combinations?  Just kidding.  If you can
> > log in, try to capture the output of top -b -n 1 and then we can
> > figure out what process(es) is(are) sucking the life out of the
> > thing.
> > 
> > Cheers,
> > 
> > a
> > 
> > 
> > On Tue, 7 Jul 2009 18:57:09 -0700 Sandrine Boulanger
> > <sandrine.boulanger@onstor.com> wrote:
> > 
> > > I see those too on the console:
> > >
> > > Out of memory: kill process 971 (pm) score 1062 or a child
> > > Killed process 983 (ncmd)
> > >
> > > Out of memory: kill process 804 (exim4) score 2819 or a child
> > > Killed process 7699 (exim4)
> > >
> > > I'm starting to wonder if we should put back a stable 4.0.2.x
> > > build on those systems to be able to use them for the test
> > > automation development...
> > >
> > > _____________________________________________
> > > From: Sandrine Boulanger
> > > Sent: Tuesday, July 07, 2009 6:41 PM
> > > To: Sandrine Boulanger; Andy Sharp
> > > Cc: Jonathan Goldick; Maxim Kozlovsky; Yogesh Sawant; Dilip Jha;
> > > Sandeep Chavan Subject: RE: what is kswapd0 on Cougar? It takes
> > > too much cpu and slows down ssc on latest Cougar dev build
> > > (07/06/09)
> > >
> > > Well, can't be HW, now g9r204 is showing this too. No idea how to
> > > recover from this but power cycle, but we'll eventually end up
> > > there again. What's happening?
> > >
> > > SiByte User Watchdog in danger of initiating system reset in 4.1
> > > seconds SiByte User Watchdog in danger of initiating system reset
> > > in 4.1 seconds SiByte User Watchdog in danger of initiating system
> > > reset in 4.1 seconds SiByte User Watchdog in danger of initiating
> > > system reset in 4.1 seconds SiByte User Watchdog in danger of
> > > initiating system reset in 4.1 seconds SiByte User Watchdog in
> > > danger of initiating system reset in 4.1 seconds SiByte User
> > > Watchdog in danger of initiating system reset in 4.1 seconds
> > > SiByte User Watchdog in danger of initiating system reset in 4.1
> > > seconds
> > >
> > > _____________________________________________
> > > From: Sandrine Boulanger
> > > Sent: Tuesday, July 07, 2009 6:22 PM
> > > To: Andy Sharp
> > > Cc: Jonathan Goldick; Maxim Kozlovsky; Yogesh Sawant; Dilip Jha;
> > > Sandeep Chavan Subject: what is kswapd0 on Cougar? It takes too
> > > much cpu and slows down ssc on latest Cougar dev build (07/06/09)
> > >
> > > top - 18:14:33 up  5:52,  1 user,  load average: 10.41, 7.52, 5.75
> > > Tasks:  70 total,   2 running,  68 sleeping,   0 stopped,   0
> > > zombie Cpu(s):  1.6%us, 10.6%sy,  0.0%ni,  0.0%id,  7.2%wa,
> > > 79.4%hi, 1.2%si,  0.0%st Mem:    466460k total,   459428k
> > > used,     7032k free,      164k buffers Swap:    30232k total,
> > > 30232k used, 0k free,     5392k cached
> > >
> > >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> > > COMMAND 50 root      10  -5     0    0    0 D 21.5  0.0   1:11.95
> > > kswapd0 984 root      15   0 16780  444  184 S  7.5  0.1
> > > 0:38.44 pm 7665 root      18   0 22188 1524  852 D  5.9  0.3
> > > 0:01.50 nfxsh 21206 root      10  -5 18680 1460  376 S  5.9
> > > 0.3   3:50.38 cluster_contrl 7668 root      18   0 22188 1544
> > > 868 R  5.6  0.3 0:01.46 nfxsh 21205 root      10  -5 18680 1352
> > > 284 S  5.6  0.3 0:16.33 cluster_contrl
> > >
> > > I reconfigured again the cluster g9r204/g5r204 because I kept
> > > having cluster errors with any build. The cluster now seems
> > > stable but executing anything on the SSC is super slow.
> > >
> > > Brian, are you aware of g5r204 having HW issues? It is stuck with
> > > "SiByte User Watchdog in danger of initiating system reset in 8.2
> > > seconds" messages on the console, no way to interrupt and access
> > > the prompt.
> > >
> > > The ssc sonsoles are 10.2.203.235 9039 for g8r204 and 9041 for
> > > g5r204.
> > >
