AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20090708105159.59e8ed40@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:mail.onstor.net
NSV:
SSH:
R:<Yogesh.Sawant@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@exch1.onstor.net/INBOX	0	2779531E7C760D4491C96305019FEEB52AD1AFD62E@exch1.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Wed, 8 Jul 2009 10:52:14 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: Yogesh Sawant <Yogesh.Sawant@onstor.com>
Subject: Re: what is kswapd0 on Cougar? It takes too much cpu and slows down
 ssc on latest Cougar dev build (07/06/09)
Message-ID: <20090708105214.2a355825@ripper.onstor.net>
In-Reply-To: <2779531E7C760D4491C96305019FEEB52AD1AFD62E@exch1.onstor.net>
References: <102AB4F33EBBDB4C91915B145C8E9FB31377A82E02@exch1.onstor.net>
	<102AB4F33EBBDB4C91915B145C8E9FB31377A82E04@exch1.onstor.net>
	<20090707195838.517273c1@ripper.onstor.net>
	<2779531E7C760D4491C96305019FEEB52AD1AFD62D@exch1.onstor.net>
	<2779531E7C760D4491C96305019FEEB52AD1AFD62E@exch1.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Can you just copy the file to your home directory somewhere?  Thanks.

On Wed, 8 Jul 2009 06:13:46 -0700 Yogesh Sawant
<Yogesh.Sawant@onstor.com> wrote:

> Hi Andy,
> 
> I couldn't send the text file for some unknown reason (I'm using
> Konqueror on ubuntu and it does not like ms outlook).
> 
> The text file can be found here:  g5r204:/tmp/top_output.txt
> I generated it by running
> g5r204:/root/ysawant/capture_top_output.bash  while "vol show" was in
> progress.
> 
> Thanks,
> Yogesh Sawant
> 
> ________________________________________
> From: Yogesh Sawant
> Sent: Wednesday, July 08, 2009 6:31 PM
> To: Andy Sharp; Sandrine Boulanger
> Cc: Maxim Kozlovsky; Dilip Jha; Sandeep Chavan
> Subject: RE: what is kswapd0 on Cougar? It takes too much cpu and
> slows down ssc on latest Cougar dev build (07/06/09)
> 
> Hi Andy,
> 
> I ran "vol show" and captured output of "top -b -n 1" at intervals of
> 5 seconds, please see attached text file.
> 
> g5r204 diag> vol show
> Operation failed. Timeout.
> g5r204 diag>
> 
> I see these in the log:
> 
> Jul  8 05:47:43 g5r204 : 0:0:nfxsh:NOTICE: cmd[1]: vol show :
> status[0] Jul  8 05:47:43 g5r204 : 0:0:tape-driver:ERROR: tape_rpc:
> tape_sess_lookup app sdm cpu 0 slot 0 failed Jul  8 05:47:44 g5r204 :
> 0:0:cluster2:ERROR: ClusterCtrl_GetDbVer: ubik_call failed, code
> 5376, rc 30 Jul  8 05:47:53 g5r204 last message repeated 11 times
> Jul  8 05:47:53 g5r204 : 0:0:tape-driver:ERROR: tape_rpc:
> tape_sess_lookup app sdm cpu 0 slot 0 failed Jul  8 05:47:54 g5r204 :
> 0:0:cluster2:ERROR: ClusterCtrl_GetDbVer: ubik_call failed, code
> 5376, rc 30 Jul  8 05:48:03 g5r204 last message repeated 11 times
> Jul  8 05:48:04 g5r204 : 0:0:tape-driver:ERROR: tape_rpc:
> tape_sess_lookup app sdm cpu 0 slot 0 failed Jul  8 05:48:04 g5r204 :
> 0:0:cluster2:ERROR: ClusterCtrl_GetDbVer: ubik_call failed, code
> 5376, rc 30 Jul  8 05:48:07 g5r204 last message repeated 3 times
> 
> 
> Another attempt:
> 
> g5r204 diag> lun show disk -t free
> Failed to get response from SPM
> PCC node unknown. Cannot process message
> % Command failure.
> g5r204 diag>
> 
> Jul  8 05:55:13 g5r204 : 0:0:nfxsh:NOTICE: cmd[2]: lun show disk -t
> free : status[11] Jul  8 05:55:14 g5r204 : 0:0:cluster2:ERROR:
> ClusterCtrl_GetDbVer: ubik_call failed, code 5376, rc 30 Jul  8
> 05:55:20 g5r204 last message repeated 7 times Jul  8 05:55:20
> g5r204 : 0:0:ea:ERROR: ea_evmGetVolList[646]: Failed to get volume
> list, rc[8] Jul  8 05:55:20 g5r204 : 0:0:ea:INFO: Error
> string[Operation failed. Timeout.] len[26] Jul  8 05:55:21 g5r204 :
> 0:0:cluster2:ERROR: ClusterCtrl_GetDbVer: ubik_call failed, code
> 5376, rc 30 Jul  8 05:55:22 g5r204 : 0:0:cluster2:ERROR:
> ClusterCtrl_GetDbVer: ubik_call failed, code 5376, rc 30
> 
> Thanks,
> Yogesh Sawant
> 
> ________________________________________
> From: Andy Sharp
> Sent: Wednesday, July 08, 2009 8:28 AM
> To: Sandrine Boulanger
> Cc: Maxim Kozlovsky; Yogesh Sawant; Dilip Jha; Sandeep Chavan
> Subject: Re: what is kswapd0 on Cougar? It takes too much cpu and
> slows down ssc on latest Cougar dev build (07/06/09)
> 
> Yup, it's out of memory.  Let me guess, you are testing a system with
> 5-million lun/path combinations?  Just kidding.  If you can log in,
> try to capture the output of top -b -n 1 and then we can figure out
> what process(es) is(are) sucking the life out of the thing.
> 
> Cheers,
> 
> a
> 
> 
> On Tue, 7 Jul 2009 18:57:09 -0700 Sandrine Boulanger
> <sandrine.boulanger@onstor.com> wrote:
> 
> > I see those too on the console:
> >
> > Out of memory: kill process 971 (pm) score 1062 or a child
> > Killed process 983 (ncmd)
> >
> > Out of memory: kill process 804 (exim4) score 2819 or a child
> > Killed process 7699 (exim4)
> >
> > I'm starting to wonder if we should put back a stable 4.0.2.x build
> > on those systems to be able to use them for the test automation
> > development...
> >
> > _____________________________________________
> > From: Sandrine Boulanger
> > Sent: Tuesday, July 07, 2009 6:41 PM
> > To: Sandrine Boulanger; Andy Sharp
> > Cc: Jonathan Goldick; Maxim Kozlovsky; Yogesh Sawant; Dilip Jha;
> > Sandeep Chavan Subject: RE: what is kswapd0 on Cougar? It takes too
> > much cpu and slows down ssc on latest Cougar dev build (07/06/09)
> >
> > Well, can't be HW, now g9r204 is showing this too. No idea how to
> > recover from this but power cycle, but we'll eventually end up there
> > again. What's happening?
> >
> > SiByte User Watchdog in danger of initiating system reset in 4.1
> > seconds SiByte User Watchdog in danger of initiating system reset in
> > 4.1 seconds SiByte User Watchdog in danger of initiating system
> > reset in 4.1 seconds SiByte User Watchdog in danger of initiating
> > system reset in 4.1 seconds SiByte User Watchdog in danger of
> > initiating system reset in 4.1 seconds SiByte User Watchdog in
> > danger of initiating system reset in 4.1 seconds SiByte User
> > Watchdog in danger of initiating system reset in 4.1 seconds SiByte
> > User Watchdog in danger of initiating system reset in 4.1 seconds
> >
> > _____________________________________________
> > From: Sandrine Boulanger
> > Sent: Tuesday, July 07, 2009 6:22 PM
> > To: Andy Sharp
> > Cc: Jonathan Goldick; Maxim Kozlovsky; Yogesh Sawant; Dilip Jha;
> > Sandeep Chavan Subject: what is kswapd0 on Cougar? It takes too much
> > cpu and slows down ssc on latest Cougar dev build (07/06/09)
> >
> > top - 18:14:33 up  5:52,  1 user,  load average: 10.41, 7.52, 5.75
> > Tasks:  70 total,   2 running,  68 sleeping,   0 stopped,   0 zombie
> > Cpu(s):  1.6%us, 10.6%sy,  0.0%ni,  0.0%id,  7.2%wa, 79.4%hi,
> > 1.2%si,  0.0%st Mem:    466460k total,   459428k used,     7032k
> > free,      164k buffers Swap:    30232k total,    30232k used,
> > 0k free,     5392k cached
> >
> >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >    50 root      10  -5     0    0    0 D 21.5  0.0   1:11.95 kswapd0
> >   984 root      15   0 16780  444  184 S  7.5  0.1   0:38.44 pm
> >  7665 root      18   0 22188 1524  852 D  5.9  0.3   0:01.50 nfxsh
> > 21206 root      10  -5 18680 1460  376 S  5.9  0.3   3:50.38
> > cluster_contrl 7668 root      18   0 22188 1544  868 R  5.6  0.3
> > 0:01.46 nfxsh 21205 root      10  -5 18680 1352  284 S  5.6  0.3
> > 0:16.33 cluster_contrl
> >
> > I reconfigured again the cluster g9r204/g5r204 because I kept having
> > cluster errors with any build. The cluster now seems stable but
> > executing anything on the SSC is super slow.
> >
> > Brian, are you aware of g5r204 having HW issues? It is stuck with
> > "SiByte User Watchdog in danger of initiating system reset in 8.2
> > seconds" messages on the console, no way to interrupt and access the
> > prompt.
> >
> > The ssc sonsoles are 10.2.203.235 9039 for g8r204 and 9041 for
> > g5r204.
> >
