AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:
CFG:
PT:0
S:andy.sharp@lsi.com
RQ:
SSV:mhbs.lsil.com
NSV:
SSH:
R:<John.Keiffer@lsi.com>,<Dave.Limato@lsi.com>,<Danqing.Jin@lsi.com>,<Chris.Vandever@lsi.com>
MAID:2
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/LSI/INBOX	0	85A1D09038E3C1438820EF7A7FAFDD3001080D2BFF@cosmail01.lsi.com
X-Sylpheed-End-Special-Headers: 1
Date: Wed, 6 Jan 2010 13:48:14 -0800
From: Andrew Sharp <andy.sharp@lsi.com>
To: "Keiffer, John" <John.Keiffer@lsi.com>
Cc: "Limato, Dave" <Dave.Limato@lsi.com>, "Jin, Danqing"
 <Danqing.Jin@lsi.com>, "Vandever, Chris" <Chris.Vandever@lsi.com>
Subject: Re: What's causing my System Resources to hang?
Message-ID: <20100106134814.0dfe1da8@ripper.onstor.net>
In-Reply-To: <85A1D09038E3C1438820EF7A7FAFDD3001080D2BFF@cosmail01.lsi.com>
References: <85A1D09038E3C1438820EF7A7FAFDD3001080D2B44@cosmail01.lsi.com>
	<D7A889C980962746B30DE07864593C02CCBF3BC4@cosmail02.lsi.com>
	<20100106121312.6f0ae557@ripper.onstor.net>
	<85A1D09038E3C1438820EF7A7FAFDD3001080D2BFF@cosmail01.lsi.com>
Organization: LSI
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

cluster_server is already the biggest memory hog, it may be part of the
problem, but evm_* has historically been a troublesome child, so
perhaps it's causing all the problems.

A thrashed system often is caused by one bug, which causes one daemon
to use all the cpu [bogusly] requesting stuff from another daemon,
causing that other daemon to use all the memory, and boink.  One daemon
can do all the damage by itself as well, but almost always both cpu and
memory have to be hammered at the same time to cause the thrashing.

On Wed, 6 Jan 2010 13:23:36 -0700 "Keiffer, John"
<John.Keiffer@lsi.com> wrote:

> Well at that time I was also getting SyBite error messages and
> nothing was really working. It was actually hard to reboot at that
> time.
> 
> Today Danqing is helping me look into issues with vsvr's. We just
> rebooted the same system, and now top is not looking good as far as
> CPU usage, and it appears the cultrate is evm_cfgd. This may or may
> not be similar to the end result I had before.
> 
> g7r62:~# top -b -n 1
> top - 12:22:27 up 20 min,  2 users,  load average: 1.26, 1.22, 0.93
> Tasks:  68 total,   3 running,  65 sleeping,   0 stopped,   0 zombie
> Cpu(s): 49.4%us, 46.4%sy,  0.0%ni,  0.8%id,  0.6%wa,  2.7%hi,
> 0.1%si,  0.0%st Mem:    433548k total,    88788k used,   344760k
> free,     4576k buffers Swap:    30232k total,        0k used,
> 30232k free,    37128k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  1138 root      25   0 18672 3448 1780 R 98.7  0.8  18:18.68 evm_cfgd
>     1 root      15   0  3152  712  612 S  0.0  0.2   0:06.05 init
>     2 root      11  -5     0    0    0 S  0.0  0.0   0:00.00 kthreadd
>     3 root      34  19     0    0    0 S  0.0  0.0   0:00.00
> ksoftirqd/0 4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00
> watchdog/0 5 root      10  -5     0    0    0 S  0.0  0.0   0:00.01
> events/0 6 root      10  -5     0    0    0 S  0.0  0.0   0:00.00
> khelper 24 root      10  -5     0    0    0 S  0.0  0.0   0:00.42
> kblockd/0 25 root      10  -5     0    0    0 S  0.0  0.0   0:00.45
> ata/0 26 root      20  -5     0    0    0 S  0.0  0.0   0:00.00
> ata_aux 48 root      25   0     0    0    0 S  0.0  0.0   0:00.00
> pdflush 49 root      15   0     0    0    0 S  0.0  0.0   0:00.00
> pdflush 50 root      20  -5     0    0    0 S  0.0  0.0   0:00.00
> kswapd0 51 root      20  -5     0    0    0 S  0.0  0.0   0:00.00
> aio/0 66 root      11  -5     0    0    0 S  0.0  0.0   0:00.07
> pccardd 68 root      11  -5     0    0    0 S  0.0  0.0   0:00.07
> pccardd 71 root      10  -5     0    0    0 S  0.0  0.0   0:00.00
> scsi_eh_0 95 root      10  -5     0    0    0 S  0.0  0.0   0:00.00
> scsi_eh_1 105 root      10  -5     0    0    0 S  0.0  0.0   0:00.09
> kjournald 186 root      15  -4  4440  632  376 S  0.0  0.1   0:00.10
> udevd 543 root      11  -5     0    0    0 S  0.0  0.0   0:00.90
> kjournald 544 root      10  -5     0    0    0 S  0.0  0.0   0:00.25
> kjournald 674 daemon    15   0  2564  516  404 S  0.0  0.1   0:00.70
> portmap 757 root      15   0  2312  668  564 S  0.0  0.2   0:00.31
> syslogd 763 root      15   0  2000  404  328 S  0.0  0.1   0:00.05
> klogd 842 root      15   0 21408 1592  996 S  0.0  0.4   0:00.00 sshd
>   860 statd     24   0  2908  800  696 S  0.0  0.2   0:00.00 rpc.statd
>   893 ntp       15   0  6768 1348 1056 S  0.0  0.3   0:00.09 ntpd
>   905 daemon    21   0  3804  444  324 S  0.0  0.1   0:00.00 atd
>   912 root      15   0  4828  840  668 S  0.0  0.2   0:00.00 cron
>   924 root      19   0  4896 1528 1244 S  0.0  0.4   0:00.55 cfmond.sh
>   997 root      15   0 17180 1688 1088 S  0.0  0.4   0:01.44 pm
>  1003 root      15   0 16500 1940 1464 S  0.0  0.4   0:00.30 elog
>  1013 root      15   0 18108 3020 1564 S  0.0  0.7   0:01.05 ncmd
>  1018 root      15   0 16916 2276 1412 S  0.0  0.5   0:00.51 eventd
>  1019 root      15   0 17128 2008 1524 R  0.0  0.5   0:00.32
> timekeeper 1023 root      15   0  4212  716  592 S  0.0  0.2
> 0:00.19 chassisd 1029 root      10  -5 21000 5092 1736 S  0.0  1.2
> 0:01.55 cluster_server 1036 root      15   0  1996  544  476 S  0.0
> 0.1   0:00.00 getty 1045 root      10  -5 19020 2916 1812 S  0.0
> 0.7   0:00.37 cluster_contrl 1046 root      10  -5 19084 2264 1096 S
> 0.0  0.5   0:00.67 cluster_contrl 1091 root      15   0 17524 2864
> 1616 S  0.0  0.7   0:00.57 sdm_cfgd 1139 root      15   0 18436 3564
> 1936 S  0.0  0.8   0:00.37 ea 1145 root      15   0 17056 2080 1388
> S  0.0  0.5   0:00.13 spm 1146 root      15   0 17844 2480 1212 S
> 0.0  0.6   0:00.05 ipmd 1150 root      15   0 17044 1796 1332 S  0.0
> 0.4   0:00.23 tape-driver 1151 root      15   0 17848 2096 1552 S
> 0.0  0.5   0:00.08 ndmp_cfgd 1158 root      15   0 17164 2356 1804 S
> 0.0  0.5   0:00.17 auth-agent 1178 root      15   0 19484 3644 2052
> S  0.0  0.8   0:00.24 vsd 1219 root      15   0 16928 1928 1388 S
> 0.0  0.4   0:00.16 vtmd 1232 root      15   0 17728 2336 1660 S  0.0
> 0.5   0:00.24 sanmd 1239 root      15   0 18168 3636 2604 S  0.0
> 0.8   0:00.39 cifsd 1240 root      15   0 17164 1168  632 S  0.0
> 0.3   0:00.17 auth-agent 1241 root      15   0 18168 3636 2604 S
> 0.0  0.8   0:00.23 cifsd 1243 root      15   0 17164 1176  628 S
> 0.0  0.3   0:00.02 auth-agent 1267 root      15   0 18132 1928 1416
> S  0.0  0.4   0:00.17 cluster_relay 1268 root      15   0 19776 3532
> 2252 S  0.0  0.8   0:00.34 snmpd 1275 root      15   0 16616 2056
> 1564 S  0.0  0.5   0:00.19 asd 1565 Debian-e  20   0 20324 1456  884
> S  0.0  0.3   0:00.00 exim4 1572 root      15   0 17492 2496 1844 S
> 0.0  0.6   0:00.10 sscccc 1576 root      17   0 16296 1560 1060 S
> 0.0  0.4   0:00.06 crashsaved 2294 root      16   0 24128 3444 2796
> S  0.0  0.8   0:00.35 sshd 2314 root      15   0  4992 1780 1384 S
> 0.0  0.4   0:00.07 bash 2331 root      15   0 23208 4044 2808 S  0.0
> 0.9   0:00.16 nfxsh 2818 root      16   0 24128 3436 2772 S  0.0
> 0.8   0:00.74 sshd 2822 root      15   0  4992 1780 1384 S  0.0
> 0.4   0:00.03 bash 5709 root      19   0  1924  404  344 S  0.0
> 0.1   0:00.00 sleep 5710 root      15   0  4448 1076  840 R  0.0
> 0.2   0:00.02 top
> 
> -----Original Message-----
> From: Andrew Sharp [mailto:andy.sharp@lsi.com] 
> Sent: Wednesday, January 06, 2010 12:13 PM
> To: Limato, Dave
> Cc: Keiffer, John
> Subject: Re: What's causing my System Resources to hang?
> 
> Always run 
> 
> top -b -n 1
> 
> so as to get a full list.  But your system is out of
> memory, so the basic idea is to try and see if there is one process
> that's swollen in size to some stupid big number, like more than 100MB
> of resident memory.
> 
> 
> On Wed, 6 Jan 2010 11:22:22 -0700 "Limato, Dave"
> <Dave.Limato@lsi.com> wrote:
> 
> > More specifically, he tries to disable a vsvr and the command times
> > out. He is consistently getting in this state when running the
> > automated tests.
> > 
> > From: Keiffer, John
> > Sent: Wednesday, January 06, 2010 10:17 AM
> > To: Sharp, Andy
> > Cc: Limato, Dave
> > Subject: What's causing my System Resources to hang?
> > 
> > System resources full?...
> > 
> > top - 15:27:00 up 11 min,  2 users,  load average: 17.11, 15.34,
> > 7.46 Tasks:  76 total,   6 running,  70 sleeping,   0 stopped,   0
> > zombie Cpu(s):  1.0%us,  2.0%sy,  0.0%ni,  0.0%id, 26.5%wa, 69.4%hi,
> > 1.0%si,  0.0%st Mem:    433548k total,   427080k used,     6468k
> > free,      196k buffers Swap:    30232k total,    29580k used,
> > 652k free,     5076k cached
> > 
> >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >  1223 root      15   0 19528  352  272 R  4.9  0.1   0:07.93 vsd
> >  1070 root      15   0 17444  320  204 S  4.8  0.1   0:09.50 pm
> >  1202 root      15   0 17108  320  268 R  2.9  0.1   0:07.21
> > tape-driver 1249 root      15   0 19840  336  252 S  2.7  0.1
> > 0:05.99 snmpd 1250 root      15   0 16616  216  176 S  2.7  0.0
> > 0:06.78 asd 50 root      10  -5     0    0    0 D  2.5  0.0
> > 1:21.00 kswapd0 2492 root      15   0 16500  400  360 S  2.5  0.1
> > 0:01.90 elog 1166 root      15   0 18500  360  268 S  2.4  0.1
> > 0:06.70 ea 1203 root      15   0 17912  216  172 S  2.2  0.0
> > 0:05.91 ndmp_cfgd 1236 root      15   0 17228  268  208 S  2.2
> > 0.1   0:04.88 auth-agent 1803 root      15   0 23652  268  184 S
> > 2.2  0.1 0:04.79 nfxsh 2513 root      15   0  4512  840  688 R
> > 2.1  0.2 0:01.16 top 1232 root      15   0 18364  212  116 S  1.9
> > 0.0 0:07.18 sanmd 2517 root      18   0  7000  224  168 D  1.9  0.1
> > 0:00.21 ncmd 1108 root      15   0  4212  100   68 S  1.7  0.0
> > 0:03.28 chassisd 1420 root      15   0 17556  232  180 S  1.7  0.1
> > 0:06.42 sscccc 1167 root      15   0 17760  200  120 S  1.6  0.0
> > 0:06.54 spm
