AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:
CFG:
PT:0
S:andy.sharp@lsi.com
RQ:
SSV:mhbs.lsil.com
NSV:
SSH:
R:<Yifeng.Liu@lsi.com>,<Dave.Johnson@lsi.com>,<Shin.Irie@lsi.com>,<Andy.Sharp@lsi.com>
MAID:2
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/LSI/INBOX	0	3D228269E7866B4B85BE8E4FC5A4447A0CBCF8@cosmail02.lsi.com
X-Sylpheed-End-Special-Headers: 1
Date: Thu, 8 Oct 2009 11:26:51 -0700
From: Andrew Sharp <andy.sharp@lsi.com>
To: "Liu, Yifeng" <Yifeng.Liu@lsi.com>
Cc: "Johnson, Dave" <Dave.Johnson@lsi.com>, "Irie, Shin"
 <Shin.Irie@lsi.com>, "Sharp, Andy" <Andy.Sharp@lsi.com>
Subject: Re: running processes queue hitting huge spikes on beast...
Message-ID: <20091008112651.71e0cb38@ripper.onstor.net>
In-Reply-To: <3D228269E7866B4B85BE8E4FC5A4447A0CBCF8@cosmail02.lsi.com>
References: <C5277CB418429641BC1498607A9F480593D3B405@cosmail01.lsi.com>
	<DEC609CD0E54B2448DAF023C89AE9755E250CF1A@cosmail02.lsi.com>
	<C5277CB418429641BC1498607A9F480593D3B47C@cosmail01.lsi.com>
	<A1FEB16D007D2E4DAE212D51980EC3B9DB3601D7@sikmail02.lsi.com>
	<20091006234405.36cc1a2a@ripper.onstor.net>
	<3D228269E7866B4B85BE8E4FC5A4447A0CB9A7@cosmail02.lsi.com>
	<A1FEB16D007D2E4DAE212D51980EC3B9DB360390@sikmail02.lsi.com>
	<C5277CB418429641BC1498607A9F480593D9FFEC@cosmail01.lsi.com>
	<3D228269E7866B4B85BE8E4FC5A4447A0CBCF8@cosmail02.lsi.com>
Organization: LSI
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Wed, 7 Oct 2009 20:19:26 -0600 "Liu, Yifeng" <Yifeng.Liu@lsi.com>
wrote:

> So, what is the actual problem here, the instantaneous spawning of
> apache processes in response to EMRS upload requests or the already
> spawned apache processes eating up the resources?

The latter.  You could spawn quite a few processes and it wouldn't hurt
anything as long as they didn't do anything.  That's why I'm advocating
raising the max time slice value: new processes would have to wait
longer before they got resources, but the overall processing throughput
would increase, dispatching these "humps" quicker, thereby relieving
the congestion sooner.

For example, if a flurry of processing requests come in, lets say it
take 10 minutes to clear that out and return to a normal steady state.
If you increase the max time slice, it might reduce that to 7 minutes.
Yes, it will harm response time, but it sounds like that's already
completely in the crapper, far beyond what this would do to it.  So
actually it might improve it.  Actually the response time issue is more
likely related to the kernel bug that is also triggered at the same
time.

> # send the kpi file every hour (the actual minute of this cron job is
> random; seeded by the mac address, so not all field machines send at
> the same time) 49     *     *     *     *	emrscron -s kpi_stats
>=20
> A quick check for when the kpi_stats are getting uploaded to the dw
> will give the answer to how good the randomness is, but I am not sure
> if it is being randomized at all, since they seem to be uploaded at
> exact same time every hour to the DW.
>=20
> -----Original Message-----
> From: Johnson, Dave=20
> Sent: 2009=E5=B9=B410=E6=9C=887=E6=97=A5 17:17
> To: Irie, Shin; Liu, Yifeng; Sharp, Andy
> Cc: Scheer, Larry; Currin, Shawn; Collins, Caeli; Cook, Neil; Duffy,
> Bill; Hong, Xian; Junod, JeanPaul; LaReau, Rich; Lewis, Cheryl; Mora,
> Carlos; Onstor-cs-mail-archive; Ryman, Summer; Suzuki, Takuji;
> Swenson, Timothy; Thiessen, Joachim; DL-ONStor-Customer Service
> Group; DL-ONStor-Engineering; Keiffer, John; Onstor-cs-mail-archive;
> Piela, Ben; Roldan, Arnaldo; Seidel, Jan; Shankar, Shiva Subject: RE:
> running processes queue hitting huge spikes on beast...
>=20
> Shin,
>=20
> Our emails crossed each other...
>=20
> "# send the kpi file every hour (the actual minute of this cron job
> is ramdom; seeded by the mac address, so not all field machines send
> at the same time) 24	*	*	*	*
> emrscron -s kpi_stats
>=20
> The algorithm to randamize the actual minutes is not good enough??"
>=20
> Is random the right answer here ?  I would think mod'ing the unique
> portion of our unit serial number with 3600 will give the ideal
> number of seconds past the hour to begin the upload, which should
> result in nearly complete uniform load on the server (given enough
> units sold :)
>=20
> Only about 200 units "phoned home" over the course of the one upload
> session I observed starting at 29 min past the hour, which lasted
> about a minute in duration.  That the box becomes non-responsive
> during this time doesn't bode well for scalability of the current
> setup.
>=20
> After checking more based on Shin's comments, the load seems most
> correlated to the firing off of the upload.cgi script.  The launching
> and tear-down for all those processes simultaneously is killing the
> server.  The apache processes appear to already be launched and
> waiting idle for the remote systems to connect so they shouldn't be
> eating much resources beside memory (which there seems to be plenty
> of).  Even though, I would modify the /etc/apache2/httpd.conf file to
> up the StartServers, MinSpareServers, and especially lowering
> MaxConnection settings accordingly as well as turning KeepAlive off.
> Check /home/djohnson/httpd.conf.patch for my comments and suggestions.
>=20
> Until the client-side load distribution is fixed, this issue will
> continue.
>=20
> -=3Ddave
>=20
> -----Original Message-----
> From: Irie, Shin=20
> Sent: Wednesday, October 07, 2009 3:53 PM
> To: Liu, Yifeng; Sharp, Andy
> Cc: Johnson, Dave; Scheer, Larry; Currin, Shawn; Collins, Caeli;
> Cook, Neil; Duffy, Bill; Hong, Xian; Johnson, Dave; Junod, JeanPaul;
> LaReau, Rich; Lewis, Cheryl; Mora, Carlos; Onstor-cs-mail-archive;
> Ryman, Summer; Suzuki, Takuji; Swenson, Timothy; Thiessen, Joachim;
> DL-ONStor-Customer Service Group; DL-ONStor-Engineering; Johnson,
> Dave; Keiffer, John; Onstor-cs-mail-archive; Piela, Ben; Roldan,
> Arnaldo; Seidel, Jan; Shankar, Shiva Subject: RE: running processes
> queue hitting huge spikes on beast...
>=20
>  I ran "top -b -d 5 -n 2000" to see the trend.  The load average
> value goes up around the top of the hour and the half hour. =20
>=20
> Around 21:00 yesterday
> top - 21:00:03 up 21 days,  2:12,  5 users,  load average: 0.48,
> 0.81, 1.44 top - 21:00:08 up 21 days,  2:12,  5 users,  load average:
> 0.52, 0.81, 1.44 top - 21:00:13 up 21 days,  2:12,  5 users,  load
> average: 0.64, 0.83, 1.44 top - 21:00:18 up 21 days,  2:13,  5
> users,  load average: 4.83, 1.69, 1.72 top - 21:00:23 up 21 days,
> 2:13,  5 users,  load average: 8.69, 2.54, 1.99 top - 21:00:28 up 21
> days,  2:13,  5 users,  load average: 12.31, 3.40, 2.27 top -
> 21:00:33 up 21 days,  2:13,  5 users,  load average: 15.73, 4.26,
> 2.55 top - 21:00:38 up 21 days,  2:13,  5 users,  load average:
> 17.52, 4.82, 2.74 top - 21:00:43 up 21 days,  2:13,  5 users,  load
> average: 17.23, 4.97, 2.80
>=20
> Around 21:30 yesterday
> top - 21:29:08 up 21 days,  2:41,  5 users,  load average: 6.44,
> 1.97, 1.44 top - 21:29:13 up 21 days,  2:41,  5 users,  load average:
> 14.33, 3.68, 1.99 top - 21:29:18 up 21 days,  2:42,  5 users,  load
> average: 22.15, 5.48, 2.58 top - 21:29:23 up 21 days,  2:42,  5
> users,  load average: 29.27, 7.23, 3.17 top - 21:29:28 up 21 days,
> 2:42,  5 users,  load average: 35.57, 8.90, 3.73 top - 21:29:33 up 21
> days,  2:42,  5 users,  load average: 42.01, 10.68, 4.33 top -
> 21:29:53 up 21 days,  2:42,  5 users,  load average: 59.27, 16.73,
> 6.46 top - 21:29:58 up 21 days,  2:42,  5 users,  load average:
> 58.29, 17.24, 6.68 top - 21:30:03 up 21 days,  2:42,  5 users,  load
> average: 57.22, 17.70, 6.89 top - 21:30:08 up 21 days,  2:42,  5
> users,  load average: 55.20, 17.93, 7.02
>=20
> The spike around the half hour should be from this emrs cron entry.
>=20
> # gather stats every hour (on the half hour)
> 30	*	*	*	*	emrscron -g stats
>=20
> I'm not sure where the peak around the top of hour comes from, but
> this could be a candidate.
>=20
> # send the kpi file every hour (the actual minute of this cron job is
> ramdom; seeded by the mac address, so not all field machines send at
> the same time) 24	*	*	*	*	emrscron
> -s kpi_stats
>=20
> The algorithm to randamize the actual minutes is not good enough??
>=20
>=20
> Other cron entry for EMRS are:
>=20
> # gather config every day (at 23:05)
> 5	23	*	*	*	emrscron -g config
>=20
> # gather h_res_stats data every 3 min
> */3	*	*	*	*	emrscron -g h_res_stats
>=20
> # keep the last 7 days of config and stats files (removing excess
> daily at 00:53) 53	0	*	*	*
> emrscron -t config; emrscron -t stats
>=20
> # roll the log files for h_res_stats and keep only 7 log files
> around; send files to onstor 59	23	*	*
> *	emrscron -t h_res_stats; emrscron -s all
>=20
>=20
> --
> Irie
>=20
>=20
> -----Original Message-----
> From: Liu, Yifeng=20
> Sent: Thursday, October 08, 2009 2:53 AM
> To: Sharp, Andy; Irie, Shin
> Cc: Johnson, Dave; Scheer, Larry; Currin, Shawn; Collins, Caeli;
> Cook, Neil; Duffy, Bill; Hong, Xian; Irie, Shin; Johnson, Dave;
> Junod, JeanPaul; LaReau, Rich; Lewis, Cheryl; Liu, Yifeng; Mora,
> Carlos; Onstor-cs-mail-archive; Ryman, Summer; Suzuki, Takuji;
> Swenson, Timothy; Thiessen, Joachim; DL-ONStor-Customer Service
> Group; DL-ONStor-Engineering; Johnson, Dave; Keiffer, John;
> Onstor-cs-mail-archive; Piela, Ben; Roldan, Arnaldo; Seidel, Jan;
> Shankar, Shiva Subject: RE: running processes queue hitting huge
> spikes on beast...
>=20
> One thing that I can't understand is why the slow responsiveness
> always happens on the whole and half hours.
>=20
> If it is much related to EMRS, it should depends on the number of
> EMRS client requests coming in.
>=20
> In a separated email I will forward John Rogers' finding on what the
> beast slow responsiveness could be related to as reference.
>=20
> Thanks
>=20
> -----Original Message-----
> From: Andrew Sharp [mailto:andy.sharp@lsi.com]
> Sent: 2009=E5=B9=B410=E6=9C=886=E6=97=A5 23:44
> To: Irie, Shin
> Cc: Johnson, Dave; Scheer, Larry; DL-ONStor-cstech
> Subject: Re: running processes queue hitting huge spikes on beast...
>=20
> On Tue, 6 Oct 2009 18:27:43 -0600 "Irie, Shin" <Shin.Irie@lsi.com>
> wrote:
>=20
> > I think it's related to EMRS.  Lots of upload.cgi was running
> > around 1700 today.
>=20
> > top - 17:00:29 up 20 days, 22:13,  5 users,  load average: 19.82,
> > 5.84, 3.70 Tasks: 334 total,  60 running, 274 sleeping,   0
> > stopped,   0 zombie Cpu(s): 89.4%us,  9.8%sy,  0.0%ni,  0.0%id,
> > 0.0%wa,  0.3%hi, 0.5%si,  0.0%st Mem:   7262544k total,  2527472k
> > used,  4735072k free,    96368k buffers Swap:  7815580k
> > total,       64k used,  7815516k free,   782512k cached
> >
> >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> > 11598 emrs      15   0 47908  43m 3676 S   12  0.6   0:09.83
> > EMRS_event_cons 4080 emrs      16   0  252m 250m 3072 S    6  3.5
> > 356:06.93 emrs_mq.pl
>                                ^^^^
> Holy flying pigs, Batman, who gave this thing so much thrust?
> 250m *resident* for a user space process?  Not to mention the 10-20m
> for each of the others (that we can see).  It says 60 running,
> including the ones in iowait.  These processes are total porkers.  No
> wonder the system caves.  The all getI/O contentious at some point,
> I'm betting, causing the large load average.  If this was the mysql
> daemon, that'd be one thing, but this is a friggin perl program.  The
> system's got memory enough to get bottlenecked on CPU, but it's
> thrashing.
>=20
> I wonder why it says 0.0%wa when I can see two processes in this
> short list alone that are in IO wait.  Maybe because the CPU
> contention is also impressive, sixty processes adding up to almost
> 100% cpu, but most of them in the 5-6% range.  That's a lot of task
> switching.  We might help ourselves by increasing the maximum time
> slice on this system. Processes might have to wait longer to get
> started, but there'd be less thrashing.  Or just add 3-4 more cores.
>=20
> > 11872 emrs      17   0 11760 9.8m 1692 R    6  0.1   0:00.28
> > upload.cgi 11856 emrs      17   0 15996  13m 1816 R    5  0.2
> > 0:00.40 upload.cgi 11871 emrs      17   0 11104 9444 1692 R    5
> > 0.1   0:00.26 upload.cgi 11820 emrs      15   0 23400  20m 3572
> > D    5  0.3   0:00.61 upload.cgi 11825 emrs      17   0 17952  15m
> > 1844 R    4  0.2   0:00.45 upload.cgi 11857 emrs      17   0 17600
> > 15m 1828 R    4  0.2   0:00.44 upload.cgi 11881 emrs      17   0
> > 10440 8784 1684 R    4  0.1   0:00.21 upload.cgi 11848 emrs
> > 17   0 17820  15m 1828 R    4  0.2   0:00.46 upload.cgi 11858
> > emrs      17   0 16940  14m 1828 R    4  0.2   0:00.40 upload.cgi
> > 11869 emrs      17   0 10048 8400 1684 R    4  0.1   0:00.19
> > upload.cgi 11884 emrs      17   0 10044 8360 1684 R    4  0.1
> > 0:00.18 upload.cgi 11841 emrs      17   0 17952  15m 1844 D    3
> > 0.2   0:00.44 upload.cgi 11852 emrs      17   0 16148  13m 1824
> > R    3  0.2   0:00.38 upload.cgi
>=20
>=20
>=20
