AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20081110111309.15a157d7@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:exch1.onstor.net
NSV:
SSH:
R:<sandrine.boulanger@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@exch1.onstor.net/INBOX	0	2779531E7C760D4491C96305019FEEB5175D5BE1F5@exch1.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Mon, 10 Nov 2008 11:13:22 -0800
From: Andrew Sharp <andy.sharp@onstor.com>
To: Sandrine Boulanger <sandrine.boulanger@onstor.com>
Subject: Re: status after reboot
Message-ID: <20081110111322.2f6eb04f@ripper.onstor.net>
In-Reply-To: <2779531E7C760D4491C96305019FEEB5175D5BE1F5@exch1.onstor.net>
References: <20081109211317.15c5d3e0@ripper.onstor.net>
	<2779531E7C760D4491C96305019FEEB5175D5BE1F5@exch1.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

The previous version of exim may have left the cluster daemon fairly
hosed, causing the cluster errors.  I'd suggest another reboot ~:^(

On Mon, 10 Nov 2008 09:44:46 -0800 Sandrine Boulanger
<sandrine.boulanger@onstor.com> wrote:

> No exim processes hung on g11r10 this morning, but it is still
> getting cluster errors, maybe because of the other nodes.
> 
> G12r10 has many processes stuck.
> 
> 
> 
> g11r10:/var/log/onstor# ps ax | grep exim
> 
> 11536 ?        Ss     0:00 /usr/sbin/exim4 -bd -q30m
> 
> 19912 pts/0    R+     0:00 grep exim
> 
> g11r10:/var/log/onstor# tail -f messages | grep -i error
> 
> Nov 10 09:36:15 g11r10 : 0:0:cluster2:ERROR:
> cluster_getRecordIdByKey: no reply bck -1
> 
> Nov 10 09:36:15 g11r10 : 0:0:cluster2:ERROR:
> cluster_getFilerNameList: cannot get cluster rec, code 30
> 
> 
> 
> g11r10:/var/log/onstor# exiqgrep -z -c
> 
> 13 matches out of 18 messages
> 
> g11r10:/var/log/onstor# exim -bp
> 
> 14h  1.4K 1KzNxO-0000MP-Es <> *** frozen ***
> 
>           root@g11r10
> 
> 
> 
> 13h  1.4K 1KzOuE-0004M2-Ul <> *** frozen ***
> 
>           root@g11r10
> 
> 
> 
> 12h  1.4K 1KzPr5-00007S-Oh <> *** frozen ***
> 
>           root@g11r10
> 
> 
> 
> 11h  1.4K 1KzQn2-0004NE-Jp <> *** frozen ***
> 
>           root@g11r10
> 
> 
> 
> 10h  1.4K 1KzRhz-00069l-DS <> *** frozen ***
> 
>           root@g11r10
> 
> 
> 
>  9h  1.4K 1KzSdi-000253-Hs <> *** frozen ***
> 
>           root@g11r10
> 
> 
> 
>  8h  1.4K 1KzTbK-0006Q7-2b <> *** frozen ***
> 
>           root@g11r10
> 
> 
> 
>  7h  1.4K 1KzUWB-0002GC-C5 <> *** frozen ***
> 
>           root@g11r10
> 
> 
> 
>  6h  1.4K 1KzVUi-0006Nq-H3 <> *** frozen ***
> 
>           root@g11r10
> 
> 
> 
>  5h  1.4K 1KzWPV-00021L-CX <> *** frozen ***
> 
>           root@g11r10
> 
> 
> 
>  4h  2.5K 1KzWqA-0003i8-JO <g12r10@onstor.com>
> 
>           raj.kumar@onstor.com
> 
>           sandrine.boulanger@onstor.com
> 
> 
> 
>  4h  1.4K 1KzXLd-0006UH-Cg <> *** frozen ***
> 
>           root@g11r10
> 
> 
> 
>  3h  1.4K 1KzYGU-0002TK-8G <> *** frozen ***
> 
>           root@g11r10
> 
> 
> 
>  2h  1.4K 1KzZDZ-0006oR-1Q <> *** frozen ***
> 
>           root@g11r10
> 
> 
> 
> 73m  2.5K 1KzZeN-0008Q8-7b <g12r10@onstor.com>
> 
>           raj.kumar@onstor.com
> 
>           sandrine.boulanger@onstor.com
> 
> 
> 
> 53m  2.5K 1KzZxi-0001e2-LT <g12r10@onstor.com>
> 
>           raj.kumar@onstor.com
> 
>           sandrine.boulanger@onstor.com
> 
> 
> 
> 23m  2.5K 1KzaQl-0003KS-D0 <g12r10@onstor.com>
> 
>           raj.kumar@onstor.com
> 
>           sandrine.boulanger@onstor.com
> 
> 
> 
> 13m  2.5K 1KzaaR-0003vg-8Q <g12r10@onstor.com>
> 
>           raj.kumar@onstor.com
> 
>           sandrine.boulanger@onstor.com
> 
> 
> 
> g11r10:/var/log/onstor#
> 
> -----Original Message-----
> From: Andy Sharp
> Sent: Sunday, November 09, 2008 9:13 PM
> To: Sandrine Boulanger
> Subject: Re: status after reboot
> 
> 
> 
> I put yet another test version on g11r10, I'm hoping for the best.
> 
> I'd like to see how it's doing in the morning.
> 
> 
> 
> Thanks,
> 
> 
> 
> a
> 
> 
> 
> 
> 
> On Sun, 9 Nov 2008 14:33:43 -0800 Sandrine Boulanger
> 
> <sandrine.boulanger@onstor.com> wrote:
> 
> 
> 
> > Too bad. Good luck. I'm wondering if we would be better off using
> > the
> 
> > Bobcat method for sending emails.
> 
> >
> 
> > -----Original Message-----
> 
> > From: Andy Sharp
> 
> > Sent: Sunday, November 09, 2008 11:30 AM
> 
> > To: Sandrine Boulanger
> 
> > Subject: Re: status after reboot
> 
> >
> 
> > I'm running an even more experimental version on my cougar, which I
> 
> > thought would work better, but it has similar stuck processes on it.
> 
> > I'll work on it some more today.  ~:^(
> 
> >
> 
> > On Sat, 8 Nov 2008 11:11:28 -0800 Sandrine Boulanger
> 
> > <sandrine.boulanger@onstor.com> wrote:
> 
> >
> 
> > > If you said you thawed everything on g11r10 last night and fixed
> > > the
> 
> > > hosts file, how come we have so many stuck since then. I have
> > > never
> 
> > > seen that many on cougar soak yet. What can we try next?
> 
> > >
> 
> > > g11r10:~# ps ax -o pid,ppid,tt,wchan,state,start,time,command |
> > > grep
> 
> > > exim 771  1311 ?        wait   S 04:29:54 00:00:00 /usr/sbin/exim4
> 
> > > -q 775   771 ?        select S 04:29:55 00:00:01 /usr/sbin/exim4
> > > -q
> 
> > >   802     1 ?        wait   S 16:33:19 00:00:00 /usr/sbin/exim4 -q
> 
> > >   811   802 ?        select S 16:33:20 00:00:02 /usr/sbin/exim4 -q
> 
> > >  1311     1 ?        select S 16:34:07 00:00:00 /usr/sbin/exim4
> > > -bd
> 
> > > -q30m 1972     1 ?        select S 23:12:02
> > > 00:00:01 /usr/sbin/exim4
> 
> > > -Mc 1KyhzG-0000Vc-Dz 6102     1 ?        select S 23:14:02
> 
> > > 00:00:01 /usr/sbin/exim4 -Mc 1Kyi1C-0001aP-7m 6475     1 ?
> 
> > > select S 06:50:04 00:00:00 /usr/sbin/exim4 -Mc 1Kyp8V-0001gQ-MV
> > > 9068
> 
> > > 1311 ?        wait   S 04:59:54 00:00:00 /usr/sbin/exim4 -q 9080
> 
> > > 9068 ?        select S 04:59:58 00:00:00 /usr/sbin/exim4 -q 9875
> 
> > > 6694 pts/1    -      R 11:07:46 00:00:00 grep exim 13653
> 
> > > 1 ?        select S 05:20:03 00:00:00 /usr/sbin/exim4 -Mc
> 
> > > 1KynjP-0003YC-0e 15420     1 ?        select S 07:30:03
> 
> > > 00:00:00 /usr/sbin/exim4 -Mc 1KyplD-00040B-6C 15558     1 ?
> 
> > > select S 09:30:04 00:00:00 /usr/sbin/exim4 -Mc 1KyrdL-00042U-M0
> 
> > > 15897  1311 ?        wait   S 05:29:54 00:00:00 /usr/sbin/exim4 -q
> 
> > > 15902 15897 ?        select S 05:29:55 00:00:00 /usr/sbin/exim4 -q
> 
> > > 17151  1311 ?        wait   S 03:29:54 00:00:00 /usr/sbin/exim4 -q
> 
> > > 17154 17151 ?        select S 03:29:55 00:00:01 /usr/sbin/exim4 -q
> 
> > > 20159     1 ?        select S 07:38:04 00:00:00 /usr/sbin/exim4
> > > -MCS
> 
> > > -MCP -MC remote_smtp mail.onstor.com 66.201.51.107 2
> 
> > > 1Kypr0-00056v-3v 20866     1 ?        select S 07:40:07
> 
> > > 00:00:00 /usr/sbin/exim4 -MCS -MCP -MC remote_smtp mail.onstor.com
> 
> > > 66.201.51.107 3 1Kypn8-0004uM-69 20988     1 ?        select S
> 
> > > 07:42:04 00:00:00 /usr/sbin/exim4 -MCS -MCP -MC remote_smtp
> 
> > > mail.onstor.com 66.201.51.107 2 1KypjG-0003v2-8F 21430
> 
> > > 1 ?        select S 05:40:04 00:00:01 /usr/sbin/exim4 -Mc
> 
> > > 1Kyo2l-0005Zd-MT 22904     1 ?        select S 23:06:03
> 
> > > 00:00:01 /usr/sbin/exim4 -Mc 1KyhtS-0005x7-NV 26583  1311 ? wait
> 
> > > S 03:59:54 00:00:00 /usr/sbin/exim4 -q 26586 26583 ? select S
> 
> > > 03:59:55 00:00:00 /usr/sbin/exim4 -q 32630     1 ? select S
> 
> > > 02:30:05 00:00:00 /usr/sbin/exim4 -Mc 1Kyl4t-0008TS-Kt g11r10:~#
> 
> > >
> 
> > > -----Original Message-----
> 
> > > From: Andy Sharp
> 
> > > Sent: Friday, November 07, 2008 7:35 PM
> 
> > > To: Sandrine Boulanger
> 
> > > Subject: Re: status after reboot
> 
> > >
> 
> > > I fixed g11r10, the format of the hosts file is very important to
> 
> > > exim for some reason.  The bare node name has to come before the
> 
> > > node.sc0 name.
> 
> > >
> 
> > > I unfroze the messages, and the next queue run they all got thrown
> 
> > > away.
> 
> > >
> 
> > > I'm having trouble getting to g12r10.
> 
> > >
> 
> > >
> 
> > >
> 
> > >
> 
> > > On Fri, 7 Nov 2008 17:43:22 -0800 Sandrine Boulanger
> 
> > > <sandrine.boulanger@onstor.com> wrote:
> 
> > >
> 
> > > > The 2 nodes I rebooted have extra processes and one of them has
> 
> > > > already 160 frozen messages.
> 
> > > >
> 
> > > >
> 
> > > > g12r10:/var/log/onstor#  ps ax | grep exim
> 
> > > >   816 ?        S      0:00 /usr/sbin/exim4 -q
> 
> > > >   819 ?        S      0:00 /usr/sbin/exim4 -q
> 
> > > >  1263 ?        Ss     0:00 /usr/sbin/exim4 -bd -q30m
> 
> > > > 13474 pts/0    R+     0:00 grep exim
> 
> > > >
> 
> > > > g11r10:/var/log/onstor# ps ax | grep exim
> 
> > > >   802 ?        S      0:00 /usr/sbin/exim4 -q
> 
> > > >   811 ?        S      0:00 /usr/sbin/exim4 -q
> 
> > > >  1311 ?        Ss     0:00 /usr/sbin/exim4 -bd -q30m
> 
> > > > 25076 pts/0    R+     0:00 grep exim
> 
> > > >
> 
> > > > g11r10:/var/log/onstor# exiqgrep -z -c
> 
> > > > 160 matches out of 161 messages
> 
> > > >
