X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C7207A.A0376D63@onstor-exch02.onstor.net>; Fri, 15 Dec 2006 10:55:44 -0800
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Content-class: urn:content-classes:message
Subject: RE: Updates
Date: Fri, 15 Dec 2006 10:55:44 -0800
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E01C0B34C@onstor-exch02.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E0A90E1@onstor-exch02.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Updates
thread-index: Acce+b3Mfy+xIPS2TRi4/4armFj97QAAFnpQAAA/8cAAAA8pAAAAMVEwAAtKwPAAAB7EkAAAXasgAAF+nMAAG+bGsAAAEIdhABPlQMAAH9fUUAABwNUiAADAE1A=
From: "Eric Barrett" <eric.barrett@onstor.com>
To: "Larry Scheer" <larry.scheer@onstor.com>,
	"Paul Hammer" <paul.hammer@onstor.com>,
	"Tim Gardner" <tim.gardner@onstor.com>,
	"Sandrine Boulanger" <sandrine.boulanger@onstor.com>,
	"Ed Kwan" <ed.kwan@onstor.com>,
	"Charissa Willard" <charissa.willard@onstor.com>
Cc: "Kevin Matthews" <kevin.matthews@onstor.com>,
	"Brian Baker" <brian.baker@onstor.com>,
	"Vikas Saini" <vikas.saini@onstor.com>,
	"dl-Clio" <dl-Clio@onstor.com>

Larry,

No problem.  Yes, seen it twice in the field, one with an upgrade to
1.3.3.1, the other with an upgrade to 1.3.3.3.  This suggests the
problem is not Clio-specific, but does affect it.  In the field (with
mostly Bobcats) this is a very hard thing to work around, as you first
have to identify the problem, and then, once you do, you have about 30
seconds to remount the /usr/local/agile file system read-write and
modify pmtab before the whole thing goes down again.

The symptoms were the same, snmpd crashing over and over, so when I saw
this occurring on Mightydog I knew exactly what to do to work around it.
(I was the one who modified pmtab here.)  Apparently on Cheetahs you
have a bit more time before the TXRX gives up the ghost.

We didn't know about the bad libraries until you found that out, so we
don't have checksums.  I've left voicemails with the two customers
affected hoping to get these, but they haven't returned my calls yet.

Also, for what it's worth, I've gone back and ran cksum on every
opt-build Bobcat and Cheetah libucdmibs.so I can find, both from
/n/build-trees and our FTP site.  None of them match the cksum you got
from Mightydog's.  Looks like it's getting corrupted by the install
process somehow.



-----Original Message-----
From: Larry Scheer=20
Sent: Friday, December 15, 2006 10:43 AM
To: Eric Barrett; Paul Hammer; Tim Gardner; Sandrine Boulanger; Ed Kwan;
Charissa Willard
Cc: Kevin Matthews; Brian Baker; Vikas Saini; dl-Clio
Subject: RE: Updates


Eric,
   I didn't know this issue was ocurring with more than just dogfood and
customers were seeing the exact same problem, when I sent my first
email.Are we certain it is the exact same problem?

Is it the exact same files that were corrupted in the field? Or is it
similiar files? By that I mean one or two the same, but not all, or
others? The upgrade of dogfood was to Clio and that is not out in the
field. But I understand that doesn't mean it can't be the same problem.

Yes, I did preserve the files from dogfood for analysis. I also looked
at /var/log/messages but I am not sure I was looking at the file from
time of the upgrade, I am still inspecting those files.

I'll talk to you later today to get more info on what the customers are
seeing.

Larry

-----Original Message-----
From: Eric Barrett
Sent: Fri 12/15/2006 10:15 AM
To: Larry Scheer; Paul Hammer; Tim Gardner; Sandrine Boulanger; Ed Kwan;
Charissa Willard
Cc: Kevin Matthews; Brian Baker; Vikas Saini; dl-Clio
Subject: RE: Updates
=20
I have to say, based on the fact we've had three occurrences of exactly
the same problem, two in the field, I strongly disagree with "degrading
compact flash" as an answer.  Were there any associated I/O errors in
the BSD /var/log/messages?  We always see these when a compact flash
fails in the field
=20
I think it's far more likely that the upgrade did not complete
successfully even if the 'system upgrade' command said it did.
=20
Did we preserve the old files from Dogfood for analysis?
=20
=20
=20
________________________________

From: Larry Scheer=20
Sent: Thursday, December 14, 2006 7:45 PM
To: Paul Hammer; Eric Barrett; Tim Gardner; Sandrine Boulanger; Ed Kwan;
Charissa Willard
Cc: Kevin Matthews; Brian Baker; Vikas Saini; dl-Clio
Subject: RE: Updates



I spent some time investigating the upgrade problems that occurred on
dogfood. Of the list of filer reported previously these files had
problems:

/usr/local/agile/lib/libucdagent.so
/usr/local/agile/lib/libucdmibs.so

/usr/local/agile/web/js/console/login.js=20

/usr/local/agile/web/js/console/view.js
/usr/local/agile/web/js/domain/util.js=20



I found that although their file size was the same, their check sums did
not match the source files stored in the compressed tar file used for
the release distribution.

=20

Examining the contents of the files I found that the files were
corrupted. They were not copies of the files from a previous release but
were versions of the files for 2.1.0.0 with either nulls or somewhat
random bits stored in the file. The files were unpacked by the upgrade
program from the source distribution and stored on the compact flash
incorrectly.

=20

I was able to replace these files with copies from the source
distribution without any problems. After copying good sources to the
same files on the compact flash their check sums verified. I modified
the pmtab to start snmpd and rebooted dogfood. The system is running and
snmpd is staying up.

=20

Although I was unable to determine the root cause, the most likely
reason for this type of problem is a degrading compact flash. I
recommended to Brian and Kevin to replace this flash at their earliest
convenience. After talking to Andy Sharp and Tim I ruled out network and
memory problems as a cause. If memory was an issue the system would have
been failing in other more obvious ways. If the network were an issue
the entire upgrade would have failed in a more obvious manner.

=20

I will add this information to the bug report.

=20

Larry

=20

________________________________

From: Paul Hammer=20
Sent: Thursday, December 14, 2006 8:58 AM
To: Eric Barrett; Tim Gardner; Sandrine Boulanger; Ed Kwan; Charissa
Willard
Cc: Kevin Matthews; Brian Baker; Vikas Saini; dl-Clio
Subject: RE: Updates

=20

Can we run the tests some place other than on MD?

=20

________________________________

From: Eric Barrett
Sent: Thu 12/14/2006 8:57 AM
To: Tim Gardner; Sandrine Boulanger; Ed Kwan; Charissa Willard; Paul
Hammer
Cc: Kevin Matthews; Brian Baker; Vikas Saini; dl-Clio
Subject: RE: Updates

I would also suggest comparison to mktg3 -- it's the other node in the
cluster.  It was not affected by the snmp issue despite also being
upgraded to Clio.

=20

_____________________________________________=20
From:   Tim Gardner =20
Sent:   Wednesday, December 13, 2006 7:43 PM=20
To:     Sandrine Boulanger; Ed Kwan; Charissa Willard; Paul Hammer=20
Cc:     Kevin Matthews; Brian Baker; Eric Barrett; Vikas Saini; dl-Clio=20
Subject:        RE: Updates=20

/mnt/usr/local/agile/lib/libucdagent.so=20
/mnt/usr/local/agile/lib/libucdmibs.so=20

These files are both used by snmp. We suspect that they may be the cause
of the snmp problem.=20
We need to do another upgrade to dogfood. But prior to the upgrade, we
should reinstall just the libraries=20
that have checksum differences and try to start snmp to verify that the
libraries are the cause of the problem.=20

The next issue will be to try to understand why system upgrade either
munged the libraries or simply=20
failed to upgrade them. Before overwriting them we should see if their
checksums match the checksums=20
of the libraries from the previous release that was on dogfood.=20

=20

_____________________________________________
From: Sandrine Boulanger
Sent: Wednesday, December 13, 2006 6:56 PM
To: Sandrine Boulanger; Ed Kwan; Charissa Willard; Paul Hammer
Cc: Kevin Matthews; Brian Baker; Eric Barrett; Vikas Saini; dl-Clio
Subject: RE: Updates=20

I copied the upgrade log (which was on the other flash), to
~sandrineb/traces/sys_upgrade.log. It looks like those files were
supposed to be upgraded and they did, check the Installing part. I don't
know why the checksum would not match then.

_____________________________________________
From: Sandrine Boulanger
Sent: Wednesday, December 13, 2006 6:46 PM
To: Ed Kwan; Charissa Willard; Paul Hammer
Cc: Kevin Matthews; Brian Baker; Eric Barrett; Vikas Saini; dl-Clio
Subject: RE: Updates=20

I ran a system compare, here are the results. We should get the system
upgrade log file on Dogfood.=20

The following files are different:=20
/mnt/etc/newsyslog.conf =3D> normal, always different=20
/mnt/usr/local/agile/lib/libucdagent.so=20
/mnt/usr/local/agile/lib/libucdmibs.so=20
/mnt/usr/local/agile/etc/pmtab =3D> scary, did someone modified that =
one?=20
/mnt/usr/local/agile/web/js/console/login.js=20
/mnt/usr/local/agile/web/js/console/view.js=20
/mnt/usr/local/agile/web/js/domain/util.js=20
/mnt/var/log/sendmail.st=20
/mnt/version =3D> normal, always different=20
Dogfood diag>=20

_____________________________________________
From: Ed Kwan
Sent: Wednesday, December 13, 2006 6:44 PM
To: Charissa Willard; Paul Hammer
Cc: Kevin Matthews; Brian Baker; Eric Barrett; Vikas Saini; dl-Clio
Subject: RE: Updates=20

Checksums of some libraries on dogfood don't match the release, e.g.=20

Dogfood# cksum libucdmibs.so=20
2089852042 3278706 libucdmibs.so=20

[edk@edk-linux lib]$ pwd=20
/n/build-trees/R2.1.0.0/R2.1.0.0-120906/nfx-tree/Build/ch/opt/lib=20
[edk@edk-linux lib]$ cksum  libucdmibs.so=20
4212168562 3278706 libucdmibs.so=20

Sandrine is running a "system compare" on dogfood.=20

_____________________________________________
From: Charissa Willard
Sent: Wednesday, December 13, 2006 1:20 PM
To: Paul Hammer
Cc: Kevin Matthews; Brian Baker; Ed Kwan; Eric Barrett; Vikas Saini;
dl-Clio
Subject: RE: Updates=20

Paul,=20

Ed's been looking into 16471 on MD. We still haven't been able to get a
stack trace.=20

-Charissa=20

=20

=3D=3D=3D=3D=3D State: Assigned by:edk at 12/12/2006 7:01:15 PM =
=3D=3D=3D=3D=3D

Ran the unstripped version of snmpd in gdb on dogfood, and still didn't
get a good stack trace:

(gdb) bt
#0 0x5ffe31a0 in ?? ()
warning: Hit heuristic-fence-post without finding
warning: enclosing function for address 0x5ffe46c0

I set breakpoints at main, __main, __start and __init, and snmpd got
SIGSEGV before hitting the break points.

Ran "readelf snmpd" on my Linux box, and the executable instructions in
snmpd don't map to the pc (0x5ffe31a0).
There are 15 shared libraries used by snmpd, but gdb says they are not
even loaded at the time:

(gdb) info sharedlibrary
No shared libraries loaded at this time.

I played with ldd and the "LD_TRACE_LOADED_OBJECTS" environment
variables, but I can't get any useful info.

May need to compile all the libraries with "-g" and load them on
dogfood...=20

_____________________________________________
From: Paul Hammer
Sent: Wednesday, December 13, 2006 1:14 PM
To: Eric Barrett; Vikas Saini; dl-Clio
Cc: Kevin Matthews; Brian Baker
Subject: RE: Updates=20

This one should have been submitted separately against 2.1, need core
Dev to have a look at this one in Clio. Agree that it looks similar to
the one at Shopzilla, however the version at the customer did not core
on MD nor did it leave a core file, appears now with 2.1 we have a core.


_____________________________________________
From: Eric Barrett
Sent: Wednesday, December 13, 2006 1:10 PM
To: Vikas Saini; Paul Hammer; dl-Clio
Cc: Kevin Matthews; Brian Baker
Subject: RE: Updates=20

snmpd issue is already filed as 16471.=20

=20

_____________________________________________=20
From:   Vikas Saini =20
Sent:   Wednesday, December 13, 2006 1:08 PM=20
To:     Paul Hammer; dl-Clio=20
Cc:     Eric Barrett; Kevin Matthews; Brian Baker=20
Subject:        RE: Updates=20
Importance:     High=20

Clio is going fine. First regression is complete. Second regression pass
is going on and should be complete soon. Just sent the Clio Soak
status..

There are still 2 defects in Dev court that needs resolution.=20

QA has verified all the defects which were targeted for Clio. Right now
we are verifying the defects which were not targeted for Clio but got
resolved in Clio.

I haven't seen any defect on snmpd issue. will file one soon. Yes that
should be resolved.=20

Thanks=20
Vikas=20

=20

_____________________________________________
From: Paul Hammer
Sent: Wednesday, December 13, 2006 1:01 PM
To: dl-Clio
Cc: Eric Barrett; Kevin Matthews; Brian Baker
Subject: Updates=20

Any updates on Clio status?=20

Soak Updates? Uptime numbers=20

Number of defects in Dev that we need to address still in Clio?=20

Defects in QA that we need to resolve?=20

I have not seen any update on the Defect found with SNMP on MD, assume
this is MF for Clio too.=20

-Paul=20

=20


