AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20080310121547.5a6091b5@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<raj.kumar@onstor.com>,<chris.vandever@onstor.com>,<larry.scheer@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@onstor-exch02.onstor.net/INBOX	0	BB375AF679D4A34E9CA8DFA650E2B04E08C0FA4F@onstor-exch02.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Mon, 10 Mar 2008 12:15:52 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: "Raj Kumar" <raj.kumar@onstor.com>
Cc: "Chris Vandever" <chris.vandever@onstor.com>, "Larry Scheer"
 <larry.scheer@onstor.com>
Subject: Re: system config reset
Message-ID: <20080310121552.58b53407@ripper.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E08C0FA4F@onstor-exch02.onstor.net>
References: <BB375AF679D4A34E9CA8DFA650E2B04E03E9A693@onstor-exch02.onstor.net>
	<BB375AF679D4A34E9CA8DFA650E2B04E08C0FA4F@onstor-exch02.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Cougar doesn't currently use FTI, so you're saved.

I don't think you commited your changes during the initial config
menus.  You can log your telnet session using the screen command in
order to capture everything.  If you're not familiar, stop by and I'll
fill you in.

Cheers,

a

PS, meanwhile, I'll try it on my cougar to see what happens.


On Mon, 10 Mar 2008 12:09:35 -0700 "Raj Kumar" <raj.kumar@onstor.com>
wrote:

> I assume that's why clusterDB and cluster.conf are not created since
> the cluster services didn't start?
> 
> To make progress, should I just copy the pmtab and start pm?
> 
> _____________________________________________
> From: Chris Vandever 
> Sent: Monday, March 10, 2008 12:07 PM
> To: Raj Kumar; Larry Scheer; Andy Sharp
> Subject: RE: system config reset
> 
> Actually, it does.  That went in with the FTI changes so we don't
> start the bulk of the apps until we're configured.  We just start
> what's needed for nfxsh to run so we can do the config.  That
> explains the missing log entries (the apps weren't started because
> they weren't in pmtab), and it explains why we didn't see the message
> I expected from ClusterCtrl_Init().
> 
> ChrisV
> 
> _____________________________________________
> From: Raj Kumar 
> Sent: Monday, March 10, 2008 12:04 PM
> To: Chris Vandever; Larry Scheer; Andy Sharp
> Subject: RE: system config reset
> 
> Does pmtab gets wiped out during config reset? Didn't think so.
> 
> g11r10:~# cat /onstor/etc/pmtab
> initwait: /onstor/bin/elog
> initwait: /onstor/bin/sscccc
> 
> g11r10:~# 
> g11r10:~# ps ax | grep onstor
>  4403 ?        Ss     0:00 /onstor/bin/sshd
>  4591 ?        Ss     0:00 /onstor/bin/pm
>  4603 ?        S      0:12 /onstor/bin/elog
>  4612 ?        S      0:00 /onstor/bin/sscccc
>  7025 ?        Ss     0:00 /bin/sh /onstor/bin/emrscron -g stats
>  7335 ?        S      0:00 /bin/sh /onstor/bin/support.sh -e
> nfxsh_connect  -g stats -s --
>  9325 ?        Ss     0:00 /bin/sh /onstor/bin/emrscron -g h_res_stats
>  9357 pts/0    R+     0:00 grep onstor
> g11r10:~#
> 
> _____________________________________________
> From: Chris Vandever 
> Sent: Monday, March 10, 2008 12:01 PM
> To: Raj Kumar; Larry Scheer; Andy Sharp
> Subject: RE: system config reset
> 
> What apps does ps show running?
> 
> Based on the elog, clustering hasn't even started (but as evidenced by
> the second reboot, there are more messages missing from the elog than
> IN the elog).  The messages that look like they're from clustering are
> actually from libcluster being called by an app that starts prior to
> clustering.  I see the 'system config reset' at 10:24 with the initial
> reboot at 10:28.  The only app that made it to the log is elog:
> 
> Mar 10 10:24:37 g11r10 : 0:0:eventd:CRITICAL: Process-EVENT Node: Name
> 'local', State Down, Msg 'Node going down for reboot! ('system config
> reset' issued from nfxsh).'
> Mar 10 10:28:10 g11r10 pm: /onstor/bin/elog: finished initialization.
> Mar 10 10:30:20 g11r10 : 0:0:cluster2:ERROR: Cluster_RetrieveConfig:
> Cluster cfg file /onstor/conf/cluster.conf missing or corrupted or
> node intentionally removed from cluster, defaulting to standalone
> mode, err 0
> 
> There's another boot at 10:35, and based on the subsequent reboot we
> made it at least as far as sscccc (which is at the end of pmtab just
> before sendmail):
> 
> Mar 10 10:35:28 g11r10 pm: /onstor/bin/elog: finished initialization.
> Mar 10 10:36:28 g11r10 : 0:0:cluster2:ERROR: Cluster_RetrieveConfig:
> Cluster cfg file /onstor/conf/cluster.conf missing or corrupted or
> node intentionally removed from cluster, defaulting to standalone
> mode, err 0 Mar 10 10:38:00 g11r10 pm: pm_terminate: child 1732
> (/onstor/bin/sscccc) terminated
> Mar 10 10:38:01 g11r10 pm: pm_terminate: child 1713 (/onstor/bin/elog)
> terminated
> 
> ChrisV
> _____________________________________________
> From: Raj Kumar 
> Sent: Monday, March 10, 2008 11:41 AM
> To: Chris Vandever; Larry Scheer; dl-Cougar
> Subject: RE: system config reset
> 
> Yes, those messages are in elog after the 2nd attempt of config reset
> (didn't see them after the last attempt though). Elogs at
> /n/newcorevol/defect_22743
> 
> From /etc/onstor/initial-config option 3:
> 
> Current Settings:
>    Node Name: g11r10
>    Date & Time: Mon Mar 10 11:40:47 PDT 2008
>    Network Settings:
>       Mgmt port 1 IP: 10.2.10.11 NETMASK: 255.255.0.0
>       Mgmt port 2 IP: address NETMASK: netmask
>       Current default route: 10.2.0.1
> 
> 
> Pending changes:
> 
> Press 'Enter' to continue...
> 
> _____________________________________________
> From: Chris Vandever 
> Sent: Monday, March 10, 2008 11:37 AM
> To: Raj Kumar; Larry Scheer; dl-Cougar
> Subject: RE: system config reset
> 
> Can I get the full elogs?  They should contain a message like the
> following:
> 
> Cluster_RetrieveConfig: Cluster cfg file cluster.conf missing or
> corrupted or node intentionally removed from cluster, defaulting to
> standalone mode, err 0
> 
> When cluster_contrl starts it should create the missing cluster.conf
> file UNLESS it is unable to get an IP address for the local host.
> Then, it will complain:
> 
> ClusterCtrl_InitUbik: fail to find any IP address
> 
> And it will exit.  So, the question is, what happened to the IP
> address?
> 
> ChrisV
> 
> _____________________________________________
> From: Raj Kumar 
> Sent: Monday, March 10, 2008 11:24 AM
> To: Larry Scheer; dl-Cougar
> Subject: RE: system config reset
> 
> I couldn't cut and paste all those screens but I did set all those
> before exiting the script.
> 
> _____________________________________________
> From: Larry Scheer 
> Sent: Monday, March 10, 2008 11:23 AM
> To: Raj Kumar; dl-Cougar
> Subject: RE: system config reset
> 
> Going strictly by the information you provided it is because you reset
> the configuration and exited the configuration script without setting
> any configuration information. You have no IP address, hostname,
> default route, etc.
> 
> _____________________________________________
> From: Raj Kumar 
> Sent: Monday, March 10, 2008 11:00 AM
> To: dl-Cougar
> Subject: FW: system config reset
> 
> Any idea?
> 
> _____________________________________________
> From: Raj Kumar 
> Sent: Monday, March 10, 2008 10:51 AM
> To: dl-QA
> Subject: system config reset
> 
> Hi,
> 
> Did config reset on cougar soak g11r10 ( already tried twice). After
> reset the filer's services doesn't come up because Cluster DB and
> cluster.conf are missing. Any ideas?
> 
> g11r10:~# ls -l /onstor/conf/     
> total 1433
> -rw-r--r-- 1 root root 693561 Feb 19 20:43 R4.0.0.0-021908.bom
> -rw-r--r-- 1 root root 693237 Feb 14 14:41 R4.0.0.0DBG-021408.bom
> lrwxrwxrwx 1 root root     19 Feb 20 13:11 current.bom ->
> R4.0.0.0-021908.bom
> -rw-r--r-- 1 root root   2046 Feb  6 16:05 emrs_client.pem
> -rw-r--r-- 1 root root   1363 Feb  6 16:05 emrs_server.crt
> drwx------ 2 root root  12288 Feb  7 20:00 lost+found
> lrwxrwxrwx 1 root root     22 Feb 20 13:06 previous.bom ->
> R4.0.0.0DBG-021408.bom
> -rw-r--r-- 1 root root  53742 Feb  6 16:05 sdm-devcap
> g11r10:~#
> 
>      1. Configure Administrative Settings
> 
> 
>      2. Configure Network Settings
> 
> 
>      3. Display Current Settings
> 
> 
>      4. Commit Changes
> 
> 
>      5. Help
> 
> 
>      6. Copy Configuration Files From Secondary Flash
> 
> 
>      7. Exit
> 
> 
>     Enter Selection: 7
> 
> Value Entered is 7
> 
> .
> Setting up networking....
> Configuring network interfaces...SIOCADDRT: Network is unreachable
> run-parts: /etc/network/if-up.d/addroutes exited with return code 7
> address: Host name lookup failure
> ifconfig: `--help' gives usage information.
> Failed to bring up eth1.
> done.
> Starting portmap daemon....
> INIT: Entering runlevel: 2
> Starting system log daemon: syslogd.
> Starting kernel log daemon: klogd.
> Starting portmap daemon...Already running..
> Starting automounter: loading autofs4 kernel module, no automount maps
> defined.
> Setting NIS domainname to: NASgateway.
> Starting NIS services: ypserv yppasswdd ypxfrd ypbind.
> Starting MTA: exim4.
> * ALERT: exim paniclog /var/log/exim4/paniclog has non-zero size, mail
> system possibly broken
> Starting internet superserver: inetd.
> Starting OpenBSD Secure Shell server: sshd.
> Starting NFS common utilities: statd.
> Starting NTP server: ntpd.
> Starting deferred execution scheduler: atd.
> Starting periodic command scheduler: crond.
> Starting ONStor services: mgmtbus/onstor/bin/emrscron -f 
>  pm.
> 
> OnStor GNU/Linux 4.0 g11r10 duart0
> 
> g11r10 login: Mar 10 10:46:46 g11r10 : 0:0:cluster2:ERROR:
> cluster_iUpdateRecordData: no reply bck -1 
> Mar 10 10:46:47 g11r10 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1 
> Mar 10 10:46:47 g11r10 : 0:0:nfxsh:NOTICE: cmd[0]: elog display
> enable : status[11]
> Mar 10 10:47:36 g11r10 : 0:0:cluster2:ERROR: cluster_iGetRecordData:
> no reply bck -1 
> Mar 10 10:48:09 g11r10 last message repeated 3 times
> Mar 10 10:48:19 g11r10 last message repeated 2 times
> Mar 10 10:48:29 g11r10 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1 
> Mar 10 10:48:29 g11r10 : 0:0:cluster2:ERROR: cluster_iGetRecordData:
> no reply bck -1 
> Mar 10 10:48:39 g11r10 last message repeated 2 times
> Mar 10 10:48:50 g11r10 : 0:0:cluster2:ERROR: cluster_getRecordIdByKey:
> no reply bck -1 
> Mar 10 10:48:51 g11r10 : 0:0:cluster2:ERROR: cluster_iGetRecordData:
> no reply bck -1 
> Mar 10 10:49:01 g11r10 last message repeated 2 times
> 
> Thanks.
> 
> --kumar :-)
> 
