AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20071030151721.3c4abd13@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<mike.lee@onstor.com>,<vikas.saini@onstor.com>,<tim.gardner@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@onstor-exch02.onstor.net/INBOX	0	BB375AF679D4A34E9CA8DFA650E2B04E030E397C@onstor-exch02.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Tue, 30 Oct 2007 15:17:29 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: "Mike Lee" <mike.lee@onstor.com>
Cc: "Vikas Saini" <vikas.saini@onstor.com>, "Tim Gardner"
 <tim.gardner@onstor.com>
Subject: Re: cluster db prob
Message-ID: <20071030151729.69272a2a@ripper.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E030E397C@onstor-exch02.onstor.net>
References: <BB375AF679D4A34E9CA8DFA650E2B04E030E397C@onstor-exch02.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

On Tue, 30 Oct 2007 14:59:08 -0700 "Mike Lee" <mike.lee@onstor.com>
wrote:

> Andy:
> The message of interest that suggests a panic is highlighted in red.
> Thanks.
> -Mike
> 
> >  -----Original Message-----
> > From: 	Vikas Saini  
> > Sent:	Monday, October 29, 2007 2:02 PM
> > To:	Mike Lee
> > Subject:	RE: cluster db prob
> > 
> > Starting NIS services: ypbindstart-stop-daemon: nothing in /proc -
> > not mounted? (Success)

/proc is not mounted, most likely because root filesystem is mounted RW
only.  This must be fixed before anything will work.

> > binding to YP server...................................fp0:
> > [dp_proxy_init]: pass 1
> > tx0: [dp_proxy_init]: pass 1
> > tx0: pte_setup, region = 4, phys_offset = 0x58000000, virt =
> > 0x4000000000, psize = 28000000, vsize = 28000000, pgsize = 8000000,
> > readwrite_size = 0x28000000
> > fp0: pte_setup, region = 2, phys_offset = 0x30000000, virt =
> > 0x2000000000, psize = 50000000, vsize = 50000000, pgsize = 8000000,
> > readwrite_size = 0x50000000
> > fp0: pte_setup, region = 6, phys_offset = 0x30000000, virt =
> > 0x6000000000, psize = 50000000, vsize = 50000000, pgsize = 8000000,
> > readwrite_size = 0x50000000
> > fp0: Zeroing the WKA area, 0x73000000
> > tx0: pte_setup, region = 7, phys_offset = 0x58000000, virt =
> > 0x7000000000, psize = 28000000, vsize = 28000000, pgsize = 8000000,
> > readwrite_size = 0x28000000
> > tx0: Zeroing the WKA area, 0x73000000
> > ....tx0: unusable buffer, 0x200ffffc00, phys_base = cffffc00,
> > phys_end = 100000300
> > ....failed (backgrounded).

Root mounted RO.

> > .
> > fp0: [dp_proxy_init]: pass 2
> > fp0: sb1250dm_initModule: sb1250dmCfg[0]=0x1000507e00
> > fp0: sb1250dm_initModule: sb1250dmCfg[1]=0x1000507c80
> > fp0: sb1250dm_initModule: sb1250dmCfg[2]=0x1000507b00
> > fp0: sb1250dm_initModule: sb1250dmCfg[3]=0x1000507980
> > fp0: esm_stackInit(): Enter
> > fp0: esm_stackInit(): Leave
> > fp0: bmc12500_install_allocators: install RUNTIME allocators
> > fp0: bmc12500Eth_initModule w/AUTO TX Retrans enabled
> > fp0: bcm12500Eth_initModule SEQ RRX Retrans enabled
> > fp0: isModel2280: Model 2220/40/60
> > fp0: rmc: rmc_init(): RMC version 2.0.2 - myslot[1] mycpu[2]
> > myapp_id[79]
> > fp0: fs_initModule
> > Starting MTA:open: Read-only file system

Ditto

> > touch: cannot touch `/var/lib/exim4/config.autogenerated.tmp':
> > Read-only file system
> > chown: cannot access `/var/lib/exim4/config.autogenerated.tmp': No
> > such file or directory
> > chown: changing ownership of `/var/lib/exim4/config.autogenerated':
> > Read-only file system
> > chmod: cannot access `/var/lib/exim4/config.autogenerated.tmp': No
> > such file or directory
> > chmod: changing permissions of
> > `/var/lib/exim4/config.autogenerated': Read-only file system
> > /usr/sbin/update-exim4.conf: line 286: cannot create temp file for
> > here document: Read-only file system
> > /usr/sbin/update-exim4.conf: line 435:
> > /var/lib/exim4/config.autogenerated.tmp: Read-only file system
> > 2013-03-04 08:29:16 Cannot open main log file
> > "/var/log/exim4/mainlog": Read-only file system: euid=0 egid=0
> > 2013-03-04 08:29:16 non-existent configuration file(s):
> > /var/lib/exim4/config.autogenerated.tmp
> > 2013-03-04 08:29:16 Cannot open main log file
> > "/var/log/exim4/mainlog": Read-only file system: euid=0 egid=0
> > exim[756]: 2013-03-04 08:29:16 non-existent configuration file(s):
> > /var/lib/exim4/config.autogenerated.tmp
> > exim[756]: 2013-03-04 08:29:16 Cannot open main log file
> > "/var/log/exim4/mainlog": Read-only file system: euid=0 egid=0
> > exim[756]: exim: could not open panic log - aborting: see message(s)
> > above
> > exim: could not open panic log - aborting: see message(s) above
> > Invalid new configfile /var/lib/exim4/config.autogenerated.tmp
> > not installing /var/lib/exim4/config.autogenerated.tmp to 
> > /var/lib/exim4/config.autogenerated
> > tx0: ! phyid 11 reg 0 wrote 1340 got 1140
> > tx0: gt_mii_writeSMI: write verify failed! phyid 12 reg 0 wrote 1340
> > got 1140
> > * Not starting internet superserver: no services enabled.
> > tx0: gt_mii_writeSMI: write verify failed! phyid 13 reg 0 wrote 1340
> > got 1140
> > tx0: Initializing profiler _start@0xffffffff83000000
> > _end@0xffffffff83754b90 textsize 7687056.
> > tx0: altcpu_start(1, 0xffffffff834489dc)
> > tx1: ECC exception handler Initialized
> > tx1: handler already registered for ipl = 3 (0xffffffff834459d8)
> > tx1: handler already registered for ipl = 2 (0xffffffff834459d8)
> > tx0: writing 0xffffffff8547bfe8 to 0xffffffff808af3d8
> > tx0: writing 0xffffffff839a3c70 to 0xffffffff808af3f8
> > tx0: writing 0x1 to 0xffffffff808af418
> > tx0: writing 0xffffffff834489dc to 0xffffffff808af3b8
> > tx1: rmc: rmc_init(): RMC version 2.0.2 - myslot[1] mycpu[1]
> > myapp_id[79]
> > tx1: sb1250dm_openChnl: sb1250dmCfg[2]=0x1001d6fc00
> > tx1: Initializing profiler _start@0xffffffff83000000
> > _end@0xffffffff83754b90 textsize 7687056.
> > Starting OpenBSD Secure Shell server: sshdfp0: dump ra sm create
> > fp0: fs_initModule done on slot 1 cpu 2
> > fp0: efs_sscAppId: 130, nfsAppId: 20, mcpuAppId: 67
> > fp0: sb1250dm_openChnl: sb1250dmCfg[0]=0x1000507e00
> > fp0: sb1250dm_openChnl: sb1250dmCfg[1]=0x1000507c80
> > fp0: tpl_fp_init: init completeInitializing profiler
> > _start@0xffffffff83000000 _end@0xffffffff83727800 textsize 7501824.
> > tx0: TXRX0:1 > 2: Port : fp1.0 is DOWN
> > tx0: 
> > tx0: 3: luc_link_down: 359: lport fp1.0 DOWN.
> > tx0: 
> > tx0: 4: luc_link_down:381: Spurious link down notification. Port
> > fp1.0.
> > tx0: 
> > tx0: 5: Port : fp1.1 is DOWN
> > tx0: 
> > tx0: 6: luc_link_down: 359: lport fp1.1 DOWN.
> > tx0: 
> > tx0: 7: luc_link_down:381: Spurious link down notification. Port
> > fp1.1.
> > tx0: 
> > fp0: core_init_buffers: total buffers = 504
> > fp0: altcpu_start(1, 0xffffffff83401b20)
> > fp0: writing 0xffffffff86205fe8 to 0xffffffff808af3d8
> > fp0: writing 0xffffffff83758f90 to 0xffffffff808af3f8
> > fp0: writing 0x1 to 0xffffffff808af418
> > fp0: writing 0xffffffff83401b20 to 0xffffffff808af3b8
> > fp1: ECC exception handler Initialized
> > fp1: handler already registered for ipl = 3 (0xffffffff833fe7a8)
> > fp1: handler already registered for ipl = 2 (0xffffffff833fe7a8)
> > fp1: Initializing profiler _start@0xffffffff83000000
> > _end@0xffffffff83727800 textsize 7501824.
> > tx0: 8: Port : fp1.2 is DOWN
> > tx0: 
> > tx0: 9: luc_link_down: 359: lport fp1.2 DOWN.
> > tx0: 
> > tx0: 10: luc_link_down:381: Spurious link down notification. Port
> > fp1.2.
> > tx0: 
> > tx0: 11: Port : fp1.3 is DOWN
> > tx0: 
> > tx0: 12: luc_link_down: 359: lport fp1.3 DOWN.
> > tx0: 
> > tx0: 13: luc_link_down:381: Spurious link down notification. Port
> > fp1.3.
> > tx0: 
> > .
> > Starting NFS common utilities: statdstart-stop-daemon: nothing in
> > /proc - not mounted? (Success)
> > .
> > Starting NTP server: ntpd.
> > Starting deferred execution scheduler: atd.
> > Starting periodic command scheduler: crond/usr/sbin/cron: can't open
> > or create /var/run/crond.pid: Read-only file system
> >  failed!
> > Starting ONStor services: mgmtbusstart-stop-daemon: nothing
> > in /proc - not mounted? (Success)
> > .
> > fp0: FP0:1 > warning: rmc_pm_handle_failure(): sess
> > {unknown_app:pm.0.0} down
> > fp0: rmc_context_rm_sess: removing listen session[0x100050a000] from
> > list, flags[c2020x].
> > tx1: TXRX1:1 > warning: rmc_pm_handle_failure(): sess
> > {unknown_app:pm.0.0} down
> > tx1: rmc_context_rm_sess: removing listen session[0x100512f6c0] from
> > list, flags[c2020x].
> > fp0: warning: rmc_pm_handle_failure(): sess {unknown_app:pm.0.0}
> > down fp0: rmc_context_rm_sess: removing listen
> > session[0x100050a000] from list, flags[c2020x].
> > tx1: warning: rmc_pm_handle_failure(): sess {unknown_app:pm.0.0}
> > down tx1: rmc_context_rm_sess: removing listen
> > session[0x100512f6c0] from list, flags[c2020x].
> > fp0: warning: rmc_pm_handle_failure(): sess {unknown_app:pm.0.0}
> > down fp0: rmc_context_rm_sess: removing listen
> > session[0x100050a000] from list, flags[c2020x].
> > tx1: warning: rmc_pm_handle_failure(): sess {unknown_app:pm.0.0}
> > down tx1: rmc_context_rm_sess: removing listen
> > session[0x100512f6c0] from list, flags[c2020x].
> > INIT: Id "T0" respawning too fast: disabled for 5 minutes
> > INIT: no more processes left in this runlevel

Well, I don't see any kernel panic.  But I do see what looks like the
console getty can't start, probably because /dev can't be mounted
because of the aforementioned root not mounted RW.

> > 
> > _____________________________________________
> > From: Mike Lee 
> > Sent: Monday, October 29, 2007 1:51 PM
> > To: Vikas Saini
> > Subject: RE: cluster db prob
> > 
> > Vikas: 
> > Tim is asking me to investigate if it turns out to be a bug.
> > So, please keep me posted.  Thanks.
> > -Mike
> > 
> >  -----Original Message-----
> > From: 	Vikas Saini  
> > Sent:	Monday, October 29, 2007 1:49 PM
> > To:	Mike Lee
> > Subject:	RE: cluster db prob
> > 
> > It is a clustering problem... here is the email which I sent earlier
> > to these guys...... I am setting up my env to reproduce the
> > problem...What details u r looking for, I can collect it incase I
> > can reproduce it.
> > 
> > 
> > Clustering is behaving in a very unreliable manner in bobcat Linux.
> > Here is what I tried since last Thursday and output I got.
> > 
> > Eng63 is the PCC and it has around 5-6 vsvr and 5-6 volumes.
> > 
> > I got another system eng60. Both eng63 and eng60 were on same bobcat
> > Linux build which I believe is sub5.
> > 
> > I added eng60 to eng63 cluster (cluster add followed by cluster
> > commit), it didn't work and I kept on getting ubik and cluster
> > related error messages on both eng63 and eng60.
> > 
> > I rebooted eng60... no change in behavior
> > 
> > I rebooted both eng63 and eng60 and still same problem.
> > 
> > I removed eng60 from eng63 cluster (cluster delete followed by
> > cluster commit), it rebooted eng60 and removed it from cluster but
> > I also lost my clusterdb on eng63. All the vsvr information and
> > domain information is gone...
> > 
> > 
> > Anyway I tried eng60 back to eng63 cluster and it worked fine. I
> > tried 2 times and both times it was ok.
> > 
> > After 2 successful attempts, I tried to remove eng60 from eng63
> > cluster and now eng60 is in weird state where it never came up after
> > reboot. I am getting following error messages 
> > 
> > fp0: FP0:1 > warning: rmc_pm_handle_failure(): sess
> > {unknown_app:pm.0.0} down
> > tx1: TXRX1:1 > warning: rmc_pm_handle_failure(): sess
> > {unknown_app:pm.0.0} down
> > fp0: warning: rmc_pm_handle_failure(): sess {unknown_app:pm.0.0}
> > down tx1: warning: rmc_pm_handle_failure(): sess
> > {unknown_app:pm.0.0} down fp0: warning: rmc_pm_handle_failure():
> > sess {unknown_app:pm.0.0} down tx1: warning:
> > rmc_pm_handle_failure(): sess {unknown_app:pm.0.0} down INIT: Id
> > "T0" respawning too fast: disabled for 5 minutes INIT: no more
> > processes left in this runlevel fp0: warning:
> > rmc_pm_handle_failure(): sess {unknown_app:pm.0.0} down tx1:
> > warning: rmc_pm_handle_failure(): sess {unknown_app:pm.0.0} down
> > fp0: warning: rmc_pm_handle_failure(): sess {unknown_app:pm.0.0}
> > down
> > 
> > 
> > 
> > Also on eng63, since its cluster was gone, I tried to restore the
> > clusterDB by copying it from a previous SGA, after copying clusterDB
> > when I rebooted even eng63 went to a hung state getting same error
> > messages described above.
> > 
> > I am going to open a generic defect for clustering and copy all the
> > info but looks like we might have quite a bit of issues for
> > clustering to work in bobcat Linux.
> > 
> > 
> > 
> > _____________________________________________
> > From: Mike Lee 
> > Sent: Monday, October 29, 2007 1:45 PM
> > To: Vikas Saini
> > Subject: cluster db prob
> > 
> > Vikas:
> > Tim mentioned you found a cluster db problem today on bobcat-linux.
> > Can u pls tell me where I can find details?
> > Thanks.
> > -Mike
