AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:
CFG:
PT:0
S:andy.sharp@lsi.com
RQ:
SSV:mhbs.lsil.com
NSV:
SSH:
R:<Dave.Limato@lsi.com>,<Mai.Ly@lsi.com>,<Raj.Kumar@lsi.com>,<dl-qa@lsi.com>,<Larry.Scheer@lsi.com>,<Jobi.Ariyamannil@lsi.com>,<Svati.Chandra@lsi.com>
MAID:2
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/LSI/INBOX	0	D7A889C980962746B30DE07864593C02CF29B5E5@cosmail02.lsi.com
X-Sylpheed-End-Special-Headers: 1
Date: Fri, 26 Mar 2010 16:12:44 -0700
From: Andrew Sharp <andy.sharp@lsi.com>
To: "Limato, Dave" <Dave.Limato@lsi.com>
Cc: "Ly, Mai" <Mai.Ly@lsi.com>, "Kumar, Raj" <Raj.Kumar@lsi.com>,
 DL-ONStor-QA <dl-qa@lsi.com>, "Scheer, Larry" <Larry.Scheer@lsi.com>,
 "Ariyamannil, Jobi" <Jobi.Ariyamannil@lsi.com>, Svati Chandra
 <Svati.Chandra@lsi.com>
Subject: Re: NBU for DMA
Message-ID: <20100326161244.3e348097@ripper.onstor.net>
In-Reply-To: <D7A889C980962746B30DE07864593C02CF29B5E5@cosmail02.lsi.com>
References: <D7A889C980962746B30DE07864593C02CF29B534@cosmail02.lsi.com>
	<0BAA09DBFAD04A4DBB6CE240807CB3B9010C7FE66D@cosmail03.lsi.com>
	<D7A889C980962746B30DE07864593C02CF29B591@cosmail02.lsi.com>
	<0BAA09DBFAD04A4DBB6CE240807CB3B9010C7FE6A4@cosmail03.lsi.com>
	<D7A889C980962746B30DE07864593C02CF29B5A3@cosmail02.lsi.com>
	<0BAA09DBFAD04A4DBB6CE240807CB3B9010C7FE6B9@cosmail03.lsi.com>
	<D7A889C980962746B30DE07864593C02CF29B5B6@cosmail02.lsi.com>
	<0BAA09DBFAD04A4DBB6CE240807CB3B9010C7FE6C5@cosmail03.lsi.com>
	<EEF62483CEBDF841AA70EDD55C39B8B4E9434309@cosmail03.lsi.com>
	<0BAA09DBFAD04A4DBB6CE240807CB3B9010C7FE6E5@cosmail03.lsi.com>
	<EEF62483CEBDF841AA70EDD55C39B8B4E943431E@cosmail03.lsi.com>
	<D7A889C980962746B30DE07864593C02CF29B5E5@cosmail02.lsi.com>
Organization: LSI
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Sounds like another bug needs to be filed!


On Fri, 26 Mar 2010 17:04:39 -0600 "Limato, Dave" <Dave.Limato@lsi.com>
wrote:

> Its failing, let me do this shameful workaround and get back to you.
> 
> From: Ly, Mai
> Sent: Friday, March 26, 2010 4:00 PM
> To: Kumar, Raj; Limato, Dave; DL-ONStor-QA
> Subject: RE: NBU for DMA
> 
> 
> Dave,
> 
> Can you run ndmp status -a? if dump start yet?
> 
> From: Kumar, Raj
> Sent: Friday, March 26, 2010 3:52 PM
> To: Ly, Mai; Limato, Dave; DL-ONStor-QA
> Subject: RE: NBU for DMA
> 
> Looks like there is a hung ndmp in MD, possibly from one of the
> earlier failure. This ndmp might be still holding the write lock on
> the tape. To clean it up.
> 
> 
> 1.       Kill all the nfxsh>ndmp (ndmp delete session -a) if it
> doesn't work kill the processes from the bash shell
> 
> 2.       # kill <tape-driver>
> 
> 3.       Do the robotic inventory from NBU
> 
> 4.       Restart the backup
> 
> 
> 
> Dogfood:/var/log/onstor# ps -ef | grep ndmp
> root      1270   987  0 Mar23 ?        00:00:26 /onstor/bin/ndmp_cfgd
> root     18612  1270  0 Mar25 ?        00:13:29 /onstor/bin/ndmp_cfgd
> root     32379  1270  0 15:49 ?        00:00:00 /onstor/bin/ndmp_cfgd
> root     32394 26690  0 15:49 pts/1    00:00:00 grep ndmp
> 
> Dogfood:/var/log/onstor# ps -ef | grep tape
> root      1269   987  0 Mar23 ?
> 00:00:37 /onstor/bin/tape-driver Dogfood:/var/log/onstor#
> Dogfood:/var/log/onstor#
> From: Ly, Mai
> Sent: Friday, March 26, 2010 3:27 PM
> To: Kumar, Raj; Limato, Dave; DL-ONStor-QA
> Subject: RE: NBU for DMA
> 
> Here is the error Raj
> I think there is some issue with the Tape Library.
> 
> 
> 3/26/2010 2:48:03 PM - requesting resource
> nbu-prod-1-hcart-robot-tld-0-10.0.0.222 3/26/2010 2:48:03 PM -
> requesting resource nbu-prod-1.NBU_CLIENT.MAXJOBS.10.0.0.222
> 3/26/2010 2:48:03 PM - requesting resource
> nbu-prod-1.NBU_POLICY.MAXJOBS.POL-1-MSFTP 3/26/2010 2:48:03 PM -
> granted resource nbu-prod-1.NBU_CLIENT.MAXJOBS.10.0.0.222 3/26/2010
> 2:48:03 PM - granted resource
> nbu-prod-1.NBU_POLICY.MAXJOBS.POL-1-MSFTP 3/26/2010 2:48:03 PM -
> granted resource 0017L3 3/26/2010 2:48:03 PM - granted resource
> HP.ULTRIUM4-SCSI.000 3/26/2010 2:48:03 PM - granted resource
> nbu-prod-1-hcart-robot-tld-0-10.0.0.222 3/26/2010 2:48:03 PM -
> estimated 0 kbytes needed 3/26/2010 2:48:04 PM - started process
> bpbrm (3012) 3/26/2010 2:48:04 PM - connecting 3/26/2010 2:48:04 PM -
> connected; connect time: 00:00:00 3/26/2010 2:48:06 PM - mounting
> 0017L3 3/26/2010 3:03:52 PM - Error bptm(pid=3140) error requesting
> media, TpErrno = Robot operation failed 3/26/2010 3:03:52 PM -
> Warning bptm(pid=3140) media id 0017L3 load operation reported an
> error 3/26/2010 3:03:52 PM - current media 0017L3 complete,
> requesting next media Any 3/26/2010 3:07:15 PM - current media --
> complete, awaiting next media Any Reason: Robotic library is down on
> server., Media Server: nbu-prod-1, Robot Number: 0, Robot Type: TLD,
> Media ID: N/A, Drive Name: N/A, Volume Pool: NetBackup, Storage Unit:
> nbu-prod-1-hcart-robot-tld-0-10.0.0.222, Drive Scan Host: N/A
> 3/26/2010 3:20:44 PM - Error ndmpagent(pid=3896) terminated by parent
> process
> 
> From: Kumar, Raj
> Sent: Friday, March 26, 2010 2:55 PM
> To: Limato, Dave; DL-ONStor-QA
> Subject: RE: NBU for DMA
> 
> Ok, this failure is due to NBU, either misconfiguration or no free
> tapes available.
> 
> Mai --> D Girl?
> 
> From: Limato, Dave
> Sent: Friday, March 26, 2010 2:50 PM
> To: Kumar, Raj; DL-ONStor-QA
> Subject: RE: NBU for DMA
> 
> This is it...
> 
> 3/26/2010 11:52:20 AM - requesting resource
> nbu-prod-1-hcart-robot-tld-0-10.0.0.222 3/26/2010 11:52:20 AM -
> requesting resource nbu-prod-1.NBU_CLIENT.MAXJOBS.10.0.0.222
> 3/26/2010 11:52:20 AM - requesting resource
> nbu-prod-1.NBU_POLICY.MAXJOBS.POL-1-MSFTP 3/26/2010 11:52:20 AM -
> Error nbjm(pid=484) NBU status: 96, EMM status: No media is available
> unable to allocate new media for backup, storage unit has none
> available(96)
> 
> It looks like you or DMA-Girl is in and running new backup now.
> 
> Do you guys like the Term DMA-Girl? Or Lady-DMA (Play on Lady-GaGa).
> 
> 
> 
> From: Kumar, Raj
> Sent: Friday, March 26, 2010 2:39 PM
> To: Limato, Dave; DL-ONStor-QA
> Subject: RE: NBU for DMA
> 
> Actually I was asking the latest data on NBU GUI (detailed status)
> under activity monitor, like the one below, for the last failed
> backup.
> 
> 3/25/2010 11:50:07 PM - granted resource HP.ULTRIUM4-SCSI.000
> 3/25/2010 11:50:07 PM - granted resource
> nbu-prod-1-hcart-robot-tld-0-10.0.0.222 3/25/2010 11:50:08 PM -
> mounting 0019L3 3/25/2010 11:51:06 PM - mounted; mount time: 00:00:58
> 3/25/2010 11:51:10 PM - positioning 0019L3 to file 1
> 3/25/2010 11:51:15 PM - positioned 0019L3; position time: 00:00:05
> 3/25/2010 11:51:15 PM - begin writing
> client process aborted(50)
> 
> 
> From: Limato, Dave
> Sent: Friday, March 26, 2010 2:32 PM
> To: Kumar, Raj; DL-ONStor-QA
> Subject: RE: NBU for DMA
> 
> \\10.0.0.222\qe\Qualifications\users\davel\NBULogs<file:///\\10.0.0.222\qe\Qualifications\users\davel\NBULogs>
> 
> 
> 
> From: Kumar, Raj
> Sent: Friday, March 26, 2010 2:20 PM
> To: Limato, Dave; DL-ONStor-QA
> Subject: RE: NBU for DMA
> 
> Can you send the latest error messages from NBU? Is it still failing?
> The below failure is not due to NBU. Its due to the failure on our
> side.
> 
> 
> 
> From: Limato, Dave
> Sent: Friday, March 26, 2010 2:17 PM
> To: Kumar, Raj; DL-ONStor-QA
> Subject: RE: NBU for DMA
> 
> I know, let me make this more clear.
> 
> Backup failed at 02:33 as per your message.
> 
> Now I can't start next backup because NBU reports error status 50
> (Below).
> 
> So does it think there is a session open? No tape available? NDMPD
> hosed and needs a kill? Reboot?
> 
> ======
> According to NBU, Backup Failed 23:51.
> 
> 3/25/2010 11:50:07 PM - granted resource HP.ULTRIUM4-SCSI.000
> 3/25/2010 11:50:07 PM - granted resource
> nbu-prod-1-hcart-robot-tld-0-10.0.0.222 3/25/2010 11:50:08 PM -
> mounting 0019L3 3/25/2010 11:51:06 PM - mounted; mount time: 00:00:58
> 3/25/2010 11:51:10 PM - positioning 0019L3 to file 1
> 3/25/2010 11:51:15 PM - positioned 0019L3; position time: 00:00:05
> 3/25/2010 11:51:15 PM - begin writing
> client process aborted(50)
> 
> 23:50 Elog shows this. And it still seems to be going.
> 
> Elog:
> Mar 25 23:50:26 Dogfood : 0:0:tape-driver:NOTICE: tape_do_locate: dev
> 0x4ab700, cookie 0x4b1da8 Mar 25 23:50:26 Dogfood :
> 0:0:tape-driver:NOTICE: tape_do_locate: Current Partition 0 block
> number 1 Mar 25 23:50:26 Dogfood : 0:0:ssc_ndmp:INFO: Tape opened for
> Sess: 1269399603 device: NRNU526h, mode: RD_WR. Mar 25 23:50:26
> Dogfood : 1:3:scsi:NOTICE: 8309: ispfc:sp2.1:1920 SCSI error, No
> Sense dev[0xcb0501] WWN:500110a0008c218b:0 CS[15] SF[1000] SS[818]
> ExpDL[8] RDL[8] - cdb 0x5e Mar 25 23:50:36 Dogfood : 1:2:efs:INFO:
> 8310: FS: s_home       0x10200000187 - dumpStart - dump_restore -
> dump progress[4]: total records: 65938946; dumped: 18910484;
> remaining: 47028462; estimated time remaining: 19:53:1; percent
> complete: 28%; throughput: 42384384 bytes sec Mar 25 23:50:37
> Dogfood : 1:2:efs:INFO: 8311: FS: s_home       0x10200000187 -
> dumpStart - dump_restore - [4] dump LOG message: Thu Mar 25 23:50:36
> 2010 dump progress[4]: total records: 65938946; dumped: 18910484;
> remaining: 47028462; estimated time remaining: 19:53:1; percent
> complete: 28%; throughput: 42384384 bytes sec Mar 25 23:55:36
> Dogfood : 1:3:efs:INFO: 8312: FS: s_home       0x10200000187 -
> dumpStart - dump_restore - dump progress[4]: total records: 65938946;
> dumped: 19258342; remaining: 46680604; estimated time remaining:
> 11:11:17; percent complete: 29%; throughput: 74769408 bytes sec Mar
> 25 23:55:37 Dogfood : 1:3:efs:INFO: 8313: FS: s_home
> 0x10200000187 - dumpStart - dump_restore - [4] dump LOG message: Thu
> Mar 25 23:55:36 2010 dump progress[4]: total records: 65938946;
> dumped: 19258342; remaining: 46680604; estimated time remaining:
> 11:11:17; percent complete: 29%; throughput: 74769408 bytes sec Mar
> 25 23:57:46 Dogfood : 1:0:bsdrl:INFO: 266: arp: 10.0.0.244 moved from
> 00:0B:DB:A8:D0:EE on fp1.0 unit 4 Mar 25 23:57:46 Dogfood :
> 1:0:bsdrl:INFO: 267: arp: 10.0.0.244 moved from 00:07:E9:18:7C:F9 on
> fp1.0 unit 4 Mar 26 00:00:02 Dogfood : 1:4:efs:INFO: 8314: FS:
> nx_corevol   0x10200000163 - snapAdmin - snp - snapshot daily.0
> removal initiated Mar 26 00:00:36 Dogfood : 1:4:efs:INFO: 8315: FS:
> s_home       0x10200000187 - dumpStart - dump_restore - dump
> progress[4]: total records: 65938946; dumped: 19578982; remaining:
> 46359964; estimated time remaining: 12:3:29; percent complete: 29%;
> throughput: 68898816 bytes sec Mar 26 00:00:37 Dogfood :
> 1:4:efs:INFO: 8316: FS: s_home       0x10200000187 - dumpStart -
> dump_restore - [4] dump LOG message: Fri Mar 26 00:00:36 2010 dump
> progress[4]: total records: 65938946; dumped: 19578982; remaining:
> 46359964; estimated time remaining: 12:3:29; percent complete: 29%;
> throughput: 68898816 bytes sec Mar 26 00:00:43 Dogfood :
> 1:3:efs:NOTICE: 8317: FS: nx_corevol   0x10200000163 - snapAdmin -
> snp - snap remove complete for daily.0 id 8 Mar 26 00:00:43 Dogfood :
> 1:4:efs:INFO: 8318: FS: nx_corevol   0x10200000163 - snapAdmin - snp
> - snapshot daily.0 creation initiated Mar 26 00:01:24 Dogfood :
> 1:4:efs:NOTICE: 8319: FS: nx_corevol   0x10200000163 - snapAdmin -
> snp - snap create complete for daily.0 id 8 Mar 26 00:01:24 Dogfood :
> 1:3:efs:INFO: 8320: FS: s_home       0x10200000187 - snapAdmin - snp
> - snapshot daily.1 removal initiated Mar 26 00:03:17 Dogfood :
> 1:5:efs:NOTICE: 8321: FS: s_home       0x10200000187 - snapAdmin -
> snp - snap remove complete for daily.1 id 3 Mar 26 00:03:17 Dogfood :
> 1:4:efs:INFO: 8322: FS: s_home       0x10200000187 - snapAdmin - snp
> - snapshot daily.0 creation initiated Mar 26 00:04:55 Dogfood :
> 1:4:efs:NOTICE: 8323: FS: s_home       0x10200000187 - snapAdmin -
> snp - snap create complete for daily.0 id 3 Mar 26 00:04:55 Dogfood :
> 1:4:efs:INFO: 8324: FS: nx-sysadm    0xa0000018b - snapAdmin - snp -
> snapshot daily.1 removal initiated Mar 26 00:05:37 Dogfood :
> 1:2:efs:INFO: 8325: FS: s_home       0x10200000187 - dumpStart -
> dump_restore - dump progress[4]: total records: 65938946; dumped:
> 19842669; remaining: 46096277; estimated time remaining: 14:35:2;
> percent complete: 30%; throughput: 56641536 bytes sec Mar 26 00:05:37
> Dogfood : 1:2:efs:INFO: 8326: FS: s_home       0x10200000187 -
> dumpStart - dump_restore - [4] dump LOG message: Fri Mar 26 00:05:37
> 2010 dump progress[4]: total records: 65938946; dumped: 19842669;
> remaining: 46096277; estimated time remaining: 14:35:2; percent
> complete: 30%; throughput: 56641536 bytes sec Mar 26 00:05:37
> Dogfood : 1:2:efs:NOTICE: 8327: FS: nx-sysadm    0xa0000018b -
> snapAdmin - snp - snap remove complete for daily.1 id 5 Mar 26
> 00:05:37 Dogfood : 1:2:efs:INFO: 8328: FS: nx-sysadm    0xa0000018b -
> snapAdmin - snp - snapshot daily.0 creation initiated Mar 26 00:06:07
> Dogfood : 1:5:efs:NOTICE: 8329: FS: nx-sysadm    0xa0000018b -
> snapAdmin - snp - snap create complete for daily.0 id 5 Mar 26
> 00:10:37 Dogfood : 1:4:efs:INFO: 8330: FS: s_home       0x10200000187
> - dumpStart - dump_restore - dump progress[4]: total records:
> 65938946; dumped: 20018814; remaining: 45920132; estimated time
> remaining: 21:43:49; percent complete: 30%; throughput: 37868544
> bytes sec Mar 26 00:10:37 Dogfood : 1:4:efs:INFO: 8331: FS:
> s_home       0x10200000187 - dumpStart - dump_restore - [4] dump LOG
> me
> 
> Then rmc failed as you document below.
> 
> 
> 
> 
> From: Kumar, Raj
> Sent: Friday, March 26, 2010 1:25 PM
> To: Limato, Dave; DL-ONStor-QA
> Subject: RE: NBU for DMA
> 
> This failure happened yesterday due to the RMC issue that I sent.
> 
> Today's backup also failed due to a different RMC issue.
> 
> Mar 26 02:33:31 10.0.2.2 : 1:3:efs:INFO: 8765: FS: s_home
> 0x10200000187 - dumpStart - dump_restore - yield: Operation aborted
> during yield Mar 26 02:33:31 10.0.2.2 : 1:3:efs:WARNING: 8766: FS:
> s_home       0x10200000187 - dumpStart - dump_restore -
> fs_dr_rmcSendMessage: fs_dr_yield timed out; retrying send anyway;
> retry:1. Mar 26 02:33:31 10.0.2.2 : 1:3:efs:ERROR: 8767: FS:
> s_home       0x10200000187 - dumpStart - dump_restore -
> fs_dr_rmcSendMessage: rmc send failed after 1 retry. Mar 26 02:33:31
> 10.0.2.2 : 1:3:efs:ERROR: 8768: FS: s_home       0x10200000187 -
> dumpStart - dump_restore - fs_dumpSendLog: rmc send message failed
> Mar 26 02:33:31 10.0.2.2 : 1:3:efs:INFO: 8769: FS: s_home
> 0x10200000187 - dumpStart - dump_restore - [4] io output stats: dump
> paused due to ndmp/tape flow control 6871586167 (usec) Mar 26
> 02:33:31 10.0.2.2 : 1:3:efs:INFO: 8770: FS: s_home
> 0x10200000187 - dumpStart - dump_restore - [4] io output stats: recs
> written 24414149 bytes written 1575005580288 Mar 26 02:33:31
> 10.0.2.2 : 1:3:efs:INFO: 8771: FS: s_home       0x10200000187 -
> dumpStart - dump_restore - [4] Pass III stats: nDirs 4845215 blks
> 4923382 compact 3186168 Mar 26 02:33:31 10.0.2.2 : 1:4:efs:INFO:
> 8772: FS: s_home       0x10200000187 - dumpStart - snp - snapshot
> dump_4 removal initiated
> 
> From: Limato, Dave
> Sent: Friday, March 26, 2010 12:51 PM
> To: DL-ONStor-QA
> Subject: NBU for DMA
> 
> DMA-Girl,
> 
> Can you loginto NBU for Mightydog. See whats going on. The logs
> rolled over on ndmp.trace, (TED00028290) and I don't see a message in
> elog at 23:50 to say the backup aborted.
> 
> 
> 1.     My question is, did we hit EOM, and did I fail to do something
> with tapes prior config (like clear tapes). ?
> 
> 2.     Does it think the other tapes are full? Or did we hit EOM and
> fail to load the next tape. ?
> 
> 3.     Perhaps I missed the SPAN = Y option.?
> 
> 
> NBU says Status 50
> The client backup terminated abnormally. For example, this error
> occurs if a NetBackup master or media server is shut down or rebooted
> when a backup or restore is in progress.
> 
> /25/2010 8:36:08 PM - begin writing
> 3/25/2010 11:49:08 PM - current media 0016L3 complete, requesting
> next media Any 3/25/2010 11:49:08 PM - current media -- complete,
> awaiting next media Any<<<<<<<<<<<<<<<<< Reason: Drives are in use.,
> Media Server: nbu-prod-1, Robot Number: 0, Robot Type: TLD, Media ID:
> N/A, Drive Name: N/A, Volume Pool: NetBackup, Storage Unit:
> nbu-prod-1-hcart-robot-tld-0-10.0.0.222, Drive Scan Host: N/A
> 3/25/2010 11:50:07 PM - granted resource 0019L3 3/25/2010 11:50:07 PM
> - granted resource HP.ULTRIUM4-SCSI.000 3/25/2010 11:50:07 PM -
> granted resource nbu-prod-1-hcart-robot-tld-0-10.0.0.222 3/25/2010
> 11:50:08 PM - mounting 0019L3 3/25/2010 11:51:06 PM - mounted; mount
> time: 00:00:58 3/25/2010 11:51:10 PM - positioning 0019L3 to file 1
> 3/25/2010 11:51:15 PM - positioned 0019L3; position time: 00:00:05
> 3/25/2010 11:51:15 PM - begin writing
> client process aborted(50)
> 
> Thanks
> 
> Dave Limato - Sr. QA Engineer - LSI Corporation - ONStor Product Test
> - desk 408-433-8742  - cell 510.329.9994 -- dave.limato@lsi.com
> 
