tape drive flakiness

Michael George yellowdog-general@lists.terrasoftsolutions.com
Tue Oct 15 05:29:00 2002


I have an XServe running YDL2.3 and I'm having some trouble with an 
Ecrix tape drive attached to it.

This drive worked fine when it was connected to an Intel RedHat box 
with only the occasional weirdness (once every 2-3 months and then a 
retry would get things right).  However, we've had this deck fail doing 
backups 3x in the past 6 days.

This time is the worst, as I cannot even access the device anymore.  
When trying to do "mt -f /dev/st0 status" I get:

/dev/st0: No such device or address

But I know the device is there, because when the machine booted, I got:

Oct  8 09:01:01 stout kernel: SCSI subsystem driver Revision: 1.00
Oct  8 09:01:01 stout kernel: sym.18.2.0: setting PCI_COMMAND_PARITY...
Oct  8 09:01:01 stout kernel: sym.18.2.1: setting PCI_COMMAND_PARITY...
Oct  8 09:01:01 stout kernel: sym0: <896> rev 0x5 on pci bus 18 device 
2 function 0 irq 52
Oct  8 09:01:01 stout kernel: sym0: No NVRAM, ID 7, Fast-40, LVD, 
parity checking
Oct  8 09:01:01 stout kernel: sym0: SCSI BUS has been reset.
Oct  8 09:01:01 stout kernel: sym1: <896> rev 0x5 on pci bus 18 device 
2 function 1 irq 52
Oct  8 09:01:01 stout kernel: sym1: No NVRAM, ID 7, Fast-40, LVD, 
parity checking
Oct  8 09:01:01 stout kernel: sym1: SCSI BUS has been reset.
Oct  8 09:01:01 stout kernel: scsi0 : sym-2.1.17a
Oct  8 09:01:01 stout kernel: scsi1 : sym-2.1.17a
Oct  8 09:01:01 stout kernel: blk: queue dff40c28, I/O limit 4095Mb 
(mask 0xffffffff)
Oct  8 09:01:01 stout kernel:   Vendor: ECRIX     Model: VXA-1          
    Rev: 2A6A
Oct  8 09:01:01 stout kernel:   Type:   Sequential-Access               
    ANSI SCSI revision: 02
Oct  8 09:01:01 stout kernel: blk: queue dff40228, I/O limit 4095Mb 
(mask 0xffffffff)
Oct  8 09:01:01 stout kernel: mesh: configured for synchronous 5 MB/s
Oct  8 09:01:01 stout kernel: st: Version 20020805, bufsize 32768, wrt 
30720, max init. bufs 4, s/g segs 16
Oct  8 09:01:01 stout kernel: Attached scsi tape st0 at scsi0, channel 
0, id 4, lun 0

When the drive failed on 10/12/2002, I had this in /var/log/messages:

Oct 12 02:30:53 stout kernel: invalidate: busy buffer
Oct 12 02:31:08 stout last message repeated 87 times
Oct 12 02:31:44 stout kernel: st0: Error with sense data: Info 
fld=0x187, Current st09:00: sense key Hardware Error
Oct 12 02:31:44 stout kernel: Additional sense indicates Mechanical 
positioning error
Oct 12 02:36:20 stout kernel: st0: Error with sense data: Info 
fld=0x168, Current st09:00: sense key Illegal Request
Oct 12 02:36:20 stout kernel: Additional sense indicates Write append 
error
Oct 12 02:36:20 stout kernel: st0: Error on write filemark.

Yesterday, I was able to eject the tape and do the backup seemingly 
okay, but this was in the logfile:

Oct 14 09:25:28 stout kernel: st0: Error with sense data: Current 
st09:00: sense key Not ReadyOct 14 09:25:28 stout kernel: Additional 
sense indicates Medium not present
Oct 14 09:25:55 stout kernel: invalidate: busy bufferOct 14 09:26:09 
stout last message repeated 167 times
Oct 14 09:28:11 stout sshd(pam_unix)[13573]: session closed for user 
georgeOct 14 09:28:22 stout kernel: sym0:4:0: ABORT operation started.
Oct 14 09:28:22 stout kernel: sym0:4:control msgout: 80 6. Oct 14 
09:28:22 stout kernel: sym0:4:0: ABORT operation complete.
Oct 14 09:28:22 stout kernel: sym0: unexpected disconnectOct 14 
09:28:22 stout kernel: sym0:4:0: DEVICE RESET operation started.
Oct 14 09:28:22 stout kernel: sym0:4:0: DEVICE RESET operation 
failed.Oct 14 09:28:22 stout kernel: sym0:4:0: BUS RESET operation 
started.
Oct 14 09:28:22 stout kernel: sym0:4:0: BUS RESET operation failed.Oct 
14 09:28:27 stout kernel: sym0:4:0: HOST RESET operation started.
Oct 14 09:28:27 stout kernel: sym0:4:0: HOST RESET operation failed.
Oct 14 09:28:37 stout kernel: scsi: device set offline - command error 
recover failed: host 0 channel 0 id 4 lun 0

and today /var/log/messages has only this at the time when the backup 
should have been run:

Oct 15 02:30:14 stout kernel: invalidate: busy buffer
Oct 15 02:30:15 stout last message repeated 37 times

It kinda looks like hardware failure, but after rebooting, the backup 
worked just fine...

We're running kernel 2.4.20-pre5 benh.  I'm going to check for a newer 
kernel right now...

Thanks!

-Michael