Re: Dead drive on Pro 4

Elvis85 · ‎2012-10-03

Hi,
I had to replace my good old NV+ (burnt in fire) with a brand new Pro 4 last july. The disks I used are 4x 2TB WD Caviar black (Model: WD2002FAEX-007BA0, Firmware: 05.01D05).
Everything went fine until today when it tried to make its disk scrubbing, stating : "The scheduled disk scrubbing was skipped because the operating system volume was degraded."

Seeing that error, I went in Frontview -> Status --> Health and the Status of every disk was OK (and the volume was fully redundant according to Volume Status). But there was definitely something odd with disk 2 since I could not get its SMART+ information and disk temperature was 0 C. I rebooted my Pro and Disk 2 was dead after reboot.

Since I was a bit skeptical about this (I am well aware that this is more than likely to be a disk failure), I went to check the logs, system and kernel logs more specifically, a found some interesting logs related as to why Disk 2 was dropped : (this is system.log, which includes entries from kernel.log)

Oct  2 12:00:01 Gozer RAIDiator: The on-line filesystem consistency check has started for Volume C.
Oct  2 12:00:11 Gozer noflushd[5422]: Disks spinning up after 8284 minutes.
Oct  2 12:00:11 Gozer RAIDiator: Volume consistency check started for Volume C. (Gozer) : The on-line filesystem consistency check has started for Volume C.
Oct  2 12:00:12 Gozer kernel: EXT4-fs (dm-3): INFO: recovery required on readonly filesystem
Oct  2 12:00:12 Gozer kernel: EXT4-fs (dm-3): write access will be enabled during recovery
Oct  2 12:00:15 Gozer kernel: EXT4-fs (dm-3): recovery complete
Oct  2 12:00:15 Gozer kernel: EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: acl,user_xattr
Oct  2 12:05:06 Gozer RAIDiator: The on-line filesystem consistency check completed without errors for Volume C.
Oct  2 12:05:06 Gozer RAIDiator: Volume consistency check completed for Volume C. (Gozer) : The on-line filesystem consistency check completed without errors for Volume C.
Oct  2 12:07:04 Gozer syslogd 1.4.1#18: restart.
Oct  2 12:08:54 Gozer syslogd 1.4.1#18: restart.
Oct  2 12:09:17 Gozer syslogd 1.4.1#18: restart.
Oct  2 12:39:54 Gozer noflushd[5422]: Spinning down disks.
Oct  2 12:39:54 Gozer noflushd[5422]: Disks spinning up after 0 minutes.
Oct  2 12:40:56 Gozer kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct  2 12:40:56 Gozer kernel: ata2.00: failed command: FLUSH CACHE EXT
Oct  2 12:40:56 Gozer kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Oct  2 12:40:56 Gozer kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct  2 12:40:56 Gozer kernel: ata2.00: status: { DRDY }
Oct  2 12:40:56 Gozer kernel: ata2: hard resetting link
Oct  2 12:41:02 Gozer kernel: ata2: link is slow to respond, please be patient (ready=0)
Oct  2 12:41:06 Gozer kernel: ata2: COMRESET failed (errno=-16)
Oct  2 12:41:06 Gozer kernel: ata2: hard resetting link
Oct  2 12:41:12 Gozer kernel: ata2: link is slow to respond, please be patient (ready=0)
Oct  2 12:41:16 Gozer kernel: ata2: COMRESET failed (errno=-16)
Oct  2 12:41:16 Gozer kernel: ata2: hard resetting link
Oct  2 12:41:22 Gozer kernel: ata2: link is slow to respond, please be patient (ready=0)
Oct  2 12:41:51 Gozer kernel: ata2: COMRESET failed (errno=-16)
Oct  2 12:41:51 Gozer kernel: ata2: limiting SATA link speed to 1.5 Gbps
Oct  2 12:41:51 Gozer kernel: ata2: hard resetting link
Oct  2 12:41:56 Gozer kernel: ata2: COMRESET failed (errno=-16)
Oct  2 12:41:56 Gozer kernel: ata2: reset failed, giving up
Oct  2 12:41:56 Gozer kernel: ata2.00: disabled
Oct  2 12:41:56 Gozer kernel: ata2.00: device reported invalid CHS sector 0
Oct  2 12:41:56 Gozer kernel: ata2: EH complete
Oct  2 12:41:56 Gozer kernel: sd 1:0:0:0: [sdb] Unhandled error code
Oct  2 12:41:56 Gozer kernel: sd 1:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Oct  2 12:41:56 Gozer kernel: sd 1:0:0:0: [sdb] CDB: Write(10): 2a 00 00 00 00 48 00 00 02 00
Oct  2 12:41:56 Gozer kernel: end_request: I/O error, dev sdb, sector 72
Oct  2 12:41:56 Gozer kernel: end_request: I/O error, dev sdb, sector 72
Oct  2 12:41:56 Gozer kernel: md: super_written gets error=-5, uptodate=0
Oct  2 12:41:56 Gozer kernel: md/raid1:md0: Disk failure on sdb1, disabling device.
Oct  2 12:41:56 Gozer kernel: <1>md/raid1:md0: Operation continuing on 3 devices.
Oct  2 12:41:57 Gozer kernel: RAID1 conf printout:
Oct  2 12:41:57 Gozer kernel:  --- wd:3 rd:4
Oct  2 12:41:57 Gozer kernel:  disk 0, wo:0, o:1, dev:sda1
Oct  2 12:41:57 Gozer kernel:  disk 1, wo:1, o:0, dev:sdb1
Oct  2 12:41:57 Gozer kernel:  disk 2, wo:0, o:1, dev:sdc1
Oct  2 12:41:57 Gozer kernel:  disk 3, wo:0, o:1, dev:sdd1
Oct  2 12:41:57 Gozer kernel: RAID1 conf printout:
Oct  2 12:41:57 Gozer kernel:  --- wd:3 rd:4
Oct  2 12:41:57 Gozer kernel:  disk 0, wo:0, o:1, dev:sda1
Oct  2 12:41:57 Gozer kernel:  disk 2, wo:0, o:1, dev:sdc1
Oct  2 12:41:57 Gozer kernel:  disk 3, wo:0, o:1, dev:sdd1
Oct  2 13:00:06 Gozer kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

Note that the issue first starts at 12:40. I have a weekly automatic volume consistency check at 12:00. According to this log, the kernel drops the disk because it fails to reset the SATA link. I've look for other similar entries since I started to use my new Pro and I actually found another similar entry on August 14, except that it manages to recover from its exception. Here is the log:

Aug 14 12:00:13 Gozer noflushd[5422]: Disks spinning up after 2185 minutes.
Aug 14 12:00:21 Gozer RAIDiator: The on-line filesystem consistency check has started for Volume C.
Aug 14 12:00:21 Gozer RAIDiator: Volume consistency check started for Volume C. (Gozer) : The on-line filesystem consistency check has started for Volume C.
Aug 14 12:00:22 Gozer kernel: EXT4-fs (dm-3): INFO: recovery required on readonly filesystem
Aug 14 12:00:22 Gozer kernel: EXT4-fs (dm-3): write access will be enabled during recovery
Aug 14 12:00:30 Gozer kernel: EXT4-fs (dm-3): recovery complete
Aug 14 12:00:30 Gozer kernel: EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: acl,user_xattr
Aug 14 12:00:43 Gozer syslogd 1.4.1#18: restart.
Aug 14 12:03:24 Gozer syslogd 1.4.1#18: restart.
Aug 14 12:04:43 Gozer RAIDiator: The on-line filesystem consistency check completed without errors for Volume C.
Aug 14 12:04:43 Gozer RAIDiator: Volume consistency check completed for Volume C. (Gozer) : The on-line filesystem consistency check completed without errors for Volume C.
Aug 14 12:35:24 Gozer noflushd[5422]: Spinning down disks.
Aug 14 12:35:24 Gozer noflushd[5422]: Disks spinning up after 0 minutes.
Aug 14 12:36:24 Gozer kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Aug 14 12:36:24 Gozer kernel: ata2.00: failed command: FLUSH CACHE EXT
Aug 14 12:36:24 Gozer kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Aug 14 12:36:24 Gozer kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 14 12:36:24 Gozer kernel: ata2.00: status: { DRDY }
Aug 14 12:36:24 Gozer kernel: ata2: hard resetting link
Aug 14 12:36:30 Gozer kernel: ata2: link is slow to respond, please be patient (ready=0)
Aug 14 12:36:34 Gozer kernel: ata2: COMRESET failed (errno=-16)
Aug 14 12:36:34 Gozer kernel: ata2: hard resetting link
Aug 14 12:36:40 Gozer kernel: ata2: link is slow to respond, please be patient (ready=0)
Aug 14 12:36:44 Gozer kernel: ata2: COMRESET failed (errno=-16)
Aug 14 12:36:44 Gozer kernel: ata2: hard resetting link
Aug 14 12:36:50 Gozer kernel: ata2: link is slow to respond, please be patient (ready=0)
Aug 14 12:37:19 Gozer kernel: ata2: SATA link down (SStatus 0 SControl 300)
Aug 14 12:37:19 Gozer kernel: ata2.00: link offline, clearing class 1 to NONE
Aug 14 12:37:19 Gozer kernel: ata2: hard resetting link
Aug 14 12:37:25 Gozer kernel: ata2: link is slow to respond, please be patient (ready=0)
Aug 14 12:37:27 Gozer kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 14 12:37:27 Gozer kernel: ata2.00: configured for UDMA/133
Aug 14 12:37:27 Gozer kernel: ata2.00: retrying FLUSH 0xea Emask 0x4
Aug 14 12:37:27 Gozer kernel: ata2.00: device reported invalid CHS sector 0
Aug 14 12:37:27 Gozer kernel: ata2: EH complete
Aug 14 13:05:30 Gozer noflushd[5422]: Spinning down disks.

Note that this also happens about 30-40 minutes after a volume consistency check and a disks spin down.

I still need to test my presumed faulty disk with WD tools, but I would like to know if anyone has any thought on that? For now, I tried to put it back into my Pro and it is now trying to resync. I know this is not a solution but I'm curious to know if the same error will come back again.

I've seen blacey's post and his logs seem pretty similar to what I have, even though I'm on RAIDiator 4.2.21.

Thanks!

bpeddada · ‎2012-10-05

Hi,

When you see:

Aug 14 12:36:44 Gozer kernel: ata2: COMRESET failed (errno=-16)

COMRESET failed is dtected by smartctrl.

Means the drive is bad. Please call and open a case. Have logs downloaded and ready to be emailed.

Netgear Support.