Disk failure detected, but not logged or reported (by email)

Question

Pretty scary event over the weekend. I was poking around the logs looking for something else and noticed this:

Sep  1 16:23:48 ReadyNAS01 kernel: RAID1 conf printout:
Sep  1 16:23:48 ReadyNAS01 kernel:  --- wd:5 rd:6
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 0, wo:0, o:1, dev:sda1
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 1, wo:0, o:1, dev:sdb1
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 2, wo:1, o:0, dev:sdc1
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 3, wo:0, o:1, dev:sdd1
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 4, wo:0, o:1, dev:sde1
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 5, wo:0, o:1, dev:sdf1

I thought "That's odd, why is disk #2 different?" So I logged into Frontview and saw the drive was listed as dead (not failed, not failing, but dead). I immediatly replaced the drive, and got back up and running with protection and started digging further into the syslogs.

Regardless of if the drive is actually bad or not, I have become used to the Reaynas emailing me for any issues. In this case, one more drive failure would have resulted in total data loss. I checked and double-checked the logs in Frontview, aside from the backups, there is nothing listed on September 1st. I checked my email (and spam folder) - nothing was sent. Email is clearly working since I got the emails when replacing the drive and rebuilding the volume.

This is partially a word of caution - not to rely on automated logs and alerts, but also a question, does anyone know what could have caused the lack of notification?

*Technical Details*
ReadyNas Pro 6
RAIDiator 4.2.15
Below is the actual log statement of the command that triggered the disk being marked dead:

Sep  1 16:23:48 ReadyNAS01 kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  1 16:23:48 ReadyNAS01 kernel: ata3.00: failed command: SMART
Sep  1 16:23:48 ReadyNAS01 kernel: ata3.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
Sep  1 16:23:48 ReadyNAS01 kernel:          res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Sep  1 16:23:48 ReadyNAS01 kernel: ata3.00: status: { DRDY }
Sep  1 16:23:48 ReadyNAS01 kernel: ata3: hard resetting link
Sep  1 16:23:48 ReadyNAS01 kernel: ata3: link is slow to respond, please be patient (ready=0)
Sep  1 16:23:48 ReadyNAS01 kernel: ata3: COMRESET failed (errno=-16)
Sep  1 16:23:48 ReadyNAS01 kernel: ata3: hard resetting link
Sep  1 16:23:48 ReadyNAS01 kernel: ata3: link is slow to respond, please be patient (ready=0)
Sep  1 16:23:48 ReadyNAS01 kernel: ata3: COMRESET failed (errno=-16)
Sep  1 16:23:48 ReadyNAS01 kernel: ata3: hard resetting link
Sep  1 16:23:48 ReadyNAS01 kernel: ata3: link is slow to respond, please be patient (ready=0)
Sep  1 16:23:48 ReadyNAS01 kernel: ata3: COMRESET failed (errno=-16)
Sep  1 16:23:48 ReadyNAS01 kernel: ata3: limiting SATA link speed to 1.5 Gbps
Sep  1 16:23:48 ReadyNAS01 kernel: ata3: hard resetting link
Sep  1 16:23:48 ReadyNAS01 kernel: ata3: COMRESET failed (errno=-16)
Sep  1 16:23:48 ReadyNAS01 kernel: ata3: reset failed, giving up
Sep  1 16:23:48 ReadyNAS01 kernel: ata3.00: disabled
Sep  1 16:23:48 ReadyNAS01 kernel: ata3: EH complete
Sep  1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] Unhandled error code
Sep  1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Sep  1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 00 80 02 9c 00 00 20 00
Sep  1 16:23:48 ReadyNAS01 kernel: end_request: I/O error, dev sdc, sector 8389276
Sep  1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] Unhandled error code
Sep  1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Sep  1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] CDB: Write(10): 2a 00 00 00 00 0c 00 00 02 00
Sep  1 16:23:48 ReadyNAS01 kernel: end_request: I/O error, dev sdc, sector 12
Sep  1 16:23:48 ReadyNAS01 kernel: end_request: I/O error, dev sdc, sector 12
Sep  1 16:23:48 ReadyNAS01 kernel:  **************** super written barrier kludge on md0: error==IO 0xfffffffb
Sep  1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] Unhandled error code
Sep  1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Sep  1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] CDB: Write(10): 2a 00 00 00 00 0c 00 00 02 00
Sep  1 16:23:48 ReadyNAS01 kernel: end_request: I/O error, dev sdc, sector 12
Sep  1 16:23:48 ReadyNAS01 kernel: md: super_written gets error=-5, uptodate=0
Sep  1 16:23:48 ReadyNAS01 kernel: raid1: Disk failure on sdc1, disabling device.
Sep  1 16:23:48 ReadyNAS01 kernel: raid1: Operation continuing on 5 devices.
Sep  1 16:23:48 ReadyNAS01 kernel: RAID1 conf printout:
Sep  1 16:23:48 ReadyNAS01 kernel:  --- wd:5 rd:6
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 0, wo:0, o:1, dev:sda1
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 1, wo:0, o:1, dev:sdb1
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 2, wo:1, o:0, dev:sdc1
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 3, wo:0, o:1, dev:sdd1
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 4, wo:0, o:1, dev:sde1
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 5, wo:0, o:1, dev:sdf1
Sep  1 16:23:48 ReadyNAS01 kernel: RAID1 conf printout:
Sep  1 16:23:48 ReadyNAS01 kernel:  --- wd:5 rd:6
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 0, wo:0, o:1, dev:sda1
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 1, wo:0, o:1, dev:sdb1
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 3, wo:0, o:1, dev:sdd1
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 4, wo:0, o:1, dev:sde1
Sep  1 16:23:48 ReadyNAS01 kernel:  disk 5, wo:0, o:1, dev:sdf1

de_niro · Answer

I'd advise you upgrade latest firmware 4.2.21,I suppose the  system will send email only if disk appears ATA errors,can you post your 3rd disk's SMART reporting here?

alaeth · Answer

"Disk Failure" is the first alert listed, and in fact you can't disable it (the option is checked, and greyed out).

I'll contemplate upgrading the firmware, but I've had bad experiences with that in the past (lost a lot of customized settings and applications) and would prefer to try and solve the issue from the back-end. Plus the changes in 4.2.16 and newer that make impossible to downgrade make me nervous since I no longer have enough space to fully backup the NAS.

I don't have the SMART codes from the NAS (obviously - since it detected the drive as dead). But here they are from HDDScan.

Model: ST31500341AS
Firmware: CC1H
Serial: 9VS3AQRR
LBA: 2930277168

Report By: HDDScan for Windows version 3.3
Report Date: 9/5/2012 9:13:03 PM


 Num  Attribute Name  Value  Worst  Raw(hex)  Threshold  
 001 Raw Read Error Rate  117 099 00000009B2-4A8A 006 
 003 Spin Up Time  100 092 0000000000-0000 000 
 004 Start/Stop Count  100 100 0000000000-024A 020 
 005 Reallocation Sector Count  100 100 0000000000-0000 036 
 007 Seek Error Rate  082 060 0000000A56-3097 030 
 009 Power-On Hours Count  075 075 0000000000-5742 000 
 010 Spin Retry Count  100 100 0000000000-0002 097 
 012 Device Power Cycle Count  100 100 0000000000-0024 020 
 184 End To End Error Count  100 100 0000000000-0000 099 
 187 Reported Uncorrectable Error  100 100 0000000000-0000 000 
 188 Reported Command Timeouts  099 099 0000010001-0001 000 
 189 High Fly Writes  001 001 0000000000-02F5 000 
 190 Airflow Temperature  075 049 25 C  045 
 190 Airflow Temperature Minimum 075 049 25 C 045 
 190 Airflow Temperature Maximum 075 049 25 C 045 
 194 HDA Temperature  025 051 25 C  000 
 194 HDA Temperature Minimum 025 051 19 C 000 
 194 HDA Temperature Maximum 025 051 49 C 000 
 195 Error Rate 042 028 00000009B2-4A8A 000 
 197 Current Pending Errors Count  100 100 0000000000-0000 000 
 198 Uncorrectable Errors Count  100 100 0000000000-0000 000 
 199 UltraDMA CRC Errors  200 200 0000000000-0000 000 
 240 Heads Flying Hours  100 253 345ECC0000-570E 000 
 241 Total Host Writes  100 253 0000002F48-4C93 000 
 242 Total Host Reads  100 253 00000026E0-F00A 000

alaeth · Answer

Another disk (disk 1 this time) has been marked as "dead" with no notifications or emails.

This time I removed the disk, rebooted, and replaced it. The drive has no SMART errors and was able to resync fine. I setup a webcam facing the front panel since the ONLY indication I have of these failures is the OLED staying on with "Vol C Unprotected"

I will attempt to backup what critical data I can on spare drives and update the firmware (against my better judgement).

Could replacement drive not listed in the HCL be to blame? I recently had to replace two drives due to increasing SMART reallocation errors - but I used WDC drives not listed (see this cross-post)

Forum Discussion

Disk failure detected, but not logged or reported (by email)

3 Replies

Related Content

Failure to Detect Adapter

What to Include When Reporting an Issue

Bug Report: Orbi App Timed out

NETGEAR GS748T fans failure

D6220 upgrade failure

NETGEAR Academy

ProSupport for Business