NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.

Forum Discussion

alaeth's avatar
alaeth
Aspirant
Sep 03, 2012

Disk failure detected, but not logged or reported (by email)

Pretty scary event over the weekend. I was poking around the logs looking for something else and noticed this:
Sep  1 16:23:48 ReadyNAS01 kernel: RAID1 conf printout:
Sep 1 16:23:48 ReadyNAS01 kernel: --- wd:5 rd:6
Sep 1 16:23:48 ReadyNAS01 kernel: disk 0, wo:0, o:1, dev:sda1
Sep 1 16:23:48 ReadyNAS01 kernel: disk 1, wo:0, o:1, dev:sdb1
Sep 1 16:23:48 ReadyNAS01 kernel: disk 2, wo:1, o:0, dev:sdc1
Sep 1 16:23:48 ReadyNAS01 kernel: disk 3, wo:0, o:1, dev:sdd1
Sep 1 16:23:48 ReadyNAS01 kernel: disk 4, wo:0, o:1, dev:sde1
Sep 1 16:23:48 ReadyNAS01 kernel: disk 5, wo:0, o:1, dev:sdf1


I thought "That's odd, why is disk #2 different?" So I logged into Frontview and saw the drive was listed as dead (not failed, not failing, but dead). I immediatly replaced the drive, and got back up and running with protection and started digging further into the syslogs.

Regardless of if the drive is actually bad or not, I have become used to the Reaynas emailing me for any issues. In this case, one more drive failure would have resulted in total data loss. I checked and double-checked the logs in Frontview, aside from the backups, there is nothing listed on September 1st. I checked my email (and spam folder) - nothing was sent. Email is clearly working since I got the emails when replacing the drive and rebuilding the volume.

This is partially a word of caution - not to rely on automated logs and alerts, but also a question, does anyone know what could have caused the lack of notification?


*Technical Details*
ReadyNas Pro 6
RAIDiator 4.2.15
Below is the actual log statement of the command that triggered the disk being marked dead:
Sep  1 16:23:48 ReadyNAS01 kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 1 16:23:48 ReadyNAS01 kernel: ata3.00: failed command: SMART
Sep 1 16:23:48 ReadyNAS01 kernel: ata3.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
Sep 1 16:23:48 ReadyNAS01 kernel: res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Sep 1 16:23:48 ReadyNAS01 kernel: ata3.00: status: { DRDY }
Sep 1 16:23:48 ReadyNAS01 kernel: ata3: hard resetting link
Sep 1 16:23:48 ReadyNAS01 kernel: ata3: link is slow to respond, please be patient (ready=0)
Sep 1 16:23:48 ReadyNAS01 kernel: ata3: COMRESET failed (errno=-16)
Sep 1 16:23:48 ReadyNAS01 kernel: ata3: hard resetting link
Sep 1 16:23:48 ReadyNAS01 kernel: ata3: link is slow to respond, please be patient (ready=0)
Sep 1 16:23:48 ReadyNAS01 kernel: ata3: COMRESET failed (errno=-16)
Sep 1 16:23:48 ReadyNAS01 kernel: ata3: hard resetting link
Sep 1 16:23:48 ReadyNAS01 kernel: ata3: link is slow to respond, please be patient (ready=0)
Sep 1 16:23:48 ReadyNAS01 kernel: ata3: COMRESET failed (errno=-16)
Sep 1 16:23:48 ReadyNAS01 kernel: ata3: limiting SATA link speed to 1.5 Gbps
Sep 1 16:23:48 ReadyNAS01 kernel: ata3: hard resetting link
Sep 1 16:23:48 ReadyNAS01 kernel: ata3: COMRESET failed (errno=-16)
Sep 1 16:23:48 ReadyNAS01 kernel: ata3: reset failed, giving up
Sep 1 16:23:48 ReadyNAS01 kernel: ata3.00: disabled
Sep 1 16:23:48 ReadyNAS01 kernel: ata3: EH complete
Sep 1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] Unhandled error code
Sep 1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Sep 1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] CDB: Read(10): 28 00 00 80 02 9c 00 00 20 00
Sep 1 16:23:48 ReadyNAS01 kernel: end_request: I/O error, dev sdc, sector 8389276
Sep 1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] Unhandled error code
Sep 1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Sep 1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] CDB: Write(10): 2a 00 00 00 00 0c 00 00 02 00
Sep 1 16:23:48 ReadyNAS01 kernel: end_request: I/O error, dev sdc, sector 12
Sep 1 16:23:48 ReadyNAS01 kernel: end_request: I/O error, dev sdc, sector 12
Sep 1 16:23:48 ReadyNAS01 kernel: **************** super written barrier kludge on md0: error==IO 0xfffffffb
Sep 1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] Unhandled error code
Sep 1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Sep 1 16:23:48 ReadyNAS01 kernel: sd 2:0:0:0: [sdc] CDB: Write(10): 2a 00 00 00 00 0c 00 00 02 00
Sep 1 16:23:48 ReadyNAS01 kernel: end_request: I/O error, dev sdc, sector 12
Sep 1 16:23:48 ReadyNAS01 kernel: md: super_written gets error=-5, uptodate=0
Sep 1 16:23:48 ReadyNAS01 kernel: raid1: Disk failure on sdc1, disabling device.
Sep 1 16:23:48 ReadyNAS01 kernel: raid1: Operation continuing on 5 devices.
Sep 1 16:23:48 ReadyNAS01 kernel: RAID1 conf printout:
Sep 1 16:23:48 ReadyNAS01 kernel: --- wd:5 rd:6
Sep 1 16:23:48 ReadyNAS01 kernel: disk 0, wo:0, o:1, dev:sda1
Sep 1 16:23:48 ReadyNAS01 kernel: disk 1, wo:0, o:1, dev:sdb1
Sep 1 16:23:48 ReadyNAS01 kernel: disk 2, wo:1, o:0, dev:sdc1
Sep 1 16:23:48 ReadyNAS01 kernel: disk 3, wo:0, o:1, dev:sdd1
Sep 1 16:23:48 ReadyNAS01 kernel: disk 4, wo:0, o:1, dev:sde1
Sep 1 16:23:48 ReadyNAS01 kernel: disk 5, wo:0, o:1, dev:sdf1
Sep 1 16:23:48 ReadyNAS01 kernel: RAID1 conf printout:
Sep 1 16:23:48 ReadyNAS01 kernel: --- wd:5 rd:6
Sep 1 16:23:48 ReadyNAS01 kernel: disk 0, wo:0, o:1, dev:sda1
Sep 1 16:23:48 ReadyNAS01 kernel: disk 1, wo:0, o:1, dev:sdb1
Sep 1 16:23:48 ReadyNAS01 kernel: disk 3, wo:0, o:1, dev:sdd1
Sep 1 16:23:48 ReadyNAS01 kernel: disk 4, wo:0, o:1, dev:sde1
Sep 1 16:23:48 ReadyNAS01 kernel: disk 5, wo:0, o:1, dev:sdf1

3 Replies

Replies have been turned off for this discussion
  • I'd advise you upgrade latest firmware 4.2.21,I suppose the system will send email only if disk appears ATA errors,can you post your 3rd disk's SMART reporting here?
  • "Disk Failure" is the first alert listed, and in fact you can't disable it (the option is checked, and greyed out).

    I'll contemplate upgrading the firmware, but I've had bad experiences with that in the past (lost a lot of customized settings and applications) and would prefer to try and solve the issue from the back-end. Plus the changes in 4.2.16 and newer that make impossible to downgrade make me nervous since I no longer have enough space to fully backup the NAS.

    I don't have the SMART codes from the NAS (obviously - since it detected the drive as dead). But here they are from HDDScan.
    Model: ST31500341AS
    Firmware: CC1H
    Serial: 9VS3AQRR
    LBA: 2930277168

    Report By: HDDScan for Windows version 3.3
    Report Date: 9/5/2012 9:13:03 PM


    Num Attribute Name Value Worst Raw(hex) Threshold
    001 Raw Read Error Rate 117 099 00000009B2-4A8A 006
    003 Spin Up Time 100 092 0000000000-0000 000
    004 Start/Stop Count 100 100 0000000000-024A 020
    005 Reallocation Sector Count 100 100 0000000000-0000 036
    007 Seek Error Rate 082 060 0000000A56-3097 030
    009 Power-On Hours Count 075 075 0000000000-5742 000
    010 Spin Retry Count 100 100 0000000000-0002 097
    012 Device Power Cycle Count 100 100 0000000000-0024 020
    184 End To End Error Count 100 100 0000000000-0000 099
    187 Reported Uncorrectable Error 100 100 0000000000-0000 000
    188 Reported Command Timeouts 099 099 0000010001-0001 000
    189 High Fly Writes 001 001 0000000000-02F5 000
    190 Airflow Temperature 075 049 25 C 045
    190 Airflow Temperature Minimum 075 049 25 C 045
    190 Airflow Temperature Maximum 075 049 25 C 045
    194 HDA Temperature 025 051 25 C 000
    194 HDA Temperature Minimum 025 051 19 C 000
    194 HDA Temperature Maximum 025 051 49 C 000
    195 Error Rate 042 028 00000009B2-4A8A 000
    197 Current Pending Errors Count 100 100 0000000000-0000 000
    198 Uncorrectable Errors Count 100 100 0000000000-0000 000
    199 UltraDMA CRC Errors 200 200 0000000000-0000 000
    240 Heads Flying Hours 100 253 345ECC0000-570E 000
    241 Total Host Writes 100 253 0000002F48-4C93 000
    242 Total Host Reads 100 253 00000026E0-F00A 000
  • Another disk (disk 1 this time) has been marked as "dead" with no notifications or emails.

    This time I removed the disk, rebooted, and replaced it. The drive has no SMART errors and was able to resync fine. I setup a webcam facing the front panel since the ONLY indication I have of these failures is the OLED staying on with "Vol C Unprotected"

    I will attempt to backup what critical data I can on spare drives and update the firmware (against my better judgement).

    Could replacement drive not listed in the HCL be to blame? I recently had to replace two drives due to increasing SMART reallocation errors - but I used WDC drives not listed (see this cross-post)

NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology! 

Join Us!

ProSupport for Business

Comprehensive support plans for maximum network uptime and business peace of mind.

 

Learn More