Forum Discussion

Aspirant

Sep 29, 2012

3 drive failures w/in 10 days of 4.2.22 upgrade? #19570015

Within two days of upgrading my Ultra 6 with 6 2 TB drives in it to 4.2.22, it marked a drive dead. I always keep a spare so I swapped the drive out and completed a cross-ship RMA with Seagate. Before the new drive even arrived, a second drive was marked dead due to the same symptoms (spin-control/command retry latency) and I was away on travel so I remotely disabled the Power Scheduler and powered the NAS down for a couple days.

Last night, I returned home to find the first replacement drive at my door so I powered up the NAS to grab the logs for RMA processing with Seagate and the NAS started re-syncing the drive that it previously marked dead. I let it resync overnight and at this point, the purported dead drive seems to be working fine.

I haven't had a drive failure in a long time but experienced 2 dead drives within one week of the 4.2.22 firmware upgrade. In both cases, the drives passed the daily SMART tests but it failed due to spin/command latencies. Is it possible that the drives are fine but that there is a bug in the 4.2.22 firmware where it improperly detects/diagnoses a dead drive?

Chronology of Events:

9/20 - Upgrade to 4.2.22
9/24 - First Dead Disk. Replace drive, resync. Cross-ship to Seagate
9/26 - Second Dead Disk. Shutdown remotely and disable power scheduler so I could deal with it when the first replacement drive arrived
9/28 - Replacement drive arrived. Powered up NAS, second Dead Disk no longer Dead. Resync completed, array fully operational
9/29 - Post experience to rule-in/out 4.2.22 firmware regression bug before RMA'ing another drive

Here are the relevant log snippets and SMART reports for the two drive failures:

Abridged User Log from 9/22 upgrade -> 9/24 first failure, replace, resync -> 9/26 second failure with just resync on 9/28)


Sep 21 23:00:12 nas RAIDiator: System power-on scheduled for 09/22/2012 08:00
Sep 22 08:01:13 nas RAIDiator: System is up.
Sep 22 08:06:21 nas RAIDiator: Backup finished. [Job 002]
Sep 22 08:08:11 nas RAIDiator: Backup finished. [Job 003]
Sep 22 08:08:37 nas RAIDiator: Please reboot your ReadyNAS device to continue with the update process.
Sep 22 08:08:46 nas RAIDiator: Backup finished. [Job 004]
Sep 22 08:09:03 nas RAIDiator: Please close this browser session and use RAIDar to reconnect to the device.  System rebooting...
Sep 22 08:09:03 nas RAIDiator: Rebooting device...
Sep 22 08:09:08 nas shutdown[4561]: shutting down for system reboot
Sep 22 08:11:56 nas RAIDiator: System Update Status (nas) : Your ReadyNAS device has been updated with a new firmware image. (RAIDiator-x86 4.2.22)
Sep 22 08:12:07 nas RAIDiator: System is up.
Sep 22 08:12:25 nas RAIDiator: Reallocated sector count has increased in the last day.\n\nDisk  0:\n  Previous count: 51\n  Current count: 93\n\nGrowing SMART errors indicate a disk that may fail soon.  If the errors continue to increase, you should be prepared to replace the disk.
Sep 22 08:12:25 nas RAIDiator: New SMART disk errors detected! (nas) : Reallocated sector count has increased in the last day.\n\nDisk  0:\n  Previous count: 51\n  Current count: 93\n\nGrowing SMART errors indicate a disk that may fail soon.  If the errors continue to increase, you should be prepared to replace the disk.
Sep 22 09:05:58 nas RAIDiator: Backup finished. [Job 002]
Sep 22 10:05:10 nas RAIDiator: Backup finished. [Job 003]
Sep 22 12:05:08 nas RAIDiator: Backup finished. [Job 005]
Sep 22 12:05:13 nas RAIDiator: Backup finished. [Job 001]
Sep 22 23:01:08 nas RAIDiator: System power-on scheduled for 09/23/2012 08:00
Sep 23 08:01:07 nas RAIDiator: nic.agent: "nice -n 5 /usr/sbin/apache-ssl -f /etc/frontview/apache/httpd.conf" failed to start.
Sep 23 08:01:17 nas RAIDiator: System is up.
Sep 23 08:05:47 nas RAIDiator: Backup finished. [Job 003]
Sep 23 21:05:42 nas RAIDiator: Backup finished. [Job 002]
Sep 23 22:00:13 nas RAIDiator: System power-on scheduled for 09/24/2012 07:00
Sep 24 07:01:14 nas RAIDiator: System is up.
Sep 24 07:05:40 nas RAIDiator: Backup finished. [Job 002]
Sep 24 07:18:28 nas RAIDiator: Detected increasing reallocated sector count on disk 2 [-1, 0] 
Sep 24 07:18:28 nas RAIDiator: Reallocation event count: [-1, 0] 
Sep 24 07:18:28 nas RAIDiator: Detected increasing spin retry count on disk 2 [168626698, 0] 
Sep 24 07:18:28 nas RAIDiator: Detected increasing spin retry count[168626698] on disk 2 [ST32000542AS, 6XW1KLP1]. This often indicates an impending failure. Please be prepared to replace this disk to maintain data redundancy. 
Sep 24 07:18:49 nas RAIDiator: Detected increasing end-to-end errors on disk 2 [-1, 0] 
Sep 24 07:18:49 nas RAIDiator: Detected increasing command timeouts on disk 2 [-1, 0] 
Sep 24 07:18:49 nas RAIDiator: Detected increasing pending sector count on disk 2 [-1, 0] 
Sep 24 07:18:49 nas RAIDiator: Detected increasing uncorrectable errors on disk 2 [-1, 0] 
Sep 24 07:18:49 nas RAIDiator: Current (temp, start_stop_cnt, power_on_hrs, power_cycle_cnt, load_cycle_cnt) = (-1, -1, -1, -1, -1) 
Sep 24 08:01:33 nas RAIDiator: Disk failure detected.
Sep 24 08:01:33 nas RAIDiator: If the failed disk is used in a RAID level 1, 5, or X-RAID volume, please note that volume is now unprotected, and an additional disk failure may render that volume dead.  If this disk is a part of a RAID 6 volume, your volume is still protected if this is your first failure.  A 2nd disk failure will make your volume unprotected. If this disk is a part of a RAID 10 volume,your volume is still protected if more than half of the disks alive. But another failure of disks been marked may render that volume dead.  It is recommended that you replace the failed disk as soon as possible to maintain optimal protection of your volume.
Sep 24 08:01:33 nas RAIDiator: Disk failure detected. (nas) : Disk failure detected.\n\nIf the failed disk is used in a RAID level 1, 5, or X-RAID volume, please note that volume is now unprotected, and an additional disk failure may render that volume dead.  If this disk is a part of a RAID 6 volume, your volume is still protected if this is your first failure.  A 2nd disk failure will make your volume unprotected. If this disk is a part of a RAID 10 volume,your volume is still protected if more than half of the disks alive. But another failure of disks been marked may render that volume dead.  It is recommended that you replace the failed disk as soon as possible to maintain optimal protection of your volume.\n\n[Mon Sep 24 08:01:33 PDT 2012]
Sep 24 08:06:27 nas RAIDiator: Backup finished. [Job 002]
Sep 24 08:06:27 nas RAIDiator: Backup finished. [Job 003]
Sep 24 08:06:54 nas RAIDiator: Backup finished. [Job 004]
Sep 24 12:05:08 nas RAIDiator: Backup finished. [Job 005]
Sep 24 20:32:52 nas shutdown[28336]: shutting down for system halt
Sep 24 20:33:03 nas RAIDiator: System power-on scheduled for 09/25/2012 07:00
Sep 24 20:34:31 nas RAIDiator: nic.agent: "nice -n 5 /usr/sbin/apache-ssl -f /etc/frontview/apache/httpd.conf" failed to start.
Sep 24 20:34:41 nas RAIDiator: System is up.
Sep 24 20:35:02 nas RAIDiator: Reallocated sector count has increased in the last day.\n\nDisk  1:\n  Previous count: 10\n  Current count: 13\n\nGrowing SMART errors indicate a disk that may fail soon.  If the errors continue to increase, you should be prepared to replace the disk.
Sep 24 20:35:03 nas RAIDiator: New SMART disk errors detected! (nas) : Reallocated sector count has increased in the last day.\n\nDisk  1:\n  Previous count: 10\n  Current count: 13\n\nGrowing SMART errors indicate a disk that may fail soon.  If the errors continue to increase, you should be prepared to replace the disk.
Sep 24 20:35:50 nas RAIDiator: RAID sync started on volume C.
Sep 24 20:35:50 nas RAIDiator: RAID event detected. (nas) : RAID sync started on volume C.\n\n[Mon Sep 24 20:35:50 PDT 2012]
Sep 24 20:35:54 nas RAIDiator: Detected increasing reallocated sector count on disk 1 [13, 10] 
Sep 24 20:35:54 nas RAIDiator: Reallocation event count: [13, 10] 
Sep 24 20:35:54 nas RAIDiator: Current (temp, start_stop_cnt, power_on_hrs, power_cycle_cnt, load_cycle_cnt) = (34, 14809, 10758, 629, -1) 
Sep 24 21:05:43 nas RAIDiator: Backup finished. [Job 002]
Sep 24 22:05:10 nas RAIDiator: Backup finished. [Job 003]
Sep 25 00:05:08 nas RAIDiator: Backup finished. [Job 005]
Sep 25 00:06:46 nas RAIDiator: Backup finished. [Job 004]
Sep 25 06:05:45 nas RAIDiator: Backup finished. [Job 002]
Sep 25 06:07:46 nas RAIDiator: RAID sync finished on volume C.
Sep 25 06:07:47 nas RAIDiator: RAID event detected. (nas) : RAID sync finished on volume C.\n\n[Tue Sep 25 06:07:42 PDT 2012]
Sep 25 07:06:09 nas RAIDiator: Backup finished. [Job 002]
Sep 25 08:05:37 nas RAIDiator: Backup finished. [Job 003]
Sep 25 21:05:46 nas RAIDiator: Backup finished. [Job 002]
Sep 25 22:00:57 nas RAIDiator: System power-on scheduled for 09/26/2012 07:00
Sep 25 22:08:31 nas RAIDiator: System is up.
Sep 25 23:05:58 nas RAIDiator: Backup finished. [Job 002]
Sep 26 06:05:16 nas RAIDiator: Backup finished. [Job 001]
Sep 26 06:25:02 nas RAIDiator: Volume C is approaching capacity:\n99% used\n121G available
Sep 26 07:06:15 nas RAIDiator: Backup finished. [Job 002]
Sep 26 08:05:49 nas RAIDiator: Backup finished. [Job 003]
Sep 26 08:07:12 nas RAIDiator: Backup finished. [Job 004]
Sep 26 12:05:40 nas RAIDiator: Backup finished. [Job 005]
Sep 26 20:07:14 nas RAIDiator: Backup finished. [Job 004]
Sep 26 21:06:22 nas RAIDiator: Detected increasing reallocated sector count on disk 1 [-1, 13] 
Sep 26 21:06:22 nas RAIDiator: Reallocation event count: [-1, 13] 
Sep 26 21:06:22 nas RAIDiator: Detected increasing spin retry count on disk 1 [1701265709, 0] 
Sep 26 21:06:22 nas RAIDiator: Detected increasing spin retry count[1701265709] on disk 1 [ST32000542AS, 5XW1BJZ6]. This often indicates an impending failure. Please be prepared to replace this disk to maintain data redundancy. 
Sep 26 21:06:43 nas RAIDiator: Detected increasing end-to-end errors on disk 1 [-1, 0] 
Sep 26 21:06:43 nas RAIDiator: Detected increasing command timeouts on disk 1 [-1, 65537] 
Sep 26 21:06:43 nas RAIDiator: Detected increasing pending sector count on disk 1 [-1, 0] 
Sep 26 21:06:43 nas RAIDiator: Detected increasing uncorrectable errors on disk 1 [-1, 0] 
Sep 26 21:06:43 nas RAIDiator: Current (temp, start_stop_cnt, power_on_hrs, power_cycle_cnt, load_cycle_cnt) = (-1, -1, -1, -1, -1) 
Sep 26 21:07:04 nas RAIDiator: Backup finished. [Job 002]
Sep 26 21:07:10 nas RAIDiator: Disk failure detected.
Sep 26 21:07:10 nas RAIDiator: If the failed disk is used in a RAID level 1, 5, or X-RAID volume, please note that volume is now unprotected, and an additional disk failure may render that volume dead.  If this disk is a part of a RAID 6 volume, your volume is still protected if this is your first failure.  A 2nd disk failure will make your volume unprotected. If this disk is a part of a RAID 10 volume,your volume is still protected if more than half of the disks alive. But another failure of disks been marked may render that volume dead.  It is recommended that you replace the failed disk as soon as possible to maintain optimal protection of your volume.
Sep 26 21:07:11 nas RAIDiator: Disk failure detected. (nas) : Disk failure detected.\n\nIf the failed disk is used in a RAID level 1, 5, or X-RAID volume, please note that volume is now unprotected, and an additional disk failure may render that volume dead.  If this disk is a part of a RAID 6 volume, your volume is still protected if this is your first failure.  A 2nd disk failure will make your volume unprotected. If this disk is a part of a RAID 10 volume,your volume is still protected if more than half of the disks alive. But another failure of disks been marked may render that volume dead.  It is recommended that you replace the failed disk as soon as possible to maintain optimal protection of your volume.\n\n[Wed Sep 26 21:07:10 PDT 2012]
Sep 26 22:00:12 nas RAIDiator: System power-on scheduled for 09/27/2012 07:00
Sep 26 22:13:34 nas RAIDiator: Detected increasing reallocated sector count on disk 1 [13, -1] 
Sep 26 22:13:34 nas RAIDiator: Reallocation event count: [13, -1] 
Sep 26 22:13:34 nas RAIDiator: Detected increasing spin retry count on disk 1 [0, 1701265709] 
Sep 26 22:13:34 nas RAIDiator: Detected increasing end-to-end errors on disk 1 [0, -1] 
Sep 26 22:13:34 nas RAIDiator: Detected increasing command timeouts on disk 1 [131074, -1] 
Sep 26 22:13:34 nas RAIDiator: Detected increasing command timeouts[131074] on disk 1 [ST32000542AS, 5XW1BJZ6]. This often indicates an impending failure. Please be prepared to replace this disk to maintain data redundancy. 
Sep 26 22:13:35 nas RAIDiator: Detected increasing pending sector count on disk 1 [0, -1] 
Sep 26 22:13:35 nas RAIDiator: Detected increasing uncorrectable errors on disk 1 [0, -1] 
Sep 26 22:13:35 nas RAIDiator: Current (temp, start_stop_cnt, power_on_hrs, power_cycle_cnt, load_cycle_cnt) = (32, 14863, 10805, 631, -1) 
Sep 26 22:19:12 nas RAIDiator: Please close this browser session and use RAIDar to reconnect to the device after it is powered back on.  System powering off...
Sep 26 22:19:12 nas RAIDiator: Powering off device
Sep 26 22:19:22 nas shutdown[4917]: shutting down for system halt
Sep 28 19:06:03 nas RAIDiator: Backup finished. [Job 002]
Sep 28 20:05:15 nas RAIDiator: Backup finished. [Job 003]
Sep 28 20:07:31 nas RAIDiator: Backup finished. [Job 004]
Sep 29 00:05:07 nas RAIDiator: Backup finished. [Job 005]
Sep 29 02:25:37 nas RAIDiator: RAID sync finished on volume C.
Sep 29 02:25:37 nas RAIDiator: RAID event detected. (nas) : RAID sync finished on volume C.\n\n[Sat Sep 29 02:25:33 PDT 2012]
Sep 29 03:06:02 nas RAIDiator: Backup finished. [Job 002]
Sep 29 04:05:41 nas RAIDiator: Backup finished. [Job 003]
Sep 29 04:06:44 nas RAIDiator: Backup finished. [Job 004]
Sep 29 09:06:28 nas RAIDiator: Backup finished. [Job 002]

System Log for first drive failure on 9/24


Sep 24 07:16:45 nas noflushd[2186]: Spinning down disks.
Sep 24 07:17:11 nas noflushd[2186]: Disks spinning up after 0 minutes.
Sep 24 07:17:28 nas kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 24 07:18:28 nas kernel: ata2.00: failed command: SMART
Sep 24 07:18:28 nas kernel: ata2.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/40 tag 0 pio 512 in
Sep 24 07:18:28 nas kernel:          res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Sep 24 07:18:28 nas kernel: ata2.00: status: { DRDY }
Sep 24 07:18:28 nas kernel: ata2: hard resetting link
Sep 24 07:18:28 nas kernel: ata2: link is slow to respond, please be patient (ready=0)
Sep 24 07:18:28 nas kernel: ata2: COMRESET failed (errno=-16)
Sep 24 07:18:28 nas kernel: ata2: hard resetting link
Sep 24 07:18:28 nas kernel: ata2: link is slow to respond, please be patient (ready=0)
Sep 24 07:18:28 nas kernel: ata2: COMRESET failed (errno=-16)
Sep 24 07:18:28 nas kernel: ata2: hard resetting link
Sep 24 07:18:28 nas kernel: ata2: link is slow to respond, please be patient (ready=0)
Sep 24 07:18:28 nas kernel: ata2: COMRESET failed (errno=-16)
Sep 24 07:18:28 nas kernel: ata2: limiting SATA link speed to 1.5 Gbps
Sep 24 07:18:28 nas kernel: ata2: hard resetting link
Sep 24 07:18:28 nas kernel: ata2: COMRESET failed (errno=-16)
Sep 24 07:18:28 nas kernel: ata2: reset failed, giving up
Sep 24 07:18:28 nas kernel: ata2.00: disabled
Sep 24 07:18:28 nas kernel: ata2: EH complete
Sep 24 07:18:28 nas kernel: sd 1:0:0:0: [sdb] Unhandled error code
Sep 24 07:18:28 nas kernel: sd 1:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Sep 24 07:18:28 nas kernel: sd 1:0:0:0: [sdb] CDB: Write(10): 2a 00 00 00 00 48 00 00 02 00
Sep 24 07:18:28 nas kernel: end_request: I/O error, dev sdb, sector 72
Sep 24 07:18:28 nas kernel: end_request: I/O error, dev sdb, sector 72
Sep 24 07:18:28 nas kernel: md: super_written gets error=-5, uptodate=0
Sep 24 07:18:28 nas kernel: md/raid1:md0: Disk failure on sdb1, disabling device.
Sep 24 07:18:28 nas kernel: <1>md/raid1:md0: Operation continuing on 5 devices.
Sep 24 07:18:28 nas kernel: RAID1 conf printout:
Sep 24 07:18:28 nas kernel:  --- wd:5 rd:6
Sep 24 07:18:28 nas kernel:  disk 0, wo:0, o:1, dev:sda1
Sep 24 07:18:28 nas kernel:  disk 1, wo:1, o:0, dev:sdb1
Sep 24 07:18:28 nas kernel:  disk 2, wo:0, o:1, dev:sdc1
Sep 24 07:18:28 nas kernel:  disk 3, wo:0, o:1, dev:sdd1
Sep 24 07:18:28 nas kernel:  disk 4, wo:0, o:1, dev:sde1
Sep 24 07:18:28 nas kernel:  disk 5, wo:0, o:1, dev:sdf1
Sep 24 07:18:28 nas kernel: RAID1 conf printout:
Sep 24 07:18:28 nas kernel:  --- wd:5 rd:6
Sep 24 07:18:28 nas kernel:  disk 0, wo:0, o:1, dev:sda1
Sep 24 07:18:28 nas kernel:  disk 2, wo:0, o:1, dev:sdc1
Sep 24 07:18:28 nas kernel:  disk 3, wo:0, o:1, dev:sdd1
Sep 24 07:18:28 nas kernel:  disk 4, wo:0, o:1, dev:sde1
Sep 24 07:18:28 nas kernel:  disk 5, wo:0, o:1, dev:sdf1
Sep 24 07:18:28 nas RAIDiator: Detected increasing reallocated sector count on disk 2 [-1, 0] 
Sep 24 07:18:28 nas RAIDiator: Reallocation event count: [-1, 0] 
Sep 24 07:18:28 nas RAIDiator: Detected increasing spin retry count on disk 2 [168626698, 0] 
Sep 24 07:18:28 nas RAIDiator: Detected increasing spin retry count[168626698] on disk 2 [ST32000542AS, 6XW1KLP1]. This often indicates an impending failure. Please be prepared to replace this disk to maintain data redundancy. 
Sep 24 07:18:49 nas RAIDiator: Detected increasing end-to-end errors on disk 2 [-1, 0] 
Sep 24 07:18:49 nas RAIDiator: Detected increasing command timeouts on disk 2 [-1, 0] 
Sep 24 07:18:49 nas RAIDiator: Detected increasing pending sector count on disk 2 [-1, 0] 
Sep 24 07:18:49 nas RAIDiator: Detected increasing uncorrectable errors on disk 2 [-1, 0] 
Sep 24 07:18:49 nas RAIDiator: Current (temp, start_stop_cnt, power_on_hrs, power_cycle_cnt, load_cycle_cnt) = (-1, -1, -1, -1, -1) 
Sep 24 07:27:11 nas noflushd[2186]: Spinning down disks.
Sep 24 07:27:12 nas noflushd[2186]: Spindown for /dev/sdb failed: Input/output error
Sep 24 07:27:30 nas noflushd[2186]: Disks spinning up after 0 minutes.
Sep 24 07:32:33 nas noflushd[2186]: Spinning down disks.
Sep 24 07:32:34 nas noflushd[2186]: Spindown for /dev/sdb failed: Input/output error
Sep 24 07:32:51 nas noflushd[2186]: Disks spinning up after 0 minutes.
Sep 24 07:37:54 nas noflushd[2186]: Spinning down disks.
Sep 24 07:37:55 nas noflushd[2186]: Spindown for /dev/sdb failed: Input/output error
Sep 24 07:38:13 nas noflushd[2186]: Disks spinning up after 0 minutes.
Sep 24 07:43:16 nas noflushd[2186]: Spinning down disks.
Sep 24 07:43:17 nas noflushd[2186]: Spindown for /dev/sdb failed: Input/output error
Sep 24 07:43:34 nas noflushd[2186]: Disks spinning up after 0 minutes.
Sep 24 07:45:00 nas rsyncd[4686]: connect from miniduo1.home (192.168.0.103) 
Sep 24 07:45:00 nas rsyncd[4686]: unknown module 'backup' tried from miniduo1.home (192.168.0.103) 
Sep 24 07:48:37 nas noflushd[2186]: Spinning down disks.
Sep 24 07:48:38 nas noflushd[2186]: Spindown for /dev/sdb failed: Input/output error
Sep 24 07:48:55 nas noflushd[2186]: Disks spinning up after 0 minutes.
Sep 24 07:53:58 nas noflushd[2186]: Spinning down disks.
Sep 24 07:53:59 nas noflushd[2186]: Spindown for /dev/sdb failed: Input/output error
Sep 24 07:54:17 nas noflushd[2186]: Disks spinning up after 0 minutes.
Sep 24 07:59:20 nas noflushd[2186]: Spinning down disks.
Sep 24 07:59:21 nas noflushd[2186]: Spindown for /dev/sdb failed: Input/output error
Sep 24 07:59:38 nas noflushd[2186]: Disks spinning up after 0 minutes.
Sep 24 08:01:01 nas udhcpc[2728]: Sending renew...
Sep 24 08:01:01 nas udhcpc[2728]: Lease of 192.168.0.110 obtained, lease time 7200
Sep 24 08:01:29 nas kernel: sd 1:0:0:0: [sdb] Unhandled error code
Sep 24 08:01:29 nas kernel: sd 1:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Sep 24 08:01:29 nas kernel: sd 1:0:0:0: [sdb] CDB: Write(10): 2a 00 00 90 00 50 00 00 02 00
Sep 24 08:01:29 nas kernel: end_request: I/O error, dev sdb, sector 9437264
Sep 24 08:01:29 nas kernel: end_request: I/O error, dev sdb, sector 9437264
Sep 24 08:01:29 nas kernel: md: super_written gets error=-5, uptodate=0
Sep 24 08:01:29 nas kernel: md/raid:md2: Disk failure on sdb5, disabling device.
Sep 24 08:01:29 nas kernel: <1>md/raid:md2: Operation continuing on 5 devices.
Sep 24 08:01:29 nas kernel: RAID conf printout:
Sep 24 08:01:29 nas kernel:  --- level:5 rd:6 wd:5
Sep 24 08:01:29 nas kernel:  disk 0, o:1, dev:sda5
Sep 24 08:01:29 nas kernel:  disk 1, o:0, dev:sdb5
Sep 24 08:01:29 nas kernel:  disk 2, o:1, dev:sdc5
Sep 24 08:01:29 nas kernel:  disk 3, o:1, dev:sdd5
Sep 24 08:01:29 nas kernel:  disk 4, o:1, dev:sde5
Sep 24 08:01:29 nas kernel:  disk 5, o:1, dev:sdf5
Sep 24 08:01:29 nas kernel: RAID conf printout:
Sep 24 08:01:29 nas kernel:  --- level:5 rd:6 wd:5
Sep 24 08:01:29 nas kernel:  disk 0, o:1, dev:sda5
Sep 24 08:01:29 nas kernel:  disk 2, o:1, dev:sdc5
Sep 24 08:01:29 nas kernel:  disk 3, o:1, dev:sdd5
Sep 24 08:01:29 nas kernel:  disk 4, o:1, dev:sde5
Sep 24 08:01:29 nas kernel:  disk 5, o:1, dev:sdf5
Sep 24 08:01:33 nas RAIDiator: Disk failure detected.
Sep 24 08:01:33 nas RAIDiator: If the failed disk is used in a RAID level 1, 5, or X-RAID volume, please note that volume is now unprotected, and an additional disk failure may render that volume dead.  If this disk is a part of a RAID 6 volume, your volume is still protected if this is your first failure.  A 2nd disk failure will make your volume unprotected. If this disk is a part of a RAID 10 volume,your volume is still protected if more than half of the disks alive. But another failure of disks been marked may render that volume dead.  It is recommended that you replace the failed disk as soon as possible to maintain optimal protection of your volume.
Sep 24 08:01:33 nas RAIDiator: Disk failure detected. (nas) : Disk failure detected.\n\nIf the failed disk is used in a RAID level 1, 5, or X-RAID volume, please note that volume is now unprotected, and an additional disk failure may render that volume dead.  If this disk is a part of a RAID 6 volume, your volume is still protected if this is your first failure.  A 2nd disk failure will make your volume unprotected. If this disk is a part of a RAID 10 volume,your volume is still protected if more than half of the disks alive. But another failure of disks been marked may render that volume dead.  It is recommended that you replace the failed disk as soon as possible to maintain optimal protection of your volume.\n\n[Mon Sep 24 08:01:33 PDT 2012]
Sep 24 08:06:02 nas kernel: sd 1:0:0:0: [sdb] Unhandled error code
Sep 24 08:06:02 nas kernel: sd 1:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Sep 24 08:06:02 nas kernel: sd 1:0:0:0: [sdb] CDB: Write(10): 2a 00 00 80 00 48 00 00 02 00
Sep 24 08:06:02 nas kernel: end_request: I/O error, dev sdb, sector 8388680
Sep 24 08:06:02 nas kernel: end_request: I/O error, dev sdb, sector 8388680
Sep 24 08:06:02 nas kernel: md: super_written gets error=-5, uptodate=0
Sep 24 08:06:02 nas kernel: md/raid:md1: Disk failure on sdb2, disabling device.
Sep 24 08:06:02 nas kernel: <1>md/raid:md1: Operation continuing on 5 devices.
Sep 24 08:06:02 nas kernel: RAID conf printout:
Sep 24 08:06:02 nas kernel:  --- level:6 rd:6 wd:5
Sep 24 08:06:02 nas kernel:  disk 0, o:1, dev:sda2
Sep 24 08:06:02 nas kernel:  disk 1, o:0, dev:sdb2
Sep 24 08:06:02 nas kernel:  disk 2, o:1, dev:sdc2
Sep 24 08:06:02 nas kernel:  disk 3, o:1, dev:sdd2
Sep 24 08:06:02 nas kernel:  disk 4, o:1, dev:sde2
Sep 24 08:06:02 nas kernel:  disk 5, o:1, dev:sdf2
Sep 24 08:06:02 nas kernel: RAID conf printout:
Sep 24 08:06:02 nas kernel:  --- level:6 rd:6 wd:5
Sep 24 08:06:02 nas kernel:  disk 0, o:1, dev:sda2
Sep 24 08:06:02 nas kernel:  disk 2, o:1, dev:sdc2
Sep 24 08:06:02 nas kernel:  disk 3, o:1, dev:sdd2
Sep 24 08:06:02 nas kernel:  disk 4, o:1, dev:sde2
Sep 24 08:06:02 nas kernel:  disk 5, o:1, dev:sdf2
Sep 24 08:06:27 nas RAIDiator: Backup finished. [Job 002]
Sep 24 08:06:27 nas RAIDiator: Backup finished. [Job 003]
Sep 24 08:06:54 nas RAIDiator: Backup finished. [Job 004]
Sep 24 08:28:04 nas noflushd[2186]: Spinning down disks.
Sep 24 08:28:05 nas noflushd[2186]: Spindown for /dev/sdb failed: Input/output error
Sep 24 08:28:23 nas noflushd[2186]: Disks spinning up after 0 minutes.
Sep 24 08:33:25 nas noflushd[2186]: Spinning down disks.
Sep 24 08:33:26 nas noflushd[2186]: Spindown for /dev/sdb failed: Input/output error
Sep 24 08:33:43 nas noflushd[2186]: Disks spinning up after 0 minutes.
Sep 24 08:38:46 nas noflushd[2186]: Spinning down disks.
Sep 24 08:38:47 nas noflushd[2186]: Spindown for /dev/sdb failed: Input/output error
Sep 24 08:39:04 nas noflushd[2186]: Disks spinning up after 0 minutes.
Sep 24 08:44:17 nas noflushd[2186]: Spinning down disks.
Sep 24 08:44:18 nas noflushd[2186]: Spindown for /dev/sdb failed: Input/output error
Sep 24 08:44:35 nas noflushd[2186]: Disks spinning up after 0 minutes.
Sep 24 08:49:48 nas noflushd[2186]: Spinning down disks.
Sep 24 08:49:49 nas noflushd[2186]: Spindown for /dev/sdb failed: Input/output error
Sep 24 08:50:07 nas noflushd[2186]: Disks spinning up after 0 minutes.
Sep 24 08:55:09 nas noflushd[2186]: Spinning down disks.
Sep 24 08:55:10 nas noflushd[2186]: Spindown for /dev/sdb failed: Input/output error
Sep 24 08:55:28 nas noflushd[2186]: Disks spinning up after 0 minutes.
Sep 24 09:00:30 nas noflushd[2186]: Spinning down disks.
Sep 24 09:00:32 nas noflushd[2186]: Spindown for /dev/sdb failed: Input/output error
Sep 24 09:00:49 nas noflushd[2186]: Disks spinning up after 0 minutes.

First drive failure SMART log


***** Disk SMART log for channel 2 [sdb] *****


smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.37.6.RNx86_64.2.4] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda LP
Device Model:     ST32000542AS
Serial Number:    6XW1KLP1
LU WWN Device Id: 5 000c50 0228e9b48
Firmware Version: CC34
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Thu Sep 20 07:01:46 2012 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  643) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  Always       -       228790855
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   091   091   020    Old_age   Always       -       9786
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   044   044   030    Pre-fail  Always       -       15212863609781
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       6658
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       357
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   077   055   045    Old_age   Always       -       23 (Min/Max 22/23)
194 Temperature_Celsius     0x0022   023   045   000    Old_age   Always       -       23 (0 20 0 0 0)
195 Hardware_ECC_Recovered  0x001a   046   034   000    Old_age   Always       -       228790855
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       51973399256608
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2100760496
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2052664029

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

System Log for second drive failure on 9/26


Sep 26 20:35:26 nas noflushd[2202]: Spinning down disks.
Sep 26 21:05:13 nas noflushd[2202]: Disks spinning up after 29 minutes.
Sep 26 21:06:22 nas kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 26 21:06:22 nas kernel: ata1.00: failed command: SMART
Sep 26 21:06:22 nas kernel: ata1.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/40 tag 0 pio 512 in
Sep 26 21:06:22 nas kernel:          res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Sep 26 21:06:22 nas kernel: ata1.00: status: { DRDY }
Sep 26 21:06:22 nas kernel: ata1: hard resetting link
Sep 26 21:06:22 nas kernel: ata1: link is slow to respond, please be patient (ready=0)
Sep 26 21:06:22 nas kernel: ata1: COMRESET failed (errno=-16)
Sep 26 21:06:22 nas kernel: ata1: hard resetting link
Sep 26 21:06:22 nas kernel: ata1: link is slow to respond, please be patient (ready=0)
Sep 26 21:06:22 nas kernel: ata1: COMRESET failed (errno=-16)
Sep 26 21:06:22 nas kernel: ata1: hard resetting link
Sep 26 21:06:22 nas kernel: ata1: link is slow to respond, please be patient (ready=0)
Sep 26 21:06:22 nas kernel: ata1: COMRESET failed (errno=-16)
Sep 26 21:06:22 nas kernel: ata1: limiting SATA link speed to 1.5 Gbps
Sep 26 21:06:22 nas kernel: ata1: hard resetting link
Sep 26 21:06:22 nas kernel: ata1: COMRESET failed (errno=-16)
Sep 26 21:06:22 nas kernel: ata1: reset failed, giving up
Sep 26 21:06:22 nas kernel: ata1.00: disabled
Sep 26 21:06:22 nas kernel: ata1: EH complete
Sep 26 21:06:22 nas kernel: sd 0:0:0:0: [sda] Unhandled error code
Sep 26 21:06:22 nas kernel: sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Sep 26 21:06:22 nas kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 20 08 70 00 00 08 00
Sep 26 21:06:22 nas kernel: end_request: I/O error, dev sda, sector 2099312
Sep 26 21:06:22 nas kernel: md/raid1:md0: sda1: rescheduling sector 2097200
Sep 26 21:06:22 nas kernel: sd 0:0:0:0: [sda] Unhandled error code
Sep 26 21:06:22 nas kernel: sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Sep 26 21:06:22 nas kernel: sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 00 00 48 00 00 02 00
Sep 26 21:06:22 nas kernel: end_request: I/O error, dev sda, sector 72
Sep 26 21:06:22 nas kernel: end_request: I/O error, dev sda, sector 72
Sep 26 21:06:22 nas kernel: md: super_written gets error=-5, uptodate=0
Sep 26 21:06:22 nas kernel: md/raid1:md0: Disk failure on sda1, disabling device.
Sep 26 21:06:22 nas kernel: <1>md/raid1:md0: Operation continuing on 5 devices.
Sep 26 21:06:22 nas kernel: RAID1 conf printout:
Sep 26 21:06:22 nas kernel:  --- wd:5 rd:6
Sep 26 21:06:22 nas kernel:  disk 0, wo:1, o:0, dev:sda1
Sep 26 21:06:22 nas kernel:  disk 1, wo:0, o:1, dev:sdb1
Sep 26 21:06:22 nas kernel:  disk 2, wo:0, o:1, dev:sdc1
Sep 26 21:06:22 nas kernel:  disk 3, wo:0, o:1, dev:sdd1
Sep 26 21:06:22 nas kernel:  disk 4, wo:0, o:1, dev:sde1
Sep 26 21:06:22 nas kernel:  disk 5, wo:0, o:1, dev:sdf1
Sep 26 21:06:22 nas kernel: RAID1 conf printout:
Sep 26 21:06:22 nas kernel:  --- wd:5 rd:6
Sep 26 21:06:22 nas kernel:  disk 1, wo:0, o:1, dev:sdb1
Sep 26 21:06:22 nas kernel:  disk 2, wo:0, o:1, dev:sdc1
Sep 26 21:06:22 nas kernel:  disk 3, wo:0, o:1, dev:sdd1
Sep 26 21:06:22 nas kernel:  disk 4, wo:0, o:1, dev:sde1
Sep 26 21:06:22 nas kernel:  disk 5, wo:0, o:1, dev:sdf1
Sep 26 21:06:22 nas kernel: md/raid1:md0: redirecting sector 2097200 to other mirror: sdf1
Sep 26 21:06:22 nas RAIDiator: Detected increasing reallocated sector count on disk 1 [-1, 13] 
Sep 26 21:06:22 nas RAIDiator: Reallocation event count: [-1, 13] 
Sep 26 21:06:22 nas RAIDiator: Detected increasing spin retry count on disk 1 [1701265709, 0] 
Sep 26 21:06:22 nas RAIDiator: Detected increasing spin retry count[1701265709] on disk 1 [ST32000542AS, 5XW1BJZ6]. This often indicates an impending failure. Please be prepared to replace this disk to maintain data redundancy. 
Sep 26 21:06:43 nas RAIDiator: Detected increasing end-to-end errors on disk 1 [-1, 0] 
Sep 26 21:06:43 nas RAIDiator: Detected increasing command timeouts on disk 1 [-1, 65537] 
Sep 26 21:06:43 nas RAIDiator: Detected increasing pending sector count on disk 1 [-1, 0] 
Sep 26 21:06:43 nas RAIDiator: Detected increasing uncorrectable errors on disk 1 [-1, 0] 
Sep 26 21:06:43 nas RAIDiator: Current (temp, start_stop_cnt, power_on_hrs, power_cycle_cnt, load_cycle_cnt) = (-1, -1, -1, -1, -1) 
Sep 26 21:06:54 nas kernel: sd 0:0:0:0: [sda] Unhandled error code
Sep 26 21:06:54 nas kernel: sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Sep 26 21:06:54 nas kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 28 50 09 38 00 00 10 00
Sep 26 21:06:54 nas kernel: end_request: I/O error, dev sda, sector 676333880
Sep 26 21:06:54 nas kernel: sd 0:0:0:0: [sda] Unhandled error code
Sep 26 21:06:54 nas kernel: sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Sep 26 21:06:54 nas kernel: sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 28 50 09 40 00 00 08 00
Sep 26 21:06:54 nas kernel: end_request: I/O error, dev sda, sector 676333888
Sep 26 21:06:54 nas kernel: md/raid:md2: Disk failure on sda5, disabling device.
Sep 26 21:06:54 nas kernel: <1>md/raid:md2: Operation continuing on 5 devices.
Sep 26 21:06:54 nas kernel: md: recovery of RAID array md2
Sep 26 21:06:54 nas kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Sep 26 21:06:54 nas kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Sep 26 21:06:54 nas kernel: md: using 128k window, over a total of 1948792192 blocks.
Sep 26 21:06:54 nas kernel: md: resuming recovery of md2 from checkpoint.
Sep 26 21:06:54 nas kernel: md: md2: recovery done.
Sep 26 21:06:54 nas kernel: RAID conf printout:
Sep 26 21:06:54 nas kernel:  --- level:5 rd:6 wd:5
Sep 26 21:06:54 nas kernel:  disk 0, o:0, dev:sda5
Sep 26 21:06:54 nas kernel:  disk 1, o:1, dev:sdb5
Sep 26 21:06:54 nas kernel:  disk 2, o:1, dev:sdc5
Sep 26 21:06:54 nas kernel:  disk 3, o:1, dev:sdd5
Sep 26 21:06:54 nas kernel:  disk 4, o:1, dev:sde5
Sep 26 21:06:54 nas kernel:  disk 5, o:1, dev:sdf5
Sep 26 21:06:54 nas kernel: RAID conf printout:
Sep 26 21:06:54 nas kernel:  --- level:5 rd:6 wd:5
Sep 26 21:06:54 nas kernel:  disk 0, o:0, dev:sda5
Sep 26 21:06:54 nas kernel:  disk 1, o:1, dev:sdb5
Sep 26 21:06:54 nas kernel:  disk 2, o:1, dev:sdc5
Sep 26 21:06:54 nas kernel:  disk 3, o:1, dev:sdd5
Sep 26 21:06:54 nas kernel:  disk 4, o:1, dev:sde5
Sep 26 21:06:54 nas kernel:  disk 5, o:1, dev:sdf5
Sep 26 21:06:54 nas kernel: RAID conf printout:
Sep 26 21:06:54 nas kernel:  --- level:5 rd:6 wd:5
Sep 26 21:06:54 nas kernel:  disk 1, o:1, dev:sdb5
Sep 26 21:06:54 nas kernel:  disk 2, o:1, dev:sdc5
Sep 26 21:06:54 nas kernel:  disk 3, o:1, dev:sdd5
Sep 26 21:06:54 nas kernel:  disk 4, o:1, dev:sde5
Sep 26 21:06:54 nas kernel:  disk 5, o:1, dev:sdf5
Sep 26 21:07:04 nas RAIDiator: Backup finished. [Job 002]
Sep 26 21:07:10 nas RAIDiator: Disk failure detected.
Sep 26 21:07:10 nas RAIDiator: If the failed disk is used in a RAID level 1, 5, or X-RAID volume, please note that volume is now unprotected, and an additional disk failure may render that volume dead.  If this disk is a part of a RAID 6 volume, your volume is still protected if this is your first failure.  A 2nd disk failure will make your volume unprotected. If this disk is a part of a RAID 10 volume,your volume is still protected if more than half of the disks alive. But another failure of disks been marked may render that volume dead.  It is recommended that you replace the failed disk as soon as possible to maintain optimal protection of your volume.
Sep 26 21:07:11 nas RAIDiator: Disk failure detected. (nas) : Disk failure detected.\n\nIf the failed disk is used in a RAID level 1, 5, or X-RAID volume, please note that volume is now unprotected, and an additional disk failure may render that volume dead.  If this disk is a part of a RAID 6 volume, your volume is still protected if this is your first failure.  A 2nd disk failure will make your volume unprotected. If this disk is a part of a RAID 10 volume,your volume is still protected if more than half of the disks alive. But another failure of disks been marked may render that volume dead.  It is recommended that you replace the failed disk as soon as possible to maintain optimal protection of your volume.\n\n[Wed Sep 26 21:07:10 PDT 2012]
Sep 26 21:08:28 nas udhcpc[2807]: Sending renew...
Sep 26 21:08:28 nas udhcpc[2807]: Lease of 192.168.0.110 obtained, lease time 7200
Sep 26 21:35:16 nas noflushd[2202]: Spinning down disks.
Sep 26 21:35:16 nas noflushd[2202]: Spindown for /dev/sda failed: Input/output error
Sep 26 21:35:21 nas noflushd[2202]: Disks spinning up after 0 minutes.
Sep 26 21:42:47 nas noflushd[2202]: Spinning down disks.
Sep 26 21:42:47 nas noflushd[2202]: Spindown for /dev/sda failed: Input/output error
Sep 26 21:43:05 nas noflushd[2202]: Disks spinning up after 0 minutes.
Sep 26 21:48:08 nas noflushd[2202]: Spinning down disks.
Sep 26 21:48:08 nas noflushd[2202]: Spindown for /dev/sda failed: Input/output error
Sep 26 21:48:25 nas noflushd[2202]: Disks spinning up after 0 minutes.
Sep 26 21:53:28 nas noflushd[2202]: Spinning down disks.
Sep 26 21:53:28 nas noflushd[2202]: Spindown for /dev/sda failed: Input/output error
Sep 26 21:53:46 nas noflushd[2202]: Disks spinning up after 0 minutes.
Sep 26 22:00:01 nas init: Switching to runlevel: 0
Sep 26 22:00:03 nas kernel: ata2.00: configured for UDMA/133
Sep 26 22:00:03 nas kernel: ata2: EH complete
Sep 26 22:00:03 nas kernel: sd 1:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Sep 26 22:00:03 nas kernel: ata3.00: configured for UDMA/133
Sep 26 22:00:03 nas kernel: ata3: EH complete
Sep 26 22:00:03 nas kernel: sd 2:0:0:0: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Sep 26 22:00:04 nas kernel: ata4.00: configured for UDMA/133
Sep 26 22:00:04 nas kernel: ata4: EH complete
Sep 26 22:00:04 nas kernel: sd 3:0:0:0: [sdd] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Sep 26 22:00:04 nas kernel: ata5.00: configured for UDMA/133
Sep 26 22:00:04 nas kernel: ata5: EH complete
Sep 26 22:00:04 nas kernel: sd 4:0:0:0: [sde] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Sep 26 22:00:04 nas kernel: ata6.00: configured for UDMA/133
Sep 26 22:00:04 nas kernel: ata6: EH complete
Sep 26 22:00:04 nas kernel: sd 5:0:0:0: [sdf] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
Sep 26 22:00:06 nas kernel: sd 0:0:0:0: [sda] Unhandled error code
Sep 26 22:00:06 nas kernel: sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Sep 26 22:00:06 nas kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 80 07 c0 00 00 10 00
Sep 26 22:00:06 nas kernel: end_request: I/O error, dev sda, sector 8390592
Sep 26 22:00:06 nas kernel: sd 0:0:0:0: [sda] Unhandled error code
Sep 26 22:00:06 nas kernel: sd 0:0:0:0: [sda]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Sep 26 22:00:06 nas kernel: sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 80 07 c0 00 00 08 00
Sep 26 22:00:06 nas kernel: end_request: I/O error, dev sda, sector 8390592
Sep 26 22:00:06 nas kernel: md/raid:md1: Disk failure on sda2, disabling device.
Sep 26 22:00:06 nas kernel: <1>md/raid:md1: Operation continuing on 5 devices.
Sep 26 22:00:06 nas kernel: RAID conf printout:
Sep 26 22:00:06 nas kernel:  --- level:6 rd:6 wd:5
Sep 26 22:00:06 nas kernel:  disk 0, o:0, dev:sda2
Sep 26 22:00:06 nas kernel:  disk 1, o:1, dev:sdb2
Sep 26 22:00:06 nas kernel:  disk 2, o:1, dev:sdc2
Sep 26 22:00:06 nas kernel:  disk 3, o:1, dev:sdd2
Sep 26 22:00:06 nas kernel:  disk 4, o:1, dev:sde2
Sep 26 22:00:06 nas kernel:  disk 5, o:1, dev:sdf2
Sep 26 22:00:06 nas kernel: RAID conf printout:
Sep 26 22:00:06 nas kernel:  --- level:6 rd:6 wd:5
Sep 26 22:00:06 nas kernel:  disk 1, o:1, dev:sdb2
Sep 26 22:00:06 nas kernel:  disk 2, o:1, dev:sdc2
Sep 26 22:00:06 nas kernel:  disk 3, o:1, dev:sdd2
Sep 26 22:00:06 nas kernel:  disk 4, o:1, dev:sde2
Sep 26 22:00:06 nas kernel:  disk 5, o:1, dev:sdf2

System log for 9/28 Bootup and resync (i.e. I did not replace the second dead drive) in next post because I exceeded the character limit

NOTE: On 10/1, a third drive failed within 10 days of my upgrading the Ultra6 to 4.2.22 - see recent post below.

Issues with Boot|Installation|Upgrade|Expansion

Other Discussions

44 Replies

Replies have been turned off for this discussion

beisser1
Tutor
Oct 15, 2012
the error messages have changed in 4.2.22 as far as i know.

it reports now far more smartvalues than before where it only cared for reallocated sectors and ata errors.

the values it reacts now on are critical ones that may not necessary mean your drive is dead (like the spin retry count), but you would certainly be wise to keep an eye on all the drives that are throwing these errors. on some drives this may be caused my crappy drive firmwware that reports false values, but in general you can say that the nas wont report them unless they are running out of spec. a disk that fails to start repeatedly is by no means in good condition and the nas just tells you that.
HERBIEO
Aspirant
Oct 15, 2012
If you want some info about S.M.A.R.T. attributes check this post
viewtopic.php?f=24&t=66928#p373331

Sandshark

Sensei

Oct 16, 2012

I have noticed an increased number of posts of users who believe drive failures can be attributed to a recent firmware upgrade - on all platforms.
I am certainly not aware of any reason for this to be the case other than pure coincidence. There will always be a number of users who experience drive failures at or shortly after a firmware upgrade.
These are the same users who are also then more likely to post about their woes - reinforcing the belief in other unlucky users that the firmware is to blame.

While I have had no failures (yet?), there is definately something different about how 4.2.22 is treating at least some of these drives. Mine, which have firmware CC32 and had none of the problems others have reported with that drive firmware in about a year and a half of running suddenly started having them. Some of the things I've noticed, like excessive clicking that Seagate reports as some kind of auto recalibration, might be enough to push a borderline drive over the edge.

thomasx
Aspirant
Oct 17, 2012
I Have had two failed drives, had them replaced by Seagate and since the 3 more drives have failed, all in the last few weeks. I am really concerned about my data. so far, only one drive failed at a time, and the raid levels helped secure the data, but I don't know where to start to repair the situation.

All drives are the exact same drives (Seagate ST32000542AS, Firmware CC34) from the HCL, some used in a Ultra 6 (which calls itself business pro instead of ultra 6), some in an ultra 2 and some in a duo.

Just to rule out that it is the sea gates i just ordered three WD RED drives. I find it hard to believe that all of a sudden those files fail all at once, and not only here but at many other places.

I tried to start a support ticket, but i could not get past step 2, where i have to enter my country. although all fields are okay to me, a popup tells me the starred fields are required.

Is the any hope of finding the problem and after that: a solution?

with kind regards,
thomasx
blacey
Aspirant
Oct 22, 2012
Quick update: I just had a 4th drive fail - in fact, drives are failing faster than I can RMA them with Seagate. Also, a replacement drive has since failed within less than a month. This NAS has been in service for a couple years and nothing changed other than the firmware upgrade so if the recently rampant drive failures are not due to the firmware upgrade, then I am open to other root causes.

Keep in mind that the drives may not in fact be bad but 4.2.22 purports them to fail - as an experiment, I disabled the drive spindown feature because it seems from the logs that the failures are correlated with increased command latencies and exceptions in the log. With drive spindown disabled, everything was fine for a couple weeks however after reenabling it, I have had two drives purportedly fail (one new failure and one that previously failed but recovered after a power cycle).
prouser1
Aspirant
Nov 12, 2012
Chiming in to indicate that my 4.2.22 might have a similar issue.

My setup:
* Have four Seagate ST32000542AS drives update to flash CC35.
* On the Ultra4 (instead of the OP's Ultra6)
* Updated to 4.2.22. For a few weeks everything was fine with this setup until today where ReadyNAS claimed one of the drives was dead. The server powered off to prevent further damage (so it says).

What was done:
1. After powering up ReadyNAS back up, was still reported as dead from the ReadyNAS FrontView page.
2. Took out the drive and popped it in. Still the drive was reported as dead.
3. Rebooted ReadyNAS. Still the drive was reported as dead.
4. Took out the drive and ran Western's Data Lifeguard Diagnostics. The SMART report passed. The quick test passed. Everything was good. Ran the write zero's test (to the beginning and end sectors only) to clear out the existing partition.
5. Downgraded the ReadyNAS to 4.2.21.
6. Rebooted the ReadyNAS. Verified 4.2.21 downgrade was complete.
7. Popped in the drive back to ReadyNAS.
8. ReadyNAS says New Drive detected and started re-sync. The FrontView SMART no longer reports the drive as dead. The SMART statistics was all normal.

Not sure if downgrading fixed the issue or that wiping the drive caused ReadyNAS to reset something to detect the drive again. In any case, the unit is back to working so I'm happy. Hope this information helps the next reader.
prouser1
Aspirant
Nov 13, 2012
Found a possible relationship with my issue. The "command timeout" is >0. There isn't any reallocated sectors associated with the command timeout. Not really sure if it is the drive or the ReadyNAS unit had a problem with the drive. If anyone out there would know, please post a comment. Thanks.

SMART Information for Disk 3

Model: ST32000542AS
Serial: ********
Firmware: CC35

SMART Attribute

Spin Up Time 0
Start Stop Count 2623
Reallocated Sector Count 0
Power On Hours 15281
Spin Retry Count 0
Power Cycle Count 36
Runtime Bad Block 0
End-to-End Error 0
Reported Uncorrect 0
Command Timeout 1
High Fly Writes 0
Airflow Temperature Cel 25
Temperature Celsius 25
Current Pending Sector 0
Offline Uncorrectable 0
UDMA CRC Error Count 0
Head Flying Hours 90726889176070
Total LBAs Written 2502792016
Total LBAs Read 931203149

ATA Error Count 0
Mr__Bear
Aspirant
Nov 13, 2012
I am also seeing similar issue with 4.2.22 on my ReadyNAS 2100. The 2100 was running without issue with 3x 2TB ST32000644NS and 1x 2TB ST2000NM0011 prior to 4.2.22, the odd ST2000NM0011 was a replacement from NETGEAR about a year ago for a failed ST32000644NS, a 6Gbps 2TB disk that is on the 2100 HCL.

As soon as 4.2.22 is applied, the ST2000NM0011 start throwing Command Timeout Error at an alarming rate (30+ timeout in a day) especially when the unit is under load. While waiting for NETGEAR to RMA the drive, we replaced the ST2000NM0011 with another identical brand new (no error whatsoever on SMART) and instantly the same Command Timeout errors were thrown. Looking at the logs, we are seeing once the unit is under load, it was throwing multiple errors similar to the one below -

Nov 7 07:15:40 emboar kernel: ata5.00: status: { DRDY }
Nov 7 07:15:40 emboar kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Nov 7 07:15:40 emboar kernel: ata5.00: cmd 61/08:88:d0:90:a2/00:00:08:00:00/40 tag 17 ncq 4096 out
Nov 7 07:15:40 emboar kernel: res 40/00:b0:c8:9c:a2/00:00:08:00:00/40 Emask 0x10 (ATA bus error)

Nov 7 07:15:40 emboar kernel: ata5: hard resetting link
Nov 7 07:15:40 emboar kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 7 07:15:40 emboar kernel: ata5.00: configured for UDMA/133
Nov 7 07:15:40 emboar kernel: ata5: EH complete

Sometime it will attempt to reset the link @ 3.0 Gbps but that will always timeout again and lead to another reset to 1.5Gbps where things will stabilize and the timeout errors stop. Once the unit is reboot and the link reestablished at 5.0Gbps as it should be, the timeout error will quickly return leading to another downgrade to 1.5Gbps.

We then replaced the second ST2000NM0011 with NETGEAR's replacement ST32000644NS, the error went away completely and we can stress the unit and it sticks at 5.0 Gbps without any errors. We did a full long test using SEATools on the ST2000NM0011 without any problem.

I would say there is something in 4.2.22 at least in my case that affected compatibility with certain drives. More detail @ http://www.readynas.com/forum/viewtopic.php?f=24&t=67736

A level 2 case was also opened #19802848.
ReadySECURE
Apprentice
Nov 13, 2012
I feel like it would be more of a chassis issue.
Looks like they're replacing your chassis though.
Mr__Bear
Aspirant
Nov 14, 2012
readysecure1985 wrote:
I feel like it would be more of a chassis issue.
Looks like they're replacing your chassis though.

Hi.. readysecure1985,

Are you referring to our case# 19802848? I have not heard from Level 2 Support since last Friday, would like to know if indeed the chassis is to be replaced. Thanks!