Disk in channel 1 (Internal) changed state from ONLINE to FAILED (RN202, 2x WD Red 3TB, OS 6.10.2)

liouk · ‎2020-01-20

Greetings,

For some time now I have been getting the error mentioned in the subject, and I am trying to find out whether the failing disk needs replacing or if there's anything I can do to fix this. Data is still there, as I'm in RAID-1 configuration, and the second disk is fine.

I've skimmed through the logs, and I do see I/O errors being reported, however I'm looking for a second - more experienced - opinion. I'm attaching a couple of examples -- can anyone make anything out of this? Please let me know if you need more logs! Thanks in advance 🙂

dmesg.log

[Mon Jan 20 23:49:02 2020] blk_update_request: I/O error, dev sda, sector 5860533160
[Mon Jan 20 23:49:02 2020] blk_update_request: I/O error, dev sda, sector 5860533160
[Mon Jan 20 23:49:02 2020] Buffer I/O error on dev sda, logical block 732566645, async page read
[Mon Jan 20 23:49:02 2020] blk_update_request: I/O error, dev sda, sector 5860533128
[Mon Jan 20 23:49:02 2020] blk_update_request: I/O error, dev sda, sector 5860533128
[Mon Jan 20 23:49:02 2020] Buffer I/O error on dev sda, logical block 732566641, async page read
[Mon Jan 20 23:49:02 2020] blk_update_request: I/O error, dev sda, sector 5860532992
[Mon Jan 20 23:49:02 2020] Buffer I/O error on dev sda, logical block 732566624, async page read

system.log

$ grep mdadm system.log
Jan 20 23:33:27 ReadyNAS mdadm[1851]: NewArray event detected on md device /dev/md0
Jan 20 23:33:27 ReadyNAS mdadm[1851]: DegradedArray event detected on md device /dev/md0
Jan 20 23:33:27 ReadyNAS mdadm[1851]: NewArray event detected on md device /dev/md1
Jan 20 23:33:27 ReadyNAS mdadm[1851]: DegradedArray event detected on md device /dev/md1
Jan 20 23:33:27 ReadyNAS mdadm[1851]: NewArray event detected on md device /dev/md127
Jan 20 23:33:36 ReadyNAS mdadm[1851]: DegradedArray event detected on md device /dev/md127
Jan 20 23:34:13 ReadyNAS mdadm[1851]: RebuildStarted event detected on md device /dev/md0, component device recovery
Jan 20 23:34:18 ReadyNAS mdadm[1851]: RebuildStarted event detected on md device /dev/md1, component device recovery
Jan 20 23:34:24 ReadyNAS mdadm[1851]: RebuildStarted event detected on md device /dev/md127, component device recovery
Jan 20 23:34:33 ReadyNAS mdadm[1851]: RebuildFinished event detected on md device /dev/md1, component device recovery
Jan 20 23:34:33 ReadyNAS mdadm[1851]: SpareActive event detected on md device /dev/md1, component device /dev/sda2
Jan 20 23:36:29 ReadyNAS mdadm[1851]: RebuildFinished event detected on md device /dev/md0, component device recovery
Jan 20 23:36:29 ReadyNAS mdadm[1851]: SpareActive event detected on md device /dev/md0, component device /dev/sda1
Jan 20 23:41:15 ReadyNAS mdadm[1851]: Fail event detected on md device /dev/md0, component device /dev/sda1
Jan 20 23:41:16 ReadyNAS mdadm[1851]: FailSpare event detected on md device /dev/md127, component device /dev/sda3
Jan 20 23:41:16 ReadyNAS mdadm[1851]: RebuildFinished event detected on md device /dev/md127, component device recovery

StephenB · ‎2020-01-27

@liouk wrote:

Here's my disk_info.log as well:

Device:             sda
Controller:         0
Channel:            0
Model:              WDC_WD30EFRX-68EUZN0
Serial:             WD-WCC4N1XHR82L
Firmware:           82.00A82
Class:              SATA
Sectors:            5860533168
Pool:               data
PoolType:           RAID 1
PoolState:          3
PoolHostId:         1165483a
Health data 
  ATA Error Count:                0

This is all you see for disk 1? My guess is yes, as that is consistent with what you posted before.

Reformatting a line from the earlier pdf:

time                model                serial               realloc_sect realloc_evnt spin_retry_cnt ioedc      cmd_timeouts pending_sect uncorrectable_err ata_errors
------------------- -------------------- -------------------- ------------ ------------ -------------- ---------- ------------ ------------ ----------------- ----------
2020-01-20 23:39:10 WDC WD30EFRX-68EUZN0 WD-WCC4N1XHR82L           41          7              0           -1          -1           0             0               0

You can see there were 41 reallocated sectors reported on the 20th, and that count was increasing regularly for some months.

I believe that disk 1 has failed. If you can connect it to a Windows PC (either with a USB adapter/dock or with SATA), you can test it with WD's Lifeguard program. FWIW, I'd replace it even if it passes Lifeguard.

If you installed it at the same time as disk 2, it likely is still covered by the manufacturer's warranty (The power-on hours suggests it's been installed for about 18 months, and the warranty is three years. Though if the NAS is powered down a lot, the disks could be a lot older). If it is covered, you can get an RMA, but the replacement disk will be recertified (not new). Personally I generally purchase a new disk, and keep the replacement disk as an emergency spare.

View solution in original post

Sandshark · ‎2020-01-26

What you want to look at is the SMART stats, which I suspect will show significant issues. All of those rebuild events are creating quite a lot of activity on both of your drives, raising the chances of the second one failing and you losing all your data. It sounds like you don't have a backup of the data, and you should look into remedying that situation, as RAID alone is not enough to keep your data safe and you don't even currently have a sound RAID.

liouk · ‎2020-01-26

Hi @Sandshark ,

Thanks for your response! I'm not quite sure how to read the SMART stats, so I'm attaching two logfiles I could find relevant info in -- from the little I can gather, it doesn't look too bad.

Regarding your comments -- actually I'm using my ReadyNAS as a backup for everything else, and you're right, I don't have a backup of that. Can you elaborate on why my RAID is not good enough? Do you mean that using only 2 disks isn't sufficient, or is there something else in my configuration?

Again, thanks for the response!

StephenB · ‎2020-01-26

@liouk wrote:

Thanks for your response! I'm not quite sure how to read the SMART stats, so I'm attaching two logfiles I could find relevant info in -- from the little I can gather, it doesn't look too bad.

I don't like the command timeouts. Can you post disk_info.log - that will give the full SMART stats. You definitely are getting errors on disk 1 (sda) - that's what your first set of btrfs errors are telling you.

@liouk wrote:

Can you elaborate on why my RAID is not good enough? Do you mean that using only 2 disks isn't sufficient, or is there something else in my configuration?

He's saying that RAID is never enough to keep your data (or anyone else's data) safe. RAID is helpful, but there are a lot of scenarios where it can fail, and you will lose your data if it does.

@liouk wrote:

Regarding your comments -- actually I'm using my ReadyNAS as a backup for everything else,

If everything on the NAS is stored on another device, then you aren't depending on RAID alone to keep it safe - since the primary copy is still on the original device. That's good.

If that's the case, then backing up the ReadyNAS itself might also worth considering, giving you more recovery options if something catastrophic happens. I like to have three copies of everything I care about myself (including the original). A couple times (before I had a ReadyNAS), I had PC hard disks fail, and then discovered that my USB backup had disk errors - so I lost some data. I haven't lost anything since I started keeping three copies.

liouk · ‎2020-01-27

@StephenB thanks for the reply!

Here's my disk_info.log as well:

Device:             sda
Controller:         0
Channel:            0
Model:              WDC_WD30EFRX-68EUZN0
Serial:             WD-WCC4N1XHR82L
Firmware:           82.00A82
Class:              SATA
Sectors:            5860533168
Pool:               data
PoolType:           RAID 1
PoolState:          3
PoolHostId:         1165483a
Health data 
  ATA Error Count:                0

Device:             sdb
Controller:         0
Channel:            1
Model:              WDC WD30EFRX-68EUZN0
Serial:             WD-WCC4N1NA90XX
Firmware:           82.00A82W
Class:              SATA
RPM:                5400
Sectors:            5860533168
Pool:               data
PoolType:           RAID 1
PoolState:          3
PoolHostId:         1165483a
Health data 
  ATA Error Count:                0
  Reallocated Sectors:            0
  Reallocation Events:            0
  Spin Retry Count:               0
  Current Pending Sector Count:   0
  Uncorrectable Sector Count:     0
  Temperature:                    31
  Start/Stop Count:               329
  Power-On Hours:                 13446
  Power Cycle Count:              329
  Load Cycle Count:               328

I understand the risks when using RAID and backups -- I actually use the NAS to back up data I already have on other devices, plus to store data that I do not mind losing, but don't want to permanently store on my primary PC. Thanks for the insights though! I'm now considering backing up my NAS once more, so that I end up with three copies as well.

StephenB · ‎2020-01-27

@liouk wrote:

Here's my disk_info.log as well:

Device:             sda
Controller:         0
Channel:            0
Model:              WDC_WD30EFRX-68EUZN0
Serial:             WD-WCC4N1XHR82L
Firmware:           82.00A82
Class:              SATA
Sectors:            5860533168
Pool:               data
PoolType:           RAID 1
PoolState:          3
PoolHostId:         1165483a
Health data 
  ATA Error Count:                0

This is all you see for disk 1? My guess is yes, as that is consistent with what you posted before.

Reformatting a line from the earlier pdf:

time                model                serial               realloc_sect realloc_evnt spin_retry_cnt ioedc      cmd_timeouts pending_sect uncorrectable_err ata_errors
------------------- -------------------- -------------------- ------------ ------------ -------------- ---------- ------------ ------------ ----------------- ----------
2020-01-20 23:39:10 WDC WD30EFRX-68EUZN0 WD-WCC4N1XHR82L           41          7              0           -1          -1           0             0               0

You can see there were 41 reallocated sectors reported on the 20th, and that count was increasing regularly for some months.

I believe that disk 1 has failed. If you can connect it to a Windows PC (either with a USB adapter/dock or with SATA), you can test it with WD's Lifeguard program. FWIW, I'd replace it even if it passes Lifeguard.

If you installed it at the same time as disk 2, it likely is still covered by the manufacturer's warranty (The power-on hours suggests it's been installed for about 18 months, and the warranty is three years. Though if the NAS is powered down a lot, the disks could be a lot older). If it is covered, you can get an RMA, but the replacement disk will be recertified (not new). Personally I generally purchase a new disk, and keep the replacement disk as an emergency spare.

liouk · ‎2020-01-27

This is all you see for disk 1? My guess is yes, as that is consistent with what you posted before.

Yes indeed, this is all there is in the log for disk 1.

If you installed it at the same time as disk 2, it likely is still covered by the manufacturer's warranty (The power-on hours suggests it's been installed for about 18 months, and the warranty is three years. Though if the NAS is powered down a lot, the disks could be a lot older). If it is covered, you can get an RMA, but the replacement disk will be recertified (not new). Personally I generally purchase a new disk, and keep the replacement disk as an emergency spare.

It's actually much older than 18 months, 4+ years now -- but you're right, I'm powering it down frequently when it's not in use. Not sure if this is recommended, maybe this is an anti-pattern.

I've already purchased a new disk based on all your comments here -- if I get a chance I might run it through Lifeguard and see what happens.

Thanks for all the support @StephenB and @Sandshark !

StephenB · ‎2020-01-27

@liouk wrote:

t's actually much older than 18 months, 4+ years now -- but you're right, I'm powering it down frequently when it's not in use. Not sure if this is recommended, maybe this is an anti-pattern.

My main NAS is on 24x7, but my backups are all on a power schedule - generally on for an hour or two each day.

Disk in channel 1 (Internal) changed state from ONLINE to FAILED (RN202, 2x WD Red 3TB, OS 6.10.2)

Disk in channel 1 (Internal) changed state from ONLINE to FAILED (RN202, 2x WD Red 3TB, OS 6.10.2)

Re: Disk in channel 1 (Internal) changed state from ONLINE to FAILED (RN202, 2x WD Red 3TB, OS 6.10.

Re: Disk in channel 1 (Internal) changed state from ONLINE to FAILED (RN202, 2x WD Red 3TB, OS 6.10.

Re: Disk in channel 1 (Internal) changed state from ONLINE to FAILED (RN202, 2x WD Red 3TB, OS 6.10.

Re: Disk in channel 1 (Internal) changed state from ONLINE to FAILED (RN202, 2x WD Red 3TB, OS 6.10.

Re: Disk in channel 1 (Internal) changed state from ONLINE to FAILED (RN202, 2x WD Red 3TB, OS 6.10.

Re: Disk in channel 1 (Internal) changed state from ONLINE to FAILED (RN202, 2x WD Red 3TB, OS 6.10.

Re: Disk in channel 1 (Internal) changed state from ONLINE to FAILED (RN202, 2x WD Red 3TB, OS 6.10.

Re: Disk in channel 1 (Internal) changed state from ONLINE to FAILED (RN202, 2x WD Red 3TB, OS 6.10.