314 Degraded Volume

jannear · ‎2019-06-11

Afternoon. This is my third attempt at posting this topic. All appears to post ok, my post count increases, but dont ask me to try and find the post!

Long story short -

I have a 314. Currently populated with 4 6TB WD Red Drives.

I ran a volume defrag over the weekend.

Powered off and on the NAS a few days later, and I get a flashing LCD indicating that the volume is degraded.

Log into Web console. All 4 drives online. All show as healthy.

I've downloaded the logs. Checked smart_history.log.

2015-01-26 13:50:32 ST3000DM001-1CH166    W1F3HN47              -1            -1            -1              -1          -1            -1            -1                 186
2015-01-27 16:12:17 unknown               unknown               -1            -1            -1              -1          -1            -1            -1                 0
2015-01-28 16:48:55 WDC WD60EFRX-68MYMN1 WD-WX11D841V4R1       0             0             0               -1          -1            0             0                  0
2015-09-09 12:36:35 WDC WD60EFRX-68MYMN1 WD-WX31D55DF4Y7       0             0             0               -1          -1            0             0                  0
2015-09-10 07:26:30 WDC WD60EFRX-68MYMN1 WD-WX31D55A44UT       0             0             0               -1          -1            0             0                  0
2015-09-11 17:57:35 WDC WD60EFRX-68MYMN1 WD-WX41D948Y69V       0             0             0               -1          -1            0             0                  0
2019-06-10 14:24:09 WDC WD60EFRX-68MYMN1 WD-WX31D55DF4Y7       0             0             0               -1          -1            5             0                  0
2019-06-11 16:03:15 WDC WD60EFRX-68MYMN1 WD-WX31D55DF4Y7       0             0             0               -1          -1            6             0

WX31D55DF4Y7 appears to be showing 6 pending sectors. For the past 24hours. (Now 16:48 12-6-19 in AU).

Is the NAS telling me the degraded state is because of this drive?
Id rather not replace a perfectly good drive, only to find out I didn't need to as hard drives are expensive here!

Thanks

jannear · ‎2019-06-11

Excellent this appears to have posted. Third time lucky!

ps. used Intenet Explorer instead of Chorme (two different PC's on two differnet Internet connections) this time.

StephenB · ‎2019-06-12

@jannear wrote:
ps. used Intenet Explorer instead of Chorme (two different PC's on two differnet Internet connections) this time.

More likely to be the spam filter in the forum.

@jannear wrote:

Is the NAS telling me the degraded state is because of this drive?
Id rather not replace a perfectly good drive, only to find out I didn't need to as hard drives are expensive here!

Can you post mdstat.log? (cut and paste into a reply)

Also, look in disk_info.log for the details of the SMART stats, and look in system.log and kernel.log for disk i/o errors.

FWIW, I recently found that I had a failed WD60EFRX even though the SMART stats looked ok. It was still under warranty, so I got a recertified replacement from Western Digital.

jannear · ‎2019-06-12

System.log looks clean.

kernel log is reporting Buffer I/O error on dev sdc3. sector 9437258 , 9437260 etc.

Disk info log is reporitng 6 current pending sectors for device sdc3 which is the WD-WX31D55DF4Y7 device.

Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md126 : active raid5 sdc4[0] sda4[3] sdb4[2] sdd4[1]
      8790380736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      
md127 : active raid5 sda3[7] sdd3[4] sdb3[6]
      8776243968 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [UU_U]
      
md1 : active raid6 sdb2[0] sda2[3] sdd2[2] sdc2[1]
      1046528 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
      
md0 : active raid1 sda1[7] sdd1[4] sdb1[6]
      4192192 blocks super 1.2 [4/3] [UU_U]
      
unused devices: <none>
/dev/md/0:
           Version : 1.2
     Creation Time : Thu Nov 21 10:45:45 2002
        Raid Level : raid1
        Array Size : 4192192 (4.00 GiB 4.29 GB)
     Used Dev Size : 4192192 (4.00 GiB 4.29 GB)
      Raid Devices : 4
     Total Devices : 3
       Persistence : Superblock is persistent

       Update Time : Wed Jun 12 05:53:01 2019
             State : clean, degraded 
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : unknown

              Name : 5e27565a:0  (local to host 5e27565a)
              UUID : fc4bea7f:c4762704:10588945:1b4d6a74
            Events : 45420

    Number   Major   Minor   RaidDevice State
       7       8        1        0      active sync   /dev/sda1
       6       8       17        1      active sync   /dev/sdb1
       -       0        0        2      removed
       4       8       49        3      active sync   /dev/sdd1
/dev/md/1:
           Version : 1.2
     Creation Time : Fri Sep 11 17:55:47 2015
        Raid Level : raid6
        Array Size : 1046528 (1022.00 MiB 1071.64 MB)
     Used Dev Size : 523264 (511.00 MiB 535.82 MB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

       Update Time : Fri Aug 11 23:16:07 2017
             State : clean 
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : unknown

              Name : 5e27565a:1  (local to host 5e27565a)
              UUID : 079d487d:a029dc77:e943c5a5:ce30ec6b
            Events : 19

    Number   Major   Minor   RaidDevice State
       0       8       18        0      active sync   /dev/sdb2
       1       8       34        1      active sync   /dev/sdc2
       2       8       50        2      active sync   /dev/sdd2
       3       8        2        3      active sync   /dev/sda2
/dev/md/data-0:
           Version : 1.2
     Creation Time : Thu Nov 21 10:45:46 2002
        Raid Level : raid5
        Array Size : 8776243968 (8369.68 GiB 8986.87 GB)
     Used Dev Size : 2925414656 (2789.89 GiB 2995.62 GB)
      Raid Devices : 4
     Total Devices : 3
       Persistence : Superblock is persistent

       Update Time : Wed Jun 12 05:47:14 2019
             State : clean, degraded 
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 64K

Consistency Policy : unknown

              Name : 5e27565a:data-0  (local to host 5e27565a)
              UUID : 1d849ae1:0c6b45c5:82e18f85:3a0a3891
            Events : 15467

    Number   Major   Minor   RaidDevice State
       7       8        3        0      active sync   /dev/sda3
       6       8       19        1      active sync   /dev/sdb3
       -       0        0        2      removed
       4       8       51        3      active sync   /dev/sdd3
/dev/md/data-1:
           Version : 1.2
     Creation Time : Wed Sep  9 19:55:07 2015
        Raid Level : raid5
        Array Size : 8790380736 (8383.16 GiB 9001.35 GB)
     Used Dev Size : 2930126912 (2794.39 GiB 3000.45 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

       Update Time : Wed Jun 12 05:47:14 2019
             State : clean 
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 64K

Consistency Policy : unknown

              Name : 5e27565a:data-1  (local to host 5e27565a)
              UUID : b23891d4:abf49ff9:3ff6798d:b3451656
            Events : 16158

    Number   Major   Minor   RaidDevice State
       0       8       36        0      active sync   /dev/sdc4
       1       8       52        1      active sync   /dev/sdd4
       2       8       20        2      active sync   /dev/sdb4
       3       8        4        3      active sync   /dev/sda4

jannear · ‎2019-06-12

Ok. 3rd time replying to this message. Whilst I appreciate the help, this medium is rubbish.

System log clear

Kernel log reporting issues with SDC.

disk info reports 6 sectors on that same disk.

mdstat:

               Total    Copied   Skipped Mismatch    FAILED    Extras
    Dirs :     23436     23435     23434         0         0         0
   Files :    118377    118377         0         0         0         0
   Bytes : 90.971 g 90.971 g         0         0         0         0
   Times : 122:25:43 19:40:42                       0:00:00   4:57:17

   Speed :           3600979774 Bytes/sec.
   Speed :           206049.715 MegaBytes/min.
   Ended : Thursday, 13 June 2019 8:41:48 AM

jannear · ‎2019-06-12

fourth time.... trying to reply to this message.

Appreciate the help, but this forum is rubbish.

jannear · ‎2019-06-12

mdsat

               Total    Copied   Skipped Mismatch    FAILED    Extras
    Dirs :     23436     23435     23434         0         0         0
   Files :    118377    118377         0         0         0         0
   Bytes : 90.971 g 90.971 g         0         0         0         0
   Times : 122:25:43 19:40:42                       0:00:00   4:57:17

   Speed :           3600979774 Bytes/sec.
   Speed :           206049.715 MegaBytes/min.
   Ended : Thursday, 13 June 2019 8:41:48 AM

jannear · ‎2019-06-12

system log is clear.

kernel log is showing sector errors for sdc.

disk info is showing 6 sectors as per other info.

StephenB · ‎2019-06-13

@jannear wrote:

kernel log is reporting Buffer I/O error on dev sdc3. sector 9437258 , 9437260 etc.

...

     
md127 : active raid5 sda3[7] sdd3[4] sdb3[6]
8776243968 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/3] [UU_U]
 /dev/md/data-0:  State : clean, degraded

md0 : active raid1 sda1[7] sdd1[4] sdb1[6] 
4192192 blocks super 1.2 [4/3] [UU_U] unused devices: <none> 
/dev/md/0: State : clean, degraded

The relevant bits of mdstat are above.

Your data volume has two RAID groups - md126 and md127. That's because you vertically expanded your RAID array at some point (starting off with 3 TB drives and upgrading to 6 TB later).

sdc is part of md126 and md1 (though I didn't quote those bits). But it's not part of md0 (which is the NAS OS partition) and it's not part of md127 (half of your data volume) - and both are marked as degraded.

So sdc is the culprit - some pending sectors, some kernel errors, and it's dropped out of two of the four RAID groups in your system. It would be good to update your backup right away, since your data is at risk.

If you have ssh enabled (or are willing to do that), then you could try entering

# smartctl -x /dev/sdc

This will give you extended SMART status. The interesting part is the queue of failed commands. For instance, this snippet from one of my own WD60EFRX disks:

Error 12 [11] occurred at disk power-on lifetime: 36166 hours (1506 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 0c 27 df 40 40 00  Error: UNC at LBA = 0x0c27df40 = 203939648

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 80 00 c8 00 00 0c 27 df 40 40 08  1d+08:15:46.204  READ FPDMA QUEUED
  60 00 08 00 c0 00 01 06 34 36 98 40 08  1d+08:15:46.163  READ FPDMA QUEUED
  60 00 80 00 b8 00 00 0c 27 e4 40 40 08  1d+08:15:46.146  READ FPDMA QUEUED
  60 00 80 00 b0 00 00 0c 27 e9 40 40 08  1d+08:15:46.123  READ FPDMA QUEUED
  60 00 80 00 a8 00 00 0c 27 ee 40 40 08  1d+08:15:46.094  READ FPDMA QUEUED

This particular snippet shows an unrecoverable error (UNC). I suspect you'll see an error like this around the time of the buffer error (and perhaps more). If the drive is still under warranty, you could RMA it. I've sometimes returned drives that passed Lifeguard (and Seatools), and neither vendor challenged them.

I'd suggest removing the drive and testing it in a windows PC with WD's lifeguard program. If the long generic test passes, then I suggest following up with the destructive write-zeros test. I've found that the write-zeros test will sometimes pick up issues that the non-destructive test misses (and vice-versa). If the disk passes, then check the SMART stats. If they look good, then you could try adding it to the array again (though if you do that, keep a close eye on it).

jannear · ‎2019-06-13

Thanks Stephen.

My volume is in bad shape. from /sdc/ - errors such as this

Error 6799 [6] occurred at disk power-on lifetime: 18909 hours (787 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 00 90 00 49 40 00  Error: UNC at LBA = 0x00900049 = 9437257

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 07 00 20 00 00 00 90 00 49 40 08     04:42:30.908  READ FPDMA QUEUED
  60 00 01 00 10 00 00 00 90 00 48 40 08     04:42:30.908  READ FPDMA QUEUED
  ef 00 10 00 02 00 00 00 00 00 00 a0 08     04:42:30.907  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 00 00 00 00 00 e0 08     04:42:30.907  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 00 00 00 00 00 a0 08     04:42:30.907  IDENTIFY DEVICE

I decided to check sda, sdb and sdd (which I assume are all 4 phyiscal drives). This appears for sdd.

Error 35 [10] occurred at disk power-on lifetime: 15466 hours (644 days + 10 hours)
  When the command that caused the error occurred, the device was in standby mode.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 00 f6 f1 38 40 00  Error: UNC at LBA = 0x00f6f138 = 16183608

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 90 00 00 04 3f 0c e0 40 08     03:53:07.269  READ FPDMA QUEUED
  e5 00 00 00 00 00 00 00 00 00 00 00 08     03:52:34.053  CHECK POWER MODE
  e5 00 00 00 00 00 00 00 00 00 00 00 08     03:52:34.052  CHECK POWER MODE
  e5 00 00 00 00 00 00 00 00 00 00 00 08     03:50:34.033  CHECK POWER MODE
  e5 00 00 00 00 00 00 00 00 00 00 00 08     03:50:34.032  CHECK POWER MODE

I suspect that both drives (3&4) will need replacing?
Any suggestion as to how best do this? This is a dump volume. I store everything and anything on this. Anything of critical importance is backed up to another Netgear NAS running JBOD.

StephenB · ‎2019-06-13

@jannear wrote:

Error 35 [10] occurred at disk power-on lifetime: 15466 hours (644 days + 10 hours)
  When the command that caused the error occurred, the device was in standby mode.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 00 f6 f1 38 40 00  Error: UNC at LBA = 0x00f6f138 = 16183608

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 08 00 90 00 00 04 3f 0c e0 40 08     03:53:07.269  READ FPDMA QUEUED
  e5 00 00 00 00 00 00 00 00 00 00 00 08     03:52:34.053  CHECK POWER MODE
  e5 00 00 00 00 00 00 00 00 00 00 00 08     03:52:34.052  CHECK POWER MODE
  e5 00 00 00 00 00 00 00 00 00 00 00 08     03:50:34.033  CHECK POWER MODE
  e5 00 00 00 00 00 00 00 00 00 00 00 08     03:50:34.032  CHECK POWER MODE

I suspect that both drives (3&4) will need replacing?

I'd replace sdc, since you are seeing operational issues with it in addition to the errors.

I'm not so sure about sdd. Certainly there is evidence that it might be failing - so you should at least test it. Power down the NAS when you do that (since you don't want to risk resyncs).

FWIW, I had some file access issues a couple months ago (file explorer started timing out, also my media player was misbehaving). I had one disk (sdc) showing errors in kernel.log at the time the media player failed, and three disks showing the UNCs (including sdc). One disk was completely clean.

I tested sdc with lifeguard - it failed so I replaced it. The usual process worked - there was no volume collapse during the resync.

I then replaced a second one - though when I tested it, it passed the Lifeguard tests I mentioned above. I've set that disk aside - I'm not sure if I can trust it or not. But I didn't want two disks with those errors in the same NAS.

I've left the third disk in service for now (the error snipped I posted was on that drive). After all, it had survived two recent resyncs (and also a scrub) without causing volume loss or a degraded volume. The UNC rate is about once every 3 months. If it shows another one I'll replace it also.

jannear · ‎2019-06-13

Thanks Stephen very much for your assistance.

As recommended, I'll replace SDC and consider replacing SDD. If SDD tests OK, at least I'll have a spare drive on-hand for any future issues. At present, I potentially have two failing drives - a precarious position to be in. So keen to get back to having 4 functional drives.

314 Degraded Volume

314 Degraded Volume

Re: 314 Degraded Volume

Re: 314 Degraded Volume

Re: 314 Degraded Volume

Re: 314 Degraded Volume

Re: 314 Degraded Volume

Re: 314 Degraded Volume

Re: 314 Degraded Volume

Re: 314 Degraded Volume

Re: 314 Degraded Volume

Re: 314 Degraded Volume

Re: 314 Degraded Volume