BTRFS DMESG error on Readynas Pro

sandroid · ‎2020-04-21

I've gotten a lot of my mileage out of my hacked up Readynas Pro moving up to os6 and been very happy overall. Moving now from 6TB WD Reds to 8TB WD Reds recently and noticed an error in DMESG,

One of the volumes has a write error of some sort and wr number keeps incrementing:

[1195510.074643] BTRFS error (device md125): bdev /dev/md125 errs: wr 10614, rd 0, flush 0, corrupt 0, gen 0

Here is the layout of my device.

root@qubert:~/incoming# btrfs fi usage /data/
Overall:
    Device size:                  29.08TiB
    Device allocated:             24.98TiB
    Device unallocated:            4.11TiB
    Device missing:                  0.00B
    Used:                         24.89TiB
    Free (estimated):              4.20TiB      (min: 2.14TiB)
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,single: Size:24.92TiB, Used:24.83TiB
   /dev/md124     34.00GiB
   /dev/md125      7.88TiB
   /dev/md126      3.96TiB
   /dev/md127     13.04TiB

Metadata,RAID1: Size:30.00GiB, Used:28.78GiB
   /dev/md125      9.00GiB
   /dev/md126     26.00GiB
   /dev/md127     25.00GiB

System,RAID1: Size:32.00MiB, Used:2.72MiB
   /dev/md125     32.00MiB
   /dev/md126     32.00MiB

Unallocated:
   /dev/md124      1.79TiB
   /dev/md125      1.20TiB
   /dev/md126    573.87GiB
   /dev/md127    573.46GiB

I suspect some of this may have to do with the fact the existing set up was more than 90% used and it may resolve itself. Is this something I should be concerned about and would defrag/balancing help here?

And is there a guide on the best way to upgrade the device? I'm moving 6xRAID5 replacing 6TB with 8TB one by one. I've replaced two so far.

TIA!!!

StephenB · ‎2020-04-21

I'd definitely be concerned on the error message. Can you look at the underlying SMART stats for the disks?

I suggest using

# smartctl -x /dev/sda

etc for the rest.

The -x will give you access to an error history - which turned up some UNCs on a couple of my own WD60EFRX drives that I wasn't aware of.

Maybe also look for disk errors in the log. Once you locate the disk, you can replace it next - if it happens to be one of the new ones, then exchange it with the seller

sandroid · ‎2020-04-21

Hey StephenB,

Thanks for replying. I can't seem to figure out how to issolate the problem on the actual disk level, only on the array level. The smartctl command is undecipherable to me as I don't see any fields which indicate problems.

Poking around it looks like under the covers it looks like these arrays are made when new disks are added, but can live on for a while. Is there a way to force the removal of old arrays? The other question is would any of the maintenance tasks aggrevate (or potentially fix) these problems?

I have a back up of the data so I'm thinking about just continuing the upgrade, in the event the problem is being caused by one of the older drives which is on the way out anyway.

Thanks for your help!

StephenB · ‎2020-04-21

@sandroid wrote:

Thanks for replying. I can't seem to figure out how to issolate the problem on the actual disk level, only on the array level. The smartctl command is undecipherable to me as I don't see any fields which indicate problems.

Then look in system.log and kernel.log for btrfs errors. Likely you will see a disk i/o error nearby.

sandroid · ‎2020-04-21

I did a search on the logs and no sdx errors there. I have docker installed which causes a lot of error messages due to the networking which limits how far back it goes. I'll try the defrag/balance/scrub and see if anything else shakes loos, before adding the next disk.

Thanks for your help.

StephenB · ‎2020-04-21

@sandroid wrote:

The smartctl command is undecipherable to me as I don't see any fields which indicate problems.

Of course there are the usual stats - command timeouts, pending sectors, reallocated sectors.

-x gives you an error log that starts with

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 14
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

You might find stuff like this in that section:

Error 14 [13] occurred at disk power-on lifetime: 44522 hours (1855 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 01 4b 81 cb 40 40 00  Error: UNC at LBA = 0x14b81cb40 = 5561764672

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 80 00 e8 00 01 4b 81 cb 40 40 08  8d+00:41:33.721  READ FPDMA QUEUED
  60 01 80 00 e0 00 01 4b 81 c9 40 40 08  8d+00:41:33.721  READ FPDMA QUEUED
  60 01 80 00 d8 00 01 4b 81 c7 40 40 08  8d+00:41:33.720  READ FPDMA QUEUED
  60 01 80 00 d0 00 01 4b 81 c5 40 40 08  8d+00:41:33.720  READ FPDMA QUEUED
  60 01 80 00 c8 00 01 4b 81 c3 40 40 08  8d+00:41:33.719  READ FPDMA QUEUED

This particular error is a UNC (short for uncorrectable).

sandroid · ‎2020-04-21

Here's a pastbin of the output from all the drives: https://pastebin.com/KqnUukd9

Maybe you can spot the problem, but nothing like you showed me jumped out.

Thanks!

StephenB · ‎2020-04-21

@sandroid wrote:

Here's a pastbin of the output from all the drives: https://pastebin.com/KqnUukd9

Nothing is showing up for me either - everything looks healthy there.

Sandshark · ‎2020-04-22

@sandroid wrote:

Poking around it looks like under the covers it looks like these arrays are made when new disks are added, but can live on for a while. Is there a way to force the removal of old arrays? The other question is would any of the maintenance tasks aggrevate (or potentially fix) these problems?

What the NAS does as you incrementally upgrade the drives is "stack" arrays on top of each other to make the volume larger. Thiose "old arrays" are still in use. The only way to be rid of them is to destroy the volume or factory default and start over with recovering data from backup. With the kind of error you are seeing, that could be a good idea, anyway. The process of syncing the array may trigger some SMART or other errors that are easier to decypher.

As far as living with it for a while, you are on shaky ground. Your volume may completely fail, so make sureyou keep your backup up to date. But if the files are becoming corrupt but the volume is intact, you may be backing up corrupt files.

If you need a better illustration of how the NAS "stacks" arrays, let's say you start with 4 x 1TB drives in a 4-bay NAS (4 drives horizontally, 1TB vertically). If you replace two drives with 4TB, it adds a 2 x 3TB array on top of the 4 x 1TB one. As you replace more, it horizontally expands that second layer. If you then start replacing with 6TB, it adds another 2TB high layer. It never vertically expands or deletes an MDADM array. Since BTRFS volumes can span more than one MDADM array, it's unnecessary. So, while the volume expands vertically, the arrays don't; the arrays get added to.

sandroid · ‎2020-04-22

Thanks for the explanation Sandshark! I have 2 active (a poor old NV+ sits maxed out for emergencies) ReadyNAS, a PioneerPro and Ultra4. I back up to the ultra4 so I can re-create the data if needed. I haven't kept up with the Replicate/DR I only manually rsync individual volumes, but I suppose I could do something fancier if I need to rebuild the main box to get rid of this bad volume.

Update on

I ran the defrag/balance which both ran for a minute and didn't do anything, then I patched (from 6.9.5 hotfix 1 to 6.9.6) and rebooted. The volume which was having the problem changed id. I'm not sure if the id changed on the same volume, or the errors moved to a different volume, I think the first,

Now I'm running scrub and the error which was incrementing by one every 5 minutes has stopped incrementing so we'll see what btrfs says at the end of the scrub, and decide if erasing the main box and restoring from back up is the way to go.

Thanks for your help.

sandroid · ‎2020-04-23

Well that didn't work. The scrub crashed stopped the incrementing errors temporarily, but crashed the system when it got to a certain point and the errors remain. I'm going to reset the box and copy over the backups since that seems like its the best way to clear everything in the long run and it will take a week or two to expand the volume at this rate anyway.

I'd like to do one last backup, but I'm afraid I will overwrite good files with junk (although there's no evidence of problems). I wish there was a btrfs tool which would tell me which files were involved in the failed io writes.

Sandshark · ‎2020-04-23

If you have room for a separate backup from your main one, make that new one and use a utility to do a file compare. You'll have to use one that uses more than just name and date for the comparison. Then, you can decide (most likely based on date) whihc are genuinely new/modified and which of the new ones are likely corrupt.

BTRFS DMESG error on Readynas Pro

BTRFS DMESG error on Readynas Pro

Re: BTRFS DMESG error on Readynas Pro

Re: BTRFS DMESG error on Readynas Pro

Re: BTRFS DMESG error on Readynas Pro

Re: BTRFS DMESG error on Readynas Pro

Re: BTRFS DMESG error on Readynas Pro

Re: BTRFS DMESG error on Readynas Pro

Re: BTRFS DMESG error on Readynas Pro

Re: BTRFS DMESG error on Readynas Pro

Re: BTRFS DMESG error on Readynas Pro

Re: BTRFS DMESG error on Readynas Pro

Re: BTRFS DMESG error on Readynas Pro