RN316 Corrupting files on write

q3d
Aspirant
Jun 25, 2021
StephenB wrote:
q3d wrote:

It's also possible one of the network cables are/were faulty, and some mid-writes aborted due to timeouts - could mid-write network timeout transfer aborts also cause BTRFS write issues?

Connection timeouts couldn't cause BTRFS write issues, though of course they could result in lost writes.

You are using both network connections? If so, what form of network aggregation are you using? (LACP, etc)?
Lost writes? as in the file transfer would abort completely? eg clean aborts?

I'm using both NIC's - I changed the binding a few times over the last several months (the BTRFS errors start around April '21), I'm currently using Adaptive Load Balancing. I have 1x NAS NIC going to a router, and the other going through a managed switch which is managed by the same router - eg same network /24 range, etc. - not sure if that could cause issues using going through devices with Adaptive Load Balancing.

I also noticed a few SMB failures on IPV6 - not sure why IPV6 is on DHCP enabled (the LAN is IPV4 focused).

../source3/smbd/smb2_read.c:258(smb2_sendfile_send_data) smb2_sendfile_send_data: sendfile failed for file (Input/output error) for client ipv6:fe80::4425:12a8:14fd:b7fb:54365. Terminating

StephenB wrote:

q3d wrote:

I also ran btrfs scrub -r on the raid, and found 123 read errors within the first few gb's scan,,,but that's all.

What volume were you scrubbing? The OS partition (md0) is also BTRFS.

Read errors obviously would account for it. Either there are disk errors involved, or the file system contains entries that somehow point to non-existent disk sectors.
Scrub on /dev/md127 - it appears on Frontview when running it, but allows me to capture live data as it scrubs via console. Once done, I can also check /dev/md0 just in case.

Any idea what md126 is? I get a stack of the following in kernal.log (says BTRF warnings come from device md126? but i/o errrors are on /dev/md127)
Jun 24 14:06:09 NAS kernel: BTRFS warning (device md126): i/o error at logical 68479287296 on dev /dev/md127, sector 135862144, root 272, inode 1215727, offset 57511936, length 4096, links 1 (path: *****) Jun 24 14:06:09 NAS kernel: BTRFS error (device md126): bdev /dev/md127 errs: wr 61280, rd 690, flush 0, corrupt 0, gen 0 ... Jun 24 14:06:09 NAS kernel: BTRFS error (device md126): bdev /dev/md127 errs: wr 61280, rd 694, flush 0, corrupt 0, gen 0 Jun 24 14:06:09 NAS kernel: BTRFS error (device md126): bdev /dev/md127 errs: wr 61280, rd 695, flush 0, corrupt 0, gen 0 ...
Also, just prior to the above, I get this:
Jun 24 12:45:30 NAS kernel: nfsd: last server has exited, flushing export cache Jun 24 12:45:30 NAS kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory Jun 24 12:45:30 NAS kernel: NFSD: starting 90-second grace period (net ffffffff88d74240) Jun 24 13:58:10 NAS kernel: CIFS VFS: Send error in QFSUnixInfo = -13 Jun 24 14:01:11 NAS kernel: CIFS VFS: Send error in QFSAttributeInfo = -13 Jun 24 14:04:11 NAS kernel: CIFS VFS: cifs_mount failed w/return code = -13
- q3d
  Aspirant
  Jun 25, 2021
  Any idea on how to check if the filesystem points to non-existant disk sectors without having to count and compare each sector? ;)
- StephenB
  Guru - Experienced User
  Jun 25, 2021
  q3d wrote:
  Lost writes? as in the file transfer would abort completely? eg clean aborts?
  
  If the connection drops, then anything already cached in the NAS would still be written. But the transfer from the client device would obviously stop. Any recovery would be up to the client software. I believe Windows would not restart the transfer, so the new file would be simply be truncated. A rewritten file would also have old stuff in whatever sectors weren't copied over.
  
  But you would't see any BTRFS errors from that.
  
  q3d wrote:
  
  not sure why IPV6 is on DHCP enabled (the LAN is IPV4 focused).
  
  The PCs likely still have ipv6 link local addresses allocated, and can use those to write to the NAS. I suggest disabling ipv6 in the NAS if your network doesn't use it.
  
  q3d wrote:
  Scrub on /dev/md127 - it appears on Frontview when running it, but allows me to capture live data as it scrubs via console. Once done, I can also check /dev/md0 just in case.
  
  Any idea what md126 is? I get a stack of the following in kernal.log (says BTRF warnings come from device md126? but i/o errrors are on /dev/md127)
  
  You have two RAID groups (md126 and md127). At some point in the past you vertically expanded your initial volume. When you did that, the NAS created new partitions (using the new "vertical" space), then created a second RAID group. The two groups are concatenated into a single volume, so this is transparent to NAS users.
  
  If the btrfs errors in the scrub are not associated with disk errors, then it's clear that the file system has somehow gotten corrupted. Personally I'd create a new one, and then restore the files from backup. But you can attempt a btrfs repair if you prefer.
  - q3d
    Aspirant
    Jun 29, 2021
    
    StephenB wrote:
    q3d wrote:
    Any idea what md126 is? I get a stack of the following in kernal.log (says BTRF warnings come from device md126? but i/o errrors are on /dev/md127)
    You have two RAID groups (md126 and md127). At some point in the past you vertically expanded your initial volume. When you did that, the NAS created new partitions (using the new "vertical" space), then created a second RAID group. The two groups are concatenated into a single volume, so this is transparent to NAS users.
    Yes 4->8TB drive expanded. I'm a little concerned that the md126 is showing errors, especially if its an old RAID group. Would errors on 126 cause issues with 127 or the way its concatenated?
    
    Jun 28 22:54:12 NAS kernel: btrfs_dev_stat_print_on_error: 113 callbacks suppressed Jun 28 22:54:12 NAS kernel: BTRFS error (device md126): bdev /dev/md126 errs: wr 738, rd 744, flush 0, corrupt 92, gen 0 Jun 28 22:54:12 NAS kernel: BTRFS error (device md126): bdev /dev/md126 errs: wr 739, rd 744, flush 0, corrupt 92, gen 0
    ....etc...
    Also, /dev/md0 scub had no errors/issues.
    
    I also noticed at some point, there was an out of memory error (which killed a running app I was testing speeds with over the NIC). I'm a little concenred that this may cause issues with BTRFS corruption (there have been prior issues with BTRFS corruption and memory issues, not sure what OS 6.10+ has in the way of patches on BTRFS, etc.).
    
    I've run several memory tests, and no memory errors found, so clearly something caused these BTRFS write errors. Before I do anything drastic and fix/reformat/wipe the NAS, I'd like to find the cause since if it still happens afterwards, I won't be happy.
    
    Are you aware of any prior reports of BTRFS on RN3XX/OS6.10+, and if there was any root cause identifed? Has there been any input from the NG team on this issue or is it suspected as non-NG related issue (ie CPE issue only)?

Forum Discussion

Related Content

Dead RN316

RN316 will not boot. Corrupt?

RBW30 Latest Firmware file corrupt

RN316 Corrupt Files

ReadyNAS RN316 showing Volume Is read-only

NETGEAR Academy

ProSupport for Business