NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.

Forum Discussion

EMF2's avatar
EMF2
Aspirant
Aug 27, 2024

ReadyNAS 4312X SMB shares intermittently hang/freeze

ReadyNAS 4312X, running firmware version 6.10.10. 

Disks are 6TB 7200rpm SATAs of WD and Toshiba make, respectively.

Network connection is bond0 over eth4/eth5 (the two 10Gbps NICs), configured LACP 802.3ad Layer 2.  Upstream switches are Meraki MS355s, running aggregation. 

Joined to Active Directory; SMBv3 encryption was set to "Desired", now "Disabled" --- no change, so likely to go back to "Desired".

 

Every so often and quite randomly, the SMB share will be unreachable.  Python code attempting to read or write to the NAS returns "OSError: [Errno 22] Invalid argument" as an example; another tool says "Can't find the file" that it read successfully on the previous iteration.

top/iostat on the system return reasonable iowaits (<4%), tps in the mid-teens at peak per disk, and read/write rates per disk well under SATA maximums. Truncated sample:

avg-cpu: %user %nice %system %iowait %steal %idle
          4.48  6.93    4.05    0.68   0.00 83.86

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda    15.40   121.60    669.40     608    3347
sdb    15.20   115.20    669.40     576    3347

....(x12 disks)

 

I don't see any errors in the /var/log/samba directory on the NAS, and the switches are not reporting flapping or interface issues.  Drop counters on the NICs are 0; the rx_no_dma_resources counter on each is in the hundreds of thousands mark, though seem not to be rising.  I've already set the ring buffers higher through ethtool.

 

Looking for ideas at this point, or if there's more information needed to help, I'll grab it...

8 Replies

Replies have been turned off for this discussion
  • Further information:
    When the connection drops, it's down for five or ten seconds for all clients at the same time.  If a client isn't actively trying to read or write to the share, they don't even notice.

    Date and time are configured to time-sync to the DCs and match.

    There are no other apps on the device; NFS and Rsync are enabled for support of some backup jobs, but almost completely unused.

    Audit logs are turned on, but the information in them consists solely of the login/logoff which doesn't seem to correlate with the drops.


    This network configuration has existed longer than the trouble with the dropping share has occurred, but might still somehow be relevant:
    Bond0 is running subinterfaces for access segmentation and service control.  
    bond0: (native VLAN 6), IP address 192.168.6.X/255.255.255.0/192.168.6.1  DNS nas1.sec.domain.local
    bond0.2: (trunk VLAN 2) IP address 192.168.2.X/255.255.255.0/192.168.2.1   DNS nas1.domain.local

    bond0.7: (trunk VLAN 7) IP address 192.168.7.X/255.255.255.0/192.168.7.1   DNS nas1.dev.domain.local

    SSH/HTTPS/etc. are access-list blocked at the switches for all but VLAN 6, and you can only reach that VLAN through a firewalled management portal.  


     

    • StephenB's avatar
      StephenB
      Guru - Experienced User

      Just wondering - is this happening when the client is starting to communicate with the NAS?  Or does it appear to be happening mid-flight?

      • EMF2's avatar
        EMF2
        Aspirant

        I know this is going to seem like a non-answer, but it's not: Yes to both, sort of.

        Sometimes it happens in the middle of a large file.  But the Python process I mention is processing literally tens of thousands of 2.5MB files, and when it breaks, it breaks for the next few seconds of attempts.  It could be one driving the other, though -- because if I'm processing the tiny files while handling the large one, the large transfer will *also* fail.

        Leading the witness, but this seems like some sort of load issue or service crash on the NAS, where the system just can't keep up with the demand... but there's nothing in the logs or iostat/vmstat/netstat output to support that; conversely, a network issue, where there's some sort of renegotiation on how to connect to it.  Nothing in the logs on the switches is showing flapping, and no mention of service interruptions on the NAS logs --- but I will freely admit I might be missing where to look especially on the NAS.

NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology! 

Join Us!

ProSupport for Business

Comprehensive support plans for maximum network uptime and business peace of mind.

 

Learn More