Readynas Duo crashes (or hangs) on big filecopy

My Readynas DUO crashes consistently while copying a large file. The first time I copied a video file (300MB) from one place on the NAS to another (on the NAS) by means of a NFS share and Nautulus on my Ubuntu PC. To eliminate nfs I copied the file on the readynas command line from the same origin to the same destination. During the copy I loose all connections from my PC to the NAS, so it is not clear if the system crashes or hangs in some way. The readynas responds correctly on the power down button on the front of the DUO. I cannot ping the NAS from anywhere so the network is completely lost. The only thing to do is power down with the button.

In the messages file the following line is present:


May 28 20:32:38 netgear kernel: rxS1(00ca0000),rxS0(008005ee)

I have seen the same messages several times in the messages file: sometimes it occurs just before the system hangs but not always.
Once I have seen this message appear (in tail -f /var/log/messages) immediately followed by the hang (or crash). At that time there was hardly any network activity so it seems clear to me that the filecopy just started before that is the cause of the hang.

My Smart data of the two disks show the following information:


Model: 	SAMSUNG HD103UJ
Serial: 	S13PJ9AS705549
Firmware: 	1AA01118

SMART Attribute
	
Raw Read Error Rate	0
Spin Up Time	9690
Start Stop Count	12119
Reallocated Sector Count	0
Seek Error Rate	0
Seek Time Performance	0
Power On Hours	39613
Spin Retry Count	0
Calibration Retry Count	0
Power Cycle Count	25
Read Soft Error Rate	0
Runtime Bad Block	0
End-to-End Error	0
Reported Uncorrect	0
Command Timeout	0
Airflow Temperature Cel	34
Temperature Celsius	36
Hardware ECC Recovered	43898143
Reallocated Event Count	0
Current Pending Sector	0
Offline Uncorrectable	0
UDMA CRC Error Count	0
Multi Zone Error Rate	0
Soft Read Error Rate	0
	
ATA Error Count	0

Extended Attribute
	
Hot-add events	0
Hot-remove events	0
Lp stat events	0
Power glitches	0
Hard disk resets	0
Retries	0
Repaired sectors	0

and the following for the second disk:


Model: 	SAMSUNG HD103UJ
Serial: 	S13PJ90S761084
Firmware: 	1AA01118

SMART Attribute
	
Raw Read Error Rate	34
Spin Up Time	9950
Start Stop Count	12205
Reallocated Sector Count	0
Seek Error Rate	0
Seek Time Performance	0
Power On Hours	39603
Spin Retry Count	0
Calibration Retry Count	0
Power Cycle Count	23
Read Soft Error Rate	0
Runtime Bad Block	0
End-to-End Error	0
Reported Uncorrect	0
Command Timeout	0
Airflow Temperature Cel	33
Temperature Celsius	35
Hardware ECC Recovered	19397
Reallocated Event Count	0
Current Pending Sector	0
Offline Uncorrectable	0
UDMA CRC Error Count	0
Multi Zone Error Rate	1
Soft Read Error Rate	0
	
ATA Error Count	0

Extended Attribute
	
Hot-add events	0
Hot-remove events	0
Lp stat events	0
Power glitches	0
Hard disk resets	0
Retries	0
Repaired sectors	0

The first disk has a high Hardware ECC Recovered (always has been high).
The second disk seems to have a Raw Read Error Rate where it was 0 (I think).

Otherwise I don't see any strange things in the log files.

Does anybody know what could be the reason of my problems?
In other posts this behavour is also described but I think I can rule out any network related problem. To me it seems a disk problem.
What can I do to confirm and remedy this problem? After a resync and deleting of some files I can not initiate this problem anymore.

Hw & Hw Compatibility

Other

13 Replies

Replies have been turned off for this discussion

StephenB
Guru - Experienced User
May 29, 2014
I don't see anything concerning in the SMART stats. Reallocated and Pending Sector counts are 0, and they are the most useful stats for determining drive health. ATA errors have multiple causes and are worth a closer look, but you don't have any.

Hardware ECC Recovered and Raw Read Error Rate have vendor-specific formats, and without knowing the specific way Samsung formats those parameters it is not possible to analyze their significance.

In any event, if there had been a write error to the drive, the reallocated sector count would have increased. If there had been a read error from the drive, then the pending sector count would have increased. Neither happened.

If you have SSH installed, you can try copying from the command shell - which might help you isolate the problem if it begins to happen again.
tony359
Apprentice
May 29, 2014
Do you have a switch between the PC and the NAS? Can you try connecting the NAS directly and see what happens?
rocus
Aspirant
May 30, 2014
As I tried to explain in my post I did the copy on the netgears command line (the first thing I did when I bought my Nas was to install ssh). During the copy every thing happened in the NAS and there was hardly any network traffic (except for the shell output) and a tail -f messgaes in an other shell window. So there is hardly any doubt that the file copy (read and write on the NAS disk) was the cause of the crash. (and not network traffic).

I did a few things on the NAS and the error is now not reproducable.
Does the error message means anything to somebody?
Is there a way to check (at a low level) the two disks?
StephenB
Guru - Experienced User
May 30, 2014
What I was trying to suggest was to do copies exclusively on the NAS (from one share to another) using ssh, in order to exercise the disks apart from the network. There is no evidence in your previous posts that you tried this.

You can also try copying the C volume to /dev/null. Checking the file system (either manually or with a reboot) might also flush out an issue.

Another possibility is that the PSU is failing - that when you put the NAS under sustained load, is not delivering enough power.

BTW, you might also check for OS partition fullness. That can create strange NAS failures, and is easy to do with ssh.

The disks themselves look ok. You can test them with vendor tools - I think Seagate's SeaTools diag now recognizes Samsung drives (since they acquired Samsung's disk division), but am not positive. That would need to be done in a windows PC.
rocus
Aspirant
May 30, 2014
Then I explained myself poorly. At first the problem appeared when I copied files from one place on the nas to another place on the Nas. I did this with Nautilus (file manager) on Ubuntu with cut and paste. Because the file goes then to and from the PC it generates heavy network traffic. I then got the idea to do the same copy on the command line of the Nas (thereby eliminating network traffic, also much faster ofcourse). I got the same error then (hang/crash). Then I did the copy while in another shell examining the tail of the messsages file. Then it was clear that the error message (rxS ...) coincides with the hanging/crashing. As the power down button works I think the system is still running but network connection failed. (I don't see a connection there except both involve the kernel ofcourse)

My filesystems are not full (by far)

I think of running a script that logs the processes so that I can see something when the system hangs (or network traffic is down)

I will swap the PSU with another netgear PSU and see what happens.

I will try to find the disk tool.

Thanks very much..
StephenB
Guru - Experienced User
May 30, 2014
The OS partition is not the data volume (some people do seem get confused on that). It has a 2 GB limit on the v1 platforms, and it can get full. Losing network connections during a file transfer is one symptom of a filling OS partition - which is why I suggested checking that, if only to rule it out.
rocus
Aspirant
Jun 01, 2014
The root partition is only half full. (df)

Filesystem 1k-blocks Used Available Use% Mounted on
/dev/hdc1 2031872 933376 1098496 46% /
tmpfs 16 0 16 0% /USB
/dev/c/c 970385648 571441264 398944384 59% /c

I googled ofcourse the error message and came to the same posts as you. They all seem to focus on network issues. I think I ruled that out because the problem also occurred with hardly any network traffic.

I see a suggestion about memory. How do I do a memory test?

The next time the problem occurs, and is reproducable, what should I do/try to narrow down the problem?
StephenB
Guru - Experienced User
Jun 01, 2014
rocus wrote:
How do I do a memory test?
http://www.downloads.netgear.com/files/GDC/RND2110/Duov1_NV+v1_HW_en_06Dec11.pdf pages 15-16

The next time the problem occurs, and is reproducable, what should I do/try to narrow down the problem?
Check the SMART stats and run fsck to start with.
gregenz
Aspirant
Nov 23, 2014
Hi folks, did you sort this one?
I have a similar case and also wondering if it is memory. Don't believe it used to do this, but now works until you queue up a whole lot of files and it hangs just as you have described on big files as far as I can tell. Always comes back after a power cycle, but quite quickly quits again with the problem. Something is 'hitting a limit' somewhere that triggers a major event.

Would be great to know if it has been found, tries a few changes in settings in the network link manual link negotiation etc, not helped too much. Almost feels to me like its the network I/F that falls over, but not sure why I think that. But if memory had some errors high up it could do this I guess.
Greg E
mdgm-ntgr
NETGEAR Employee Retired
Nov 23, 2014
gregenz did you check the fullness of the OS partition?

# df -h
# df -i

This is happening on your Duo, right?