Discussion stats
  • 6 replies
  • 2347 views
  • 0 kudos
  • 2 in conversation
Announcements

Top Contributors
Reply
Highlighted
Aspirant

ReadyNAS Pro Business Spontaneous Reboots 16779147/19770493

Submitted to the forum in case anyone has any thoughts.

This ReadyNAS Pro Business is in a situation where it appears to reboot sporadically. The symptoms are as follows:

1. backup jobs originated on this NAS stop and show as status "cancelled" on the backup schedule page.
2. Two other ReadyNAS devices that share the same UPS and use networking monitoring briefly report "UPS Communication error", followed by "Communication with UPS OK" five seconds or so later.
3. The volume enters a Resync situation. Each time this situation happens, the Resync restarts.

Steps I have taken to diagnose:

1. Review Hard Disk SMART errors. All drives have a handful of errors similar to this one:

SMART Information for Disk 1

Model: Hitachi HDS723030ALA640
Serial: MK0331YHGUL57A
Firmware: MKAOA5C0
SMART Attribute
Raw Read Error Rate 0
Throughput Performance 86
Spin Up Time 613
Start Stop Count 19
Reallocated Sector Count 0
Seek Error Rate 0
Seek Time Performance 26
Power On Hours 187
Spin Retry Count 0
Power Cycle Count 10
Power-Off Retract Count 21
Load Cycle Count 21
Temperature Celsius 41
Reallocated Event Count 0
Current Pending Sector 0
Offline Uncorrectable 0
UDMA CRC Error Count 1
ATA Error Count 1

2. Run Hard Disk tests from Boot Menu. No errors.
3. Run Memory tests from Boot Menu. No errors:
4. Reseat existing memory module. Situation persists.
5. Replace memory module. Situation persists.
6. A factory reset. Situation persists.

My guess is that there is either a motherboard failure, or an unreported hard drive failure.

In any event, this NAS only accepts backups of backups, so it's not mission critical.
Message 1 of 7
Highlighted
Aspirant

Re: ReadyNAS Pro Business Spontaneous Reboots (16779147)

Typical response, which I will try, along with a proof-of-purchase request.

Going over your logs I see that all 6 of your drives have ATA errors. This could be the cause of your issue, we recommend replacing the drives.

The registration of your device reflects that this ReadyNAS was purchased as a discless chassis. In order to fix the registration and to avoid future confusion please upload your proof of purchase to this case. Alternatively send your proof of purchase to attachment@netgear-support.com, using your case number as the subject line. Update this webticket once the proof of purchase is sent and I will adjust the registration accordingly.
Message 2 of 7
Highlighted
Aspirant

Re: ReadyNAS Pro Business Spontaneous Reboots 16779147/19770

So, the problems persist with this ReadyNAS PRO Business. I know it can't be trusted, so I only use it for backups. I finally had time to mess with it and also needed to expand the array, so I was able to further diagnose what's going on -- in short, it's clearly not RAM and it's not hard drives -- I've put more than 20 unique combinations of drives and I can repro the problem. The only scenario that doesn't leave me with a dead array is starting with 2 new drives and expanding the array one-by-one.

Error Message or Problem: 1!!Fri Oct 26 05:26:15 CDT 2012!!root!!Volume scan failed to run properly.

Array Destroyed
SW version: RAIDiator 4.2.22
This system intermittently reboots within 2-7 hours while resyncing after replacing a drive. The error message after reboot is:

1!!Fri Oct 26 05:26:15 CDT 2012!!root!!Volume scan failed to run properly.

THE ARRAY AND ALL DATA IS LOST

Steps taken to diagnose:

- Run memory diagnostics overnight with no errors
- Replaced RAM
- Completely replaced HD with drives that have no ATA errors
- Attempted 3-drive array instead of six
- Attempted 3-drive array in slots 1-3 and also 4-6 to try to ID a backplane issue

None of these have resolved the issue.

I have also moved drives to another known good ReadyNAS Pro Business and that NAS will build the drives into an array with no problems.

The system may also occasionally reboot when it''s not resyncing a new drive, but aside from a UPS offline report coming from other NAS it''s difficult to tell when this happens or why.
Message 3 of 7
Highlighted
NETGEAR Moderator

Re: ReadyNAS Pro Business Spontaneous Reboots 16779147/19770

I have looked into part of the case.
It does look like you were having some disk issues, which does not help the RAID state. I am having a support agent contact you to see if we are able to get your data volume online.
Message 4 of 7
Highlighted
Aspirant

Re: ReadyNAS Pro Business Spontaneous Reboots 16779147/19770

OOM-9 wrote:
I have looked into part of the case.
It does look like you were having some disk issues, which does not help the RAID state. I am having a support agent contact you to see if we are able to get your data volume online.

Thank you for looking into it. I do appreciate it.

I know you guys are skeptical about the state of the drives I'm using, but I have been through dozens of drives on this one (really -- at least two dozen). Anyway, just got off the phone and think the resolution is the correct one -- I'll report back.
Message 5 of 7
Highlighted
NETGEAR Moderator

Re: ReadyNAS Pro Business Spontaneous Reboots 16779147/19770

I am glad that we were able to get things moving in the correct direction.

The hard thing with the drive failures you were experiencing is that the drives have a higher chance of failing than the backplane. It seemed like there were other parts of the logs that may have hinted the chassis. I hope that unit will treat you better. Smiley Happy
Message 6 of 7
Highlighted
Aspirant

Re: ReadyNAS Pro Business Spontaneous Reboots 16779147/19770

So, the new box arrived -- thanks very much for that! Unfortunately, it has a different problem. It hangs (freezes) while doing the resync. I have tried this three times -- once with the same array I had with the old box, once with two new Seagate 3TB drives I bought last week to try to diagnose the problems with the old box and once with two older Seagate 2TB drives that are older, but don't have ATA errors.

I guess I'll try again with a few other drives, but I think I will be creating a new case tomorrow. Smiley Sad

Here are the stats for the two new drives:
***** Disk SMART log from 2012/10/31 *****


***** Disk SMART log for channel 4 [sda] *****


smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.37.6.RNx86_64.2.4] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model: ST3000DM001-9YN166
Serial Number: W1F0WE74
LU WWN Device Id: 5 000c50 052ea2cbf
Firmware Version: CC4C
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Oct 31 16:27:26 2012 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 575) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 213257008
3 Spin_Up_Time 0x0003 094 092 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 46
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 069 060 030 Pre-fail Always - 10000500
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 144
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 17
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 6
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 059 057 045 Old_age Always - 41 (Min/Max 41/43)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 11
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 46
194 Temperature_Celsius 0x0022 041 043 000 Old_age Always - 41 (0 24 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 134780368715901
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 266911830671573
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 128153686178861

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 143 -
# 2 Short offline Completed without error 00% 143 -
# 3 Short offline Completed without error 00% 142 -
# 4 Short offline Completed without error 00% 141 -
# 5 Short offline Completed without error 00% 141 -
# 6 Short offline Completed without error 00% 53 -
# 7 Short offline Completed without error 00% 43 -
# 8 Short offline Completed without error 00% 40 -
# 9 Short offline Completed without error 00% 39 -
#10 Short offline Completed without error 00% 29 -
#11 Short offline Completed without error 00% 22 -
#12 Short offline Completed without error 00% 15 -
#13 Short offline Completed without error 00% 15 -
#14 Short offline Completed without error 00% 5 -
#15 Short offline Completed without error 00% 5 -
#16 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.




***** Disk SMART log for channel 6 [sdb] *****


smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.37.6.RNx86_64.2.4] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model: ST3000DM001-9YN166
Serial Number: W1F0XAJQ
LU WWN Device Id: 5 000c50 052ee742c
Firmware Version: CC4C
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Oct 31 16:27:31 2012 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 592) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 102 099 006 Pre-fail Always - 4841304
3 Spin_Up_Time 0x0003 094 092 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 14
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 066 060 030 Pre-fail Always - 3800173
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 53
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 14
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 1
189 High_Fly_Writes 0x003a 096 096 000 Old_age Always - 4
190 Airflow_Temperature_Cel 0x0022 061 058 045 Old_age Always - 39 (Min/Max 38/40)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 9
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 14
194 Temperature_Celsius 0x0022 039 042 000 Old_age Always - 39 (0 20 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 142730353180725
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 164790645897131
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 28888824079096

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 53 -
# 2 Short offline Completed without error 00% 53 -
# 3 Short offline Completed without error 00% 52 -
# 4 Short offline Completed without error 00% 51 -
# 5 Short offline Completed without error 00% 51 -
# 6 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Message 7 of 7