× NETGEAR will be terminating ReadyCLOUD service by July 1st, 2023. For more details click here.
Orbi WiFi 7 RBE973
Reply

Re: NAS not accessible - possible disk failure during resinc

berillio
Aspirant

NAS not accessible - possible disk failure #24263236

Hi there, I need help again.
ReadyNAS NV+ v2, 4x 3Tb disk, 4Tb free space
I was away for few months, and I was having issues accessing my NAS from abroad (I tried to deal with this in a different post,
viewtopic.php?f=21&t=77992 ), not related to this issue.

When I returnedI checked the NAS, everything seemed fine; moreover, as 5.3.11 had just been released, I downloaded it and installed it.
"Your ReadyNAS device has been updated with a new firmware image. (5.3.11)" (25/10/2014)

Then I started getting errors. These are the mails alerts (timestamp retrieved from the header)

Date: Sat, 1 Nov 2014 08:09:05 +0000 (UTC)
"Detected increasing uncorrectable errors[80] on disk 3 [ST3000DM001-1CH166, Z1F337XB]. This often indicates an impending failure. Please be prepared to replace this disk to maintain data redundancy."

Date: Sat, 1 Nov 2014 08:13:07 +0000 (UTC)
"Detected increasing uncorrectable errors[112] on disk 3 [ST3000DM001-1CH166, Z1F337XB]. This often indicates an impending failure. Please be prepared to replace this disk to maintain data redundancy."

Date: Sat, 1 Nov 2014 10:25:43 +0000 (UTC)
"Detected increasing uncorrectable errors[96] on disk 3 [ST3000DM001-1CH166, Z1F337XB]. This often indicates an impending failure. Please be prepared to replace this disk to maintain data redundancy.".

Date: Sat, 1 Nov 2014 10:29:48 +0000 (UTC)
"Detected increasing uncorrectable errors[80] on disk 3 [ST3000DM001-1CH166, Z1F337XB]. This often indicates an impending failure. Please be prepared to replace this disk to maintain data redundancy."

Similar story next day:

Date: Sun, 2 Nov 2014 06:09:15 +0000 (UTC)
"Detected increasing uncorrectable errors[72] on disk 3 [ST3000DM001-1CH166, Z1F337XB]. This often indicates an impending failure. Please be prepared to replace this disk to maintain data redundancy."

Date: Sun, 2 Nov 2014 06:13:20 +0000 (UTC)
Detected increasing uncorrectable errors[56] on disk 3 [ST3000DM001-1CH166, Z1F337XB]. This often indicates an impending failure. Please be prepared to replace this disk to maintain data redundancy.

Date: Sun, 2 Nov 2014 07:06:25 +0000 (UTC)
"Detected increasing uncorrectable errors[48] on disk 3 [ST3000DM001-1CH166, Z1F337XB]. This often indicates an impending failure. Please be prepared to replace this disk to maintain data redundancy."

Then few days without issues, then:

Date: Sun, 9 Nov 2014 06:06:44 +0000 (UTC)
"ATA error count has increased in the last day.
Disk 3:
Previous count: 0
Current count: 12
Growing SMART errors indicate a disk that may fail soon. If the errors continue to increase, you should be prepared to replace the disk."


Note: my NAS is rebooted weekly on Sunday morning (off at 5:00, On at 6:00), otherwise it is on 24/7.

No further alert messages from NAS, which has been used regularly to see recorded movies and back up recorded data. Possibly the last data copied to NAS was the Formula 1 GP last Sunday 23/11/2014 (or early hours of Monday).
Yesterday (24/11), while sorting out some modem and network issues, I tried to access NAS from another PC without success. I checked NAS, (4 Blue LED on, steady) but when I pressed the ON button once, the Display Window did NOT come up. I did not investigate further as I was sorting different issues.
This afternoon I had time to look into this. NAS does NOT obey the ON/OFF button (neither single nor double click to reboot).

I opened RAIDar, searched for NAS, which is NOT found. On the other hand, In the RAIDar LED Legend, the last three LED are blinking:

© Awaiting resync (blinks if resyncing)
This disk is waiting to sync to the RAID volume.  If the LED is blinking, this disk is currently syncing.  During sync process, volume will be in degraded mode and the ReadyNAS performance will be affected by the sync process. Another disk failure in the volume will render it dead unless the volume has dual redundancy (X-RAID2 with dual redundancy setting or RAID 6). 

© Life support mode
The volume has encountered multiple disk failures and is in the state of being marked dead.  However, the ReadyNAS has blocked it from being marked dead in the event that someone may have accidentally pulled out the wrong disk during runtime.  If the wrong disk was pulled out, shutdown the ReadyNAS immediately, reconnect the disk, and power-on the ReadyNAS.  If you reconnect the disk during runtime, the ReadyNAS will mark it as a newly added disk and you will no longer be able to access the data on it.

© Background task active
A lengthy background task such as a system update is in progress.

RAIDar seems to be refreshing every 15 second or so.
I do not know if the blinking LED in the legend are actually significant: after all a Legend should be the explanation of some symbols, not a diagnostic tool.

Note that on NAS all 4 LEDs are lit and steady, none blinking.
According to
http://www.readynas.com/download/docume ... AS-LED.pdf, this should mean
"Boot finishes, normal operating mode. ACT indicates disk access. Disk LED indicates corresponding disk is healthy. Example shows 4 healthy disks."

But infact NAS is not accessible on the network (Network Path cannot be found), the Dashboard does not pass the authentication, RAIDar does not show the NAS and the NAS itself appears to be almost asleep, apart from the activity light: normally the fan ramps up every so often, but this hasn't happened for almost all day.

Note also that NAS had two more reboots (16/11 and 23/11) since the last SMART errors were registered, without any further warnings.

Further infos:

HD in channel 1 Z1F3362S Power_On_Hours 3377
HD in channel 2 W1F0TZER Power_On_Hours 16532
HD in channel 3 Z1F337XB Power_On_Hours 8170
HD in channel 4 W1F0WXJW Power_On_Hours 16399

The NAS was originally fitted with two HDs (currently in Ch.2 and Ch.4, then two more disks were purchased, but only one was fitted in Ch.3. Later on, few days after I was back after being abroad, the disk in Ch.1 apparently failed, hence it was removed for testing. Checking the logs I discovered that while I was away both the other disks had already failed individually, but then they both woke up again after the weekly reboot and re-synched fine: this suggested that the issue was a sort of "red herring". This was corroborated by the fact that the HD which I removed had passed all diagnostic tests.
Nevertheless the unused disk was fitted in Ch.1 (hence it shows the least Power_On_Hours reading), while the "failed" disk was reformatted and fitted in Ch.4. The issues with the reported failures was solved by following the advice from Support (Case #23396545) to remove disk "spin-down". All this was described in my post viewtopic.php?f=24&t=76431 (Thank You again, StephenB).

Incidentally the disk in Ch.3 was also tested at that time and passed the Seatool for DOS tests.

--------------- SeaTools for DOS v2.23 ---------------
Device 1 is Seagate Device ST3000DM001-1CH166 Z1F337XB On Generic PCI ATA
Max Native Address 5860533167
Device is 48 Bit Addressed - Number of LBAs 5860533167 ( 3000.593 GB )
This drive supports Security Features
SMART Is Supported And ENABLED
SMART Has NOT Been Tripped
DST Is Supported
Logging Feature Set Is Supported
POH 5202 Current Temp 25

Started Short DST 6/21/2014 @ 13:35.12
DST Completed Without Error
Short DST PASSED 6/21/2014 @ 13:36.12


--------------- SeaTools for DOS v2.23 ---------------
Device 1 is Seagate Device ST3000DM001-1CH166 Z1F337XB On Generic PCI ATA
.....(as above)
Logging Feature Set Is Supported
POH 5202 Current Temp 31

Started Long Test 6/21/2014 @ 13:48.17
DST Completed Without Error
Short DST PASSED 6/21/2014 @ 18:09.02
Long Test PASSED 6/21/2014 @ 18:09.03



I am not sure of what I should do.
I could unplug the NAS, reboot, and see if it start synching, or it simply starts up OK..
On the other hand I could unplug the NAS, remove the HD in Ch.3 and then reboot.
I presume that this would force a resynch (since one disk is missing), which would take almost two days; meanwhile I could test this disk again. But, as it happens, in the past all testing showed disks as perfectly healthy.
Or I could unplug the NAS, remove the disk in Ch.3, run the SHORT test and see how it fares before continuing.

Any suggestion?
Many thanks in advance, berillio

p.s. Unfortunately I downloaded the logs, but that was done on 30/10/2014, 2 days before the fault showed up. For some reasons, I was convinced that the logs were downloaded/copied after the fault had appeared, but I was mistaken.
Message 1 of 8
mdgm-ntgr
NETGEAR Employee Retired

Re: NAS not accessible - possible disk failure during resinc

Do you have a backup?

When a NAS enters life support mode you should contact support for paid assistance.
Message 2 of 8
berillio
Aspirant

Re: NAS not accessible - possible disk failure during resinc

a) no,I do not have a back up. I never had the time to get it organised.
b) do you mean that the "Legend" in RAIDar is actually a diagnostic tool?

you don't sound too hopeful... 😞
Message 3 of 8
mdgm-ntgr
NETGEAR Employee Retired

Re: NAS not accessible - possible disk failure during resinc

a) well some would say you should do that from the start.
b) Yes

I do know if you try and fix it yourself you will probably only succeed in making it worse. How you proceed depends on whether you value your data.

Data recovery may/may not work.
Message 4 of 8
berillio
Aspirant

Re: NAS not accessible - possible disk failure during resinc

well... Thanks 😞
I'll contact support ...
Message 5 of 8
berillio
Aspirant

Re: NAS not accessible - possible disk failure during resinc

Later on Tuesday 25.11 (actually it was the early hours of Wednesday 26.11), I Contacted Support and had a Live chat with a nice and helpful guy. I pasted in the live chat message window the same post I published on the Forum earlier on, and while he was reading it, I forwarded him the logs taken two days before the fault manifested itself the first time (log: 30.10.2014, fault: 1.11.2014). He said that the disks seemed OK. Then, under his instruction, I did the following:
a) unplugged the unresponsive NAS (I hate doing that, but there was no other way), and removed the Ch.3 disk while NAS was off, then rebooted.
b) NAS came up online fine, I seemed to have access to all the files. Checked a couple of movies which did not seem to open properly on this pc (I think that there was an issue with VLC, I had no video), but played fine on the netbook in the other room.
c) unfortunately a couple of minutes later, while trying to download the logs, NAS simply disappeared from the network.
d) the NAS appeared to be frozen again: normally the display alternates between two messages, it had the message "Disk 3 failed" earlier on, but now it was stuck with " c: unprotected ".
e) I unplugged it again, went to the other room to load Safari on the netbook ( the last time I downloaded the logs l was on the netbook, I think that I MIGHT have used Safari to download them ), then came back here in this room and rebooted NAS.
f) NAS took much longer than before to boot up (or, at least, it seemed that way). When I saw RAIDar showing NAS, I went in the other room to download the Logs. but the Dashboard did not seem to come up, and then the "network not found " window popped up again.
g) according to RAIDar, NAS was back in life support mode; so we switched it off. This time NAS could be switched off using the button.
h) the Support person decided to raise the fault level to 3 (?? I think - I lost the extract of the live chat) and then we then closed the Live chat session.

Later on (Wednesday early afternoon) I started testing Disk #3 (using a Startech USB adaptor), using a variety of testing software. All testing software reported (in different ways) a number of SMART warnings; the disk passed some short tests (including the Western Digital DLDIAG) but it failed the WD DLDIAG Long test BADLY - the test was interrupted "too many bad sectors" after ~30mins (after testing ~5% of the disk). I haven't run Seatools yet, but that maybe immaterial after seeing the WD results.

Additional notes:
when the Live chat session ended, it was 6:00 am, after been up the entire night; so I copied the Live chat in a wordpad document (where I draft the posts), wrote a quick update to the forum and tried to get some sleep.
Later on in the day, while plugging-in the Startech USB adapter, another cable flipped the (very sensitive) switch on the plug extension shoe, switching off the PC. After reboot I could not find anymore the live chat notes, so I obviously I copied them in the document but NOT saved it afterwards.
Similarly I could not find the "quick update" of the live chat in the forum postings either, so I guess that I drafted and previewed it, but not submitted it.

Further comments
I was expecting to get an email from NETGEAR with the live chat extracts when it was concluded, but I haven't received anything. Therefore I posted a request for it on the ticket page, but I had no reply yet. I remember SOME of the advices and options open to me, but I can't be sure I remember it all and exactly.
Message 6 of 8
berillio
Aspirant

Re: NAS not accessible - possible disk failure #24263236

Note: this post continued briefly over here,
viewtopic.php?f=65&t=76863&hilit=berillio#p442158
before retuning to this original place
Message 7 of 8
berillio
Aspirant

Re: NAS not accessible - possible disk failure #24263236

I run 3 tests, which were successful and lasted EXACTLY 4 hours (the clock goes 3:59:59 >>> SUCCESS). I also run a 4th test, which was successful, but as I did NOT press the "record" button on the camcorder, I do not know how long it took. Then I run a 5th test, which also took exactly 4hours as the others (I double check the possibility of the testing clock suffering from a fault itself, by measuring, with the help of the eight ~30minutes videos's timestamps the actual length of the recording, and it was correct, 4h:00:03).

Previoulsy I had difficulties in downloading the Logs, using my installed version of the Avant browser, so I installed (on the Athlon64 pc, same room as NAS) the Safari for Windows browser (which seems to be be able to download logs), and. I tested that by successfully downloading the logs from another NAS (RN104).
As ReadyNAS NV+ v2 seemed to pass the memory tests, I refitted the three good disk in the orignal order and powered it up. NAS booted up fine, but it failed on three different attempts to download the logs, going "offline" during the attempts, and disappearing from RAIDar. But unlike other times the NAS did not hung, and I could reboot it by double pressing the on-off button normally. Twice, on reboot, the NAS's display showed briefly the "Checking Root FS" message.
As NAS had been working more or less fine (bar accessibility issues from other locations) for few months with FW5.3.10 (installed 24 June 2014) until the upgrade to FW5.3.11 (installed 25 October 2014, five days before HD#3 started showing uncorrectable errors), I tried to re-install FW5.3.10.
After the firmware downgrade, NAS's diplay showed the message "Checking C: 0%....2%...etc". The volume checking lasted ~90minutes before getting to 100%, then the message "Checking Quotas" was displayed; this was displayed for ~64 minutes, then "Installing Addons (~20"), "Booting" (~30"), before the normal ReadyNAS messages (IP on top and an alternating "Disk 3 failed" - "C: unprotected") appeared on the display.
I then tried to download the logs, which was successful, and then switched NAS off, as it is in "unprotected state" (not for long as the failed disk, under warranty, was sent back to Seagate). But next day I found NAS on again (?by a network call ? I don't know), so I switched it off and UNPLUGGED IT (or it would have booted up sunday morning at 6:00).
Message 8 of 8
Top Contributors
Discussion stats
  • 7 replies
  • 1831 views
  • 0 kudos
  • 2 in conversation
Announcements