Forum Discussion

Aspirant

Jan 12, 2021

Solved

No Volume exists

firmware 6.10.3 status healthy It has 4 disks. I got messages that a disk was degraded from redundant to eventually dead. I got a message the a resync was performed. Subsequently get the msg 'No vo...

ReadyNASstatuslog.pdf114 KB

rn_enthusiast
Jan 14, 2021
Hi Paulaus

Thanks for the logs.

Yea, so it is pretty much as first expected. A disk dropped out of the raid and back in. This prompted a raid sync (resilver). During that sync another drive dropped out and your raid is now broken. It is regarding disk 2 and 3. On paper, the disks aren't actually looking that bad.
Disk 2: 12 Pending Sector Error
Disk 3: 1 Pending Sector Error

However, disk 3 is spewing errors in the kernel log. Looks like the NAS is having great difficulties communicating with that disk. Example below and this is repeated over and over.

[Mon Jan 11 12:01:51 2021] do_marvell_9170_recover: ignoring PCI device (8086:3a22) at PCI#0 [Mon Jan 11 12:01:51 2021] ata3.00: exception Emask 0x0 SAct 0x10000 SErr 0x0 action 0x0 [Mon Jan 11 12:01:51 2021] ata3.00: irq_stat 0x40000008 [Mon Jan 11 12:01:51 2021] ata3.00: failed command: READ FPDMA QUEUED [Mon Jan 11 12:01:51 2021] ata3.00: cmd 60/08:80:48:00:80/00:00:00:00:00/40 tag 16 ncq 4096 in res 41/40:00:49:00:80/00:00:00:00:00/40 Emask 0x409 (media error) <F> [Mon Jan 11 12:01:51 2021] ata3.00: status: { DRDY ERR } [Mon Jan 11 12:01:51 2021] ata3.00: error: { UNC } [Mon Jan 11 12:01:51 2021] ata3.00: configured for UDMA/133 [Mon Jan 11 12:01:51 2021] sd 2:0:0:0: [sdc] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [Mon Jan 11 12:01:51 2021] sd 2:0:0:0: [sdc] tag#16 Sense Key : Medium Error [current] [descriptor] [Mon Jan 11 12:01:51 2021] sd 2:0:0:0: [sdc] tag#16 Add. Sense: Unrecovered read error - auto reallocate failed [Mon Jan 11 12:01:51 2021] sd 2:0:0:0: [sdc] tag#16 CDB: Read(16) 88 00 00 00 00 00 00 80 00 48 00 00 00 08 00 00 [Mon Jan 11 12:01:51 2021] blk_update_request: I/O error, dev sdc, sector 8388681 [Mon Jan 11 12:01:51 2021] Buffer I/O error on dev sdc2, logical block 1, async page read [Mon Jan 11 12:01:51 2021] ata3: EH complete

It is a case of a dual disk failure in a RAID5, which leaves the raid in a broken state. I feel that you were unlucky here to be honest. There were no prior signs that these disk were going to cause you trouble. It all happened rather suddenly. I do feel that this situation is salvageable. RAIDs can be saved from such a scenario but it might include cloning of disks if the current ones are too bad to work with and it will definitely involve manually reassembly of the raid.

You should contact Netgear support and discuss data recovery contract. Have their Level 3 team assess the situation and then take it from there. It won't be a free service I am sure but if the data matters and if you have no backups, it is the best (and likely cheapest) way to go.

Cheers

rn_enthusiast

Virtuoso

Jan 14, 2021

Hi Paulaus

Thanks for the logs.

Yea, so it is pretty much as first expected. A disk dropped out of the raid and back in. This prompted a raid sync (resilver). During that sync another drive dropped out and your raid is now broken. It is regarding disk 2 and 3. On paper, the disks aren't actually looking that bad.
Disk 2: 12 Pending Sector Error
Disk 3: 1 Pending Sector Error

However, disk 3 is spewing errors in the kernel log. Looks like the NAS is having great difficulties communicating with that disk. Example below and this is repeated over and over.

[Mon Jan 11 12:01:51 2021] do_marvell_9170_recover: ignoring PCI device (8086:3a22) at PCI#0
[Mon Jan 11 12:01:51 2021] ata3.00: exception Emask 0x0 SAct 0x10000 SErr 0x0 action 0x0
[Mon Jan 11 12:01:51 2021] ata3.00: irq_stat 0x40000008
[Mon Jan 11 12:01:51 2021] ata3.00: failed command: READ FPDMA QUEUED
[Mon Jan 11 12:01:51 2021] ata3.00: cmd 60/08:80:48:00:80/00:00:00:00:00/40 tag 16 ncq 4096 in
res 41/40:00:49:00:80/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[Mon Jan 11 12:01:51 2021] ata3.00: status: { DRDY ERR }
[Mon Jan 11 12:01:51 2021] ata3.00: error: { UNC }
[Mon Jan 11 12:01:51 2021] ata3.00: configured for UDMA/133
[Mon Jan 11 12:01:51 2021] sd 2:0:0:0: [sdc] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Mon Jan 11 12:01:51 2021] sd 2:0:0:0: [sdc] tag#16 Sense Key : Medium Error [current] [descriptor]
[Mon Jan 11 12:01:51 2021] sd 2:0:0:0: [sdc] tag#16 Add. Sense: Unrecovered read error - auto reallocate failed
[Mon Jan 11 12:01:51 2021] sd 2:0:0:0: [sdc] tag#16 CDB: Read(16) 88 00 00 00 00 00 00 80 00 48 00 00 00 08 00 00
[Mon Jan 11 12:01:51 2021] blk_update_request: I/O error, dev sdc, sector 8388681
[Mon Jan 11 12:01:51 2021] Buffer I/O error on dev sdc2, logical block 1, async page read
[Mon Jan 11 12:01:51 2021] ata3: EH complete

It is a case of a dual disk failure in a RAID5, which leaves the raid in a broken state. I feel that you were unlucky here to be honest. There were no prior signs that these disk were going to cause you trouble. It all happened rather suddenly. I do feel that this situation is salvageable. RAIDs can be saved from such a scenario but it might include cloning of disks if the current ones are too bad to work with and it will definitely involve manually reassembly of the raid.

You should contact Netgear support and discuss data recovery contract. Have their Level 3 team assess the situation and then take it from there. It won't be a free service I am sure but if the data matters and if you have no backups, it is the best (and likely cheapest) way to go.

Cheers

StephenB

Guru - Experienced User

Jan 14, 2021

rn_enthusiast wrote:

There were no prior signs that these disk were going to cause you trouble. It all happened rather suddenly.

One factor here is that the system won't detect a bad sector until it tries to read or write it - so a problem can lurk undetected for a long time. That's why I schedule disk tests (and RAID scrubs) in the maintenance schedule.

FWIW, Backblaze reported some time ago (2016) that about 25% of their disk failures occur with no warning (and good SMART stats reported before the failure).

I've seen this myself - most recently last week. That particular disk had no apparent issues reading or writing to the volume, and had good SMART stats. But it repeatedly failed the disk test in the NAS, and it also failed the vendor diag in a PC. Even after the failed tests, the SMART stats still looked good - no idea why, since the vendor diag reported "too many bad sectors".

You really do need a backup on another device - RAID simply isn't enough.

rn_enthusiast wrote:

I do feel that this situation is salvageable.

I hope that is the case.

rn_enthusiast wrote:

You should contact Netgear support and discuss data recovery contract. Have their Level 3 team assess the situation and then take it from there. It won't be a free service I am sure but if the data matters and if you have no backups, it is the best (and likely cheapest) way to go.

If you can connect the disks to a Windows PC, you could also try using RAID recovery software. ReclaiMe is one option that folks here have used with some success. It is expensive (but should be cheaper than a data recovery service).

rn_enthusiast
Virtuoso
Jan 14, 2021
StephenB wrote:
rn_enthusiast wrote:
There were no prior signs that these disk were going to cause you trouble. It all happened rather suddenly.
One factor here is that the system won't detect a bad sector until it tries to read or write it - so a problem can lurk undetected for a long time. That's why I schedule disk tests (and RAID scrubs) in the maintenance schedule.
Good point. Definitely something OP should consider doing in the future. I do a disk test task every 3 months, myself.
- Sandshark
  Sensei
  Jan 14, 2021
  While the problem would likely have still happened if the first drive was replaced (since it would still need to do a sync that included the other), what I don't understand is why the ReadyNAS suddenly decides on it's own that a drive that was previously dead should be re-introduced to the RAID and a re-sync started. It should take a conscious action by the admin to do that, so he can insure the backup is up to date before trying it or he can choose not to and go right for a new drive.