× NETGEAR will be terminating ReadyCLOUD service by July 1st, 2023. For more details click here.
Orbi WiFi 7 RBE973
Reply

Re: 6.2.4 high cpu => lockup => reboot => boot failed

kheno
Aspirant

6.2.4 high cpu => lockup => reboot => boot failed

Hi,

I've had the same problem as mentioned here:

http://www.readynas.com/forum/viewtopic.php?f=160&t=81294

readynas ultra 4
os 6.2.4 (as I recall)

After a while ssh was not accessible.
I also had the only option to unsafe power off the device.

Only that now it keeps saying: booting
followed by boot failed, retry boot.

I've already tried the boot mem check which resulted in no errors.

Raidar only shows "system starting up..."
ssh, web, ... are not accessible.

I'm a little careful on what to do next.

Should I try the boot os install?
Message 1 of 15
kheno
Aspirant

Re: 6.2.4 high cpu => lockup => reboot => boot failed

update:

after the mem check, again rebooted and waited.

Raidar reports: management service is offline.

So It boots again, but probably cpu is again so high that admin and shh are not accessible. (or were for a short time)
downloading logs, ... failed.
Message 2 of 15
kheno
Aspirant

Re: 6.2.4 high cpu => lockup => reboot => boot failed

could download logs. can pm on request.

First look:
systemd-journal.log

Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Message 3 of 15
mdgm-ntgr
NETGEAR Employee Retired

Re: 6.2.4 high cpu => lockup => reboot => boot failed

Can you run the memory test boot menu option?: http://kb.netgear.com/app/answers/detail/a_id/21104

Run at least a few passes of this test.

If the memory passes then run the disk test boot menu option

Please send me the logs you have downloaded (there is an email address mentioned on the page linked to by the Sending Logs link in my sig)
Message 4 of 15
kheno
Aspirant

Re: 6.2.4 high cpu => lockup => reboot => boot failed

Hi!

logs sent.

Memory test was run, but running again now.
Will run the disk test after that.

And keep you posted.

Thanks!
Message 5 of 15
mdgm-ntgr
NETGEAR Employee Retired

Re: 6.2.4 high cpu => lockup => reboot => boot failed

Thanks for the update.

Looking at your logs one of your disks has 6 ATA error, but those errors were back in April. So I don't think that is the problem here though that disk might be failing.

Will be interested to hear the result of the memory test.

Do you have a backup?
Message 6 of 15
kheno
Aspirant

Re: 6.2.4 high cpu => lockup => reboot => boot failed

memtest came clean. no errors.

disk test went from 0% to 100%

But now the lcd indicates: "Testing Disks" and power button/hard-disk numbers blink for already more than an hour.
Not 100% but I think it is only the hard-disk number 3 that blinks.

Does it take that long? I've red somewhere it should not take more than 10 minutes.

Oh, I have a partial offline backup; all important folders and files are back-upped every night to an offline site.
So, I prefer getting my data back over restoring 🙂 but it's not the end of the world when something goes wrong.

Thanks
Message 7 of 15
kheno
Aspirant

Re: 6.2.4 high cpu => lockup => reboot => boot failed

Hi,

It rebooted after a long time to land on the "failed to boot" status.

I shut down the Nas, figured it would also be good to take out the disks clean carefully the contacts and place them back.

Just in case of.

Booted again.

And downloaded the logs.

Where would I look for the result of the disk scan?

Accessing it through the smb shares results in unable to open shares or incomplete shares contents... :shock:
Message 8 of 15
mdgm-ntgr
NETGEAR Employee Retired

Re: 6.2.4 high cpu => lockup => reboot => boot failed

If you SSH in and look at the smartctl output for the disks you will see the result of the extended tests.
Message 9 of 15
kheno
Aspirant

Re: 6.2.4 high cpu => lockup => reboot => boot failed

The thing is that after the testing it rebooted and is now in "boot failed" mode.

Where the admin and ssh are not available...

I was looking into buying a new rn314 by the end the summer.

I don't thing that buying one now and swapping drives will make my data accessible?

Is the data recoverable? should I look for an expert and pay him to have a look?
If there is a way to just get data recovered and copy it to external disks ...

I've used the linux command line for a while, but not that experienced to play with btrfs tools and the readynas setup, ...

As far as I can see in the downloaded logs through raidar: no real disk errors to be found except those from april.
btrfs mentions problems. I presume I would need to be able to login to ssh and fix probably the /c with btrfs tools (restore to another external drive or mount -o recovery,ro)

I could login in the tech support mode, but do not know where to start from there.
Message 10 of 15
kheno
Aspirant

Re: 6.2.4 high cpu => lockup => reboot => boot failed

ok,

so I figured out the tech support mode, howto mount, and ...

So I am now able to ssh in when in normal boot.

smartctl output:


smartctl -a /dev/sda
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.0.101.RNx86_64.3] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model: WDC WD30EZRX-00MMMB0
Serial Number: WD-WCAWZ1418318
LU WWN Device Id: 5 0014ee 25b9acf8e
Firmware Version: 80.00A80
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sat Jun 20 20:53:02 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (50760) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 488) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 158 148 021 Pre-fail Always - 9091
4 Start_Stop_Count 0x0032 091 091 000 Old_age Always - 9149
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 070 070 000 Old_age Always - 22174
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1281
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 30
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 784723
194 Temperature_Celsius 0x0022 123 111 000 Old_age Always - 29
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 22161 -
# 2 Short offline Completed without error 00% 1 -
# 3 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


----------------------------


smartctl -a /dev/sdb
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.0.101.RNx86_64.3] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model: WDC WD30EZRX-00MMMB0
Serial Number: WD-WCAWZ1375860
LU WWN Device Id: 5 0014ee 20645a879
Firmware Version: 80.00A80
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sat Jun 20 20:53:35 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (51660) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 496) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 161 146 021 Pre-fail Always - 8933
4 Start_Stop_Count 0x0032 091 091 000 Old_age Always - 9086
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 070 070 000 Old_age Always - 22174
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1281
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 31
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 787547
194 Temperature_Celsius 0x0022 122 109 000 Old_age Always - 30
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 22161 -
# 2 Short offline Completed without error 00% 1 -
# 3 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



---------------------------------------------------------------------------------------




smartctl -a /dev/sdc
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.0.101.RNx86_64.3] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model: WDC WD30EZRX-00DC0B0
Serial Number: WD-WCC1T1741134
LU WWN Device Id: 5 0014ee 2b3f2ee8f
Firmware Version: 80.00A80
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sat Jun 20 20:54:16 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (38400) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 385) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x70b5) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 180 173 021 Pre-fail Always - 5983
4 Start_Stop_Count 0x0032 097 097 000 Old_age Always - 3444
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 8736
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 459
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 18
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 801457
194 Temperature_Celsius 0x0022 122 109 000 Old_age Always - 28
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 8734 -
# 2 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



so I went on searching:
btrfs scrub start /dev/md0
btrfs scrub status /dev/md0
dmesg

no remarks but that it only scrubbed 1.45GB?!
and the btrfs errors previously reported

then checked raid setup:


mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Fri May 23 00:24:48 2014
Raid Level : raid1
Array Size : 4190208 (4.00 GiB 4.29 GB)
Used Dev Size : 4190208 (4.00 GiB 4.29 GB)
Raid Devices : 4
Total Devices : 3
Persistence : Superblock is persistent

Update Time : Sat Jun 20 21:24:08 2015
State : clean, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0

Name : 37c0a64a:0 (local to host 37c0a64a)
UUID : 2b80a109:d01d1b7e:ed3cb495:04521fe4
Events : 562

Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
4 0 0 4 removed
3 8 49 3 active sync /dev/sdd1


It seems degraded?
Still when degraded it should still hold all data?

I'm not an expert, but the strange thing is the disk seems connected and fine while the raid setup sees it as removed?

Oh, and still with a failing disk, data should not be corrupted.

So I was looking on howto check with btrfs. I think I need to do something alike:

umount /dev/md0

btrfs check --repair /dev/sda (destructive)
btrfs restore /dev/sda /mnt/restore (nondestructive restoreto externa disk)

only can't unmout : device is busy (by alot of processes)

Since I'm not really used to doing these things I would hope you could give me some advice/directions.

Thanks
Message 11 of 15
StephenB
Guru

Re: 6.2.4 high cpu => lockup => reboot => boot failed

/dev/md0 is the OS partition, not the data partition.

On my RN102 (which is jbod) /dev/md126 and /dev/md127 are the two data partitions.
Message 12 of 15
mdgm-ntgr
NETGEAR Employee Retired

Re: 6.2.4 high cpu => lockup => reboot => boot failed

Would not suggest running commands randomly when you are not familiar with them especially destructive ones.

Cloning your disks before doing something destructive would be recommended.

If you are going to use btrfs restore you would do e.g.

# btrfs restore /dev/md127 /mnt/restore

If a USB disk is mounted at /mnt

You can if you like run it verbosely and send the data nowhere just to test if it can find anything

btrfs restore -v /dev/md127 /dev/null
Message 13 of 15
kheno
Aspirant

Re: 6.2.4 high cpu => lockup => reboot => boot failed

recovering! 😄

Had some errors while writing files.
Thought it would be an incompatibility with the ntfs file system, so I formatted the external disk in ext4
Problems solved.

The only question I have, and could nowhere find a hint:

My destination disks are too small to fit 5TB, can I recover/split over multiple destination drives?
When a drive is full I suspect recovery will end with an error and just stop.

Any idea?

Btw, thanks for your help!
Message 14 of 15
mdgm-ntgr
NETGEAR Employee Retired

Re: 6.2.4 high cpu => lockup => reboot => boot failed

Sent you a PM. If you recall your directory structure there should be a way to do this.
Message 15 of 15
Top Contributors
Discussion stats
  • 14 replies
  • 4387 views
  • 0 kudos
  • 3 in conversation
Announcements