Re: 6.2.4 high cpu => lockup => reboot => boot failed

kheno · ‎2015-06-18

Hi,

I've had the same problem as mentioned here:

http://www.readynas.com/forum/viewtopic.php?f=160&t=81294

readynas ultra 4
os 6.2.4 (as I recall)

After a while ssh was not accessible.
I also had the only option to unsafe power off the device.

Only that now it keeps saying: booting
followed by boot failed, retry boot.

I've already tried the boot mem check which resulted in no errors.

Raidar only shows "system starting up..."
ssh, web, ... are not accessible.

I'm a little careful on what to do next.

Should I try the boot os install?

kheno · ‎2015-06-18

update:

after the mem check, again rebooted and waited.

Raidar reports: management service is offline.

So It boots again, but probably cpu is again so high that admin and shh are not accessible. (or were for a short time)
downloading logs, ... failed.

kheno · ‎2015-06-18

could download logs. can pm on request.

First look:
systemd-journal.log

Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238
Jun 19 01:02:16 bigdisk kernel: btrfs: corrupt leaf, slot offset bad: block=2306986344448,root=1, slot=238

mdgm-ntgr · ‎2015-06-18

Can you run the memory test boot menu option?: http://kb.netgear.com/app/answers/detail/a_id/21104

Run at least a few passes of this test.

If the memory passes then run the disk test boot menu option

Please send me the logs you have downloaded (there is an email address mentioned on the page linked to by the Sending Logs link in my sig)

kheno · ‎2015-06-19

Hi!

logs sent.

Memory test was run, but running again now.
Will run the disk test after that.

And keep you posted.

Thanks!

mdgm-ntgr · ‎2015-06-19

Thanks for the update.

Looking at your logs one of your disks has 6 ATA error, but those errors were back in April. So I don't think that is the problem here though that disk might be failing.

Will be interested to hear the result of the memory test.

Do you have a backup?

kheno · ‎2015-06-19

memtest came clean. no errors.

disk test went from 0% to 100%

But now the lcd indicates: "Testing Disks" and power button/hard-disk numbers blink for already more than an hour.
Not 100% but I think it is only the hard-disk number 3 that blinks.

Does it take that long? I've red somewhere it should not take more than 10 minutes.

Oh, I have a partial offline backup; all important folders and files are back-upped every night to an offline site.
So, I prefer getting my data back over restoring 🙂 but it's not the end of the world when something goes wrong.

Thanks

kheno · ‎2015-06-20

Hi,

It rebooted after a long time to land on the "failed to boot" status.

I shut down the Nas, figured it would also be good to take out the disks clean carefully the contacts and place them back.

Just in case of.

Booted again.

And downloaded the logs.

Where would I look for the result of the disk scan?

Accessing it through the smb shares results in unable to open shares or incomplete shares contents... :shock:

mdgm-ntgr · ‎2015-06-20

If you SSH in and look at the smartctl output for the disks you will see the result of the extended tests.

kheno · ‎2015-06-20

The thing is that after the testing it rebooted and is now in "boot failed" mode.

Where the admin and ssh are not available...

I was looking into buying a new rn314 by the end the summer.

I don't thing that buying one now and swapping drives will make my data accessible?

Is the data recoverable? should I look for an expert and pay him to have a look?
If there is a way to just get data recovered and copy it to external disks ...

I've used the linux command line for a while, but not that experienced to play with btrfs tools and the readynas setup, ...

As far as I can see in the downloaded logs through raidar: no real disk errors to be found except those from april.
btrfs mentions problems. I presume I would need to be able to login to ssh and fix probably the /c with btrfs tools (restore to another external drive or mount -o recovery,ro)

I could login in the tech support mode, but do not know where to start from there.

kheno · ‎2015-06-20

ok,

so I figured out the tech support mode, howto mount, and ...

So I am now able to ssh in when in normal boot.

smartctl output:


smartctl -a /dev/sda
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.0.101.RNx86_64.3] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model:     WDC WD30EZRX-00MMMB0
Serial Number:    WD-WCAWZ1418318
LU WWN Device Id: 5 0014ee 25b9acf8e
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Jun 20 20:53:02 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (50760) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 488) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   158   148   021    Pre-fail  Always       -       9091
  4 Start_Stop_Count        0x0032   091   091   000    Old_age   Always       -       9149
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   070   070   000    Old_age   Always       -       22174
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1281
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       30
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       784723
194 Temperature_Celsius     0x0022   123   111   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     22161         -
# 2  Short offline       Completed without error       00%         1         -
# 3  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


----------------------------


smartctl -a /dev/sdb
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.0.101.RNx86_64.3] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model:     WDC WD30EZRX-00MMMB0
Serial Number:    WD-WCAWZ1375860
LU WWN Device Id: 5 0014ee 20645a879
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Jun 20 20:53:35 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (51660) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 496) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   161   146   021    Pre-fail  Always       -       8933
  4 Start_Stop_Count        0x0032   091   091   000    Old_age   Always       -       9086
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   070   070   000    Old_age   Always       -       22174
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1281
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       31
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       787547
194 Temperature_Celsius     0x0022   122   109   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     22161         -
# 2  Short offline       Completed without error       00%         1         -
# 3  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



---------------------------------------------------------------------------------------




smartctl -a /dev/sdc
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.0.101.RNx86_64.3] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model:     WDC WD30EZRX-00DC0B0
Serial Number:    WD-WCC1T1741134
LU WWN Device Id: 5 0014ee 2b3f2ee8f
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Jun 20 20:54:16 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (38400) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 385) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x70b5) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   180   173   021    Pre-fail  Always       -       5983
  4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3444
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       8736
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       459
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       18
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       801457
194 Temperature_Celsius     0x0022   122   109   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      8734         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

so I went on searching:
btrfs scrub start /dev/md0
btrfs scrub status /dev/md0
dmesg

no remarks but that it only scrubbed 1.45GB?!
and the btrfs errors previously reported

then checked raid setup:


mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Fri May 23 00:24:48 2014
     Raid Level : raid1
     Array Size : 4190208 (4.00 GiB 4.29 GB)
  Used Dev Size : 4190208 (4.00 GiB 4.29 GB)
   Raid Devices : 4
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Sat Jun 20 21:24:08 2015
          State : clean, degraded
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

           Name : 37c0a64a:0  (local to host 37c0a64a)
           UUID : 2b80a109:d01d1b7e:ed3cb495:04521fe4
         Events : 562

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       4       0        0        4      removed
       3       8       49        3      active sync   /dev/sdd1

It seems degraded?
Still when degraded it should still hold all data?

I'm not an expert, but the strange thing is the disk seems connected and fine while the raid setup sees it as removed?

Oh, and still with a failing disk, data should not be corrupted.

So I was looking on howto check with btrfs. I think I need to do something alike:

umount /dev/md0

btrfs check --repair /dev/sda (destructive)
btrfs restore /dev/sda /mnt/restore (nondestructive restoreto externa disk)

only can't unmout : device is busy (by alot of processes)

Since I'm not really used to doing these things I would hope you could give me some advice/directions.

Thanks

StephenB · ‎2015-06-20

/dev/md0 is the OS partition, not the data partition.

On my RN102 (which is jbod) /dev/md126 and /dev/md127 are the two data partitions.

mdgm-ntgr · ‎2015-06-21

Would not suggest running commands randomly when you are not familiar with them especially destructive ones.

Cloning your disks before doing something destructive would be recommended.

If you are going to use btrfs restore you would do e.g.


# btrfs restore /dev/md127 /mnt/restore

If a USB disk is mounted at /mnt

You can if you like run it verbosely and send the data nowhere just to test if it can find anything


btrfs restore -v /dev/md127 /dev/null

kheno · ‎2015-06-22

recovering! 😄

Had some errors while writing files.
Thought it would be an incompatibility with the ntfs file system, so I formatted the external disk in ext4
Problems solved.

The only question I have, and could nowhere find a hint:

My destination disks are too small to fit 5TB, can I recover/split over multiple destination drives?
When a drive is full I suspect recovery will end with an error and just stop.

Any idea?

Btw, thanks for your help!

mdgm-ntgr · ‎2015-06-22

Sent you a PM. If you recall your directory structure there should be a way to do this.