Re: Disk test running 10 days already

stevevandegaer2 · ‎2019-08-11

Hello,

I'm running a readynas with firmware 6.10.1

On august 1 I started a disk test and it is still running now on the 11th.

I guess this isn't normal. What should I do? Is it safe to reboot?

Kind regards

Steve

stevevandegaer2 · ‎2019-08-11

Last night had to reboot after the device became unresponsive when I activated file search.

Should I try the disk test again or do something else first?

StephenB · ‎2019-08-12

@stevevandegaer2 wrote:

Last night had to reboot after the device became unresponsive when I activated file search.

Should I try the disk test again or do something else first?

One possible reason for the very long test time is that you might have a failing disk. I suggest downloading the log zip file and looking in disk_info.log. Also look for disk related errors in kernel.log

You might also want to use ssh, and run smartctl -x

stevevandegaer2 · ‎2019-08-12

Not seeing any error in either log file.

Started the disk check again, hoping I will get some result this time

stevevandegaer2 · ‎2019-08-12

smartctl -x

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.4.178.x86_64.1] (local build)

ERROR: smartctl requires a device name as the final command-line argument.

Use smartctl -h to get a usage summary

StephenB · ‎2019-08-12

@stevevandegaer2 wrote:

ERROR: smartctl requires a device name as the final command-line argument.

Sorry, I assumed more knowledge than you have.

You need to include the device name for each disk. So for the first disk you use

# smartctl -x /dev/sda

Repeat using sdb for disk 2, and so forth.

One thing to look for is the "Extended Comprehensive Log" section. Here's a snippet from my own system. "UNC" means "uncorrected error". Not all disks will have this section, but it is helpful when they do.

root@NAS:~# smartctl -x /dev/sdc
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.4.178.x86_64.1] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
...
=== START OF READ SMART DATA SECTION ===
...
SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 12
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 12 [11] occurred at disk power-on lifetime: 36166 hours (1506 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 0c 27 df 40 40 00  Error: UNC at LBA = 0x0c27df40 = 203939648

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 80 00 c8 00 00 0c 27 df 40 40 08  1d+08:15:46.204  READ FPDMA QUEUED
  60 00 08 00 c0 00 01 06 34 36 98 40 08  1d+08:15:46.163  READ FPDMA QUEUED
  60 00 80 00 b8 00 00 0c 27 e4 40 40 08  1d+08:15:46.146  READ FPDMA QUEUED
  60 00 80 00 b0 00 00 0c 27 e9 40 40 08  1d+08:15:46.123  READ FPDMA QUEUED
  60 00 80 00 a8 00 00 0c 27 ee 40 40 08  1d+08:15:46.094  READ FPDMA QUEUED

Errors in this section might not mean the disk needs to be replaced. But if you are seeing errors that happened during your disk test, then it might help you isolate your problem. If you do see some, the next step is to power down the NAS and test the disk(s) in a Windows PC using vendor tools - lifeguard for Western Digital; seatools for Seagate. If they pass, put them back in to the NAS (in the same slot) before you power it up.

stevevandegaer2 · ‎2019-08-12

I guess I should replace this first one?

SMART Extended Comprehensive Error Log Version: 1 (2 sectors)
Device Error Count: 1
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 [0] occurred at disk power-on lifetime: 9633 hours (401 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
84 -- 51 00 08 00 00 79 cc 70 00 40 00 Error: ICRC, ABRT at LBA = 0x79cc7000 = 2043441152

Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
60 00 d0 00 10 00 00 79 cc 70 30 40 08 03:48:00.465 READ FPDMA QUEUED
60 00 d0 00 10 00 00 79 cc 70 30 40 08 03:48:00.465 READ FPDMA QUEUED
60 00 d0 00 08 00 00 79 cc 70 28 40 08 03:48:00.465 READ FPDMA QUEUED
60 00 d0 00 10 00 00 79 cc 70 18 40 08 03:48:00.465 READ FPDMA QUEUED
60 00 d0 00 20 00 00 79 cc 6f f8 40 08 03:48:00.465 READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (2 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Interrupted (host reset) 90% 59582 -
# 2 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Self_test_in_progress [60% left] (0-65535)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 20109 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version: 3
SCT Version (vendor specific): 258 (0x0102)
SCT Support Level: 1
Device State: DST executing in background (3)
Current Temperature: 36 Celsius
Power Cycle Min/Max Temperature: 30/36 Celsius
Lifetime Min/Max Temperature: 2/41 Celsius
Under/Over Temperature Limit Count: 0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SMART Extended Comprehensive Error Log Version: 1 (2 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (2 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Interrupted (host reset) 90% 59581 -
# 2 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Self_test_in_progress [70% left] (0-65535)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version: 2
SCT Version (vendor specific): 256 (0x0100)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 36 Celsius
Power Cycle Min/Max Temperature: 30/36 Celsius
Lifetime Min/Max Temperature: 16/66 Celsius
Under/Over Temperature Limit Count: 0/0

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 50994 -
# 2 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version: 3
SCT Version (vendor specific): 258 (0x0102)
SCT Support Level: 1
Device State: DST executing in background (3)
Current Temperature: 37 Celsius
Power Cycle Min/Max Temperature: 32/37 Celsius
Lifetime Min/Max Temperature: 2/44 Celsius
Under/Over Temperature Limit Count: 0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 50994 -
# 2 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version: 3
SCT Version (vendor specific): 258 (0x0102)
SCT Support Level: 1
Device State: DST executing in background (3)
Current Temperature: 37 Celsius
Power Cycle Min/Max Temperature: 32/37 Celsius
Lifetime Min/Max Temperature: 2/44 Celsius
Under/Over Temperature Limit Count: 0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 17086 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version: 3
SCT Version (vendor specific): 258 (0x0102)
SCT Support Level: 1
Device State: DST executing in background (3)
Current Temperature: 36 Celsius
Power Cycle Min/Max Temperature: 31/36 Celsius
Lifetime Min/Max Temperature: 2/41 Celsius
Under/Over Temperature Limit Count: 0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

StephenB · ‎2019-08-12

@stevevandegaer2 wrote:

I guess I should replace this first one?

...
84 -- 51 00 08 00 00 79 cc 70 00 40 00 Error: ICRC, ABRT at LBA = 0x79cc7000 = 2043441152

This error means that the request was aborted. That might not be a problem with the disk. In general I wouldn't be concerned about a single abort error. I have a couple I created accidentally when I was testing a drive with Lifeguard. Also, a single aborted error doesn't really explain the long-running test.

Still, you might look at how long ago that error happened. You can compute that by subtracting the powered-up time in the error message (e.g. 9633) from the current power up time in the SMART stats. That does assume the disk is powered 24x7 though.

stevevandegaer2 · ‎2019-08-12

So at first view I don't have any faulty disks, my new disk test started about 6 hours ago. I'll check it again tomorrow and see if it is still running.

StephenB · ‎2019-08-12

@stevevandegaer2 wrote:

So at first view I don't have any faulty disks, my new disk test started about 6 hours ago. I'll check it again tomorrow and see if it is still running.

Note that the smartctl -x command was giving you the completion status on each disk.

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Self_test_in_progress [60% left] (0-65535)

You can reduce the clutter with

# smartctl -x /dev/sda | grep -i self_test

That would let you monitor progress, and give you some idea if it is locking up or just running very slowly.

stevevandegaer2 · ‎2019-08-12

StephenB thank you very much for your help. This morning the diks test was complete and no errors where in the logs.