NETGEAR is aware of a growing number of phone and online scams. To learn how to stay safe click here.
Forum Discussion
jedas
Dec 02, 2014Aspirant
Disk spin-down problem, RN104, fw 6.2.0
Hello,
I'm using RN104, 2 x 4GB WD RED in JBOD/Flex mode (I need no raid). No apps installed, samba, dlna services stopped to debug this issue. Http/https/ssh - enabled. I've enabled disk spin down after 5 minutes from GUI menu. System logs shows that disks are stopped, but then immediately started after 4-8 seconds.
After enabling write log: echo 1 > /proc/sys/vm/block_dump I see this in the dmesg:
I do believe, these writes wakes up the disks, becaue they happen in every 4-8 seconds. I've tried to kill leafp2p process, it doesn't seem to be the problem source. I can suspect that it's these writes are made by kernel, but don't have any knowledge about raid stuff. Same behaviour is when I type "hdparm -y /dev/sda". It spinds down, and wakes up in a second. Please advise.
I'm using RN104, 2 x 4GB WD RED in JBOD/Flex mode (I need no raid). No apps installed, samba, dlna services stopped to debug this issue. Http/https/ssh - enabled. I've enabled disk spin down after 5 minutes from GUI menu. System logs shows that disks are stopped, but then immediately started after 4-8 seconds.
Disk [0] going to standby...
Disk [0] spinning up...
Disk [0] going to standby...
Disk [0] spinning up...
After enabling write log: echo 1 > /proc/sys/vm/block_dump I see this in the dmesg:
md0_raid1(694): WRITE block 8 on sda1 (1 sectors)
md0_raid1(694): WRITE block 8 on sdb1 (1 sectors)
jbd2/md0-8(719): WRITE block 3684640 on md0 (8 sectors)
jbd2/md0-8(719): WRITE block 3684648 on md0 (8 sectors)
jbd2/md0-8(719): WRITE block 3684656 on md0 (8 sectors)
jbd2/md0-8(719): WRITE block 3684664 on md0 (8 sectors)
jbd2/md0-8(719): WRITE block 3684672 on md0 (8 sectors)
jbd2/md0-8(719): WRITE block 3684680 on md0 (8 sectors)
jbd2/md0-8(719): WRITE block 3684688 on md0 (8 sectors)
jbd2/md0-8(719): WRITE block 3684696 on md0 (8 sectors)
leafp2p(1282): WRITE block 775816 on md0 (8 sectors)
md0_raid1(694): WRITE block 8 on sda1 (1 sectors)
md0_raid1(694): WRITE block 8 on sdb1 (1 sectors)
I do believe, these writes wakes up the disks, becaue they happen in every 4-8 seconds. I've tried to kill leafp2p process, it doesn't seem to be the problem source. I can suspect that it's these writes are made by kernel, but don't have any knowledge about raid stuff. Same behaviour is when I type "hdparm -y /dev/sda". It spinds down, and wakes up in a second. Please advise.
23 Replies
Replies have been turned off for this discussion
- dsm1212ApprenticeAh, ok, I think what it does then is read from both the mirrors and if it gets an IO error from one side of the mirror it will rewrite the data successfully read from the other side. md supports this automatically, you just need to trigger a read from both sides and it will happen. So bitrot protection will wake up the disks unless the program driving it checks to see if the disks are in standby.
steve - mdgm-ntgrNETGEAR Employee RetiredBitrot protection does more than what md raid can do by itself. It can fix some corruption that md raid by itself couldn't do anything about.
- StephenBGuru - Experienced User
Well, normal RAID handles read IO errors so there's no need to add more protection for that. Bitrot protection deals with the case where the data was "silently" corrupted w/o an I/O error to trigger the recovery. And the bitrot might only affect the checksums themselves, so that possibility needs to be covered.dsm1212 wrote: Ah, ok, I think what it does then is read from both the mirrors and if it gets an IO error from one side of the mirror it will rewrite the data successfully read from the other side. md supports this automatically, you just need to trigger a read from both sides and it will happen. So bitrot protection will wake up the disks unless the program driving it checks to see if the disks are in standby.
So bitrot recovery has to kick in after the btrfs checksum fails but when no read errors have occured. Once you know the checksum is bad, there are several strategies you could attempt to use. The simplest I can think of is to simulate read errors on each of the data blocks covered by the checksum, and see if substituting these recovered blocks results in a good checksum. We are just speculating of course, since Netgear isn't saying. - dsm1212ApprenticeRight a read will fix it, but if the leg with the problem is not read it sits there like a time bomb, then the drive with the good leg fails for some other reason and now your last copy has this latent read error. You avoid this by forcing a read on both legs. Really I would bet this is all it does as this is a pretty well understood problem. There is another way they may be doing this that is much more efficient, but it's disk vendor dependent and I doubt netgear would go there.
steve - StephenBGuru - Experienced UserAnything your approach could fix is also fixed by a normal scrub. Per mdgm's comment, bitrot protection goes beyond what md raid can fix.
The weakness in all RAID is that it recovers from missing (unreadable) blocks, but has no way to recover from wrong blocks that are readable. It can only provide erasure correction, not error correction.
Blending in the checksum allows you to identify blocks that might be bad, and then pretend they were erased, so they can be rebuilt. I've seen that approach used in other contexts, it is also a well understood technique if you are into this kind of stuff.
The main difference between our approaches is that I was simulating a read error/erasure, and you were assuming there must be one already happening. It's a small change, but it enables repairs that otherwise can't be made. For instance, if you attempt to fix an array failure by cloning bad disks to a good ones - in that case there are scattered blocks that are wrong, but readable. My approach can fix most of those errors, yours cannot.
But again, Netgear isn't saying what they actually did - probably because they want to retain it as a competitive advantage. - dsm1212ApprenticeOk, I think I see what you are suggesting, but I don't understand what the case you are trying to prevent has to do with bit-rot. If a bit changes on the disk then the sector won't pass internal checksums done by the disk firmware and it will be an IO error. The chance of some bits flipping and the internal checksum still being good are pretty obscure (like not in our lifetimes). Statistically a good read of the wrong data either means someone wrote it or there was a silent failed write that left the old data in tact. Neither of those cases should be labeled "bit-rot". They happen either because someone wrote directly to a leg or the RAID sw/hw itself was buggy or there was a deferred write error.
steve - StephenBGuru - Experienced UserI think a lot depends on what you think bit rot is. Here's one article that is using the silent corruption definition: http://arstechnica.com/information-tech ... lesystems/
He then goes on to say that RAID-5 didn't find the error, but that the btrfs experimental raid feature did find and fix it. No read errors are happening, the data just went wrong somehow. (BTW, A raid-5 scrub would have detected it, but it wouldn't know if the parity block or the data block was wrong. In principle a RAID-6 scrub could be written which would find/repair the error, but I don't know if a normal RAID-6 scrub would do so)....As a test, I set up a virtual machine with six drives. One has the operating system on it, two are configured as a simple btrfs-raid1 mirror, and the remaining three are set up as a conventional raid5. I saved Finn's picture on both the btrfs-raid1 mirror and the conventional raid5 array, and then I took the whole system offline and flipped a single bit—yes, just a single bit from 0 to 1—in the JPG file saved on each array.
Most of the bit-rot discussions on this forum were triggered from the article, so I think generally posters here are aligned with this understanding of the definition.
How often you think this happens in practice, and what the mechanisms for this "rot" are is an interesting question, and I suspect there are different views. Personally I don't think disks are likely to return wrong data when they fail. Like you, I believe it is far more likely that the disk's own error checks will uncover the problem, and it will return a read error.
Cloning a bad disk is one obvious way it can happen in real life. Software errors or crashes with uncompleted writes in the queue could of course do it as well. Memory failures that corrupt the data in the cache (before it is written) could do it. But personally I don't believe it happens spontaneously on the disk itself. - algiamAspirantI have a RN104 with firmware 6.2.1, prior to 6.2.0 worked fine the spindown.
I have a problem with Spindown, only 2 of the 4 hd off.
Disc 1 JBOD 500GB
Disc 2 JBOD 2T
Disc 3 1TB Raid 1
Disc 4 1TB Raid 1
Only discs 1 and 3 are stopped, why?
The disc 2 may be some reason, but the disc 4 having Raid with the 3?
I have all the stops aplicacions not have anything, even stopped the SMB, DLNA disk 2 and is the same.Dec 23 05:56:14 Optimus noflushd[12881]: Spinning up disk 1 (/dev/sdd) after 2:58:17.
Dec 23 05:56:15 Optimus noflushd[12881]: Spinning up disk 3 (/dev/sdb) after 3:08:16.
Dec 23 06:07:02 Optimus noflushd[12881]: Spinning down disk 3 (/dev/sdb).
Dec 23 06:07:05 Optimus noflushd[12881]: Spinning down disk 1 (/dev/sdd).
Dec 23 06:25:15 Optimus noflushd[12881]: Spinning up disk 3 (/dev/sdb) after 0:18:10.
Dec 23 06:25:20 Optimus noflushd[12881]: Spinning up disk 1 (/dev/sdd) after 0:18:13.
Dec 23 06:35:22 Optimus noflushd[12881]: Spinning down disk 3 (/dev/sdb).
Dec 23 06:35:24 Optimus noflushd[12881]: Spinning down disk 1 (/dev/sdd).
Dec 23 11:07:53 Optimus noflushd[12881]: Spinning up disk 1 (/dev/sdd) after 4:32:26.
Dec 23 11:08:13 Optimus noflushd[12881]: Spinning up disk 3 (/dev/sdb) after 4:32:49.
Dec 23 11:15:13 Optimus noflushd[12881]: Quitting on signal...
Dec 23 11:17:20 Optimus noflushd[1419]: Enabling spindown for disk 3 [sdb,0:2:ST31000340NS:9QJ3D9PF:300F:7200]
Dec 23 11:17:20 Optimus noflushd[1419]: Enabling spindown for disk 4 [sda,0:3:ST31000340NS:9QJ2ZMYG:300F:7200]
Dec 23 11:17:20 Optimus noflushd[1419]: Enabling spindown for disk 2 [sdc,0:1:WDC_WD20EZRX-00D8PB0:WD-WMC4M0310978:80.00A08:]
Dec 23 11:17:20 Optimus noflushd[1419]: Enabling spindown for disk 1 [sdd,0:0:ST3500418AS:6VMDW5LD:HP34:7200]
Dec 23 11:28:11 Optimus noflushd[1419]: Spinning down disk 3 (/dev/sdb).
Dec 23 11:39:16 Optimus noflushd[1419]: Spinning up disk 3 (/dev/sdb) after 0:11:02.
Dec 23 11:49:17 Optimus noflushd[1419]: Spinning down disk 3 (/dev/sdb).
Dec 23 11:49:35 Optimus noflushd[1419]: Spinning down disk 1 (/dev/sdd).
Dec 23 11:56:44 Optimus noflushd[1419]: Spinning up disk 3 (/dev/sdb) after 0:07:24.
Dec 23 11:56:44 Optimus noflushd[1419]: Spinning up disk 1 (/dev/sdd) after 0:07:07.
Dec 23 12:01:40 Optimus noflushd[1419]: Quitting on signal...
Dec 23 12:01:40 Optimus noflushd[4067]: Enabling spindown for disk 3 [sdb,0:2:ST31000340NS:9QJ3D9PF:300F:7200]
Dec 23 12:01:40 Optimus noflushd[4067]: Enabling spindown for disk 4 [sda,0:3:ST31000340NS:9QJ2ZMYG:300F:7200]
Dec 23 12:01:40 Optimus noflushd[4067]: Enabling spindown for disk 2 [sdc,0:1:WDC_WD20EZRX-00D8PB0:WD-WMC4M0310978:80.00A08:]
Dec 23 12:01:40 Optimus noflushd[4067]: Enabling spindown for disk 1 [sdd,0:0:ST3500418AS:6VMDW5LD:HP34:7200]
Dec 23 12:28:19 Optimus noflushd[4067]: Spinning down disk 3 (/dev/sdb).
Dec 23 12:28:21 Optimus noflushd[4067]: Spindown of disk 3 (/dev/sdb) cancelled.
Dec 23 12:33:22 Optimus noflushd[4067]: Spinning down disk 3 (/dev/sdb).
Dec 23 12:33:25 Optimus noflushd[4067]: Spinning down disk 1 (/dev/sdd).
Dec 23 12:39:08 Optimus noflushd[4067]: Spinning up disk 1 (/dev/sdd) after 0:05:41.
Dec 23 12:39:23 Optimus noflushd[4067]: Spinning up disk 3 (/dev/sdb) after 0:05:58.
Dec 23 12:44:25 Optimus noflushd[4067]: Spinning down disk 3 (/dev/sdb).
Dec 23 12:44:27 Optimus noflushd[4067]: Spinning down disk 1 (/dev/sdd).
Dec 23 13:08:08 Optimus noflushd[4067]: Spinning up disk 1 (/dev/sdd) after 0:23:38.
Dec 23 13:08:34 Optimus noflushd[4067]: Spinning up disk 3 (/dev/sdb) after 0:24:07.
Dec 23 13:13:25 Optimus noflushd[4067]: Spinning down disk 1 (/dev/sdd).
Dec 23 13:13:58 Optimus noflushd[4067]: Spinning down disk 3 (/dev/sdb).
Dec 23 13:31:27 Optimus noflushd[4067]: Spinning up disk 3 (/dev/sdb) after 0:17:27.
Dec 23 13:31:27 Optimus noflushd[4067]: Spinning up disk 1 (/dev/sdd) after 0:17:59.
Dec 23 13:36:40 Optimus noflushd[4067]: Spinning down disk 3 (/dev/sdb).
Dec 23 13:37:32 Optimus noflushd[4067]: Spinning down disk 1 (/dev/sdd).
Dec 23 14:58:32 Optimus noflushd[4067]: Spinning up disk 1 (/dev/sdd) after 1:20:57.
Dec 23 14:58:43 Optimus noflushd[4067]: Spinning up disk 3 (/dev/sdb) after 1:22:01.
Someone who can help me?
Thank You, - StephenBGuru - Experienced UserIf you are good with ssh, you can try this: viewtopic.php?f=154&t=77842&hilit=skywalker#p435474
- mdgm-ntgrNETGEAR Employee RetiredDisk 2 is a WD Green disk. Have you tried disabling the WDIDLE3 timer on that?
Related Content
NETGEAR Academy

Boost your skills with the Netgear Academy - Get trained, certified and stay ahead with the latest Netgear technology!
Join Us!