RN104 immediately "out of memory 390" error after 6.2.2 -> 6.2.5 -> 6.4.2 upgrade

iany · ‎2016-09-21

I decided to upgrade my 2+ years old 6.2.2 firmware to a newer stable one. The latest in 6.4 branch seemed OK -> about 9 months old, so I thought it's been proven. I wasn't wrong, I just missed all the problems others have been reporting...

You know, in work, I'm a responsible sysadmin, reading whole release notes, all procedures, testing all in the lab before trying in production. But at home--well, I'm just a better-class home user. I read release notes for known issues--no such section in Netgear's. So I tried forums, not much there, so I went ahead. Come on, Netgear! How can you expect home users to be able to dig in your KB and/or forums for hours to learn everything they may or may not need for a successful firmware upgrade? I did dig in now, because this now has priority! But before, I had had my kids' homeworks to help with etc. I also do not have another system at home to make tests on...

Now, my system with 5+ TB (4 x 2 TB disks) of data goes into __out_of_memory_390 error about 1 or 2 minutes after I boot it. All the posts here are reiterating one and only solution:

1. Back up the config and all the data (this should be possible even in my situation using "Volume read only" option from Boot menu -> check how-to at http://kb.netgear.com/app/answers/detail/a_id/22891/~/how-do-i-access-the-boot-menu-on-my-readynas-1...).

2. Do a factory reset.

3. Set up the NAS from scratch or restore the config.

4. Restore your data (may be difficult if you use home folders--all individual users need to restore their data themselves as far as I understand it--not even the admin will be able to restore permissions correctly--weird for me, but it's also unconfirmed).

I don't like this kind of solutions, but enough complaining... I only used the NAS as an NFS storage and wasn't using snapshots on the NFS share and wasn't using any other feature. Nevertheless, I believe my problem may be caused by too many snapshots. I believe this article _must_be_ linked to all release notes and firmware upgrade procedures:

http://kb.netgear.com/app/answers/detail/a_id/30033

So far I tried "OS reinstall" from the Boot menu without any success. This morning, I put in a spare 500 GB scratch disk and upgraded firmware to 6.5.0. The machine is running nicely, but my goal is to put in my 4 disks (it'll upgrade the firmware on the disks to what's in the NAS's on-board flash, i.e. 6.5.0) back in this evening in the hope that:-

- either the error will not appear at all, or

- at least I'll have enough time to delete some snapshots before it appears.

If not, I'll try to upgrade to 6.5.1 using the scratch disk and try again. If no success, I'll have to take the system to the work lab in order to backup, factory reset and restore. I've got no other place where I can find 6 TB of space...

I'll let you know my results.

Now I wanted to pass my message to:-

- the community = schedule defrag/scrub/balance of your arrays + disable and delete snapshots if you don't need them + don't trust the vendor that firmware upgrade is painless and risk-free,

- Netgear = no home user is interested in products for which you need 10+ years of IT experience to set up and maintain and 1 working day of documentation research to just upgrade firmware and back up data off it just in case.

iany · ‎2016-09-23

I'm not out of the woods yet, but I've made considerable progress.

I was able to boot in "volume read-only" mode which was reassuring. At least I'd be able to backup the data...

So, I tried to find what's actually happening in the system. So I logged in via SSH and watched the system utilization. As soon as the data volume was mounted, btrfs-cleaner process appeared. It pegged the CPU to 100% which wouldn't be a problem in itself. But it also consumed more and more memory and when the system started to swap, it ground to halt. I made another test--booted in "volume read-only" mode, logged in as root and executed `mount -o remount,rw /data`. The outcome was the same--btrfs-cleaner ate all memory, swapping started, the system halted. Again, the finger was pointed at either snapshots (too many to process etc.), or quotas (newly introduced feature).

After another "volume read-only" mode boot-up, I executed this:

mount -o remount,rw /data;btrfs quota disable /data

and the system stayed OK, no btrfs-cleaner, nothing. So I remounted all other btrfs filesystems (get the list with `mount | grep btrfs`). In my case:
- the list of btrfs volumes:-

/dev/md127 on /data type btrfs (rw,noatime,nodiratime,nodatasum,nospace_cache,subvolid=5,subvol=/)
/dev/md127 on /apps type btrfs (rw,noatime,nodiratime,nodatasum,nospace_cache,subvolid=257,subvol=/.apps)
/dev/md127 on /home type btrfs (rw,noatime,nodiratime,nodatasum,nospace_cache,subvolid=256,subvol=/home)
/dev/md127 on /var/ftp/home type btrfs (rw,noatime,nodiratime,nodatasum,nospace_cache,subvolid=256,subvol=/home)
/dev/md127 on /run/nfs4/data/Shared type btrfs (rw,noatime,nodiratime,nodatasum,nospace_cache,subvolid=275,subvol=/Shared)
/dev/md127 on /run/nfs4/home type btrfs (rw,noatime,nodiratime,nodatasum,nospace_cache,subvolid=256,subvol=/home)

- the commands to remount and disable quotas:-

mount -o remount,rw /home;btrfs quota disable /home
mount -o remount,rw /apps;btrfs quota disable /apps
mount -o remount,rw /var/ftp/home;btrfs quota disable /var/ftp/home
mount -o remount,rw /run/nfs4/home;btrfs quota disable /run/nfs4/home
mount -o remount,rw /run/nfs4/data/Shared;btrfs quota disable /run/nfs4/data/Shared

Note that /run/nfs4/data/Shared is my NFS share and you won't have it for sure. Your list may be very different.

Still no btrfs-cleaner, the system still running OK. So I removed all snapshots (note: I was not using them at all, your set up might be different, but you may need to do so too in order to restore the system).
How to remove snapshots--well, first you need to have a list of configs--they're actually numbers and their count corresponds with the btrfs volume count:-

ls /etc/snapper/configs

You can also find out which config is for which volume (the smileys are actually a colon and capital S):-

root@kostka:~# grep VOLUME /etc/snapper/configs/*
0:SUBVOLUME="/data/Backup"
1:SUBVOLUME="/data/Documents"
2:SUBVOLUME="/data/Music"
3:SUBVOLUME="/data/Pictures"
4:SUBVOLUME="/data/Shared"
5:SUBVOLUME="/data/Videos"

Then execute for each config:

snapper -c <config> list

It'll show you something like:

Type   | #   | Pre # | Date                             | User | Cleanup | Description | Userdata             
-------+-----+-------+----------------------------------+------+---------+-------------+----------------------
single | 0   |       |                                  | root |         | current     |                      
single | 13  |       | Thu 31 Mar 2016 12:00:41 AM CEST | root |         |             | snapshot=c_1459375241
single | 43  |       | Sat 30 Apr 2016 12:00:25 AM CEST | root |         |             | snapshot=c_1461967225
single | 74  |       | Tue 31 May 2016 12:00:49 AM CEST | root |         |             | snapshot=c_1464645648
single | 104 |       | Thu 30 Jun 2016 12:00:25 AM CEST | root |         |             | snapshot=c_1467237624
single | 134 |       | Sat 30 Jul 2016 12:00:20 AM CEST | root |         |             | snapshot=c_1469829619
single | 135 |       | Sun 31 Jul 2016 12:00:51 AM CEST | root |         |             | snapshot=c_1469916051
single | 141 |       | Sat 06 Aug 2016 12:00:58 AM CEST | root |         |             | snapshot=c_1470434458
single | 148 |       | Sat 13 Aug 2016 12:00:05 AM CEST | root |         |             | snapshot=c_1471039205
single | 155 |       | Sat 20 Aug 2016 12:00:25 AM CEST | root |         |             | snapshot=c_1471644025
single | 158 |       | Tue 23 Aug 2016 12:00:08 AM CEST | root |         |             | snapshot=c_1471903208
..
cut = there were daily snapshots
..
single | 181 |       | Thu 15 Sep 2016 12:00:21 AM CEST | root |         |             | snapshot=c_1473890421

And now you clean all of them, but the current. The command is `snapper -c <config> delete 1-<no. of the latest snapshot>`, in my case e.g. `snapper -c 0 delete 1-181`. They were all zero-size, so it took less than a second.

I then rebooted to normal mode. The current state of the system is:-
- Snapshots deleted and all snapshotting disabled.
- Quotas disabled. It's for sure unsupported by Netgear. Some things may be broken. Firmware upgrade may enable them or fail because of them being disabled. Etc.
- I'm now balancing, defragging and scrubbing the volume. I haven't found in which order it's best to do this, the guidance in Netgear's KB #26941 is to scrub rarely, defrag occasionally and balance regularly. My 6 TB volume took 10.5 hours to balance, 4.5 hours to defrag and it took 13 hours to scrub to 30%.

Next steps:-
- Verify data integrity.
- Enable quotas and see if the system comes up normally.

View solution in original post

iany · ‎2016-09-21

From one of the more helpful posts (5th update at https://community.netgear.com/t5/Using-your-ReadyNAS/Readynas-104-won-t-boot-Error-354-out-of-memory...😞

The Guide also mentions some common reasons why problems might be encountered:

Systems that are completely full.
Systems that have high filesystem fragmentation.
Systems that have large quantities of hourly, daily, monthly snapshots.

The first and last of these should be easy for you to verify before you update the firmware. The middle one may usually (but not always) be somewhat related to the other two, but advanced users could get a good indication by looking at the metadata usage in btrfs.log. If the metadata usage is huge then this would suggest that the way the system was configured and/or used was far from ideal.

It appears that all systems encountering this problem are affected by one or more of the issues described in these bullet points.

Some suggestions going forward would be to keep volume usage under 80%, run regular scheduled volume maintenance (defrag & balance) and to only use bit-rot protection and snapshots on shares which are suited to using those, not on every share.

bedlam1 · ‎2016-09-21

If you click the "Downloads" link at the top of this page, enter your model no., tick firmware (or all) you will find a whole bunch of release notes

mdgm-ntgr · ‎2016-09-22

The firmware release notes already link to KB article #26212, and the very first link on 26212 links you to the 30033 article you mentioned.

iany · ‎2016-09-23

Well, I know I overlooked it, but "always" is unfortunately a bit of a overestimation in this case. Instead of one link to #26212, you get two links in 6.4.1 and 6.4.2 rel. notes which point to non-existent documents--#23113 and #22804. And you need to search for 6.4.2 rel. notes as these are missing from the Downloads page.

Given the fact that this procedure may crash the machine, I'd really suggest more aggresive wording. IMPORTANT or WARNING are such a good words to attract attention... Or bold red lettering is also good. Something for the people like me... "Take into consideration" is too mild in my opinion.

iany · ‎2016-09-23

I'm not out of the woods yet, but I've made considerable progress.

I was able to boot in "volume read-only" mode which was reassuring. At least I'd be able to backup the data...

So, I tried to find what's actually happening in the system. So I logged in via SSH and watched the system utilization. As soon as the data volume was mounted, btrfs-cleaner process appeared. It pegged the CPU to 100% which wouldn't be a problem in itself. But it also consumed more and more memory and when the system started to swap, it ground to halt. I made another test--booted in "volume read-only" mode, logged in as root and executed `mount -o remount,rw /data`. The outcome was the same--btrfs-cleaner ate all memory, swapping started, the system halted. Again, the finger was pointed at either snapshots (too many to process etc.), or quotas (newly introduced feature).

After another "volume read-only" mode boot-up, I executed this:

mount -o remount,rw /data;btrfs quota disable /data

and the system stayed OK, no btrfs-cleaner, nothing. So I remounted all other btrfs filesystems (get the list with `mount | grep btrfs`). In my case:
- the list of btrfs volumes:-

/dev/md127 on /data type btrfs (rw,noatime,nodiratime,nodatasum,nospace_cache,subvolid=5,subvol=/)
/dev/md127 on /apps type btrfs (rw,noatime,nodiratime,nodatasum,nospace_cache,subvolid=257,subvol=/.apps)
/dev/md127 on /home type btrfs (rw,noatime,nodiratime,nodatasum,nospace_cache,subvolid=256,subvol=/home)
/dev/md127 on /var/ftp/home type btrfs (rw,noatime,nodiratime,nodatasum,nospace_cache,subvolid=256,subvol=/home)
/dev/md127 on /run/nfs4/data/Shared type btrfs (rw,noatime,nodiratime,nodatasum,nospace_cache,subvolid=275,subvol=/Shared)
/dev/md127 on /run/nfs4/home type btrfs (rw,noatime,nodiratime,nodatasum,nospace_cache,subvolid=256,subvol=/home)

- the commands to remount and disable quotas:-

mount -o remount,rw /home;btrfs quota disable /home
mount -o remount,rw /apps;btrfs quota disable /apps
mount -o remount,rw /var/ftp/home;btrfs quota disable /var/ftp/home
mount -o remount,rw /run/nfs4/home;btrfs quota disable /run/nfs4/home
mount -o remount,rw /run/nfs4/data/Shared;btrfs quota disable /run/nfs4/data/Shared

Note that /run/nfs4/data/Shared is my NFS share and you won't have it for sure. Your list may be very different.

Still no btrfs-cleaner, the system still running OK. So I removed all snapshots (note: I was not using them at all, your set up might be different, but you may need to do so too in order to restore the system).
How to remove snapshots--well, first you need to have a list of configs--they're actually numbers and their count corresponds with the btrfs volume count:-

ls /etc/snapper/configs

You can also find out which config is for which volume (the smileys are actually a colon and capital S):-

root@kostka:~# grep VOLUME /etc/snapper/configs/*
0:SUBVOLUME="/data/Backup"
1:SUBVOLUME="/data/Documents"
2:SUBVOLUME="/data/Music"
3:SUBVOLUME="/data/Pictures"
4:SUBVOLUME="/data/Shared"
5:SUBVOLUME="/data/Videos"

Then execute for each config:

snapper -c <config> list

It'll show you something like:

Type   | #   | Pre # | Date                             | User | Cleanup | Description | Userdata             
-------+-----+-------+----------------------------------+------+---------+-------------+----------------------
single | 0   |       |                                  | root |         | current     |                      
single | 13  |       | Thu 31 Mar 2016 12:00:41 AM CEST | root |         |             | snapshot=c_1459375241
single | 43  |       | Sat 30 Apr 2016 12:00:25 AM CEST | root |         |             | snapshot=c_1461967225
single | 74  |       | Tue 31 May 2016 12:00:49 AM CEST | root |         |             | snapshot=c_1464645648
single | 104 |       | Thu 30 Jun 2016 12:00:25 AM CEST | root |         |             | snapshot=c_1467237624
single | 134 |       | Sat 30 Jul 2016 12:00:20 AM CEST | root |         |             | snapshot=c_1469829619
single | 135 |       | Sun 31 Jul 2016 12:00:51 AM CEST | root |         |             | snapshot=c_1469916051
single | 141 |       | Sat 06 Aug 2016 12:00:58 AM CEST | root |         |             | snapshot=c_1470434458
single | 148 |       | Sat 13 Aug 2016 12:00:05 AM CEST | root |         |             | snapshot=c_1471039205
single | 155 |       | Sat 20 Aug 2016 12:00:25 AM CEST | root |         |             | snapshot=c_1471644025
single | 158 |       | Tue 23 Aug 2016 12:00:08 AM CEST | root |         |             | snapshot=c_1471903208
..
cut = there were daily snapshots
..
single | 181 |       | Thu 15 Sep 2016 12:00:21 AM CEST | root |         |             | snapshot=c_1473890421

And now you clean all of them, but the current. The command is `snapper -c <config> delete 1-<no. of the latest snapshot>`, in my case e.g. `snapper -c 0 delete 1-181`. They were all zero-size, so it took less than a second.

I then rebooted to normal mode. The current state of the system is:-
- Snapshots deleted and all snapshotting disabled.
- Quotas disabled. It's for sure unsupported by Netgear. Some things may be broken. Firmware upgrade may enable them or fail because of them being disabled. Etc.
- I'm now balancing, defragging and scrubbing the volume. I haven't found in which order it's best to do this, the guidance in Netgear's KB #26941 is to scrub rarely, defrag occasionally and balance regularly. My 6 TB volume took 10.5 hours to balance, 4.5 hours to defrag and it took 13 hours to scrub to 30%.

Next steps:-
- Verify data integrity.
- Enable quotas and see if the system comes up normally.

iany · ‎2016-09-25

I enabled quotas again. N.B.: You don't need to set them up, they were set up and I just disabled them, I did not remove their configuration.

btrfs quota enable /data
btrfs quota enable /home
btrfs quota enable /apps
btrfs quota enable /var/ftp/home
btrfs quota enable /run/nfs4/home
btrfs quota enable /run/nfs4/data/Shared

Then I rebooted and watched for unusual/unwanted processes, e.g. btrfs-cleaner eating my memory etc. 😉 Long story short, the machine is running for 30 hours now without any problems.

The last step was to set up volume maintenance schedule:-

- Disk test seems to be extended offline test. The kind you run with `smartctl -t long /dev/sdX`. I run this weekly.

- Balance will run monthly.

- Defrag will run quarterly.

- I don't run scrub as I don't use snapshots.

My last update here, I hope 🙂

mdgm-ntgr · ‎2016-09-25

Scrubbing is there for if you use bit-rot protection not snapshots.

StephenB · ‎2016-09-26

@mdgm wrote:

Scrubbing is there for if you use bit-rot protection not snapshots.

Scrubbing still reads all the data on the disks even if bit-rot protection is off. So it does provide some assurance that the drives and the file system are ok.

I schedule each of the functions - disk test, balance, defrag, and scrub - once every three months (spreading them out over the quarter).

RN104 immediately "out of memory 390" error after 6.2.2 -> 6.2.5 -> 6.4.2 upgrade

RN104 immediately "out of memory 390" error after 6.2.2 -> 6.2.5 -> 6.4.2 upgrade

Re: RN104 immediately "out of memory 390" error after 6.2.2 -> 6.2.5 -> 6.4.2 upgrad

Re: RN104 immediately "out of memory 390" error after 6.2.2 -> 6.2.5 -> 6.4.2 upgrad

Re: RN104 immediately "out of memory 390" error after 6.2.2 -> 6.2.5 -> 6.4.2 upgrad

Re: RN104 immediately "out of memory 390" error after 6.2.2 -> 6.2.5 -> 6.4.2 upgrad

Re: RN104 immediately "out of memory 390" error after 6.2.2 -> 6.2.5 -> 6.4.2 upgrad

Re: RN104 immediately "out of memory 390" error after 6.2.2 -> 6.2.5 -> 6.4.2 upgrad

Re: RN104 immediately "out of memory 390" error after 6.2.2 -> 6.2.5 -> 6.4.2 upgrad

Re: RN104 immediately "out of memory 390" error after 6.2.2 -> 6.2.5 -> 6.4.2 upgrad

Re: RN104 immediately "out of memory 390" error after 6.2.2 -> 6.2.5 -> 6.4.2 upgrad