Script for making remote NAS diagnostics available locally

Sandshark · ‎2020-12-24

@StephenB suggested starting this. Obviously, it's for advanved users with at least some Linux experience.

I have one main NAS, two local backup NAS, and a remote backup NAS. Since the backup devices are not on all the time, I have a script I run that creates some diagnostic data files and then RSYNC's them to my main NAS. That lets me review the status of those NAS without having to turn them on. For the remote NAS, I use ZeroTier so all I have to do is a normal rsync to the main NAS's ZeroTier IP. If somebody wants to add details for rsync over SSH instead, please do.

This is for local backup NAS "RN4200B" and my main NAS is located at 192.168.0.42:

#!/bin/sh
date >/data/hdsentinel/hdinfo_RN4200B.txt
echo >>/data/hdsentinel/hdinfo_RN4200B.txt
/apps/hdsentinel/HDSentinel -solid >>/data/hdsentinel/hdinfo_RN4200B.txt
echo  >>/data/hdsentinel/hdinfo_RN4200B.txt
rnutil get_disk_info >>/data/hdsentinel/hdinfo_RN4200B.txt
/apps/hdsentinel/HDSentinel -r /data/hdsentinel/hdsreport_RN4200B.html -html
rnutil create_system_log -o /data/hdsentinel/log_RN4200B.zip
rsync -a /data/hdsentinel/* rsync://192.168.0.42/hdsentinel

I use HDSentinel (Linux version is free) and the Netgear program rnutil to do this. The shares in question have rsync enabled, of course. I run get_disk_info separately just for convenience, the result is the same as in the log zip. I run something similar on my main NAS and have a regular backup job that copies all the local and remote data to a backup so i can get to it if the main NAS crashes. Just make sure it doesn't copy to the same directory as is used for the local data on that backup, or you'll create a vicious loop (unless you time stamp, as shown below). I could have added the rsync to the script to make sure I'm getting the latest copy on the backup right away, but I didn't bother.

This will overwrite each time. If you want to change that, just add a time/date stamp to the file name as in the following example:

rnutil create_system_log -o /data/hdsentinel/log_RN4200B_$(date +%Y%m%d%H%M%S).zip

Of course, now those will begin to stack up, so you can create another script to trim them (every 5 days in the following example) and run it on the NAS that collects the files, or just add to the script already collecting the data for the main NAS:

#!/bin/sh
find /data/hdsentinel/* -mtime +5 -exec rm {} \;

Then, put them in the appropriate /etc/cron.xxx directory and set them to executable. I put the backup NAS script in cron.hourly to insure it runs each time the unit powers on. Sometimes, that does mean you'd get more than one (especially during a scrub). I recommend you run them manually as a test before putting them in the cron directory.

There may be some other commands that could be useful here that others may want to suggest.

StephenB · ‎2020-12-25

I haven't installed hdsentinal, so I'll need to take a look at that.

You can get quite a bit of info with smartctl (including stuff that isn't in the log zip), and I was thinking about using that.

This gives you a lot of info on the installed disks:

for i in a b c d e f g h i j k l m n; do smartctl -a -x -l defects /dev/sd${i} | egrep -v "local build|No such device|smartmontools"; done >>smart.log

Though if you want to be more selective on the smart stats you can also do something like

for i in a b c d e f g h i j k l m n; do smartctl -a /dev/sd${i} | egrep -i "Device Model|Serial Number|Reallocated_sec|ATA Er|offline_uncorrect|current_pending_sector|Power_on"; done >>smart.log

tailoring the egrep string to include the specific parameters you want to track. @mdgm suggested this version to me a few years ago now (it is also handy in tech support mode).

Sandshark · ‎2020-12-25

@StephenB wrote:
I haven't installed hdsentinal, so I'll need to take a look at that.

I find it to be quite helpful. While I'm sure you can get the same information from smartctl (and more, as it does not show a completely missing drive as your command does), it has a very noce presentation. It's designed to be monitored by a PC running a (paid) version, but it works stand-alone just fine. The line with HDSentinel -solid gives a very brief overview and then HDSentinel -r -html gives a much more detailed report. It's especially useful to me on my main 12-drive NAS and 24-drive (not yet full) external chassis, especially since the external has SAS drives in it. But I sill recommend you take a look.

HDSentinel does have a shortcoming that it only reports use on one partition. That turns out to be the system partition on a ReadyNAS. And while it would be nice to have a report on all partitions, a report on the one that the GUI does not report and can cause catastrophy if it fills is nice.

StephenB · ‎2020-12-25

I downloaded it, and have it on my main NAS and one of the backup ones. It does look useful.

I've developed a starting point for my own script. That script will use rnutil to make the system log on OS 6 systems, otherwise it will zip /var/logs. It will then run smartctl, and will run HDSentinel if it is present. The logs are stored on a local NAS share, and if the NAS is a backup it will also rsync to the main NAS. It should also be possible to combine this with rsync backup jobs (backing up the main share to the local log share).

I still need to test it on one of my 4.1.16 systems, and also I still need to test retention. Though I'm thinking I won't need retention in the script, I can simply delete old files from time to time on the main NAS, and the rsync backups will propagate the deletions to the backups.

StephenB · ‎2021-01-01

I've been working on this over the holiday break, and I have something reasonable that so far is working ok.

Overall, the goal is to capture daily logs from both the main and various backup NAS, and consolidate those in a Logs share on the main NAS. The overall organization is to create a $(hostname) folder for each NAS in the share. Within that there is a folder for each year (e.g., 2021), and each month (2021-01, 2021-02, etc). When run on the main NAS, the script will write the logs directly to the consolidated log share. On the backup NAS, it writes the logs to a local share (LocalLogs), and then rsyncs that to the main NAS. The idea there is to allow me to back up the consolidated log share w/o having any contention (due to the backup jobs running at the same time as the script).

The script applies retention of 7 days to LocalLogs, but does not apply retention to the consolidated log. The idea there is that I don't want to have to manually log into the backup NAS regularly to clean LocalLogs (they are on a power schedule, so that is inconvenient). But it is ok for me to manually prune the consolidated logs. I might do something along the lines of the snapshot thinning used in SMART snaphots later on - not sure.

The system log naming convention is a bit different from the normal ReadyNAS names - I changed it in order to make it sort better (putting the HDSentinel log, the SMART log, and the system log for a given date together).

I designed the script to run both on my legacy 4.1.x NAS and OS-6 - and deliberately used old-school syntax to limit any compatibility issues with the older linux on 4.1.x. Probably over-did that. OS-6 is detected by looking for rnutil. Main vs Backup is detected by looking at the IP address (which is 10.0.0.15 on my main NAS). All my OS-6 NAS have a data volume (even the one running jbod FlexRaid), and of course all legacy NAS have a C volume. So I use those volume names.

The script itself is

#!/bin/sh
#
# set up some useful variables
#
MainNasIP=10.0.0.15
NasIP=`exec hostname -i | awk -F " " '{print $NF}'`
RemoteShareName=Logs
test "$MainNasIP" != "$NasIP" \
   && { ShareName=LocalLogs; \
      Retention=7; } \
   || ShareName=Logs; \
test -e /usr/bin/rnutil && LogShare=/data/$ShareName || LogShare=/c/$ShareName
LogFolder=$LogShare/$(hostname)
HDSentinel=/apps/HDSentinel/HDSentinel
timestamp="$(date +%Y%m%d_%H%M%S)"
RsyncFilter="--include=$(date +%Y)/ --include=$(date +%Y-%m)/*** --exclude=*";
test -e /usr/bin/rnutil && RshParm=-rsh=rsh
#
# make output folder if not there
#
test -d $LogFolder || mkdir $LogFolder
#
# Save Logs in /Logs/hostname/year/year-month
# build the longer folder name in two steps, so mkdir works
#
LogFolder=$LogFolder/$(date +%Y);
test -d $LogFolder || mkdir $LogFolder;
LogFolder=$LogFolder/$(date +%Y-%m);
test -d $LogFolder || mkdir $LogFolder;
#
# get system logs with rnutil on OS-6, otherwise zip /var/logs
# rnutil will create an empty file named "1" in its folder; which is harmless. But let's delete it anyway
# get smartctl data (somewhat different command for OS-6 than OS-4)
#
test -e /usr/bin/rnutil \
   && { rnutil create_system_log -o $LogFolder/$(hostname)-$timestamp-System.zip;\
      rm ./1;\
      for i in a b c d e f g h i j k l m n; do smartctl -a -x -l defects /dev/sd${i} | egrep -v "local build|No such device|smartmontools"; done >>$LogFolder/$(hostname)-$timestamp-Smart.log; }\
   || { /apps/Scripts/diag >/tmp/diagnostics.log;\
      /apps/Scripts/90_CreateLogs;\
      zip -r -j $LogFolder/$(hostname)-$timestamp-System.zip /ramfs/log_zip/*;\
      test -d /ramfs/log_zip && rm -rf /ramfs/log_zip;\
      test -e /tmp/diagnostics.log && rm /tmp/diagnostics.log;\
      for i in a b c d e f g h i j k l m n; do smartctl -a -x /dev/hd${i} | egrep -v "local build|No such device|smartmontools"; done >>/$LogFolder/$(hostname)-$timestamp-Smart.log; }
#
# log HDsentinel info if present
#
test -e $HDSentinel && $HDSentinel -r $LogFolder/$(hostname)-$timestamp-HDSentinel
#
# Apply retention limits if variable set
#
test "$Retention" != "" && find $LogShare/$(hostname)/* -mtime +$Retention -exec rm {} \;
test "$Retention" != "" && find $LogShare/$(hostname) -type d -empty -delete
#
# rsync logs to the main NAS if this is a backup NAS
# this requires that rsync be enabled as read-write on the destination share.
# retention is not being applied to the destination share
#
test "$MainNasIP" != "$NasIP" && rsync $RshParm -amv $RsyncFilter $LogShare/$(hostname)/* $MainNasIP::$RemoteShareName/$(hostname)
exit 0

On OS-6 I chose to run this as a service, and not in a cron job. To do this, you need to put a service and a timer specification into /var/systemd/system. The files I am using are below.

update_logs.service:

[Unit]
Description=Capture Logs Service
After=network-online.target multi-user.target

[Service]
Type=oneshot
RemainAfterExit=no
ExecStart=/apps/Scripts/update_logs

[Install]
WantedBy=multi-user.target

update_logs.timer:

[Unit]
Description=Capture Logs Service

[Timer]
OnCalendar=*-*-* 00:04:00
Persistent=true
Unit=update_logs.service

[Install]
WantedBy=multi-user.target

The services are set up by entering

systemctl enable update_logs
systemctl start update_logs
systemctl enable update_logs.timer
systemctl start update_logs.timer

The timer setting for Persistent is supposed to detect that the service wasn't run because the NAS was off, and run it at the next boot when that is detected. I haven't tested that.

Note that the exit 0 at the end of the script is intentional. If the final test is false, then the script returns an error status. Also, if the rsync fails because the main NAS is down, then the script would also return an error. There are apparently scenarios when systemd will stop running services that repeatedly fail. I don't know for sure if that can happen with a one-shot service, but it seemed best to avoid it.

I'll describe how I am building the system log for the legacy NAS in the next post.

StephenB · ‎2021-01-01

Of course legacy NAS don't support systemctl, so you need to run the main script as a cron job. One aspect here is that the 4.1.16 uses a system-wide cron approach. It will let you create a user cron job using crontab <filename> - but it doesn't actually run that cron job.

Another aspect is that if you look at /var/cron.log you will find that the system always skips the @reboot jobs when the system boots. That is an old debian bug ( /var/run/crond.reboot isn't deleted when the system reboots). I guess you could delete it yourself in the update_logs script, but I decided not to try.

So you end up needing to add an entry in /etc/crontab that specifies the time of day that you want the script to run.

Building the legacy system log was a bit tricky. Originally I was thinking that I'd just zip up /var/log on the legacy 4.1.x NAS. But after looking at that more, I decided it would be better to try to match the actual log file created by the web ui. That took a bit of research. The result only applies to 4.1.x - I have no way to check it on 4.2 or 5.3 NAS.

It turns out that when you download all logs from frontview, the legacy NAS creates a dynamic script to consolidate the logs, and then runs a second script to clean up after the log zip is downloaded. The script that consolidates the logs is around for long enough that you can see it, and copy it. It's called 90_CreateLogs, and it is created in /var/spool/frontview. The script does depend somewhat on how the NAS is set up - in particular, my NV+ (using XRAID) and my Duo (using FlexRAID) have somewhat different files for the RAID configuration. So I began with grabbing the dynamic scripts used on both of those systems.

FWIW, the script consolidates the files into the ram filesystem (I'd expected it to use something in /tmp).

When I ran that script, I discovered that there is one file in the log that couldn't be found - diagnostics.log. That apparently is also run by the web ui before the 90_CreateLogs is run, but I wasn't able to grab the command that creates it. The Netgear version doesn't have enough information to tell exactly what it is running.

Netgear's diagnostic.log:

Disks
-------------------------------
Passed diagnostics.

Memory
-------------------------------
Passed diagnostics.

Network
-------------------------------
Passed diagnostics.

Performance
-------------------------------
* Jumbo frames are disabled on interface 1.  If both your switch and clients support jumbo frames, you can enhance your write performance by enabling jumbo frames on this interface.

Volume
-------------------------------
Passed diagnostics.

So I don't really know what Netgear is running (and testing the volume isn't really useful for me, since I am writing the zip file to the volume). I could just dropped that file from my log zip, but instead I thought it would be useful to have something roughly comparable.

After a bit more sleuthing, I discovered an old manufacturing test on the system, called quicktest. This was definitely old (it is hard-coded to fail if it doesn't find 4 disks, so it was never used for the Duo). But I used it as a starting point - stripping out some things, and adapting the disk test - the original uses badblocks, which won't work on an operational disk. I substituted a cached read test.

What I ended up with outputs this:

================================================================
Infrant Manufacturing Test, Version 1.18
Running on kernel 2.6.17.14ReadyNAS #1 Wed Jun 20 20:08:20 PDT 2012
================================================================

================================================================
Testing onboard network interface...
================================================================
Testing DHCP and ping...done
ONBOARD NIC TEST.......... PASSED

================================================================
Testing Memory...
================================================================
Running quick memory check...done
MEMORY TEST .............. PASSED (Found 256 MB)

================================================================
Testing hard disks...
================================================================
Running quick check on hdc.../dev/hdc: Timing O_DIRECT cached reads: 106 MB in 2.00 seconds = 53.03 MB/sec
Running quick check on hde.../dev/hde: Timing O_DIRECT cached reads: 62 MB in 2.01 seconds = 30.78 MB/sec
Running quick check on hdg.../dev/hdg: Timing O_DIRECT cached reads: 124 MB in 2.02 seconds = 61.27 MB/sec
Running quick check on hdi.../dev/hdi: Timing O_DIRECT cached reads: 106 MB in 2.82 seconds = 37.65 MB/sec
DISK TEST ................ PASSED

================================================================
Testing hardware monitoring...
================================================================
temp 0:22.5
fan 0:2027

HARDWARE MONITOR TEST .... PASSED

================================================================
Checking RTC...
================================================================
RTC TEST ................. PASSED


================================================================
Final Test Summary:
================================================================
RTC TEST ................. PASSED
HARDWARE MONITOR TEST .... PASSED
DISK TEST ................ PASSED
MEMORY TEST .............. PASSED (Found 256 MB)
ONBOARD NIC TEST.......... PASSED

================================================================
TEST RESULT:   PASS
================================================================

I don't know how good these tests are at detecting failures, but it still seemed to give some useful information. I can post my version of this script if there is interest.

Script for making remote NAS diagnostics available locally

Script for making remote NAS diagnostics available locally

Re: Script for making remote NAS diagnostics available locally

Re: Script for making remote NAS diagnostics available locally

Re: Script for making remote NAS diagnostics available locally

Re: Script for making remote NAS diagnostics available locally

Re: Script for making remote NAS diagnostics available locally