Re: Earlier failure detection of disks?

alaeth · ‎2013-11-11

It seems that every time I have a failed drive, I discover the failure in one of two ways:
1. I'm in the wiring closet and see the OLED panel lit up and investigate - seeing the "Vol C: is unprotected" message
2. log into frontview and see the yellow icon next to "Volume" and "Disk".
(hint - both are very rare)

Take this weekend for example:
Disk 6 decided to die (understandable, it's one of the "original 6"). My first indication was the OLED panel.

After I started the disk replacement, the frontview status updated with the error:

Sun Nov 10 06:54:33 MST 2013	Disk failure detected.

(an email was sent at the same time)

I expect the emails to be more timely! The system was aware of the disk failure almost 16 hours earlier (from /var/log/syslog):

Nov  9 15:00:32 ReadyNAS01 kernel: ata6.00: disabled

Short of:
1. installing Splunk
2. adding an alert for "Disk failure"

Does anyone have a script or cron job to check syslog for failures and send out an alert? The current implementation seems to be horribly delayed.

fastfwd · ‎2013-11-11

For what it's worth, I'm running OS4.2.24 on my Pro Pioneer with Gmail as my mail server, and I get email notifications of all events instantly.

For example: When I replaced a disk the other day I received a "Disk removal detected" email within a few seconds, "Disk failure detected" and "Disk failure detected - automatic shutdown in 30 minutes" a minute later, and "New disk detected" a minute or two after inserting the new disk.

alaeth · ‎2013-11-11

I'm not talking about the delay between frontview errors/events, but rather the "disk failure" logic seems inferior.
The delay on emails is very small... once Frontview has detected an issue. But the OS level detection seems to be much earlier.

I have a somewhat hackish solution...

edited /etc/rc.local:

tail -f /var/log/syslog | grep --line-buffered ": failed command: SMART" >> /var/log/SMART.log

(parses the syslog on the fly and dumps any instances of "SMART errors" to a new file)

Then as a personal cron I check the md5sum on the SMART.log file and email myself if it changes.

fastfwd · ‎2013-11-11

alaeth wrote:
I'm not talking about the delay between frontview errors/events

Right, I wasn't either. But I guess my point was obscured by the list of events I gave. Please imagine that I simply wrote, "When I pulled a disk the other day, I received a 'Disk failure detected' email a minute later."

alaeth · ‎2013-11-12

I see your point, when I started the actual drive replacement, the emails related to "new drive detected" and rebuilding the volume were almost immediate... so I do get quick turn-around on some types of errors/events detected.

But my frustration is with the initial failure detection... there seems to be a huge delay between the OS-level detection (indicated in syslog) and the frontview software layer emailing me. As far as I know, I have never gotten an email related to a drive failure before detecting it myself (either by observing the front panel, or logging into the admin and seeing the error).

My goal was to reach out to the community and see if my situation is unique (which warrants a call to technical support), or if it's just "the way it is". And if it's not unique, what work-arounds has the community tried to improve the response time - specific to drives dying.

StephenB · ‎2013-11-12

I've certainly gotten SMART error alerts via email well before the drive failed. My first indication of failure has never been the front panel.

fastfwd · ‎2013-11-12

alaeth wrote:
I do get quick turn-around on some types of errors/events detected. .... My goal was to reach out to the community and see if my situation is unique (which warrants a call to technical support), or if it's just "the way it is". And if it's not unique, what work-arounds has the community tried to improve the response time - specific to drives dying.

The only "failure detected" emails I've ever received have been from removing a drive -- never a spontaneous failure -- so I can't tell whether your experience in that particular situation is unique. But surely that can't be "just the way it is"; it makes no sense to deliberately delay disk-failure emails.

If I were in your position, I'd have already contacted technical support.

alaeth · ‎2013-11-12

fastfwd, you make a good argument. I shall call them when I'm back in town and in front of the device.

In the meantime, I have (what I hope is) a much more robust solution than the default.

Inspired by this post:
http://community.spiceworks.com/scripts ... t-rollover

I've whipped up the following script that runs a a cron job every hour:

#!/bin/bash

# This is a script that will grep a log file and send an email when a specified patter is encountered.
# Original Author: Salman Bayat
# http://community.spiceworks.com/scripts/show/225-email-notifications-gernerated-via-monitoring-of-logs-that-rollover
#

errors=$(grep -wE ' Disk failure on sd[a-g][0-6], disabling device.|SMART' /var/log/syslog)
#errors=$(grep "System Error Pattern Here" /var/log/yourlogfilehere.log)
echo "${errors}" > /tmp/current-errors.log

if [ -e "/tmp/prior-errors.log" ]; then
   echo "prior-errors.log Exists" > /dev/null
else
   touch /tmp/prior-errors.log | echo "" > /tmp/prior-errors.log
fi

newentries=$(diff --suppress-common-lines -u /tmp/prior-errors.log /tmp/current-errors.log | grep '\+[0-9]')

if
   test "$newentries" != ""  &&  test "$errors" = ""
   then
   echo "No New Errors" > /dev/null
elif
   test "$newentries" != ""
   then
   echo -e "To: ADMIN\nFrom: READYNAS\nSubject: new SYSLOG events detected!\n\nThe following syslog errors have been detected matching your alert pattern:\n\n${errors}" | /usr/sbin/sendmail <my_email_address>
   echo "$errors" > /tmp/prior-errors.log
fi

The beauty of this script is I can customize it to email me whenever anything interesting occurs in the syslog.