My Samsung 870 EVO 2TB SDD is dying after 13 months of basic workstation operation. Looks like some problem with a large batch because I found many other users complaining on forums. I am going for RMA. Fortunately, I restored from my backup.

Lesson learned: SMART needs to be monitored on my home servers, this is not the first time and I was lucky enough to see the errors in the system journal in advance.

How to do that? There are multiple options, there is a shell script which ships with the smartmontools package, but I could not get it working. So I ended landing on a simple solution:

dnf install smartmontools ssmtp

Now, chances are you have a different MTA agent already installed, but my SMTP server does not work with esmtp so I want to use ssmtp instead which is much more rebust with better logging.

alternatives --set mta /usr/sbin/sendmail.ssmtp

Configure the client, use Debug to further debug issues if needed:

root=postmaster
mailhub=smtp.example.com:587
RewriteDomain=example.com
Hostname=example.com
UseTLS=NO
UseSTARTTLS=YES
TLS_CA_File=/etc/pki/tls/certs/ca-bundle.crt
Debug=NO
AuthUser=lzap@example.com
AuthPass=xxxxxxxxxxxx
AuthMethod=PLAIN

Send a test email:

echo -e "Subject: Test\n\nHi!" | sendmail -v lzap@example.com

From address must match in order to pass MTA anti-spam filters, you can do this in /etc/ssmtp/revaliases:

root:lzap@example.com

A dirty shell script will do, note that smartctl utility returns a bit mask so finding if a drive is healthy is a bit tricky. Luckily, the manpage contains an example:

cat /etc/cron.weekly/smart

Something like:

for DRIVE in sda sdb sdc sdd; do
  smartctl -H /dev/$DRIVE &>/dev/null
  dying=$(($? & 8))
  if [[ $dying -ne 0 ]]; then
    echo "Subject: SMART problem $DRIVE on $(hostname)" | sendmail -v me@example.com
  fi
done

It’s dirty, but it should work. Here is slightly more elaborated version:

#!/bin/bash
RECIPIENT="lzap@example.com"
HOSTNAME=$(hostname)
SMARTCTL="/usr/sbin/smartctl"
SENDMAIL="/usr/sbin/sendmail"

DRIVES=$(lsblk -dno NAME | grep -E 'sd|nvme')
for DRIVE in $DRIVES; do
    DEVICE="/dev/$DRIVE"
    DRIVER_FLAG="-d auto"

    if [[ "$HOSTNAME" == "yuki.internal" && "$DEVICE" == "/dev/sda" ]]; then
        DRIVER_FLAG="-d sntjmicron"
    fi

    # Run SMART check (Health -H and All Attributes -A)
    OUTPUT=$($SMARTCTL $DRIVER_FLAG -H -A "$DEVICE" 2>&1)
    EXIT_CODE=$?

    # Bit 0: Command line did not parse
    # Bit 1: Device could not be opened
    # Bit 2: SMART command failed
    # Bit 3: DISK FAILING (Critical) [Value 8]
    # Bit 4: Attributes below threshold (Pre-fail) [Value 16]
    # Bit 5: SMART status OK, but self-test log has errors [Value 32]
    # Bit 6: SMART status OK, but error log has errors [Value 64]

    # We want to alert if bits 3, 4, 5, or 6 are set (Sum = 120)
    if (( EXIT_CODE & 120 )); then

        STATUS_MSG="SMART ALERT"
        (( EXIT_CODE & 8 ))  && STATUS_MSG="CRITICAL FAILURE"
        (( EXIT_CODE & 16 )) && STATUS_MSG="PRE-FAIL WARNING"
        (( EXIT_CODE & 96 )) && STATUS_MSG="LOG ERRORS DETECTED"

        (
            echo "Subject: [$STATUS_MSG] $DEVICE on $HOSTNAME"
            echo "To: $RECIPIENT"
            echo "MIME-Version: 1.0"
            echo "Content-Type: text/plain; charset=UTF-8"
            echo ""
            echo "Attention: The health monitor detected issues with $DEVICE."
            echo ""
            echo "Detected Bits:"
            (( EXIT_CODE & 8 ))  && echo " - [BIT 3] DISK IS FAILING"
            (( EXIT_CODE & 16 )) && echo " - [BIT 4] PRE-FAIL ATTRIBUTES BELOW THRESHOLD"
            (( EXIT_CODE & 32 )) && echo " - [BIT 5] SELF-TEST LOG CONTAINS ERRORS"
            (( EXIT_CODE & 64 )) && echo " - [BIT 6] ERROR LOG CONTAINS ERRORS"
            echo ""
            echo "--- FULL SMARTCTL OUTPUT ---"
            echo "$OUTPUT"
        ) | $SENDMAIL -t

        echo "Disk Alert for $DEVICE (Exit Code $EXIT_CODE). Email sent." | logger -t disk-monitor
    fi
done

Update 2025: I had a typo in the script, special thanks to François Le Nalio who spotted it and reported back. I actually had the typo in my original script, this could have been another disaster. Like I did not have enough SSD failures in the last three years :-)

Update 2026: Added alternatives and more elaborated script.