How to Alert When Your Scheduled Task Fails
Scheduled tasks are the silent workhorses of modern applications and infrastructure. Whether it's a nightly data sync, a daily report generation, or an hourly cache refresh, these jobs run tirelessly in the background, often out of sight and out of mind. That is, until they fail. And when they do, the consequences can range from stale data and missed insights to critical service outages and unhappy customers.
The problem with "out of sight, out of mind" is that you only realize a task has failed after the fact, typically when someone notices missing data or a system error. At that point, you're reacting to a problem that's already had an impact. As engineers, our goal is to be proactive: to know about a failure the moment it happens, or even better, the moment it should have happened but didn't.
This article will walk you through robust strategies for monitoring your scheduled tasks, focusing on the powerful "heartbeat" approach. We'll explore how to implement this effectively, discuss common pitfalls, and show you how to ensure you're always in the loop when your silent workhorses stumble.
The Problem with Traditional Monitoring
When you set up a cron job or a scheduled task, you might think you're covered if it logs its output or if you've set MAILTO in your crontab. While logs are essential for debugging after a failure, relying solely on them for alerting is reactive. You have to actively check logs, or build complex log parsing and alerting pipelines, just to know if something went wrong.
Consider a simple cron entry:
0 2 * * * /usr/local/bin/my_backup_script.sh >> /var/log/my_backup.log 2>&1
If my_backup_script.sh fails, its error output goes into /var/log/my_backup.log. But what if:
* The script never starts due to a misconfigured cron entry?
* The server itself goes down before the job can run?
* The script hangs indefinitely, consuming resources but never finishing?
* The log file rotation deletes the relevant error before you see it?
In all these scenarios, your log file might show nothing, or only partial information. You're left blind, assuming everything is fine until a critical system depends on that backup, only to find it's days or weeks old. Exit codes are a fundamental mechanism for indicating success or failure, but they only tell you if the script itself reported a problem; they don't tell you if the script ever ran.
The Heartbeat Approach: Negative Monitoring
The most robust way to monitor scheduled tasks is through a "heartbeat" mechanism. Instead of waiting for a failure notification (which might never come if the job doesn't even start), you set up your task to actively "report in" when it successfully completes.
Here's the core idea: 1. You configure a monitoring system to expect a signal (a "heartbeat") from your scheduled task at a regular interval. 2. Your scheduled task, upon successful completion, sends this heartbeat signal. 3. If the monitoring system does not receive the expected heartbeat within its configured timeframe, it assumes the job has failed and triggers an alert.
This approach is sometimes called "negative monitoring" because you're alerted by the absence of a signal, rather than the presence of an error signal. It elegantly solves the problems we discussed: * Job failed: No heartbeat is sent, you're alerted. * Job didn't start: No heartbeat is sent, you're alerted. * Job got stuck/hung: No heartbeat is sent within the expected window, you're alerted. * Server down: No heartbeat is sent, you're alerted.
This gives you comprehensive coverage, ensuring you're notified whenever your task doesn't behave exactly as expected.
Implementing Heartbeats Manually
You can implement heartbeats using simple shell commands or by integrating them directly into your application code. The key is to send the heartbeat only when the job has successfully completed its work.
Basic Scripting with curl
For shell scripts or cron jobs, the easiest way to send a heartbeat is using curl. You'll need a unique URL for each job you want to monitor.
Let's say you have a daily data processing script, /usr/local/bin/process_daily_data.sh, that should run every day at 3 AM. You've obtained a unique heartbeat URL, https://your-monitoring-service.com/heartbeat/daily_data_job_id, from your monitoring system.
You can modify your crontab entry like this:
0 3 * * * /usr/local/bin/process_daily_data.sh && curl -fsS --retry 3 --retry-delay 5 "https://your-monitoring-service.com/heartbeat/daily_data_job_id"
Let's break this down:
* 0 3 * * *: The job runs daily at 3 AM.
* /usr/local/bin/process_daily_data.sh: This is your actual script.
* &&: This is a crucial shell operator. The curl command will only execute if the preceding command (process_daily_data.sh) exits with a status code of 0 (indicating success). If process_daily_data.sh fails, the curl command is skipped, and no heartbeat is sent.
* curl -fsS:
* -f: Fail silently (don't output HTML errors on HTTP 4xx/5xx).
* -s: Silent mode (don't show progress meter or error messages).
* -S: Show error when -s is used (useful if the heartbeat URL itself is unreachable, though -f handles HTTP errors).
* --retry 3 --retry-delay 5: These options make the curl command more resilient to transient network issues. It will retry up to 3 times, waiting 5 seconds between each attempt. This prevents a temporary network glitch from mistakenly marking your job as failed.
What if your script has multiple steps?
If your process_daily_data.sh script itself contains multiple commands, you should include set -e at the top of the script. This ensures that the script will exit immediately if any command within it fails, preventing it from proceeding to an incomplete state and falsely sending a success heartbeat.
#!/bin/bash
set -e # Exit immediately if a command exits with a non-zero status.
echo "Starting daily data processing..."
# Step 1: Download data
wget -q -O /tmp/data.csv "https://example.com/data.csv"
# Step 2: Process data
python /usr/local/bin/process_csv.py /tmp/data.csv
# Step 3: Upload results
sftp user@remote.server <<< "put /tmp/processed_data.csv /uploads/"
echo "Daily data processing completed successfully."
# The script will only reach here if all previous commands succeeded.
With set -e, if wget fails, the script exits, the && curl in cron is skipped, and you're alerted.