Bash cron monitoring without external dependencies
Cron jobs are the silent workhorses of many systems, handling everything from daily backups and log rotations to critical data synchronization and certificate renewals. When they run smoothly, you barely notice them. But when they fail silently, the consequences can range from minor annoyances to catastrophic data loss or service outages. That's why monitoring your cron jobs is non-negotiable for any robust system.
You might be thinking: "I need to monitor my cron jobs, but I want to keep things simple. Can I do it without introducing new tools or external services? Just plain old Bash and the tools already on my server?"
The answer is yes, you can implement a basic form of cron job monitoring using only Bash and standard Unix utilities. However, it comes with significant caveats and limitations, especially when it comes to detecting the most insidious failure: when a job simply doesn't run at all. This article will walk you through several dependency-free approaches, highlighting their strengths, weaknesses, and the inherent trade-offs.
Why Monitor Cron Jobs?
Before diving into the "how," let's quickly reiterate the "why." You monitor cron jobs to ensure:
- Data Integrity: Critical data pipelines complete successfully.
- Service Availability: Regular tasks like cache invalidation or service restarts happen as expected.
- Security: Certificate renewal jobs run before expiration, keeping your services secure.
- Resource Management: Cleanup scripts prevent disk space exhaustion.
- Operational Efficiency: Catching issues early prevents larger problems and reduces manual intervention.
A cron job that fails silently is a ticking time bomb. You need to know when something goes wrong, and ideally, before it impacts your users or data.
The Core Idea: Self-Reporting Jobs
The fundamental principle behind dependency-free cron monitoring is that the job itself, or a closely related script, is responsible for reporting its status. There's no external observer actively checking in on the job; instead, the job signals its state using local files, logs, or simple network checks.
Method 1: File-Based Heartbeats
One of the simplest ways to implement a "heartbeat" is by having your cron job touch a file. A separate monitoring script then checks the modification time of this file to determine if the job is running or has recently completed.
How it works:
- Your cron job starts by
touch-ing a "start" file. - It then performs its work.
- Upon successful completion, it
touch-es a "success" file. - A separate cron job (the monitor) periodically checks the "success" file's age. If it's too old, it triggers an alert.
Example Implementation:
Let's say you have a cron job that runs every 10 minutes to process a queue, and you expect it to finish within 5 minutes.
Your job (/usr/local/bin/process_queue.sh):
#!/bin/bash
JOB_NAME="my_queue_processor"
STATE_DIR="/var/run/cron_monitor"
mkdir -p "$STATE_DIR"
TOUCH_FILE="$STATE_DIR/${JOB_NAME}.success"
LOCK_FILE="$STATE_DIR/${JOB_NAME}.lock"
# Simple lock to prevent multiple instances
if ( set -o noclobber; echo "$$" > "$LOCK_FILE") 2> /dev/null; then
trap "rm -f '$LOCK_FILE'; exit" INT TERM EXIT
echo "$(date): $JOB_NAME started." >> /var/log/${JOB_NAME}.log
# Perform the actual work
/usr/bin/php /var/www/my_app/process_queue.php >> /var/log/${JOB_NAME}.log 2>&1
STATUS=$?
if [ $STATUS -eq 0 ]; then
echo "$(date): $JOB_NAME finished successfully." >> /var/log/${JOB_NAME}.log
touch "$TOUCH_FILE" # Update success timestamp
else
echo "$(date): $JOB_NAME failed with status $STATUS." >> /var/log/${JOB_NAME}.log
# Optionally, touch a separate failure file or send an immediate alert
fi
rm -f "$LOCK_FILE"
else
echo "$(date): $JOB_NAME is already running. Exiting." >> /var/log/${JOB_NAME}.log
exit 1
fi
Your monitoring cron job (/etc/cron.d/cron_monitor):
# Run every 5 minutes to check jobs
*/5 * * * * root /usr/local/bin/check_cron_heartbeats.sh
Your monitoring script (/usr/local/bin/check_cron_heartbeats.sh):
```bash
!/bin/bash
STATE_DIR="/var/run/cron_monitor" ALERT_EMAIL="admin@example.com" ALERT_THRESHOLD_MINUTES=15 # If a job hasn't touched its file in this many minutes
find "$STATE_DIR" -maxdepth 1 -name "*.success" -print0 | while IFS= read -r -d $'\0' file; do JOB_NAME=$(basename "$file" .success) LAST_TOUCH=$(stat -c %Y "$file") CURRENT_TIME=$(date +%s) AGE_SECONDS=$((CURRENT