Skip to content

add simple periodic task watchdog#7423

Open
escattone wants to merge 2 commits intomozilla:mainfrom
escattone:watchdog-2784
Open

add simple periodic task watchdog#7423
escattone wants to merge 2 commits intomozilla:mainfrom
escattone:watchdog-2784

Conversation

@escattone
Copy link
Copy Markdown
Contributor

@escattone escattone commented Apr 10, 2026

mozilla/sumo#2784

This PR provides an option for a simple but effective "watchdog" of all Celery beat tasks -- except for any that are explicitly excluded (WATCHDOG_EXCLUDED_TASKS) -- that alerts a list of email recipients (WATCHDOG_EMAIL_RECIPIENTS) of all tasks that are currently missing more than a configurable number of scheduled runs (WATCHDOG_ALLOWED_MISSED_RUNS). By default, a task can be included in the list of alerts only once per day (WATCHDOG_ALERT_COOLDOWN_SECONDS).

Any new Celery beat tasks added will automatically be "watched" unless explicitly excluded.

@escattone escattone force-pushed the watchdog-2784 branch 12 times, most recently from 1513359 to 1194ccf Compare April 13, 2026 22:07
@escattone escattone marked this pull request as ready for review April 13, 2026 22:33
@escattone escattone requested a review from akatsoulas April 13, 2026 22:52
Comment thread kitsune/sumo/watchdog.py Outdated
overdue = []

for task_name, task_config in beat_schedule.items():
if task_name == "watchdog":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need both here? Can't we add watchdog in excluded?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread kitsune/sumo/watchdog.py Outdated
except PeriodicTask.DoesNotExist:
continue

last_run_at = periodic_task.last_run_at
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be updated when the task is dispatched. Probably what it won't tell us is if the task has crashed right after dispatching. This way we can end up with a monitoring solution that reports that everything is working fine but the tasks are crashing. Should we add the task results into the solution?

Copy link
Copy Markdown
Contributor Author

@escattone escattone May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. That's a huge hole in this current iteration. I'll look into resolving this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been resolved with the new approach based on Celery's task_success signal and the new TaskHealth model.

Comment thread kitsune/sumo/watchdog.py Outdated
expected_next_run: datetime


def compute_period(schedule):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a task has a non uniform schedule like the cleanup_old_spam?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that we had a task with a non-uniform schedule. In the case of cleanup_old_spam, it'll return either 2 days or 3 days depending on when it's run. Given that our default value for WATCHDOG_ALLOWED_MISSED_RUNS allows a grace period of 1 skipped run, it should still work. This function was one of the trickiest to work out, but I'll take another look.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new next_run_after and nth_run_after functions are simpler and work with non-uniform crontab schedules.

Comment thread kitsune/sumo/watchdog.py Outdated
continue

try:
periodic_task = PeriodicTask.objects.get(name=task_name)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the number of periodic tasks is small, this introduces a N+1 query. Could we use here in_bulk queryset/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been resolved in the new code as well.

Comment thread kitsune/sumo/watchdog.py Outdated
)

message = "\n\n".join(lines)
send_mail(subject, message, settings.DEFAULT_FROM_EMAIL, recipients)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we are skipping here post office?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still using django-post-office because Django's send_mail gets it connection through our configured email backend of post_office.EmailBackend, which in turn uses django_ses.SESBackend.

Comment thread kitsune/sumo/tasks.py Outdated

try:
redis_conn = redis_client("default")
except RedisError:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a test for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Comment thread kitsune/sumo/tests/test_watchdog.py Outdated
def test_sends_alert_for_overdue_task(
self, MockPeriodicTask, mock_redis_client, mock_send_email, mock_try_alert
):
from kitsune.sumo.tasks import watchdog
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move all imports top level unless there's a circular dependency

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@escattone escattone force-pushed the watchdog-2784 branch 2 times, most recently from a4991e6 to ff8c90c Compare May 7, 2026 23:12
@escattone
Copy link
Copy Markdown
Contributor Author

Thanks for the review @akatsoulas! This is ready for another pass. It's so much better now:

  • Uses Celery's task_success signal to record periodic task completions ONLY if they succeed.
  • Can handle non-uniform crontab schedules.
  • Thorough tests.
  • Provides a read-only admin interface.
  • Re-organized as its own app.

@escattone escattone requested a review from akatsoulas May 7, 2026 23:21
@mozilla mozilla deleted a comment from 03637213 May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants