add simple periodic task watchdog by escattone · Pull Request #7423 · mozilla/kitsune

escattone · 2026-04-10T22:00:19Z

This PR provides an option for a simple but effective "watchdog" of all Celery beat tasks -- except for any that are explicitly excluded (WATCHDOG_EXCLUDED_TASKS) -- that alerts a list of email recipients (WATCHDOG_EMAIL_RECIPIENTS) of all tasks that are currently missing more than a configurable number of scheduled runs (WATCHDOG_ALLOWED_MISSED_RUNS). By default, a task can be included in the list of alerts only once per day (WATCHDOG_ALERT_COOLDOWN_SECONDS).

Any new Celery beat tasks added will automatically be "watched" unless explicitly excluded.

akatsoulas · 2026-05-06T12:41:54Z

+    overdue = []
+
+    for task_name, task_config in beat_schedule.items():
+        if task_name == "watchdog":


Why do we need both here? Can't we add watchdog in excluded?

akatsoulas · 2026-05-06T12:43:49Z

+        except PeriodicTask.DoesNotExist:
+            continue
+
+        last_run_at = periodic_task.last_run_at


This will be updated when the task is dispatched. Probably what it won't tell us is if the task has crashed right after dispatching. This way we can end up with a monitoring solution that reports that everything is working fine but the tasks are crashing. Should we add the task results into the solution?

Great catch. That's a huge hole in this current iteration. I'll look into resolving this.

This has been resolved with the new approach based on Celery's task_success signal and the new TaskHealth model.

akatsoulas · 2026-05-06T12:45:10Z

+    expected_next_run: datetime
+
+
+def compute_period(schedule):


What happens if a task has a non uniform schedule like the cleanup_old_spam?

I missed that we had a task with a non-uniform schedule. In the case of cleanup_old_spam, it'll return either 2 days or 3 days depending on when it's run. Given that our default value for WATCHDOG_ALLOWED_MISSED_RUNS allows a grace period of 1 skipped run, it should still work. This function was one of the trickiest to work out, but I'll take another look.

The new next_run_after and nth_run_after functions are simpler and work with non-uniform crontab schedules.

akatsoulas · 2026-05-06T13:06:32Z

+            continue
+
+        try:
+            periodic_task = PeriodicTask.objects.get(name=task_name)


Although the number of periodic tasks is small, this introduces a N+1 query. Could we use here in_bulk queryset/

This has been resolved in the new code as well.

akatsoulas · 2026-05-06T13:07:30Z

+        )
+
+    message = "\n\n".join(lines)
+    send_mail(subject, message, settings.DEFAULT_FROM_EMAIL, recipients)


Any reason why we are skipping here post office?

We're still using django-post-office because Django's send_mail gets it connection through our configured email backend of post_office.EmailBackend, which in turn uses django_ses.SESBackend.

akatsoulas · 2026-05-06T13:08:01Z

+
+    try:
+        redis_conn = redis_client("default")
+    except RedisError:


do we need a test for this?

akatsoulas · 2026-05-06T13:08:41Z

+    def test_sends_alert_for_overdue_task(
+        self, MockPeriodicTask, mock_redis_client, mock_send_email, mock_try_alert
+    ):
+        from kitsune.sumo.tasks import watchdog


Let's move all imports top level unless there's a circular dependency

escattone · 2026-05-07T23:21:38Z

Thanks for the review @akatsoulas! This is ready for another pass. It's so much better now:

Uses Celery's task_success signal to record periodic task completions ONLY if they succeed.
Can handle non-uniform crontab schedules.
Thorough tests.
Provides a read-only admin interface.
Re-organized as its own app.

escattone force-pushed the watchdog-2784 branch 12 times, most recently from 1513359 to 1194ccf Compare April 13, 2026 22:07

escattone marked this pull request as ready for review April 13, 2026 22:33

escattone force-pushed the watchdog-2784 branch from 1194ccf to ae6bbac Compare April 13, 2026 22:44

escattone requested a review from akatsoulas April 13, 2026 22:52

escattone force-pushed the watchdog-2784 branch from ae6bbac to 4ed0cb5 Compare April 28, 2026 20:27

akatsoulas requested changes May 6, 2026

View reviewed changes

add simple periodic task watchdog

1d92a46

escattone force-pushed the watchdog-2784 branch 2 times, most recently from a4991e6 to ff8c90c Compare May 7, 2026 23:12

significantly improved watchdog (now an app)

e2d296a

escattone force-pushed the watchdog-2784 branch from ff8c90c to e2d296a Compare May 7, 2026 23:15

escattone requested a review from akatsoulas May 7, 2026 23:21

mozilla deleted a comment from 03637213 May 8, 2026

Conversation

escattone commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

escattone May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

escattone commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

escattone commented Apr 10, 2026 •

edited

Loading

escattone May 6, 2026 •

edited

Loading