add simple periodic task watchdog#7423
Conversation
1513359 to
1194ccf
Compare
1194ccf to
ae6bbac
Compare
ae6bbac to
4ed0cb5
Compare
| overdue = [] | ||
|
|
||
| for task_name, task_config in beat_schedule.items(): | ||
| if task_name == "watchdog": |
There was a problem hiding this comment.
Why do we need both here? Can't we add watchdog in excluded?
| except PeriodicTask.DoesNotExist: | ||
| continue | ||
|
|
||
| last_run_at = periodic_task.last_run_at |
There was a problem hiding this comment.
This will be updated when the task is dispatched. Probably what it won't tell us is if the task has crashed right after dispatching. This way we can end up with a monitoring solution that reports that everything is working fine but the tasks are crashing. Should we add the task results into the solution?
There was a problem hiding this comment.
Great catch. That's a huge hole in this current iteration. I'll look into resolving this.
There was a problem hiding this comment.
This has been resolved with the new approach based on Celery's task_success signal and the new TaskHealth model.
| expected_next_run: datetime | ||
|
|
||
|
|
||
| def compute_period(schedule): |
There was a problem hiding this comment.
What happens if a task has a non uniform schedule like the cleanup_old_spam?
There was a problem hiding this comment.
I missed that we had a task with a non-uniform schedule. In the case of cleanup_old_spam, it'll return either 2 days or 3 days depending on when it's run. Given that our default value for WATCHDOG_ALLOWED_MISSED_RUNS allows a grace period of 1 skipped run, it should still work. This function was one of the trickiest to work out, but I'll take another look.
There was a problem hiding this comment.
The new next_run_after and nth_run_after functions are simpler and work with non-uniform crontab schedules.
| continue | ||
|
|
||
| try: | ||
| periodic_task = PeriodicTask.objects.get(name=task_name) |
There was a problem hiding this comment.
Although the number of periodic tasks is small, this introduces a N+1 query. Could we use here in_bulk queryset/
There was a problem hiding this comment.
This has been resolved in the new code as well.
| ) | ||
|
|
||
| message = "\n\n".join(lines) | ||
| send_mail(subject, message, settings.DEFAULT_FROM_EMAIL, recipients) |
There was a problem hiding this comment.
Any reason why we are skipping here post office?
There was a problem hiding this comment.
We're still using django-post-office because Django's send_mail gets it connection through our configured email backend of post_office.EmailBackend, which in turn uses django_ses.SESBackend.
|
|
||
| try: | ||
| redis_conn = redis_client("default") | ||
| except RedisError: |
There was a problem hiding this comment.
do we need a test for this?
| def test_sends_alert_for_overdue_task( | ||
| self, MockPeriodicTask, mock_redis_client, mock_send_email, mock_try_alert | ||
| ): | ||
| from kitsune.sumo.tasks import watchdog |
There was a problem hiding this comment.
Let's move all imports top level unless there's a circular dependency
a4991e6 to
ff8c90c
Compare
|
Thanks for the review @akatsoulas! This is ready for another pass. It's so much better now:
|
mozilla/sumo#2784
This PR provides an option for a simple but effective "watchdog" of all Celery beat tasks -- except for any that are explicitly excluded (
WATCHDOG_EXCLUDED_TASKS) -- that alerts a list of email recipients (WATCHDOG_EMAIL_RECIPIENTS) of all tasks that are currently missing more than a configurable number of scheduled runs (WATCHDOG_ALLOWED_MISSED_RUNS). By default, a task can be included in the list of alerts only once per day (WATCHDOG_ALERT_COOLDOWN_SECONDS).Any new Celery beat tasks added will automatically be "watched" unless explicitly excluded.