fix: scope downloaded_files per crawl to prevent race condition (#1889) by hafezparast · Pull Request #1891 · unclecode/crawl4ai

hafezparast · 2026-04-01T03:08:44Z

Summary

Fixes a race condition where downloaded_files from URL A leak into URL B's CrawlResult when downloads are slow
Root cause: self._downloaded_files was shared mutable state with fire-and-forget asyncio.create_task download handlers — no synchronization between concurrent crawls
Fix: replace the shared list with a dict[crawl_id, list] keyed by a unique ID per _crawl_web invocation, passed into _handle_download via closure

The Race (before)

URL A navigation → browser triggers download
URL A _crawl_web returns → downloaded_files = [] (download still in progress)
URL B _crawl_web starts → self._downloaded_files = [] (reset)
URL A's download completes → appends to self._downloaded_files
URL B _crawl_web returns → downloaded_files = [URL_A's_file] ← WRONG

After Fix

URL A gets crawl_id="abc" → downloads scoped to _downloads_by_crawl["abc"]
URL B gets crawl_id="xyz" → downloads scoped to _downloads_by_crawl["xyz"]
Late download for "abc" → goes to "abc"'s list (or logged if already popped)
No cross-contamination possible

Changes

crawl4ai/async_crawler_strategy.py: Replace self._downloaded_files (list) with self._downloads_by_crawl (dict keyed by crawl_id). _handle_download now takes a crawl_id parameter. Late downloads that complete after their crawl finished are logged with a warning instead of silently contaminating the next crawl.
tests/test_download_race_condition_1889.py: 9 tests covering isolation, late downloads, concurrent race simulation

Test plan

9 unit tests passing (pytest tests/test_download_race_condition_1889.py -v)
Two crawls with isolated downloads verified
Late download after crawl finishes — logged, not attached to wrong result
Concurrent race simulation — slow download on A, fast B, no cross-contamination
Integration test with real browser downloads (needs Playwright)

🤖 Generated with Claude Code

…tamination (unclecode#1889) _downloaded_files was shared mutable state on self with no synchronization. When a slow download from URL A completed after URL B started, the file appeared on URL B's CrawlResult instead of URL A's. Replace the shared list with a dict keyed by crawl invocation ID: - Each _crawl_web call gets a unique crawl_id - Downloads are scoped to their originating crawl_id via closure - Late downloads (after the crawl popped its list) are logged but don't contaminate other crawl results Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

discopops

Approved. The regression harness change is minimal, the lockfile churn was cleaned, and the race-condition fix validates with pytest -q tests/test_download_race_condition_1889.py.

discopops approved these changes Apr 1, 2026

View reviewed changes

hafezparast mentioned this pull request Apr 1, 2026

[Bug]: downloaded_files race condition causes cross-contamination between CrawlResults #1889

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: scope downloaded_files per crawl to prevent race condition (#1889)#1891

fix: scope downloaded_files per crawl to prevent race condition (#1889)#1891
hafezparast wants to merge 1 commit intounclecode:developfrom
hafezparast:fix/maysam-download-race-condition-1889

hafezparast commented Apr 1, 2026

Uh oh!

discopops left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

hafezparast commented Apr 1, 2026

Summary

The Race (before)

After Fix

Changes

Test plan

Uh oh!

discopops left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants