Skip to content

fix: scope downloaded_files per crawl to prevent race condition (#1889)#1891

Open
hafezparast wants to merge 1 commit intounclecode:developfrom
hafezparast:fix/maysam-download-race-condition-1889
Open

fix: scope downloaded_files per crawl to prevent race condition (#1889)#1891
hafezparast wants to merge 1 commit intounclecode:developfrom
hafezparast:fix/maysam-download-race-condition-1889

Conversation

@hafezparast
Copy link
Copy Markdown
Contributor

Summary

  • Fixes a race condition where downloaded_files from URL A leak into URL B's CrawlResult when downloads are slow
  • Root cause: self._downloaded_files was shared mutable state with fire-and-forget asyncio.create_task download handlers — no synchronization between concurrent crawls
  • Fix: replace the shared list with a dict[crawl_id, list] keyed by a unique ID per _crawl_web invocation, passed into _handle_download via closure

Fixes #1889

The Race (before)

URL A navigation → browser triggers download
URL A _crawl_web returns → downloaded_files = [] (download still in progress)
URL B _crawl_web starts → self._downloaded_files = [] (reset)
URL A's download completes → appends to self._downloaded_files
URL B _crawl_web returns → downloaded_files = [URL_A's_file] ← WRONG

After Fix

URL A gets crawl_id="abc" → downloads scoped to _downloads_by_crawl["abc"]
URL B gets crawl_id="xyz" → downloads scoped to _downloads_by_crawl["xyz"]
Late download for "abc" → goes to "abc"'s list (or logged if already popped)
No cross-contamination possible

Changes

  • crawl4ai/async_crawler_strategy.py: Replace self._downloaded_files (list) with self._downloads_by_crawl (dict keyed by crawl_id). _handle_download now takes a crawl_id parameter. Late downloads that complete after their crawl finished are logged with a warning instead of silently contaminating the next crawl.
  • tests/test_download_race_condition_1889.py: 9 tests covering isolation, late downloads, concurrent race simulation

Test plan

  • 9 unit tests passing (pytest tests/test_download_race_condition_1889.py -v)
  • Two crawls with isolated downloads verified
  • Late download after crawl finishes — logged, not attached to wrong result
  • Concurrent race simulation — slow download on A, fast B, no cross-contamination
  • Integration test with real browser downloads (needs Playwright)

🤖 Generated with Claude Code

…tamination (unclecode#1889)

_downloaded_files was shared mutable state on self with no synchronization.
When a slow download from URL A completed after URL B started, the file
appeared on URL B's CrawlResult instead of URL A's.

Replace the shared list with a dict keyed by crawl invocation ID:
- Each _crawl_web call gets a unique crawl_id
- Downloads are scoped to their originating crawl_id via closure
- Late downloads (after the crawl popped its list) are logged but don't
  contaminate other crawl results

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@discopops discopops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. The regression harness change is minimal, the lockfile churn was cleaned, and the race-condition fix validates with pytest -q tests/test_download_race_condition_1889.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants