feat: DocumentExtractionStrategy for binary documents (#1890) by hafezparast · Pull Request #1896 · unclecode/crawl4ai

hafezparast · 2026-04-03T09:03:37Z

Addresses #1890

…clecode#1890) Add a pluggable pipeline stage that detects binary documents (PDF, DOCX, XLSX, etc.) after browser navigation but before HTML content scraping. When configured, the strategy's detect() method checks the AsyncCrawlResponse (headers, downloaded_files, status code). If a document is detected, extract() runs instead of the HTML pipeline, producing a CrawlResult with markdown content directly from the document. - New abstract base: DocumentExtractionStrategy with detect() + extract() - New dataclass: DocumentExtractionResult (content, content_type, metadata) - New param: CrawlerRunConfig.document_extraction_strategy (default None) - Integration point in arun() before aprocess_html() - No breaking changes — defaults to None, existing behavior unchanged - No new dependencies — users bring their own extraction backend Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: DocumentExtractionStrategy for binary documents (#1890)#1896

feat: DocumentExtractionStrategy for binary documents (#1890)#1896
hafezparast wants to merge 1 commit intounclecode:developfrom
hafezparast:feat/maysam-document-extraction-strategy-1890

hafezparast commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hafezparast commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant