Skip to content

feat: DocumentExtractionStrategy for binary documents (#1890)#1896

Open
hafezparast wants to merge 1 commit intounclecode:developfrom
hafezparast:feat/maysam-document-extraction-strategy-1890
Open

feat: DocumentExtractionStrategy for binary documents (#1890)#1896
hafezparast wants to merge 1 commit intounclecode:developfrom
hafezparast:feat/maysam-document-extraction-strategy-1890

Conversation

@hafezparast
Copy link
Copy Markdown
Contributor

Addresses #1890

…clecode#1890)

Add a pluggable pipeline stage that detects binary documents (PDF, DOCX,
XLSX, etc.) after browser navigation but before HTML content scraping.

When configured, the strategy's detect() method checks the AsyncCrawlResponse
(headers, downloaded_files, status code). If a document is detected, extract()
runs instead of the HTML pipeline, producing a CrawlResult with markdown
content directly from the document.

- New abstract base: DocumentExtractionStrategy with detect() + extract()
- New dataclass: DocumentExtractionResult (content, content_type, metadata)
- New param: CrawlerRunConfig.document_extraction_strategy (default None)
- Integration point in arun() before aprocess_html()
- No breaking changes — defaults to None, existing behavior unchanged
- No new dependencies — users bring their own extraction backend

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant