fix: resolve dict attribute error in MarkdownGenerator deserialization#1887
fix: resolve dict attribute error in MarkdownGenerator deserialization#1887MTahirKleem wants to merge 1 commit intounclecode:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to make CrawlResult resilient to cached/JSON hydration cases where the markdown payload is a raw dict, preventing attribute errors during markdown access/serialization.
Changes:
- Add a
markdown_datafield (aliased from"markdown") to improve Pydantic hydration robustness. - Add a
field_validatorto convert rawdictinputs intoMarkdownGenerationResult. - Add a unit test validating dict-to-model conversion and string-compatible access behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
crawl4ai/models.py |
Introduces markdown_data alias + validator + post-init syncing to prevent markdown deserialization issues. |
tests/unit/test_crawl_result_markdown_validator.py |
Adds a unit test ensuring dict markdown input is converted and serialized correctly. |
Comments suppressed due to low confidence (1)
crawl4ai/models.py:185
- The NOTE block at lines 181–191 appears dedented to column 0 within
CrawlResult. If those lines are truly unindented, that would end the class body and make the subsequent indenteddef model_post_init(...)anIndentationError(module won’t import). Keep the NOTE block indented at the same level as other class members (or move it fully outside the class along with the following methods).
# NOTE: The StringCompatibleMarkdown class, model_post_init hook, property getters/setters,
# and model_dump override all exist to support a smooth transition from markdown as a string
# to markdown as a MarkdownGenerationResult object, while maintaining backward compatibility.
#
# This allows code that expects markdown to be a string to continue working, while also
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if isinstance(v, dict): | ||
| # This converts a raw dictionary (from cache/JSON) | ||
| # into the structured Pydantic object | ||
| return MarkdownGenerationResult(**v) |
There was a problem hiding this comment.
validate_markdown only converts dicts. If input markdown is a plain string (which can happen with legacy cache hydration), Pydantic will try to validate that string as MarkdownGenerationResult for markdown_data and raise a validation error. Consider handling str here (e.g., wrap it into a MarkdownGenerationResult with raw_markdown set and other fields empty) to preserve backward compatibility.
| return MarkdownGenerationResult(**v) | |
| return MarkdownGenerationResult(**v) | |
| if isinstance(v, str): | |
| # Legacy backward compatibility: markdown provided as a plain string | |
| # Wrap it into MarkdownGenerationResult, using raw_markdown | |
| return MarkdownGenerationResult(raw_markdown=v) |
| def test_crawl_result_converts_markdown_dict_input(): | ||
| result = CrawlResult( | ||
| url="https://example.com", | ||
| html="<html></html>", | ||
| success=True, | ||
| markdown={ | ||
| "raw_markdown": "# Hello", | ||
| "markdown_with_citations": "# Hello", | ||
| "references_markdown": "", | ||
| "fit_markdown": "Hello", | ||
| "fit_html": "<p>Hello</p>", | ||
| }, | ||
| ) |
There was a problem hiding this comment.
This test covers dict input, but it doesn’t cover legacy string hydration (e.g., CrawlResult(..., markdown="# Hello")). Adding that case would protect backward compatibility and catch the validation regression mentioned in the model validator.
| model_config = ConfigDict(arbitrary_types_allowed=True) | ||
|
|
||
| # NOTE: The StringCompatibleMarkdown class, custom __init__ method, property getters/setters, | ||
| @field_validator("markdown_data", mode="before") |
There was a problem hiding this comment.
The PR metadata links/mentions Issue #123 (“Timeout setting”), but the changes here are focused on CrawlResult markdown deserialization. If this PR is intended to fix #123, the code changes don’t appear to address timeouts; otherwise consider removing/clarifying the issue reference in the PR description.
- Added 'markdown_data' alias field to CrawlResult for robust Pydantic hydration. - Implemented field_validator to convert raw dicts to MarkdownGenerationResult. - Used model_post_init to sync internal private state. - Maintained backward compatibility for string-based markdown access. - Added unit test for validation guard.
f7a699f to
7344c05
Compare
Fixes #1880
This PR addresses an AttributeError that occurred when CrawlResult was initialized from a dictionary (e.g., from cache or JSON), where the markdown field remained a raw dict instead of being hydrated into a MarkdownGenerationResult object.
List of files changed and why
crawl4ai/models.py: Updated CrawlResult with a Pydantic alias and validator to ensure proper object hydration while maintaining the backward-compatible string-like interface.
tests/unit/test_crawl_result_markdown_validator.py: New test file to verify that passing a dictionary to the markdown parameter correctly results in a MarkdownGenerationResult object.
How Has This Been Tested?
Manual Verification: Created a reproduction script that initialized CrawlResult with a raw dictionary and verified that property access (e.g., .raw_markdown) worked without error.
Unit Testing: Ran the newly created test case to confirm the validator intercepts and converts dictionary inputs correctly.
Serialization Test: Verified that model_dump() still correctly exports the markdown field for storage.
Checklist:
[x] My code follows the style guidelines of this project
[x] I have performed a self-review of my own code
[x] I have commented my code, particularly in hard-to-understand areas
[x] I have made corresponding changes to the documentation
[x] I have added/updated unit tests that prove my fix is effective or that my feature works
[x] New and existing unit tests pass locally with my changes