Skip to content

fix: resolve dict attribute error in MarkdownGenerator deserialization#1887

Open
MTahirKleem wants to merge 1 commit intounclecode:mainfrom
MTahirKleem:fix/issue-1880-markdown-deserialization
Open

fix: resolve dict attribute error in MarkdownGenerator deserialization#1887
MTahirKleem wants to merge 1 commit intounclecode:mainfrom
MTahirKleem:fix/issue-1880-markdown-deserialization

Conversation

@MTahirKleem
Copy link
Copy Markdown

@MTahirKleem MTahirKleem commented Mar 30, 2026

Fixes #1880

This PR addresses an AttributeError that occurred when CrawlResult was initialized from a dictionary (e.g., from cache or JSON), where the markdown field remained a raw dict instead of being hydrated into a MarkdownGenerationResult object.

List of files changed and why
crawl4ai/models.py: Updated CrawlResult with a Pydantic alias and validator to ensure proper object hydration while maintaining the backward-compatible string-like interface.

tests/unit/test_crawl_result_markdown_validator.py: New test file to verify that passing a dictionary to the markdown parameter correctly results in a MarkdownGenerationResult object.

How Has This Been Tested?
Manual Verification: Created a reproduction script that initialized CrawlResult with a raw dictionary and verified that property access (e.g., .raw_markdown) worked without error.

Unit Testing: Ran the newly created test case to confirm the validator intercepts and converts dictionary inputs correctly.

Serialization Test: Verified that model_dump() still correctly exports the markdown field for storage.

Checklist:
[x] My code follows the style guidelines of this project

[x] I have performed a self-review of my own code

[x] I have commented my code, particularly in hard-to-understand areas

[x] I have made corresponding changes to the documentation

[x] I have added/updated unit tests that prove my fix is effective or that my feature works

[x] New and existing unit tests pass locally with my changes

Copilot AI review requested due to automatic review settings March 30, 2026 20:26
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to make CrawlResult resilient to cached/JSON hydration cases where the markdown payload is a raw dict, preventing attribute errors during markdown access/serialization.

Changes:

  • Add a markdown_data field (aliased from "markdown") to improve Pydantic hydration robustness.
  • Add a field_validator to convert raw dict inputs into MarkdownGenerationResult.
  • Add a unit test validating dict-to-model conversion and string-compatible access behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
crawl4ai/models.py Introduces markdown_data alias + validator + post-init syncing to prevent markdown deserialization issues.
tests/unit/test_crawl_result_markdown_validator.py Adds a unit test ensuring dict markdown input is converted and serialized correctly.
Comments suppressed due to low confidence (1)

crawl4ai/models.py:185

  • The NOTE block at lines 181–191 appears dedented to column 0 within CrawlResult. If those lines are truly unindented, that would end the class body and make the subsequent indented def model_post_init(...) an IndentationError (module won’t import). Keep the NOTE block indented at the same level as other class members (or move it fully outside the class along with the following methods).
# NOTE: The StringCompatibleMarkdown class, model_post_init hook, property getters/setters,
# and model_dump override all exist to support a smooth transition from markdown as a string
# to markdown as a MarkdownGenerationResult object, while maintaining backward compatibility.
# 
# This allows code that expects markdown to be a string to continue working, while also

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if isinstance(v, dict):
# This converts a raw dictionary (from cache/JSON)
# into the structured Pydantic object
return MarkdownGenerationResult(**v)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate_markdown only converts dicts. If input markdown is a plain string (which can happen with legacy cache hydration), Pydantic will try to validate that string as MarkdownGenerationResult for markdown_data and raise a validation error. Consider handling str here (e.g., wrap it into a MarkdownGenerationResult with raw_markdown set and other fields empty) to preserve backward compatibility.

Suggested change
return MarkdownGenerationResult(**v)
return MarkdownGenerationResult(**v)
if isinstance(v, str):
# Legacy backward compatibility: markdown provided as a plain string
# Wrap it into MarkdownGenerationResult, using raw_markdown
return MarkdownGenerationResult(raw_markdown=v)

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +16
def test_crawl_result_converts_markdown_dict_input():
result = CrawlResult(
url="https://example.com",
html="<html></html>",
success=True,
markdown={
"raw_markdown": "# Hello",
"markdown_with_citations": "# Hello",
"references_markdown": "",
"fit_markdown": "Hello",
"fit_html": "<p>Hello</p>",
},
)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test covers dict input, but it doesn’t cover legacy string hydration (e.g., CrawlResult(..., markdown="# Hello")). Adding that case would protect backward compatibility and catch the validation regression mentioned in the model validator.

Copilot uses AI. Check for mistakes.
Comment on lines 170 to +172
model_config = ConfigDict(arbitrary_types_allowed=True)

# NOTE: The StringCompatibleMarkdown class, custom __init__ method, property getters/setters,
@field_validator("markdown_data", mode="before")
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR metadata links/mentions Issue #123 (“Timeout setting”), but the changes here are focused on CrawlResult markdown deserialization. If this PR is intended to fix #123, the code changes don’t appear to address timeouts; otherwise consider removing/clarifying the issue reference in the PR description.

Copilot uses AI. Check for mistakes.
- Added 'markdown_data' alias field to CrawlResult for robust Pydantic hydration.
- Implemented field_validator to convert raw dicts to MarkdownGenerationResult.
- Used model_post_init to sync internal private state.
- Maintained backward compatibility for string-based markdown access.
- Added unit test for validation guard.
@MTahirKleem MTahirKleem force-pushed the fix/issue-1880-markdown-deserialization branch from f7a699f to 7344c05 Compare March 30, 2026 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Docker API: 'dict' object has no attribute 'generate_markdown'

2 participants