Skip to content

Releases: raintree-technology/docpull

v2.3.0 — Framework-aware extraction, LLM chunking, Python MCP, agent fast path

24 Apr 21:47

Choose a tag to compare

[2.3.0] - 2026-04-24

Sharpened positioning around the agent / RAG use case, plus real bug fixes
surfaced by validation against Next.js, Supabase, Anthropic, FastAPI, Tailwind,
and Drizzle documentation sites.

Added

  • Framework-specific fast extractors: Next.js __NEXT_DATA__, Mintlify,
    OpenAPI / Swagger JSON rendered directly to Markdown, plus source-type
    tagging for Docusaurus and Sphinx. Runs before the generic extractor.
  • Next.js App Router detection via self.__next_f.push, router state tree,
    and /_next/static/ path markers — no longer relies on __NEXT_DATA__,
    which is absent on modern App Router pages.
  • SPA detection (pre- and post-conversion): pages that produce only
    Loading... shells are skipped with a clear reason. --strict-js-required
    turns this into a hard error for agents that want to route elsewhere.
  • Trafilatura extractor as an optional alternative content extractor
    (pip install docpull[trafilatura], then --extractor trafilatura).
  • Token-aware Markdown chunking: --max-tokens-per-file N splits pages
    on heading then paragraph boundaries. Exact counts with tiktoken,
    character-estimate fallback otherwise.
  • NDJSON output format (--format ndjson) for streaming one record per
    page or per chunk. --stream writes to stdout for live pipeline consumption.
  • llm profile: bundles NDJSON + 4k-token chunks + rich metadata + dedup.
  • --single / fetch_one(url): fast single-page path with no discovery,
    designed for AI-agent tool loops.
  • Python MCP server (docpull mcp): exposes fetch_url, ensure_docs,
    list_sources, list_indexed, and grep_docs tools over stdio. Install
    via pip install docpull[mcp].

Fixed

  • robots.txt redirect handling: Cloudflare/HTTP-2 responses send
    lowercase header names, but the Location lookup was case-sensitive,
    causing 301/308 redirects to be treated as errors. This blocked
    docs.anthropic.com and any other site whose robots.txt was redirected.
  • html2text link escape artifacts: cleaned up mangled links of the form
    [text](prefix/<https:/real.url>) in the post-processing pass; handles
    both text and image-only (empty-text) links.

Removed

  • Dead dependencies: requests (replaced by aiohttp in v2.0) and
    gitpython (never used in v2+).

Changed

  • ContentFilterConfig gains extractor, enable_special_cases, and
    strict_js_required fields. OutputConfig gains max_tokens_per_file,
    tokenizer, emit_chunks, and ndjson_filename.

v2.2.1 - Security Hardening

15 Apr 21:40

Choose a tag to compare

Security Fixes

  • ILIKE wildcard DoS% and _ metacharacters in grep_docs MCP tool input are now escaped, preventing expensive full-table scans
  • CRLF header injection--user-agent and --auth-header now reject CR, LF, and null bytes at both the Pydantic config layer and the HTTP client transport layer
  • Dead code removal — Removed IntegrationConfig (containing post_process_hook: Path, a command-injection sink if ever wired up), plus unused ARCHIVE_CREATED and GIT_COMMITTED event types
  • Proxy SSRF warning — Logs a warning when proxy mode bypasses the DNS-pinning resolver
  • .gitignore hardening — Added patterns for .env.*, *.pem, *.key, *.p12, *.pfx, *.crt

Breaking Changes

  • IntegrationConfig has been removed from the public API. The fields git_commit, git_message, archive, archive_format, and post_process_hook are no longer accepted in configuration. These were never implemented (dead code).
  • YAML config files containing an integration: block will now fail validation.

Testing

  • 12 new regression tests for CRLF injection and dead code removal
  • All 157 tests pass

Audit Report

Full attack surface map available at security/01-attack-surface.md.

v2.2.0: Resume, Auth, JSON/SQLite output

15 Dec 21:00

Choose a tag to compare

New Features

  • Resume capability (--resume): Continue interrupted fetches
  • URL preview mode (--preview-urls): See discovered URLs before fetching
  • Authentication support: --auth-bearer, --auth-basic, --auth-cookie, --auth-header
  • Env var expansion for auth tokens ($VAR and ${VAR} syntax)
  • Adaptive rate limiting (--adaptive-rate-limit): Auto-adjust based on 429 responses
  • JSON output (--format json): Stream documents to single JSON file
  • SQLite output (--format sqlite): Save to SQLite database
  • Skip reason tracking: Better progress feedback

Breaking Changes

  • Requires Python 3.10+ (dropped 3.9 support)

Install

pip install docpull --upgrade

v2.0.0 - Complete Architecture Rewrite

29 Nov 23:26

Choose a tag to compare

Breaking Changes

  • New Python API: Fetcher class with async context manager and streaming events
  • src/ layout: PEP 517/518 compliant package structure
  • Pydantic models: Configuration via DocpullConfig instead of dictionaries
  • Removed v1.x modules: All deprecated code removed

New Features

  • Streaming Event API: AsyncIterator[FetchEvent] for real-time progress
  • Pipeline Architecture: Composable steps (Validate, Fetch, Convert, Dedup, Save)
  • CacheManager: O(1) lookups with batched writes and TTL eviction
  • StreamingDeduplicator: Real-time content deduplication via SHA-256
  • JavaScript Rendering: Browser-based fetching via Playwright
  • Profile Presets: RAG, MIRROR, QUICK for common use cases
  • Rate Limiting: Per-host concurrent request limits
  • Security: robots.txt respect and URL validation

Quick Start

```bash

CLI

docpull https://docs.example.com --profile rag

Python API

from docpull import Fetcher, DocpullConfig, ProfileName, EventType

async with Fetcher(DocpullConfig(url="https://docs.example.com", profile=ProfileName.RAG)) as f:
async for event in f.run():
if event.type == EventType.FETCH_PROGRESS:
print(f"{event.current}/{event.total}")
```

Full Changelog

See CHANGELOG.md

v1.5.0

29 Nov 03:55

Choose a tag to compare

Release v1.5.0: Major Simplification and Modernization

Breaking Changes

  • Removed legacy profile system (stripe-specific profiles)
  • Removed deprecated requirements.txt (use pyproject.toml instead)

Changes

  • Simplified architecture: Consolidated utils into main package
  • Reorganized documentation: Moved CONTRIBUTING.md and SECURITY.md to .github/
  • Added GitHub issue templates configuration
  • Cleaner fetcher architecture: Removed stripe-specific fetcher
  • Updated tests for new structure

Removed Files

  • CHANGELOG.md - Deprecated in favor of GitHub releases
  • MANIFEST.in - No longer needed with modern packaging
  • TROUBLESHOOTING.md - Content moved to README
  • requirements.txt - Dependencies now in pyproject.toml
  • Legacy profile system files
  • Legacy utils directory

Installation

pip install docpull

Or install from source:

pip install git+https://github.com/raintree-technology/docpull.git

v1.3.0: Rich Metadata Extraction & Simplified Profiles

20 Nov 19:30

Choose a tag to compare

v1.3.0: Rich Metadata Extraction & Simplified Profiles

Highlights

docpull v1.3.0 adds rich structured metadata extraction for enhanced AI/RAG integration and simplifies the profile system by focusing on the excellent generic fetcher.

New Features

Rich Metadata Extraction

  • Structured Metadata: Extract Open Graph, JSON-LD, and microdata during fetch
  • Enhanced Frontmatter: Adds author, description, keywords, images, publish dates, and more
  • AI/RAG Ready: Richer context for embeddings and retrieval systems
  • Opt-in Feature: Enabled with --rich-metadata flag or rich_metadata: true in config
  • Powered by extruct: Uses the battle-tested extruct library for extraction

Simplified Profile System

  • Streamlined Architecture: Removed 7 built-in profiles (React, Next.js, D3, Plaid, Tailwind, Bun, Turborepo)
  • Kept Stripe: Retained as reference implementation for custom profiles
  • Generic Fetcher Excellence: Works excellently for all documentation sites
  • Reduced Complexity: Less maintenance burden, simpler codebase
  • Easy Customization: Users can create custom profiles as needed

Technical Details

New Dependencies

  • Added extruct>=0.15.0 for structured metadata extraction

New Files

  • docpull/metadata_extractor.py - Rich metadata extraction module
  • tests/test_metadata_extractor.py - Comprehensive test suite (13 tests)

Updated Files

  • docpull/fetchers/base.py - Integrated rich metadata extraction
  • docpull/fetchers/generic_async.py - Added use_rich_metadata parameter
  • docpull/config.py - Added rich_metadata configuration option
  • docpull/sources_config.py - Added rich_metadata field
  • docpull/cli.py - Added --rich-metadata CLI flag
  • docpull/profiles/__init__.py - Simplified to single Stripe profile

Removed Files

  • 7 profile files (react.py, nextjs.py, d3.py, plaid.py, tailwind.py, bun.py, turborepo.py)
  • 7 fetcher implementation files (same names)

Version & Testing

  • Bumped version from 1.2.1 to 1.3.0
  • All 107 tests passing ✅
  • Zero mypy type errors ✅
  • All lint checks passing ✅

Example Usage

Rich Metadata Extraction

# Extract rich metadata during fetch
docpull https://docs.anthropic.com --rich-metadata

# Combine with other features
docpull https://stripe.com/docs --rich-metadata --create-index --language en

# Multi-source configuration
docpull --sources-file config.yaml

Enhanced Frontmatter Output

---
url: https://docs.example.com/guide
fetched: 2025-11-20
title: Getting Started Guide
description: Learn the basics of our platform
author: John Doe
keywords: [tutorial, guide, api]
image: https://docs.example.com/og-image.png
type: article
site_name: Example Docs
published_time: 2024-01-15T10:00:00Z
modified_time: 2024-01-20T15:30:00Z
---

Multi-Source Configuration with Rich Metadata

sources:
  anthropic:
    url: https://docs.anthropic.com
    rich_metadata: true  # Enable rich metadata extraction
    language: en
    create_index: true

  stripe:
    url: https://stripe.com/docs
    rich_metadata: true
    max_file_size: 200kb

Backward Compatibility

All existing workflows continue to work unchanged. Rich metadata extraction is opt-in, and the generic fetcher handles all documentation sites that previously used specific profiles.

Installation

pip install --upgrade docpull

Links


Stats: 30 files changed, +765/-867 lines

v1.2.1 - Critical Bug Fixes & Type Checking

17 Nov 01:19

Choose a tag to compare

🐛 Bug Fixes

This patch release fixes critical issues found in v1.2.0:

Type Checking & Code Quality

  • Fixed all 60 mypy type errors - achieved zero type errors ✅
  • Added proper type annotations throughout the codebase
  • Improved type safety in processors, formatters, and orchestrator modules
  • All lint checks now passing (mypy, ruff, black)

Test Fixes

  • Fixed test failure in test_orchestrator.py (archive_format parameter)
  • Fixed 9 SourcesConfiguration test failures
  • All 101 tests now passing ✅

Code Cleanup

  • Removed deprecated files (EMOJI_CLEANUP.md)
  • Fixed Black formatting issues
  • Added specific error codes to type: ignore comments

📝 Technical Details

Files Updated

  • docpull/processors/content_filter.py: More specific return types
  • docpull/formatters/: Proper type annotations for nested functions
  • docpull/orchestrator.py: Correct parameter naming and type hints
  • docpull/cli.py: Better handling of Optional[str] types
  • docpull/processors/language_filter.py: Fixed config type assignments
  • docpull/processors/deduplicator.py: Fixed config type assignments

CI/CD

This release ensures the codebase passes all CI checks and maintains high code quality standards.

📦 Installation

pip install --upgrade docpull

🔗 Links

v1.2.0: 15 Major Features - 58% Size Reduction

16 Nov 22:12

Choose a tag to compare

Highlights

docpull v1.2.0 delivers 15 major features that dramatically improve documentation fetching efficiency. Real-world testing shows 58% size reduction (31 MB → 13 MB) when processing 1,914 documentation files.

New Features

Phase 1: Core Optimization

  • Language Filtering: Auto-detect and filter by programming language
  • Deduplication: SHA-256 based duplicate detection with flexible keep strategies
  • Auto-Index Generation: Tree view, TOC, category-based, and statistics indexes
  • Size Limits: Enforce per-file and total size constraints
  • Multi-Source Configuration: YAML-based configuration for multiple documentation sources

Phase 2: Advanced Processing

  • Selective Crawling: Include/exclude patterns for precise control
  • Content Filtering: Remove unwanted sections from documentation
  • Format Conversion: Output in Markdown, TOON, JSON, or SQLite
  • Smart Naming: 4 naming strategies (full, short, flat, hierarchical)

Phase 3: Efficiency

  • Metadata Extraction: Automatic metadata collection and JSON storage
  • Update Detection: Skip unchanged files based on checksums
  • Incremental Mode: Update only changed documentation

Phase 4: Integration

  • Hooks/Plugins: Decorator-based plugin system for custom processing
  • Git Integration: Automatic commits with templated messages
  • Archive Mode: Create compressed archives (tar.gz, tar.bz2, tar.xz, zip)

Technical Details

  • 20 new modules (3,886 lines of code)
  • Full backward compatibility with v1.1.0
  • All features integrated into CLI
  • 145+ unit tests
  • Zero syntax errors, zero linting issues

Installation

pip install --upgrade docpull

Quick Example

# Fetch Python docs with optimization
docpull https://docs.python.org/3/ ./python-docs \
  --language python \
  --deduplicate \
  --create-index \
  --max-total-size 20MB

# Multi-source with YAML config
docpull --sources-file sources.yaml

See the CHANGELOG for complete details.

v1.1.0 - Diagnostic Tools and Improved Error Handling

14 Nov 23:47

Choose a tag to compare

What's New in v1.1.0

Added

  • --doctor command for diagnosing installation and dependency issues

    • Checks all core dependencies (requests, beautifulsoup4, html2text, defusedxml, aiohttp, rich)
    • Checks optional dependencies (PyYAML, Playwright) with installation suggestions
    • Tests network connectivity
    • Verifies output directory write permissions
    • Works even when dependencies are missing
  • requirements.txt file for transparent dependency listing

  • Comprehensive TROUBLESHOOTING.md documentation with:

    • Installation troubleshooting (missing dependencies, pipx issues)
    • Runtime issue solutions (YAML config errors, JavaScript rendering)
    • Diagnostic tools usage guide
    • Common error messages reference table
    • Quick reference commands

Changed

  • Improved error handling for missing dependencies

    • Early dependency checking at CLI entry point
    • Clear, actionable error messages with installation instructions
    • Specific recommendations for pipx, pip, and development installations
  • Enhanced YAML configuration error handling

    • Auto-fallback to JSON when PyYAML is not installed
    • Clear error messages for YAML-related import errors
    • Helpful suggestions for installing optional dependencies
  • Updated README.md with:

    • --doctor command in Quick Start section
    • Reference to TROUBLESHOOTING.md
    • Better troubleshooting guidance

Fixed

  • Improved user experience when dependencies are missing (no more confusing tracebacks)
  • Better handling of optional dependency errors (PyYAML, Playwright)

Installation

```bash
pip install docpull
docpull --doctor # Verify installation
```

Full Changelog

https://github.com/raintree-technology/docpull/blob/main/CHANGELOG.md#110---2025-11-14