24 Apr 21:47

47ee6be

v2.3.0 — Framework-aware extraction, LLM chunking, Python MCP, agent fast path Latest

Latest

[2.3.0] - 2026-04-24

Sharpened positioning around the agent / RAG use case, plus real bug fixes
surfaced by validation against Next.js, Supabase, Anthropic, FastAPI, Tailwind,
and Drizzle documentation sites.

Added

Framework-specific fast extractors: Next.js __NEXT_DATA__, Mintlify,
OpenAPI / Swagger JSON rendered directly to Markdown, plus source-type
tagging for Docusaurus and Sphinx. Runs before the generic extractor.
Next.js App Router detection via self.__next_f.push, router state tree,
and /_next/static/ path markers — no longer relies on __NEXT_DATA__,
which is absent on modern App Router pages.
SPA detection (pre- and post-conversion): pages that produce only
Loading... shells are skipped with a clear reason. --strict-js-required
turns this into a hard error for agents that want to route elsewhere.
Trafilatura extractor as an optional alternative content extractor
(pip install docpull[trafilatura], then --extractor trafilatura).
Token-aware Markdown chunking: --max-tokens-per-file N splits pages
on heading then paragraph boundaries. Exact counts with tiktoken,
character-estimate fallback otherwise.
NDJSON output format (--format ndjson) for streaming one record per
page or per chunk. --stream writes to stdout for live pipeline consumption.
llm profile: bundles NDJSON + 4k-token chunks + rich metadata + dedup.
--single / fetch_one(url): fast single-page path with no discovery,
designed for AI-agent tool loops.
Python MCP server (docpull mcp): exposes fetch_url, ensure_docs,
list_sources, list_indexed, and grep_docs tools over stdio. Install
via pip install docpull[mcp].

Fixed

robots.txt redirect handling: Cloudflare/HTTP-2 responses send
lowercase header names, but the Location lookup was case-sensitive,
causing 301/308 redirects to be treated as errors. This blocked
docs.anthropic.com and any other site whose robots.txt was redirected.
html2text link escape artifacts: cleaned up mangled links of the form
[text](prefix/<https:/real.url>) in the post-processing pass; handles
both text and image-only (empty-text) links.

Removed

Dead dependencies: requests (replaced by aiohttp in v2.0) and
gitpython (never used in v2+).

Changed

ContentFilterConfig gains extractor, enable_special_cases, and
strict_js_required fields. OutputConfig gains max_tokens_per_file,
tokenizer, emit_chunks, and ndjson_filename.

Assets 2

15 Apr 21:40

zacharyr0th

v2.2.1

84e1e81

v2.2.1 - Security Hardening

Security Fixes

ILIKE wildcard DoS — % and _ metacharacters in grep_docs MCP tool input are now escaped, preventing expensive full-table scans
CRLF header injection — --user-agent and --auth-header now reject CR, LF, and null bytes at both the Pydantic config layer and the HTTP client transport layer
Dead code removal — Removed IntegrationConfig (containing post_process_hook: Path, a command-injection sink if ever wired up), plus unused ARCHIVE_CREATED and GIT_COMMITTED event types
Proxy SSRF warning — Logs a warning when proxy mode bypasses the DNS-pinning resolver
.gitignore hardening — Added patterns for .env.*, *.pem, *.key, *.p12, *.pfx, *.crt

Breaking Changes

IntegrationConfig has been removed from the public API. The fields git_commit, git_message, archive, archive_format, and post_process_hook are no longer accepted in configuration. These were never implemented (dead code).
YAML config files containing an integration: block will now fail validation.

Testing

12 new regression tests for CRLF injection and dead code removal
All 157 tests pass

Audit Report

Full attack surface map available at security/01-attack-surface.md.

Assets 2

15 Dec 21:00

zacharyr0th

v2.2.0

44391bb

v2.2.0: Resume, Auth, JSON/SQLite output

New Features

Resume capability (--resume): Continue interrupted fetches
URL preview mode (--preview-urls): See discovered URLs before fetching
Authentication support: --auth-bearer, --auth-basic, --auth-cookie, --auth-header
Env var expansion for auth tokens ($VAR and ${VAR} syntax)
Adaptive rate limiting (--adaptive-rate-limit): Auto-adjust based on 429 responses
JSON output (--format json): Stream documents to single JSON file
SQLite output (--format sqlite): Save to SQLite database
Skip reason tracking: Better progress feedback

Breaking Changes

Requires Python 3.10+ (dropped 3.9 support)

Install

pip install docpull --upgrade

Assets 2

29 Nov 23:26

zacharyr0th

v2.0.0

a81b33c

v2.0.0 - Complete Architecture Rewrite

Breaking Changes

New Python API: Fetcher class with async context manager and streaming events
src/ layout: PEP 517/518 compliant package structure
Pydantic models: Configuration via DocpullConfig instead of dictionaries
Removed v1.x modules: All deprecated code removed

New Features

Streaming Event API: AsyncIterator[FetchEvent] for real-time progress
Pipeline Architecture: Composable steps (Validate, Fetch, Convert, Dedup, Save)
CacheManager: O(1) lookups with batched writes and TTL eviction
StreamingDeduplicator: Real-time content deduplication via SHA-256
JavaScript Rendering: Browser-based fetching via Playwright
Profile Presets: RAG, MIRROR, QUICK for common use cases
Rate Limiting: Per-host concurrent request limits
Security: robots.txt respect and URL validation

Quick Start

```bash

CLI

docpull https://docs.example.com --profile rag

Python API

from docpull import Fetcher, DocpullConfig, ProfileName, EventType

async with Fetcher(DocpullConfig(url="https://docs.example.com", profile=ProfileName.RAG)) as f:
async for event in f.run():
if event.type == EventType.FETCH_PROGRESS:
print(f"{event.current}/{event.total}")
```

Full Changelog

See CHANGELOG.md

Assets 2

29 Nov 03:55

zacharyr0th

v1.5.0

6d7e4c9

v1.5.0

Release v1.5.0: Major Simplification and Modernization

Breaking Changes

Removed legacy profile system (stripe-specific profiles)
Removed deprecated requirements.txt (use pyproject.toml instead)

Changes

Simplified architecture: Consolidated utils into main package
Reorganized documentation: Moved CONTRIBUTING.md and SECURITY.md to .github/
Added GitHub issue templates configuration
Cleaner fetcher architecture: Removed stripe-specific fetcher
Updated tests for new structure

Removed Files

CHANGELOG.md - Deprecated in favor of GitHub releases
MANIFEST.in - No longer needed with modern packaging
TROUBLESHOOTING.md - Content moved to README
requirements.txt - Dependencies now in pyproject.toml
Legacy profile system files
Legacy utils directory

Installation

pip install docpull

Or install from source:

pip install git+https://github.com/raintree-technology/docpull.git

Assets 2

20 Nov 19:30

zacharyr0th

v1.3.0

2e3fcc1

v1.3.0: Rich Metadata Extraction & Simplified Profiles

Highlights

docpull v1.3.0 adds rich structured metadata extraction for enhanced AI/RAG integration and simplifies the profile system by focusing on the excellent generic fetcher.

New Features

Rich Metadata Extraction

Structured Metadata: Extract Open Graph, JSON-LD, and microdata during fetch
Enhanced Frontmatter: Adds author, description, keywords, images, publish dates, and more
AI/RAG Ready: Richer context for embeddings and retrieval systems
Opt-in Feature: Enabled with --rich-metadata flag or rich_metadata: true in config
Powered by extruct: Uses the battle-tested extruct library for extraction

Simplified Profile System

Streamlined Architecture: Removed 7 built-in profiles (React, Next.js, D3, Plaid, Tailwind, Bun, Turborepo)
Kept Stripe: Retained as reference implementation for custom profiles
Generic Fetcher Excellence: Works excellently for all documentation sites
Reduced Complexity: Less maintenance burden, simpler codebase
Easy Customization: Users can create custom profiles as needed

Technical Details

New Dependencies

Added extruct>=0.15.0 for structured metadata extraction

New Files

docpull/metadata_extractor.py - Rich metadata extraction module
tests/test_metadata_extractor.py - Comprehensive test suite (13 tests)

Updated Files

docpull/fetchers/base.py - Integrated rich metadata extraction
docpull/fetchers/generic_async.py - Added use_rich_metadata parameter
docpull/config.py - Added rich_metadata configuration option
docpull/sources_config.py - Added rich_metadata field
docpull/cli.py - Added --rich-metadata CLI flag
docpull/profiles/__init__.py - Simplified to single Stripe profile

Removed Files

7 profile files (react.py, nextjs.py, d3.py, plaid.py, tailwind.py, bun.py, turborepo.py)
7 fetcher implementation files (same names)

Version & Testing

Bumped version from 1.2.1 to 1.3.0
All 107 tests passing ✅
Zero mypy type errors ✅
All lint checks passing ✅

Example Usage

Rich Metadata Extraction

# Extract rich metadata during fetch
docpull https://docs.anthropic.com --rich-metadata

# Combine with other features
docpull https://stripe.com/docs --rich-metadata --create-index --language en

# Multi-source configuration
docpull --sources-file config.yaml

Enhanced Frontmatter Output

---
url: https://docs.example.com/guide
fetched: 2025-11-20
title: Getting Started Guide
description: Learn the basics of our platform
author: John Doe
keywords: [tutorial, guide, api]
image: https://docs.example.com/og-image.png
type: article
site_name: Example Docs
published_time: 2024-01-15T10:00:00Z
modified_time: 2024-01-20T15:30:00Z
---

Multi-Source Configuration with Rich Metadata

sources:
  anthropic:
    url: https://docs.anthropic.com
    rich_metadata: true  # Enable rich metadata extraction
    language: en
    create_index: true

  stripe:
    url: https://stripe.com/docs
    rich_metadata: true
    max_file_size: 200kb

Backward Compatibility

All existing workflows continue to work unchanged. Rich metadata extraction is opt-in, and the generic fetcher handles all documentation sites that previously used specific profiles.

Installation

pip install --upgrade docpull

Links

Stats: 30 files changed, +765/-867 lines

Assets 2

17 Nov 01:19

zacharyr0th

v1.2.1

7ac9efe

v1.2.1 - Critical Bug Fixes & Type Checking

🐛 Bug Fixes

This patch release fixes critical issues found in v1.2.0:

Type Checking & Code Quality

Fixed all 60 mypy type errors - achieved zero type errors ✅
Added proper type annotations throughout the codebase
Improved type safety in processors, formatters, and orchestrator modules
All lint checks now passing (mypy, ruff, black)

Test Fixes

Fixed test failure in test_orchestrator.py (archive_format parameter)
Fixed 9 SourcesConfiguration test failures
All 101 tests now passing ✅

Code Cleanup

Removed deprecated files (EMOJI_CLEANUP.md)
Fixed Black formatting issues
Added specific error codes to type: ignore comments

📝 Technical Details

Files Updated

docpull/processors/content_filter.py: More specific return types
docpull/formatters/: Proper type annotations for nested functions
docpull/orchestrator.py: Correct parameter naming and type hints
docpull/cli.py: Better handling of Optional[str] types
docpull/processors/language_filter.py: Fixed config type assignments
docpull/processors/deduplicator.py: Fixed config type assignments

CI/CD

This release ensures the codebase passes all CI checks and maintains high code quality standards.

📦 Installation

pip install --upgrade docpull

🔗 Links

Assets 2

16 Nov 22:12

zacharyr0th

v1.2.0

508f7d2

v1.2.0: 15 Major Features - 58% Size Reduction

Highlights

docpull v1.2.0 delivers 15 major features that dramatically improve documentation fetching efficiency. Real-world testing shows 58% size reduction (31 MB → 13 MB) when processing 1,914 documentation files.

New Features

Phase 1: Core Optimization

Language Filtering: Auto-detect and filter by programming language
Deduplication: SHA-256 based duplicate detection with flexible keep strategies
Auto-Index Generation: Tree view, TOC, category-based, and statistics indexes
Size Limits: Enforce per-file and total size constraints
Multi-Source Configuration: YAML-based configuration for multiple documentation sources

Phase 2: Advanced Processing

Selective Crawling: Include/exclude patterns for precise control
Content Filtering: Remove unwanted sections from documentation
Format Conversion: Output in Markdown, TOON, JSON, or SQLite
Smart Naming: 4 naming strategies (full, short, flat, hierarchical)

Phase 3: Efficiency

Metadata Extraction: Automatic metadata collection and JSON storage
Update Detection: Skip unchanged files based on checksums
Incremental Mode: Update only changed documentation

Phase 4: Integration

Hooks/Plugins: Decorator-based plugin system for custom processing
Git Integration: Automatic commits with templated messages
Archive Mode: Create compressed archives (tar.gz, tar.bz2, tar.xz, zip)

Technical Details

20 new modules (3,886 lines of code)
Full backward compatibility with v1.1.0
All features integrated into CLI
145+ unit tests
Zero syntax errors, zero linting issues

Installation

pip install --upgrade docpull

Quick Example

# Fetch Python docs with optimization
docpull https://docs.python.org/3/ ./python-docs \
  --language python \
  --deduplicate \
  --create-index \
  --max-total-size 20MB

# Multi-source with YAML config
docpull --sources-file sources.yaml

See the CHANGELOG for complete details.

Assets 2

14 Nov 23:47

zacharyr0th

v1.1.0

829e183

v1.1.0 - Diagnostic Tools and Improved Error Handling

What's New in v1.1.0

Added

--doctor command for diagnosing installation and dependency issues
- Checks all core dependencies (requests, beautifulsoup4, html2text, defusedxml, aiohttp, rich)
- Checks optional dependencies (PyYAML, Playwright) with installation suggestions
- Tests network connectivity
- Verifies output directory write permissions
- Works even when dependencies are missing
requirements.txt file for transparent dependency listing
Comprehensive TROUBLESHOOTING.md documentation with:
- Installation troubleshooting (missing dependencies, pipx issues)
- Runtime issue solutions (YAML config errors, JavaScript rendering)
- Diagnostic tools usage guide
- Common error messages reference table
- Quick reference commands

Changed

Improved error handling for missing dependencies
- Early dependency checking at CLI entry point
- Clear, actionable error messages with installation instructions
- Specific recommendations for pipx, pip, and development installations
Enhanced YAML configuration error handling
- Auto-fallback to JSON when PyYAML is not installed
- Clear error messages for YAML-related import errors
- Helpful suggestions for installing optional dependencies
Updated README.md with:
- --doctor command in Quick Start section
- Reference to TROUBLESHOOTING.md
- Better troubleshooting guidance

Fixed

Improved user experience when dependencies are missing (no more confusing tracebacks)
Better handling of optional dependency errors (PyYAML, Playwright)

Installation

```bash
pip install docpull
docpull --doctor # Verify installation
```

Full Changelog

https://github.com/raintree-technology/docpull/blob/main/CHANGELOG.md#110---2025-11-14

Assets 2

Releases: raintree-technology/docpull

v2.3.0 — Framework-aware extraction, LLM chunking, Python MCP, agent fast path

[2.3.0] - 2026-04-24

Added

Fixed

Removed

Changed

Uh oh!

v2.2.1 - Security Hardening

Security Fixes

Breaking Changes

Testing

Audit Report

Uh oh!

v2.2.0: Resume, Auth, JSON/SQLite output

New Features

Breaking Changes

Install

Uh oh!

v2.0.0 - Complete Architecture Rewrite

Breaking Changes

New Features

Quick Start

CLI

Python API

Full Changelog

Uh oh!

v1.5.0

Release v1.5.0: Major Simplification and Modernization

Breaking Changes

Changes

Removed Files

Installation

Uh oh!

v1.3.0: Rich Metadata Extraction & Simplified Profiles

v1.3.0: Rich Metadata Extraction & Simplified Profiles

Highlights

New Features

Rich Metadata Extraction

Simplified Profile System

Technical Details

New Dependencies

New Files

Updated Files

Removed Files

Version & Testing

Example Usage

Rich Metadata Extraction

Enhanced Frontmatter Output

Multi-Source Configuration with Rich Metadata

Backward Compatibility

Installation

Links

Uh oh!

v1.2.1 - Critical Bug Fixes & Type Checking

🐛 Bug Fixes

Type Checking & Code Quality

Test Fixes

Code Cleanup

📝 Technical Details

Files Updated

CI/CD

📦 Installation

🔗 Links

Uh oh!

v1.2.0: 15 Major Features - 58% Size Reduction

Highlights

New Features

Phase 1: Core Optimization

Phase 2: Advanced Processing

Phase 3: Efficiency

Phase 4: Integration

Technical Details

Installation

Quick Example

Uh oh!

v1.1.0 - Diagnostic Tools and Improved Error Handling

What's New in v1.1.0

Added

Changed