GatomIA Code Wiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases
AI-Powered Repository Documentation Generation • Multi-Language Support • Architecture-Aware Analysis
Generate holistic, structured documentation for large-scale codebases • Cross-module interactions • Visual artifacts and diagrams
Quick Start • CLI Commands • Output Structure • Paper
⚠️ IMPORTANT: GatoWiki v0.25.5 requires GitHub Copilot
This version uses GitHub Copilot agents instead of direct API calls.
Integration Guide
# Install from source
pip install git+https://github.com/eitatech/gatomia-wiki.git
# Verify installation
gatowiki --version # Should show 2.0.0+Prerequisites:
- GitHub Copilot subscription (Individual, Business, or Enterprise)
- IDE with Copilot support (VS Code, IntelliJ, etc.)
# No API key configuration needed!
# GitHub Copilot handles authenticationOpen GitHub Copilot Chat in your IDE and simply say:
Generate documentation
That's it! The agent will automatically:
- Run
gatowiki analyzeif needed - Detect all modules in your repository
- Generate comprehensive documentation
- Create architecture diagrams
Other commands:
Update documentation # Skip existing
Document the cli module # Single module
gatowiki publish --github-pages --create-branch┌────────────────────────────────────────────────────┐
│ "Generate documentation" │
│ │ │
│ ▼ │
│ ┌────────────────┐ ┌───────────────────────┐ │
│ │ Auto-analyze │──▶│ Generate docs for │ │
│ │ (if needed) │ │ each module │ │
│ └────────────────┘ └───────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Write files to │ │
│ │ docs/*.md │ │
│ └───────────────────────┘ │
└────────────────────────────────────────────────────┘
GitHub Copilot Agent
GatoWiki is an open-source framework for automated repository-level documentation across seven programming languages. It generates holistic, architecture-aware documentation that captures not only individual functions but also their cross-file, cross-module, and system-level interactions.
| Innovation | Description | Impact |
|---|---|---|
| Hierarchical Decomposition | Dynamic programming-inspired strategy that preserves architectural context | Handles codebases of arbitrary size (86K-1.4M LOC tested) |
| Recursive Agentic System | Adaptive multi-agent processing with dynamic delegation capabilities | Maintains quality while scaling to repository-level scope |
| Multi-Modal Synthesis | Generates textual documentation, architecture diagrams, data flows, and sequence diagrams | Comprehensive understanding from multiple perspectives |
Python • Java • JavaScript • TypeScript • C • 🔧 C++ • C#
# Run dependency analysis and module clustering
gatowiki analyze
# Custom output directory
gatowiki analyze --output ./documentation
# Filter by languages
gatowiki analyze --languages python,typescript
# Limit module depth
gatowiki analyze --max-depth 3
# Enable verbose logging
gatowiki analyze --verboseWhat it does:
- Parses source code with Tree-sitter
- Builds dependency graphs
- Clusters modules hierarchically
- Generates
module_tree.jsonandfirst_module_tree.json - Does NOT call any LLM APIs
Open GitHub Copilot Chat and use simple commands:
Generate documentation # Full repository
Update documentation # Skip existing docs
Document the cli module # Single module
Regenerate all documentation # Overwrite all
The agent automatically:
- Runs
gatowiki analyzeif module_tree.json is missing - Detects all modules in your codebase
- Generates comprehensive docs with diagrams
- Skips already-documented modules (unless regenerating)
# Generate HTML viewer
gatowiki publish --github-pages
# Create gh-pages branch
gatowiki publish --github-pages --create-branch
# Custom output
gatowiki publish --output ./documentation --github-pages# Set default output directory
gatowiki config set --output ./docs
# Show current configuration
gatowiki config showNote: API key configuration removed in v0.25.5. GitHub Copilot handles authentication.
Generated documentation includes both textual descriptions and visual artifacts for comprehensive understanding.
- Repository overview with architecture guide
- Module-level documentation with API references
- Usage examples and implementation patterns
- Cross-module interaction analysis
- System architecture diagrams (Mermaid)
- Data flow visualizations
- Dependency graphs and module relationships
- Sequence diagrams for complex interactions
./docs/
├── overview.md # Repository overview (start here!)
├── module1.md # Module documentation
├── module2.md # Additional modules...
├── module_tree.json # Hierarchical module structure (from analyze)
├── first_module_tree.json # Initial clustering result (from analyze)
├── analysis_metadata.json # Analysis statistics (from analyze)
└── index.html # Interactive viewer (from publish)
Analysis Phase (gatowiki analyze):
- Generates:
module_tree.json,first_module_tree.json,analysis_metadata.json
Documentation Phase (GitHub Copilot):
- Generates:
overview.md,module1.md,module2.md, etc.
Publishing Phase (gatowiki publish):
- Generates:
index.htmlfor GitHub Pages
GatoWiki has been evaluated on CodeWikiBench, the first benchmark specifically designed for repository-level documentation quality assessment.
| Language Category | GatoWiki (Sonnet-4) | DeepWiki | Improvement |
|---|---|---|---|
| High-Level (Python, JS, TS) | 79.14% | 68.67% | +10.47% |
| Managed (C#, Java) | 68.84% | 64.80% | +4.04% |
| Systems (C, C++) | 53.24% | 56.39% | -3.15% |
| Overall Average | 68.79% | 64.06% | +4.73% |
| Repository | Language | LOC | GatoWiki-Sonnet-4 | DeepWiki | Improvement |
|---|---|---|---|---|---|
| All-Hands-AI--OpenHands | Python | 229K | 82.45% | 73.04% | +9.41% |
| puppeteer--puppeteer | TypeScript | 136K | 83.00% | 64.46% | +18.54% |
| sveltejs--svelte | JavaScript | 125K | 71.96% | 68.51% | +3.45% |
| Unity-Technologies--ml-agents | C# | 86K | 79.78% | 74.80% | +4.98% |
| elastic--logstash | Java | 117K | 57.90% | 54.80% | +3.10% |
View comprehensive results: See paper for complete evaluation on 21 repositories spanning all supported languages.
GatoWiki employs a three-stage process for comprehensive documentation generation:
-
Hierarchical Decomposition: Uses dynamic programming-inspired algorithms to partition repositories into coherent modules while preserving architectural context across multiple granularity levels.
-
Recursive Multi-Agent Processing: Implements adaptive multi-agent processing with dynamic task delegation, allowing the system to handle complex modules at scale while maintaining quality.
-
Multi-Modal Synthesis: Integrates textual descriptions with visual artifacts including architecture diagrams, data-flow representations, and sequence diagrams for comprehensive understanding.
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Codebase │───▶│ Hierarchical │───▶│ Multi-Agent │
│ Analysis │ │ Decomposition │ │ Processing │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Visual │◀───│ Multi-Modal │◀───│ Structured │
│ Artifacts │ │ Synthesis │ │ Content │
└─────────────────┘ └──────────────────┘ └─────────────────┘
- Python 3.12+
- GitHub Copilot (Individual, Business, or Enterprise subscription)
- IDE with Copilot support (VS Code, IntelliJ IDEA, Visual Studio, etc.)
- Node.js (optional, for Mermaid diagram validation)
- Git (optional, for branch creation features)
GitHub Copilot Integration (v0.25.5+):
- 📖 GitHub Copilot Integration Guide - Complete workflow, setup, and usage guide
- 🔄 Migration Guide - Migrating from API version to v0.25.5
- 🎨 Agent Customization Guide - Customize agents for your team
General Resources:
- 🐳 Docker Deployment - Containerized deployment instructions
- 🛠️ Development Guide - Project structure, architecture, and contributing guidelines
- 📊 GatoWikiBench - Repository-level documentation benchmark
- 🎬 Live Demo - Interactive demo and examples
- Paper - Full research paper with detailed methodology and results from the original research that GatoWiki was created.
This project is licensed under the MIT License.