AgentDebug is a framework for understanding, detecting, and recovering from LLM agent failures. It provides:
- AgentErrorTaxonomy: A classification system covering 17 error types across 5 modules (memory, reflection, planning, action, system).
- AgentErrorBench: Annotated failure trajectories from ALFWorld, GAIA, and WebShop environments.
- AgentDebug Framework: A two-stage debugging pipeline that isolates root-cause failures and provides corrective feedback.
git clone https://github.com/ulab-uiuc/AgentDebug.git
cd AgentDebug
pip install -e .AgentDebug includes vendored environments (ALFWorld, WebShop, GAIA) with modular agent prompts. The rollout system requires these environments to be functional.
API Keys: Set the following environment variables (or put them in a .env file):
export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..." # optional
export GEMINI_API_KEY="..." # optional
export TOGETHER_API_KEY="..." # optionalAgentDebug/
├── detector/ # Core error detection framework
│ ├── fine_grained_analysis.py # Phase 1: step-level per-module error detection
│ ├── critical_error_detection.py # Phase 2: critical error identification
│ └── error_definitions.py # Error taxonomy (5 modules, 17 types)
├── agentdebug/
│ ├── engines/ # Multi-provider LLM abstraction
│ │ ├── openai.py # OpenAI (GPT-4o, GPT-4.1, etc.)
│ │ ├── anthropic.py # Anthropic (Claude)
│ │ ├── gemini.py # Google (Gemini)
│ │ └── together.py # Together AI (Llama, Qwen, etc.)
│ ├── environments/ # Environment wrappers + modular prompts
│ │ ├── alfworld/ # ALFWorld embodied tasks
│ │ ├── webshop/ # WebShop e-commerce tasks
│ │ └── gaia/ # GAIA general AI assistant tasks
│ └── rollout/ # Trajectory collection
│ ├── rollout.py # Unified rollout across all environments
│ └── step_to_episode.py # Step-level → episode-level conversion
├── examples/ # Sample data and demo scripts
└── docs/ # Documentation
from detector.fine_grained_analysis import ErrorTypeDetector
from detector.critical_error_detection import CriticalErrorAnalyzer
# Configure with your API key
api_config = {
"base_url": "https://api.openai.com/v1/chat/completions",
"api_key": "your-api-key",
"model": "gpt-4o-mini",
"temperature": 0.0,
"max_retries": 3,
"timeout": 60,
}
# Phase 1: Step-level error detection
detector = ErrorTypeDetector(api_config)
trajectory_data = detector.parse_trajectory("path/to/trajectory.json")
phase1_results = await detector.analyze_trajectory(trajectory_data)
# Phase 2: Critical error identification
analyzer = CriticalErrorAnalyzer(api_config)
critical_error = await analyzer.identify_critical_error(phase1_results, trajectory_data)# AlfWorld rollout with Together AI (cheap, fast)
python -m agentdebug.rollout.rollout \
--env alfworld \
--provider together \
--model meta-llama/Llama-3.3-70B-Instruct-Turbo \
--unique_envs \
--total_envs 100 \
--concurrency 4 \
--dump_path output/alfworld_steps.jsonl
# Convert steps to episodes
python -m agentdebug.rollout.step_to_episode \
--input_jsonl output/alfworld_steps.jsonl \
--output_jsonl output/alfworld_episodes.jsonlfrom agentdebug.engines import create_chat_model
# Automatically routes to the correct provider based on model name
model = create_chat_model("gpt-4o-mini") # → OpenAI
model = create_chat_model("claude-sonnet-4-6") # → Anthropic
model = create_chat_model("gemini-2.5-flash") # → Gemini
model = create_chat_model("meta-llama/Llama-3.3-70B-Instruct-Turbo") # → Together
response = model("What is 2+2?")| Module | Error Types | Description |
|---|---|---|
| Memory | hallucination, memory_retrieval_failure, over_simplification | Agent misremembers or fails to recall information |
| Reflection | progress_misjudge, outcome_misinterpretation, causal_misattribution, hallucination | Agent incorrectly evaluates its own progress |
| Planning | constraint_ignorance, impossible_action, inefficient_plan | Agent creates flawed plans |
| Action | misalignment, invalid_action, format_error, parameter_error | Agent executes wrong actions |
| System | step_limit, tool_execution_error, llm_limit, environment_error | External system failures |
Download annotated failure trajectories: AgentErrorBench on Google Drive
- ALFWorld: 100 trajectories from embodied agent tasks
- GAIA: 50 trajectories from general AI assistant tasks
- WebShop: 50 trajectories from web navigation tasks
| Metric | Improvement |
|---|---|
| All-Correct Accuracy | +24% |
| Step Accuracy | +17% |
| Task Success Rate | Up to +26% |
@article{agentdebug2025,
title={Where LLM Agents Fail and How They Can Learn From Failures},
author={Zhu, Kunlun and Liu, Zijia and Li, Bingxuan and Tian, Muxin and Yang Yingxuan and Zhang, Jiaxun and others},
journal={arXiv preprint arXiv:2509.25370},
year={2025}
}This project is licensed under the MIT License - see the LICENSE file for details.
We welcome contributions! Please feel free to submit issues, create pull requests, or reach out for collaborations.
