MLE Reasoning environment is a benchmark harness for evaluating AI agents on Machine Learning Error (MLE) analysis tasks.
The benchmark tests an agent's ability to:
- Diagnose ML-specific bugs in Python code (data leakage, scaling errors, encoding issues, etc.)
- Localize issues to specific code locations
- Explain why the issue causes problems
- Propose and implement fixes
- Verify the fixes work correctly
mle-reasoning-environment/
├── Dockerfile # Container definition (tools only)
├── requirements.txt # Python dependencies
├── README.md # Quick start guide
├── TESTING_GUIDE.md # This file
│
├── tools/ # Core harness code
│ ├── tools.py # 5 agent tools (read, write, list, python, bash)
│ ├── agent.py # Agent loop with LLM integration
│ ├── llm.py # LiteLLM wrapper for multiple providers
│ ├── evaluator.py # Rubric-based scoring with LLM judge
│ ├── run_agent.py # CLI for running tasks
│ ├── mcp_server.py # MCP server exposing tools
│ └── agent_server.py # Reasoning environments integration
│
├── tests/ # Test suites
│ ├── run_all_tests.py # Run all test suites
│ ├── test_harness.py # Tool tests (30 tests)
│ ├── test_mcp_server.py # MCP server tests (12 tests)
│ ├── test_comprehensive.py # Full harness tests (66 tests)
│ └── test_evaluator.py # Evaluator tests (34 tests, requires API key)
│
├── tasks/ # Task JSON files (Reasoning environments format)
│ └── task_*.json
│
└── files/ # Static files for container
└── README.md
| Tool | Description | Parameters |
|---|---|---|
read_file |
Read file contents | path |
write_file |
Write/create files (auto-creates directories) | path, content |
list_files |
List directory contents | path |
run_python |
Execute Python script with timeout | script_path, timeout |
bash_exec |
Execute bash command with timeout | command, timeout, cwd |
# Install dependencies
cd mle-reasoning-environment
pip install -r requirements.txt
# For evaluator tests, set API key
export OPENAI_API_KEY="your-key"cd mle-reasoning-environment/tests
# Run individual test suites
python test_harness.py # 30 tool tests (no API key needed)
python test_mcp_server.py # 12 MCP tests (no API key needed)
python test_comprehensive.py # 66 comprehensive tests (no API key needed)
python test_evaluator.py # 34 evaluator tests (requires API key)
# Run all tests at once
python run_all_tests.py| Test File | Tests | API Key? | What It Tests |
|---|---|---|---|
test_harness.py |
40 | No | All 5 tools with edge cases |
test_mcp_server.py |
16 | No | MCP server and tool exposure |
test_comprehensive.py |
61 | No | Tools, agent, schema, integration |
test_evaluator.py |
34 | Yes | Rubric parsing, file checks, LLM grading |
Total: 151 tests
cd mle-reasoning-environment
docker build -t mle-reasoning-environment .# Mount your tasks and run
docker run --rm \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
-v $(pwd)/tasks:/workspace/tasks \
mle-reasoning-environment \
python run_agent.py \
--task /workspace/tasks/task_error-analysis-1-dev.json \
--model openai/gpt-4o-mini# Mount tests folder and run
docker run --rm \
-v $(pwd)/tests:/workspace/tests \
-v $(pwd)/tasks:/workspace/tasks \
mle-reasoning-environment \
python /workspace/tests/test_comprehensive.pydocker run --rm \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
-v $(pwd)/tasks:/workspace/tasks \
mle-reasoning-environment \
python run_agent.py \
--task /workspace/tasks/task_error-analysis-1-dev.json \
--model openai/gpt-4o-mini \
--no-evalcd mle-reasoning-environment/tools
# Install dependencies
pip install -r ../requirements.txt
# Run a task
python run_agent.py \
--task ../tasks/task_error-analysis-1-dev.json \
--model openai/gpt-4o-mini
# Run all tasks
python run_agent.py \
--tasks-dir ../tasks \
--output-dir ../results \
--model openai/gpt-4o-mini| Argument | Description | Default |
|---|---|---|
--task |
Path to single task JSON | - |
--tasks-dir |
Directory of task JSONs | tasks |
--model |
LLM model (provider/model) | openai/gpt-4o-mini |
--output-dir |
Where to save results | results |
--no-eval |
Skip evaluation (faster) | False |
| Provider | Model Examples |
|---|---|
| OpenAI | openai/gpt-4o-mini, openai/gpt-4o |
| Anthropic | anthropic/claude-3-5-sonnet-latest |
google/gemini-1.5-pro |
{
"task_id": "error-analysis-1-dev",
"prompt": [{"type": "text", "content": "...task description..."}],
"rubrics": [
{
"name": "IDENTIFY_ISSUE_TYPE",
"weight": 5,
"messages": [{"type": "text", "content": "...criterion..."}]
}
],
"rubric_text": "+5|IDENTIFY_ISSUE_TYPE|...",
"task_files": {
"src/code.py": "...code content...",
"data/train.csv": "...data..."
},
"task_dir": "/app",
"use_docker": true,
"max_turns": 50
}For MCP-compatible clients:
cd mle-reasoning-environment/tools
python mcp_server.pyThis exposes all 5 tools via the Model Context Protocol on stdio.
A successful task run produces:
{
"task_id": "error-analysis-1-dev",
"model": "openai/gpt-4o-mini",
"answer": "FINAL ANSWER: ...",
"metadata": {
"turns": 14,
"tool_calls": 16,
"errors": []
},
"evaluation": {
"score": 75,
"total_possible": 90,
"percentage": 83.3,
"results": [...]
}
}| Code | Description |
|---|---|
DATA_LEAKAGE |
Model sees validation/test data during training |
OUTLIER |
Extreme samples not handled properly |
SCALING_ERROR |
Features scaled with wrong statistics |
ENCODING_ERROR |
Categorical variables encoded incorrectly |
IMBALANCE |
Class distribution skewed and not handled |
OVERFITTING |
Good train, poor validation performance |
UNDERFITTING |
Poor performance on both splits |
NONE |
No ML-specific errors present |
open -a Docker # macOS
docker ps # verify runningecho $OPENAI_API_KEY | head -c 10 # verify set
export OPENAI_API_KEY="sk-..." # set if missing# Make sure you're in the right directory
cd mle-reasoning-environment/tests
python test_harness.py
# Or set PYTHONPATH manually
export PYTHONPATH=/path/to/mle-reasoning-environment/tools-
test_harness.py- 40/40 pass -
test_mcp_server.py- 16/16 pass -
test_comprehensive.py- 61/61 pass -
test_evaluator.py- 34/34 pass (with API key) - Docker builds successfully
- Real task completes with evaluation