A comprehensive open-source framework for benchmarking Large Language Models on business conversation capabilities
🚀 Quick Start • 📖 Documentation • 📊 Sample Results • 🤝 Contributing • 💬 Community
📖 Click to view full navigation
bizCon is a specialized evaluation framework designed to benchmark Large Language Models (LLMs) on realistic business conversation scenarios. Unlike generic benchmarks, bizCon focuses on practical business use cases involving professional communication, tool integration, and domain-specific knowledge.
- Business-Focused: Evaluates models on real-world business scenarios
- Multi-Dimensional: Assesses 5 key aspects of business communication
- Tool Integration: Tests models' ability to use business tools effectively
- Comparative Analysis: Benchmark multiple models side-by-side
- Enterprise-Ready: Professional reporting and analysis capabilities
- Product Inquiries: Enterprise software consultations
- Technical Support: Complex troubleshooting and API integration
- Contract Negotiation: SaaS agreements and enterprise deals
- Appointment Scheduling: Multi-stakeholder coordination
- Compliance Inquiries: Regulatory and data privacy questions
- Implementation Planning: Software deployment strategies
- Service Complaints: Customer service and dispute resolution
- Multi-Department: Cross-functional project coordination
- Response Quality (25%) - Factual accuracy and completeness
- Business Value (25%) - Strategic insight and actionable recommendations
- Communication Style (20%) - Professionalism and tone appropriateness
- Tool Usage (20%) - Effective integration with business tools
- Performance (10%) - Response time and efficiency
- Knowledge Base Search
- Product Catalog Lookup
- Pricing Calculator
- Appointment Scheduler
- Customer History Access
- Document Retrieval
- Order Management
- Support Ticket System
| 🤖 OpenAI | 🧠 Anthropic | 🌟 Mistral AI |
| • GPT-4 • GPT-3.5-turbo • GPT-4-turbo |
• Claude-3-opus • Claude-3-sonnet • Claude-3-haiku |
• Mistral-large • Mistral-medium • Mistral-small |
# Clone the repository
git clone https://github.com/Olib-AI/bizcon.git
cd bizcon
# Basic installation
pip install -e .
# Install with advanced visualization features (use quotes for zsh)
pip install -e ".[advanced]"
# Install all optional features
pip install -e ".[all]"- Set up your API keys:
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export MISTRAL_API_KEY="your-mistral-key"- Run a quick test:
# 🚀 Test without API keys (uses mock models)
python test_framework.py
# 🧪 Run unit and integration tests
python -m pytest tests/
# 🤖 Test with real models (requires API keys)
python test_with_real_models.py- Run a benchmark:
# 📊 Compare models on specific scenarios
python run.py --scenarios product_inquiry_001 support_001 --verbose
# 🏃 Run full benchmark with custom config
python run.py --config config/models.yaml --output results/
# 💻 Using CLI interface directly
bizcon run --config config/models.yaml --output results/- Explore available options:
# 📋 List all available scenarios
python run.py --list-scenarios
# or: bizcon list-scenarios
# 🤖 List supported models
python run.py --list-models
# or: bizcon list-modelsCustomize your evaluation in config/models.yaml:
models:
- provider: openai
name: gpt-4
temperature: 0.7
max_tokens: 2048
- provider: anthropic
name: claude-3-sonnet
temperature: 0.7
max_tokens: 2048Adjust evaluation settings in config/evaluation.yaml:
evaluation:
parallel: true
num_runs: 3
evaluator_weights:
response_quality: 0.25
business_value: 0.25
communication_style: 0.20
tool_usage: 0.20
performance: 0.10bizcon/
├── config/ # Configuration files
│ ├── models.yaml # Model configurations
│ └── evaluation.yaml # Evaluation settings
├── core/ # Core evaluation pipeline
│ ├── pipeline.py # Main evaluation orchestrator
│ └── runner.py # Scenario execution engine
├── models/ # LLM provider integrations
│ ├── openai.py # OpenAI client
│ ├── anthropic.py # Anthropic client
│ └── mistral.py # Mistral AI client
├── scenarios/ # Business conversation scenarios
│ ├── product_inquiry.py
│ ├── technical_support.py
│ └── contract_negotiation.py
├── evaluators/ # Evaluation metrics
│ ├── response_quality.py
│ ├── business_value.py
│ └── communication_style.py
├── tools/ # Business tool implementations
│ ├── knowledge_base.py
│ ├── scheduler.py
│ └── product_catalog.py
├── visualization/ # Advanced visualization and reporting
│ ├── charts.py # Static matplotlib charts
│ ├── interactive_charts.py # Interactive Plotly charts
│ ├── dashboard.py # Basic Flask dashboard
│ ├── advanced_dashboard.py # Advanced dashboard with filtering
│ ├── analysis_utils.py # Statistical analysis tools
│ └── report.py # Report generation
└── data/ # Sample business data
├── knowledge_base/
├── products/
└── pricing/
from scenarios.base import BusinessScenario
class CustomBusinessScenario(BusinessScenario):
def __init__(self, scenario_id=None):
super().__init__(
scenario_id=scenario_id or "custom_001",
name="Custom Business Scenario",
description="Your custom scenario description",
industry="technology",
complexity="medium",
tools_required=["knowledge_base", "scheduler"]
)
def _initialize_conversation(self):
return [{
"user_message": "Your initial customer message",
"expected_tool_calls": [
{"tool_id": "knowledge_base", "parameters": {"query": "example"}}
]
}]
def _initialize_ground_truth(self):
return {
"expected_facts": ["Key fact 1", "Key fact 2"],
"business_objective": "Help customer achieve X",
"expected_tone": "professional"
}from evaluators.base import BaseEvaluator
class CustomEvaluator(BaseEvaluator):
def __init__(self, weight=1.0):
super().__init__(name="Custom Evaluator", weight=weight)
def evaluate(self, response, scenario, turn_index, conversation_history, tool_calls):
# Your evaluation logic here
score = self.calculate_score(response)
return {
"score": score,
"explanation": "Detailed explanation of the score",
"max_possible": 10.0
}📈 Click to view sample benchmark results
┌─────────────────┬─────────┬─────────────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ Model │ Overall │ Response │ Business │ Communication│ Tool Usage │ Performance │
│ │ Score │ Quality │ Value │ Style │ │ │
├─────────────────┼─────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤
│ gpt-4 │ 8.2/10 │ 8.5/10 │ 8.1/10 │ 9.0/10 │ 7.8/10 │ 8.0/10 │
│ claude-3-sonnet │ 7.9/10 │ 8.2/10 │ 7.8/10 │ 8.8/10 │ 7.5/10 │ 7.2/10 │
│ claude-3-haiku │ 7.1/10 │ 7.3/10 │ 6.9/10 │ 8.0/10 │ 6.8/10 │ 8.5/10 │
│ gpt-3.5-turbo │ 6.8/10 │ 6.5/10 │ 6.2/10 │ 7.5/10 │ 6.0/10 │ 7.8/10 │
└─────────────────┴─────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘
- GPT-4: Response Quality (89%), Tool Usage (78%), Communication Style (90%)
- Claude-3-Sonnet: Response Quality (86%), Tool Usage (75%), Communication Style (88%)
- Claude-3-Haiku: Response Quality (73%), Tool Usage (68%), Communication Style (80%)
- 📊 Interactive HTML Report: Charts, breakdowns, and detailed analysis
- 📈 CSV Data Export: Raw scores for custom analysis and visualization
- 📝 Markdown Summary: Professional reports for sharing and documentation
- 🎯 Success Rate Analysis: Model performance across business scenarios
# Run multiple scenarios in parallel
python run.py --scenarios product_inquiry_001 support_001 contract_001 --parallel
# Or using CLI directly
bizcon run --scenarios product_inquiry_001 support_001 --parallelmodels:
- provider: openai
name: gpt-4
temperature: 0.3
max_tokens: 1024
parameters:
seed: 42
top_p: 0.9# Install advanced features first (use quotes for zsh)
pip install -e ".[advanced]"
# Launch interactive dashboard with advanced features
python examples/advanced_dashboard_demo.py --results-dir output/
# Launch on custom host/port with auto-refresh
python examples/advanced_dashboard_demo.py --host 0.0.0.0 --port 8080
# Disable auto-refresh for static analysis
python examples/advanced_dashboard_demo.py --no-auto-refreshNote: Advanced visualization features require additional dependencies (Plotly, Flask, SciPy). Install with pip install "bizcon[advanced]" (quotes required for zsh) to enable these features.
# Run all product inquiry scenarios
python run.py --scenarios product_inquiry_*
# Run scenarios by complexity
python run.py --scenarios complex_*We welcome contributions from the community! Here's how you can help:
- 🐛 Report Bugs: Open an issue with detailed reproduction steps
- ✨ Suggest Features: Propose new scenarios, evaluators, or tools
- 📝 Improve Documentation: Help make our docs clearer
- 🔧 Submit Code: Fix bugs or add new features
- 🧪 Add Test Cases: Improve our test coverage
git clone https://github.com/Olib-AI/bizcon.git
cd bizcon
pip install -e .
# Run framework validation (no API keys needed)
python test_framework.py
# Run full test suite
python -m pytest tests/- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Run the test suite (
pytest) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
| Component | Status | Coverage |
|---|---|---|
| Unit Tests | ✅ PASSED (12/12) | Evaluators, Scenarios, Tools |
| Integration Tests | ✅ PASSED | End-to-end Pipeline |
| Framework Tests | ✅ PASSED | Mock Model Validation |
| Report Generation | ✅ WORKING | HTML, Markdown, CSV |
| CLI Functionality | ✅ OPERATIONAL | All Commands Available |
| Data Integrity | ✅ VERIFIED | JSON Files Valid |
🧪 Click to view test commands
# 🚀 Quick framework validation (no API keys required)
python test_framework.py
# 📊 Full test suite with detailed output
python -m pytest tests/ -v
# 🔍 Test specific components
python -m pytest tests/unit/test_evaluators.py::TestResponseQualityEvaluator
python -m pytest tests/integration/test_pipeline.py
# 🎯 Test with coverage report
python -m pytest tests/ --cov=./ --cov-report=htmlNo API keys needed for framework validation - uses MockModelClient for comprehensive testing.
This project is licensed under the MIT License - see the LICENSE file for details.
- Website: www.olib.ai
- GitHub: github.com/Olib-AI
- Issues: Report bugs or request features
- Discussions: Join the conversation
Akram Hasan Sharkar - Author & Lead Developer
Maya Msahal - Co-Author & Research Contributor
Developed at Olib AI
A detailed research paper describing the methodology, evaluation framework, and empirical results of bizCon will be published on arXiv.org. The paper link will be available here upon publication.
Citation format will be provided once the paper is published.
- Built with ❤️ by Akram Hasan Sharkar and Maya Msahal at Olib AI
- Inspired by the need for better business-focused LLM evaluation
- Thanks to all contributors who help make this project better
🚀 View upcoming features and release history
| Feature | Priority | Status | Completed |
|---|---|---|---|
| 📊 Advanced Visualization Dashboards | High | ✅ Complete | May 2025 |
| 🎯 Interactive Plotly Charts | High | ✅ Complete | May 2025 |
| 🔄 Real-time Dashboard Filtering | Medium | ✅ Complete | May 2025 |
| 📈 Statistical Analysis Tools | Medium | ✅ Complete | May 2025 |
| 🔍 Model Comparison Engine | Medium | ✅ Complete | May 2025 |
| Feature | Priority | Status | ETA |
|---|---|---|---|
| 🌐 More LLM Providers (Cohere, Together AI) | High | Planning | Q3 2025 |
| 🏭 Industry-Specific Scenario Packs | Medium | Planning | Q4 2025 |
| ⚡ Real-time Evaluation APIs | Medium | Researching | Q4 2025 |
| 🔗 Custom Webhook Integrations | Low | Backlog | Q1 2026 |
| 🌍 Multi-language Support | Low | Backlog | Q1 2026 |
| 🤖 AI-Powered Insights | Medium | Planning | Q3 2025 |
- v0.4.0 (Current): Advanced visualization dashboards, interactive Plotly charts, real-time filtering, statistical analysis
- v0.3.0: Multi-provider support, tool integration, success rate differentiation
- v0.2.0: Added visualization and reporting capabilities
- v0.1.0: Initial release with core evaluation framework
Made with ❤️ by Akram Hasan Sharkar & Maya Msahal at Olib AI