bizCon: Business Conversation Evaluation Framework for LLMs

A comprehensive open-source framework for benchmarking Large Language Models on business conversation capabilities

🚀 Quick Start • 📖 Documentation • 📊 Sample Results • 🤝 Contributing • 💬 Community

📋 Table of Contents

📖 Click to view full navigation

🎯 Overview
✨ Key Features
🚀 Quick Start
📖 Documentation
📊 Sample Results
🏗️ Advanced Usage
🤝 Contributing
🧪 Testing & Validation
💬 Community
📈 Roadmap

🎯 Overview

bizCon is a specialized evaluation framework designed to benchmark Large Language Models (LLMs) on realistic business conversation scenarios. Unlike generic benchmarks, bizCon focuses on practical business use cases involving professional communication, tool integration, and domain-specific knowledge.

Why bizCon?

Business-Focused: Evaluates models on real-world business scenarios
Multi-Dimensional: Assesses 5 key aspects of business communication
Tool Integration: Tests models' ability to use business tools effectively
Comparative Analysis: Benchmark multiple models side-by-side
Enterprise-Ready: Professional reporting and analysis capabilities

✨ Key Features

🎭 Diverse Business Scenarios

Product Inquiries: Enterprise software consultations
Technical Support: Complex troubleshooting and API integration
Contract Negotiation: SaaS agreements and enterprise deals
Appointment Scheduling: Multi-stakeholder coordination
Compliance Inquiries: Regulatory and data privacy questions
Implementation Planning: Software deployment strategies
Service Complaints: Customer service and dispute resolution
Multi-Department: Cross-functional project coordination

📊 Comprehensive Evaluation Metrics

Response Quality (25%) - Factual accuracy and completeness
Business Value (25%) - Strategic insight and actionable recommendations
Communication Style (20%) - Professionalism and tone appropriateness
Tool Usage (20%) - Effective integration with business tools
Performance (10%) - Response time and efficiency

🛠️ Business Tool Ecosystem

Knowledge Base Search
Product Catalog Lookup
Pricing Calculator
Appointment Scheduler
Customer History Access
Document Retrieval
Order Management
Support Ticket System

🤖 Multi-Model Support

🤖 OpenAI	🧠 Anthropic	🌟 Mistral AI
• GPT-4 • GPT-3.5-turbo • GPT-4-turbo	• Claude-3-opus • Claude-3-sonnet • Claude-3-haiku	• Mistral-large • Mistral-medium • Mistral-small

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/Olib-AI/bizcon.git
cd bizcon

# Basic installation
pip install -e .

# Install with advanced visualization features (use quotes for zsh)
pip install -e ".[advanced]"

# Install all optional features
pip install -e ".[all]"

Basic Usage

Set up your API keys:

export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export MISTRAL_API_KEY="your-mistral-key"

Run a quick test:

# 🚀 Test without API keys (uses mock models)
python test_framework.py

# 🧪 Run unit and integration tests
python -m pytest tests/

# 🤖 Test with real models (requires API keys)
python test_with_real_models.py

Run a benchmark:

# 📊 Compare models on specific scenarios
python run.py --scenarios product_inquiry_001 support_001 --verbose

# 🏃 Run full benchmark with custom config
python run.py --config config/models.yaml --output results/

# 💻 Using CLI interface directly  
bizcon run --config config/models.yaml --output results/

Explore available options:

# 📋 List all available scenarios
python run.py --list-scenarios
# or: bizcon list-scenarios

# 🤖 List supported models  
python run.py --list-models
# or: bizcon list-models

Configuration

Customize your evaluation in config/models.yaml:

models:
  - provider: openai
    name: gpt-4
    temperature: 0.7
    max_tokens: 2048
  - provider: anthropic
    name: claude-3-sonnet
    temperature: 0.7
    max_tokens: 2048

Adjust evaluation settings in config/evaluation.yaml:

evaluation:
  parallel: true
  num_runs: 3
  evaluator_weights:
    response_quality: 0.25
    business_value: 0.25
    communication_style: 0.20
    tool_usage: 0.20
    performance: 0.10

📖 Documentation

Project Structure

bizcon/
├── config/                 # Configuration files
│   ├── models.yaml        # Model configurations
│   └── evaluation.yaml    # Evaluation settings
├── core/                  # Core evaluation pipeline
│   ├── pipeline.py        # Main evaluation orchestrator
│   └── runner.py          # Scenario execution engine
├── models/                # LLM provider integrations
│   ├── openai.py         # OpenAI client
│   ├── anthropic.py      # Anthropic client
│   └── mistral.py        # Mistral AI client
├── scenarios/             # Business conversation scenarios
│   ├── product_inquiry.py
│   ├── technical_support.py
│   └── contract_negotiation.py
├── evaluators/            # Evaluation metrics
│   ├── response_quality.py
│   ├── business_value.py
│   └── communication_style.py
├── tools/                 # Business tool implementations
│   ├── knowledge_base.py
│   ├── scheduler.py
│   └── product_catalog.py
├── visualization/         # Advanced visualization and reporting
│   ├── charts.py          # Static matplotlib charts
│   ├── interactive_charts.py  # Interactive Plotly charts
│   ├── dashboard.py       # Basic Flask dashboard
│   ├── advanced_dashboard.py  # Advanced dashboard with filtering
│   ├── analysis_utils.py  # Statistical analysis tools
│   └── report.py          # Report generation
└── data/                  # Sample business data
    ├── knowledge_base/
    ├── products/
    └── pricing/

Creating Custom Scenarios

from scenarios.base import BusinessScenario

class CustomBusinessScenario(BusinessScenario):
    def __init__(self, scenario_id=None):
        super().__init__(
            scenario_id=scenario_id or "custom_001",
            name="Custom Business Scenario",
            description="Your custom scenario description",
            industry="technology",
            complexity="medium",
            tools_required=["knowledge_base", "scheduler"]
        )
    
    def _initialize_conversation(self):
        return [{
            "user_message": "Your initial customer message",
            "expected_tool_calls": [
                {"tool_id": "knowledge_base", "parameters": {"query": "example"}}
            ]
        }]
    
    def _initialize_ground_truth(self):
        return {
            "expected_facts": ["Key fact 1", "Key fact 2"],
            "business_objective": "Help customer achieve X",
            "expected_tone": "professional"
        }

Adding Custom Evaluators

from evaluators.base import BaseEvaluator

class CustomEvaluator(BaseEvaluator):
    def __init__(self, weight=1.0):
        super().__init__(name="Custom Evaluator", weight=weight)
    
    def evaluate(self, response, scenario, turn_index, conversation_history, tool_calls):
        # Your evaluation logic here
        score = self.calculate_score(response)
        return {
            "score": score,
            "explanation": "Detailed explanation of the score",
            "max_possible": 10.0
        }

📊 Sample Results

📈 Click to view sample benchmark results

Overall Model Performance

┌─────────────────┬─────────┬─────────────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ Model           │ Overall │ Response    │ Business    │ Communication│ Tool Usage  │ Performance │
│                 │ Score   │ Quality     │ Value       │ Style       │             │             │
├─────────────────┼─────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤
│ gpt-4           │ 8.2/10  │ 8.5/10      │ 8.1/10      │ 9.0/10      │ 7.8/10      │ 8.0/10      │
│ claude-3-sonnet │ 7.9/10  │ 8.2/10      │ 7.8/10      │ 8.8/10      │ 7.5/10      │ 7.2/10      │
│ claude-3-haiku  │ 7.1/10  │ 7.3/10      │ 6.9/10      │ 8.0/10      │ 6.8/10      │ 8.5/10      │
│ gpt-3.5-turbo   │ 6.8/10  │ 6.5/10      │ 6.2/10      │ 7.5/10      │ 6.0/10      │ 7.8/10      │
└─────────────────┴─────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘

Success Rates by Category

GPT-4: Response Quality (89%), Tool Usage (78%), Communication Style (90%)
Claude-3-Sonnet: Response Quality (86%), Tool Usage (75%), Communication Style (88%)
Claude-3-Haiku: Response Quality (73%), Tool Usage (68%), Communication Style (80%)

Report Outputs

📊 Interactive HTML Report: Charts, breakdowns, and detailed analysis
📈 CSV Data Export: Raw scores for custom analysis and visualization
📝 Markdown Summary: Professional reports for sharing and documentation
🎯 Success Rate Analysis: Model performance across business scenarios

🏗️ Advanced Usage

Parallel Evaluation

# Run multiple scenarios in parallel
python run.py --scenarios product_inquiry_001 support_001 contract_001 --parallel

# Or using CLI directly
bizcon run --scenarios product_inquiry_001 support_001 --parallel

Custom Model Parameters

models:
  - provider: openai
    name: gpt-4
    temperature: 0.3
    max_tokens: 1024
    parameters:
      seed: 42
      top_p: 0.9

Advanced Visualization Dashboard

# Install advanced features first (use quotes for zsh)
pip install -e ".[advanced]"

# Launch interactive dashboard with advanced features
python examples/advanced_dashboard_demo.py --results-dir output/

# Launch on custom host/port with auto-refresh
python examples/advanced_dashboard_demo.py --host 0.0.0.0 --port 8080

# Disable auto-refresh for static analysis
python examples/advanced_dashboard_demo.py --no-auto-refresh

Note: Advanced visualization features require additional dependencies (Plotly, Flask, SciPy). Install with pip install "bizcon[advanced]" (quotes required for zsh) to enable these features.

Scenario Categories

# Run all product inquiry scenarios
python run.py --scenarios product_inquiry_*

# Run scenarios by complexity
python run.py --scenarios complex_*

🤝 Contributing

We welcome contributions from the community! Here's how you can help:

Ways to Contribute

🐛 Report Bugs: Open an issue with detailed reproduction steps
✨ Suggest Features: Propose new scenarios, evaluators, or tools
📝 Improve Documentation: Help make our docs clearer
🔧 Submit Code: Fix bugs or add new features
🧪 Add Test Cases: Improve our test coverage

Development Setup

git clone https://github.com/Olib-AI/bizcon.git
cd bizcon
pip install -e .

# Run framework validation (no API keys needed)
python test_framework.py

# Run full test suite  
python -m pytest tests/

Contribution Guidelines

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Add tests for new functionality
Run the test suite (pytest)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

🧪 Testing & Validation

🎯 Framework Validation Status

Component	Status	Coverage
Unit Tests	✅ PASSED (12/12)	Evaluators, Scenarios, Tools
Integration Tests	✅ PASSED	End-to-end Pipeline
Framework Tests	✅ PASSED	Mock Model Validation
Report Generation	✅ WORKING	HTML, Markdown, CSV
CLI Functionality	✅ OPERATIONAL	All Commands Available
Data Integrity	✅ VERIFIED	JSON Files Valid

Running Tests

🧪 Click to view test commands

# 🚀 Quick framework validation (no API keys required)
python test_framework.py

# 📊 Full test suite with detailed output
python -m pytest tests/ -v

# 🔍 Test specific components
python -m pytest tests/unit/test_evaluators.py::TestResponseQualityEvaluator
python -m pytest tests/integration/test_pipeline.py

# 🎯 Test with coverage report
python -m pytest tests/ --cov=./ --cov-report=html

No API keys needed for framework validation - uses MockModelClient for comprehensive testing.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

💬 Community

Website: www.olib.ai
GitHub: github.com/Olib-AI
Issues: Report bugs or request features
Discussions: Join the conversation

👥 Authors

Akram Hasan Sharkar - Author & Lead Developer
Maya Msahal - Co-Author & Research Contributor

Developed at Olib AI

📖 Research Paper

A detailed research paper describing the methodology, evaluation framework, and empirical results of bizCon will be published on arXiv.org. The paper link will be available here upon publication.

Citation format will be provided once the paper is published.

🙏 Acknowledgments

Built with ❤️ by Akram Hasan Sharkar and Maya Msahal at Olib AI
Inspired by the need for better business-focused LLM evaluation
Thanks to all contributors who help make this project better

📈 Roadmap

🚀 View upcoming features and release history

✅ Recent Additions (May 2025)

Feature	Priority	Status	Completed
📊 Advanced Visualization Dashboards	High	✅ Complete	May 2025
🎯 Interactive Plotly Charts	High	✅ Complete	May 2025
🔄 Real-time Dashboard Filtering	Medium	✅ Complete	May 2025
📈 Statistical Analysis Tools	Medium	✅ Complete	May 2025
🔍 Model Comparison Engine	Medium	✅ Complete	May 2025

🔮 Upcoming Features

Feature	Priority	Status	ETA
🌐 More LLM Providers (Cohere, Together AI)	High	Planning	Q3 2025
🏭 Industry-Specific Scenario Packs	Medium	Planning	Q4 2025
⚡ Real-time Evaluation APIs	Medium	Researching	Q4 2025
🔗 Custom Webhook Integrations	Low	Backlog	Q1 2026
🌍 Multi-language Support	Low	Backlog	Q1 2026
🤖 AI-Powered Insights	Medium	Planning	Q3 2025

📋 Version History

v0.4.0 (Current): Advanced visualization dashboards, interactive Plotly charts, real-time filtering, statistical analysis
v0.3.0: Multi-provider support, tool integration, success rate differentiation
v0.2.0: Added visualization and reporting capabilities
v0.1.0: Initial release with core evaluation framework

Made with ❤️ by Akram Hasan Sharkar & Maya Msahal at Olib AI

⭐ Star us on GitHub • 📖 Read the Docs • 🐛 Report Issues

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
config		config
core		core
data		data
evaluators		evaluators
examples		examples
models		models
scenarios		scenarios
tests		tests
tools		tools
visualization		visualization
wiki		wiki
.DS_Store		.DS_Store
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
check_scenarios.py		check_scenarios.py
check_specific_scenario.py		check_specific_scenario.py
cli.py		cli.py
paper.md		paper.md
run.py		run.py
run_verification.py		run_verification.py
setup.py		setup.py
test_framework.py		test_framework.py
test_with_real_models.py		test_with_real_models.py

Folders and files

Latest commit

History

Repository files navigation

bizCon: Business Conversation Evaluation Framework for LLMs

📋 Table of Contents

🎯 Overview

Why bizCon?

✨ Key Features

🎭 Diverse Business Scenarios

📊 Comprehensive Evaluation Metrics

🛠️ Business Tool Ecosystem

🤖 Multi-Model Support

🚀 Quick Start

Installation

Basic Usage

Configuration

📖 Documentation

Project Structure

Creating Custom Scenarios

Adding Custom Evaluators

📊 Sample Results

Overall Model Performance

Success Rates by Category

Report Outputs

🏗️ Advanced Usage

Parallel Evaluation

Custom Model Parameters

Advanced Visualization Dashboard

Scenario Categories

🤝 Contributing

Ways to Contribute

Development Setup

Contribution Guidelines

🧪 Testing & Validation

🎯 Framework Validation Status

Running Tests

📄 License

💬 Community

👥 Authors

📖 Research Paper

🙏 Acknowledgments

📈 Roadmap

✅ Recent Additions (May 2025)

🔮 Upcoming Features

📋 Version History

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages