Skip to content

Olib-AI/bizcon

Repository files navigation

bizCon: Business Conversation Evaluation Framework for LLMs

License: MIT Python 3.8+ Tests GitHub Issues GitHub Stars

A comprehensive open-source framework for benchmarking Large Language Models on business conversation capabilities

🚀 Quick Start📖 Documentation📊 Sample Results🤝 Contributing💬 Community


📋 Table of Contents

📖 Click to view full navigation

🎯 Overview

bizCon is a specialized evaluation framework designed to benchmark Large Language Models (LLMs) on realistic business conversation scenarios. Unlike generic benchmarks, bizCon focuses on practical business use cases involving professional communication, tool integration, and domain-specific knowledge.

Why bizCon?

  • Business-Focused: Evaluates models on real-world business scenarios
  • Multi-Dimensional: Assesses 5 key aspects of business communication
  • Tool Integration: Tests models' ability to use business tools effectively
  • Comparative Analysis: Benchmark multiple models side-by-side
  • Enterprise-Ready: Professional reporting and analysis capabilities

✨ Key Features

🎭 Diverse Business Scenarios

  • Product Inquiries: Enterprise software consultations
  • Technical Support: Complex troubleshooting and API integration
  • Contract Negotiation: SaaS agreements and enterprise deals
  • Appointment Scheduling: Multi-stakeholder coordination
  • Compliance Inquiries: Regulatory and data privacy questions
  • Implementation Planning: Software deployment strategies
  • Service Complaints: Customer service and dispute resolution
  • Multi-Department: Cross-functional project coordination

📊 Comprehensive Evaluation Metrics

  1. Response Quality (25%) - Factual accuracy and completeness
  2. Business Value (25%) - Strategic insight and actionable recommendations
  3. Communication Style (20%) - Professionalism and tone appropriateness
  4. Tool Usage (20%) - Effective integration with business tools
  5. Performance (10%) - Response time and efficiency

🛠️ Business Tool Ecosystem

  • Knowledge Base Search
  • Product Catalog Lookup
  • Pricing Calculator
  • Appointment Scheduler
  • Customer History Access
  • Document Retrieval
  • Order Management
  • Support Ticket System

🤖 Multi-Model Support

🤖 OpenAI 🧠 Anthropic 🌟 Mistral AI
• GPT-4
• GPT-3.5-turbo
• GPT-4-turbo
• Claude-3-opus
• Claude-3-sonnet
• Claude-3-haiku
• Mistral-large
• Mistral-medium
• Mistral-small

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/Olib-AI/bizcon.git
cd bizcon

# Basic installation
pip install -e .

# Install with advanced visualization features (use quotes for zsh)
pip install -e ".[advanced]"

# Install all optional features
pip install -e ".[all]"

Basic Usage

  1. Set up your API keys:
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export MISTRAL_API_KEY="your-mistral-key"
  1. Run a quick test:
# 🚀 Test without API keys (uses mock models)
python test_framework.py

# 🧪 Run unit and integration tests
python -m pytest tests/

# 🤖 Test with real models (requires API keys)
python test_with_real_models.py
  1. Run a benchmark:
# 📊 Compare models on specific scenarios
python run.py --scenarios product_inquiry_001 support_001 --verbose

# 🏃 Run full benchmark with custom config
python run.py --config config/models.yaml --output results/

# 💻 Using CLI interface directly  
bizcon run --config config/models.yaml --output results/
  1. Explore available options:
# 📋 List all available scenarios
python run.py --list-scenarios
# or: bizcon list-scenarios

# 🤖 List supported models  
python run.py --list-models
# or: bizcon list-models

Configuration

Customize your evaluation in config/models.yaml:

models:
  - provider: openai
    name: gpt-4
    temperature: 0.7
    max_tokens: 2048
  - provider: anthropic
    name: claude-3-sonnet
    temperature: 0.7
    max_tokens: 2048

Adjust evaluation settings in config/evaluation.yaml:

evaluation:
  parallel: true
  num_runs: 3
  evaluator_weights:
    response_quality: 0.25
    business_value: 0.25
    communication_style: 0.20
    tool_usage: 0.20
    performance: 0.10

📖 Documentation

Project Structure

bizcon/
├── config/                 # Configuration files
│   ├── models.yaml        # Model configurations
│   └── evaluation.yaml    # Evaluation settings
├── core/                  # Core evaluation pipeline
│   ├── pipeline.py        # Main evaluation orchestrator
│   └── runner.py          # Scenario execution engine
├── models/                # LLM provider integrations
│   ├── openai.py         # OpenAI client
│   ├── anthropic.py      # Anthropic client
│   └── mistral.py        # Mistral AI client
├── scenarios/             # Business conversation scenarios
│   ├── product_inquiry.py
│   ├── technical_support.py
│   └── contract_negotiation.py
├── evaluators/            # Evaluation metrics
│   ├── response_quality.py
│   ├── business_value.py
│   └── communication_style.py
├── tools/                 # Business tool implementations
│   ├── knowledge_base.py
│   ├── scheduler.py
│   └── product_catalog.py
├── visualization/         # Advanced visualization and reporting
│   ├── charts.py          # Static matplotlib charts
│   ├── interactive_charts.py  # Interactive Plotly charts
│   ├── dashboard.py       # Basic Flask dashboard
│   ├── advanced_dashboard.py  # Advanced dashboard with filtering
│   ├── analysis_utils.py  # Statistical analysis tools
│   └── report.py          # Report generation
└── data/                  # Sample business data
    ├── knowledge_base/
    ├── products/
    └── pricing/

Creating Custom Scenarios

from scenarios.base import BusinessScenario

class CustomBusinessScenario(BusinessScenario):
    def __init__(self, scenario_id=None):
        super().__init__(
            scenario_id=scenario_id or "custom_001",
            name="Custom Business Scenario",
            description="Your custom scenario description",
            industry="technology",
            complexity="medium",
            tools_required=["knowledge_base", "scheduler"]
        )
    
    def _initialize_conversation(self):
        return [{
            "user_message": "Your initial customer message",
            "expected_tool_calls": [
                {"tool_id": "knowledge_base", "parameters": {"query": "example"}}
            ]
        }]
    
    def _initialize_ground_truth(self):
        return {
            "expected_facts": ["Key fact 1", "Key fact 2"],
            "business_objective": "Help customer achieve X",
            "expected_tone": "professional"
        }

Adding Custom Evaluators

from evaluators.base import BaseEvaluator

class CustomEvaluator(BaseEvaluator):
    def __init__(self, weight=1.0):
        super().__init__(name="Custom Evaluator", weight=weight)
    
    def evaluate(self, response, scenario, turn_index, conversation_history, tool_calls):
        # Your evaluation logic here
        score = self.calculate_score(response)
        return {
            "score": score,
            "explanation": "Detailed explanation of the score",
            "max_possible": 10.0
        }

📊 Sample Results

📈 Click to view sample benchmark results

Overall Model Performance

┌─────────────────┬─────────┬─────────────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ Model           │ Overall │ Response    │ Business    │ Communication│ Tool Usage  │ Performance │
│                 │ Score   │ Quality     │ Value       │ Style       │             │             │
├─────────────────┼─────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤
│ gpt-4           │ 8.2/10  │ 8.5/10      │ 8.1/10      │ 9.0/10      │ 7.8/10      │ 8.0/10      │
│ claude-3-sonnet │ 7.9/10  │ 8.2/10      │ 7.8/10      │ 8.8/10      │ 7.5/10      │ 7.2/10      │
│ claude-3-haiku  │ 7.1/10  │ 7.3/10      │ 6.9/10      │ 8.0/10      │ 6.8/10      │ 8.5/10      │
│ gpt-3.5-turbo   │ 6.8/10  │ 6.5/10      │ 6.2/10      │ 7.5/10      │ 6.0/10      │ 7.8/10      │
└─────────────────┴─────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘

Success Rates by Category

  • GPT-4: Response Quality (89%), Tool Usage (78%), Communication Style (90%)
  • Claude-3-Sonnet: Response Quality (86%), Tool Usage (75%), Communication Style (88%)
  • Claude-3-Haiku: Response Quality (73%), Tool Usage (68%), Communication Style (80%)

Report Outputs

  • 📊 Interactive HTML Report: Charts, breakdowns, and detailed analysis
  • 📈 CSV Data Export: Raw scores for custom analysis and visualization
  • 📝 Markdown Summary: Professional reports for sharing and documentation
  • 🎯 Success Rate Analysis: Model performance across business scenarios

🏗️ Advanced Usage

Parallel Evaluation

# Run multiple scenarios in parallel
python run.py --scenarios product_inquiry_001 support_001 contract_001 --parallel

# Or using CLI directly
bizcon run --scenarios product_inquiry_001 support_001 --parallel

Custom Model Parameters

models:
  - provider: openai
    name: gpt-4
    temperature: 0.3
    max_tokens: 1024
    parameters:
      seed: 42
      top_p: 0.9

Advanced Visualization Dashboard

# Install advanced features first (use quotes for zsh)
pip install -e ".[advanced]"

# Launch interactive dashboard with advanced features
python examples/advanced_dashboard_demo.py --results-dir output/

# Launch on custom host/port with auto-refresh
python examples/advanced_dashboard_demo.py --host 0.0.0.0 --port 8080

# Disable auto-refresh for static analysis
python examples/advanced_dashboard_demo.py --no-auto-refresh

Note: Advanced visualization features require additional dependencies (Plotly, Flask, SciPy). Install with pip install "bizcon[advanced]" (quotes required for zsh) to enable these features.

Scenario Categories

# Run all product inquiry scenarios
python run.py --scenarios product_inquiry_*

# Run scenarios by complexity
python run.py --scenarios complex_*

🤝 Contributing

We welcome contributions from the community! Here's how you can help:

Ways to Contribute

  • 🐛 Report Bugs: Open an issue with detailed reproduction steps
  • Suggest Features: Propose new scenarios, evaluators, or tools
  • 📝 Improve Documentation: Help make our docs clearer
  • 🔧 Submit Code: Fix bugs or add new features
  • 🧪 Add Test Cases: Improve our test coverage

Development Setup

git clone https://github.com/Olib-AI/bizcon.git
cd bizcon
pip install -e .

# Run framework validation (no API keys needed)
python test_framework.py

# Run full test suite  
python -m pytest tests/

Contribution Guidelines

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite (pytest)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

🧪 Testing & Validation

🎯 Framework Validation Status

Component Status Coverage
Unit Tests ✅ PASSED (12/12) Evaluators, Scenarios, Tools
Integration Tests ✅ PASSED End-to-end Pipeline
Framework Tests ✅ PASSED Mock Model Validation
Report Generation ✅ WORKING HTML, Markdown, CSV
CLI Functionality ✅ OPERATIONAL All Commands Available
Data Integrity ✅ VERIFIED JSON Files Valid

Running Tests

🧪 Click to view test commands
# 🚀 Quick framework validation (no API keys required)
python test_framework.py

# 📊 Full test suite with detailed output
python -m pytest tests/ -v

# 🔍 Test specific components
python -m pytest tests/unit/test_evaluators.py::TestResponseQualityEvaluator
python -m pytest tests/integration/test_pipeline.py

# 🎯 Test with coverage report
python -m pytest tests/ --cov=./ --cov-report=html

No API keys needed for framework validation - uses MockModelClient for comprehensive testing.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

💬 Community

👥 Authors

Akram Hasan Sharkar - Author & Lead Developer
Maya Msahal - Co-Author & Research Contributor

Developed at Olib AI

📖 Research Paper

A detailed research paper describing the methodology, evaluation framework, and empirical results of bizCon will be published on arXiv.org. The paper link will be available here upon publication.

Citation format will be provided once the paper is published.

🙏 Acknowledgments

  • Built with ❤️ by Akram Hasan Sharkar and Maya Msahal at Olib AI
  • Inspired by the need for better business-focused LLM evaluation
  • Thanks to all contributors who help make this project better

📈 Roadmap

🚀 View upcoming features and release history

✅ Recent Additions (May 2025)

Feature Priority Status Completed
📊 Advanced Visualization Dashboards High ✅ Complete May 2025
🎯 Interactive Plotly Charts High ✅ Complete May 2025
🔄 Real-time Dashboard Filtering Medium ✅ Complete May 2025
📈 Statistical Analysis Tools Medium ✅ Complete May 2025
🔍 Model Comparison Engine Medium ✅ Complete May 2025

🔮 Upcoming Features

Feature Priority Status ETA
🌐 More LLM Providers (Cohere, Together AI) High Planning Q3 2025
🏭 Industry-Specific Scenario Packs Medium Planning Q4 2025
Real-time Evaluation APIs Medium Researching Q4 2025
🔗 Custom Webhook Integrations Low Backlog Q1 2026
🌍 Multi-language Support Low Backlog Q1 2026
🤖 AI-Powered Insights Medium Planning Q3 2025

📋 Version History

  • v0.4.0 (Current): Advanced visualization dashboards, interactive Plotly charts, real-time filtering, statistical analysis
  • v0.3.0: Multi-provider support, tool integration, success rate differentiation
  • v0.2.0: Added visualization and reporting capabilities
  • v0.1.0: Initial release with core evaluation framework

About

LLM benchmark for business conversations

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors