Introduction

Welcome to the Test Suites for Generative AI Prompt Testing project! This repository offers a comprehensive testing framework designed to evaluate the performance and behavior of various Generative AI models using prompt-based test cases. By supporting multiple AI clients—including Ollama, OpenAI, Anthropic, and Amazon Bedrock—this project provides a unified approach to testing, ensuring consistency and reliability across different models and platforms.

Motivation

As Generative AI models become increasingly integral to applications ranging from chatbots to content generation, ensuring their correctness, reliability, and ethical compliance is more important than ever. Traditional software engineering employs testing strategies like Test-Driven Development (TDD) and Behavior-Driven Development (BDD) to validate code functionality. However, testing AI models introduces unique challenges:

Probabilistic Outputs: AI models may produce different outputs for the same input due to their stochastic nature.
Complexity of Language Understanding: Evaluating the correctness of natural language outputs can be subjective and context-dependent.
Diverse Scenarios: Models must be tested across a wide range of prompts to ensure robustness and handle edge cases.

This project was created to address these challenges by:

Providing a Systematic Testing Framework: Automate the evaluation of AI model outputs against expected results.
Supporting Multiple AI Clients: Facilitate testing across different AI models and APIs without the need for separate testing tools.
- Supported AI Clients:
- Support for more AI Clients to come. If you like support for any AI API endpoints, please put in a Feature Request on GitHub
Enhancing Model Reliability: Identify inconsistencies, biases, or unwanted behaviors in AI model responses.
Streamlining Development Workflow: Integrate testing into the development pipeline, promoting best practices in AI model deployment.

By leveraging this testing suite, developers and researchers can ensure that their Generative AI models meet the desired performance standards and behave as intended in various scenarios.

Whether you're developing a new AI application or maintaining an existing one, this project aims to make the testing process more efficient and effective. We welcome contributions and feedback from the community to improve and expand this testing framework.

User Guide

Pre-requisites

Python 3.12.x
PIP or Poetry package manager
Access to at least one API enpoint:
- Ollama installed and running
- Access to OpenAI API Key
- Access to Anthropic API Key
- AWS CLI set up to use Amazon Bedrock

Step 1: Clone the repo from GitHub

git clone https://github.com/joelee/GenAI-Prompt-Test-Suites.git

Step 2: Update the API keys in `.env` file

Create a .env file, example below:

OLLAMA_API_URL='http://localhost:11434/api/generate'
OPENAI_API_KEY="Your-OpenAI-api-key-here"
ANTHROPIC_API_KEY="Your-Anthropic-api-key-here"

See .env.sample file for example

Step 3: Define models and test cases

Edit the config.yaml file for your Models and Test Cases.

To use other configuration file, you can specify the file's full path in the GENAI_TEST_CONFIG_FILE environment variable. Only yaml and json files are supported.

config.yaml is self-explanatory. There are two sections for the API Clients and Test Cases.

API Clients

model: Name of the AI Model
type: Type of API Client, currently supporting:
- ollama: Ollama API
- openai: OpenAI (ChatGPT) API
- anthropic: Anthropic (Claude) API
- bedrock: Amazon Bedrock
max_tokens: Maximum Tokens
temperature: Temperature
system_prompt: Optional System Prompt
disabled: If true will exclude this client

Example

clients:
  # Anthropic Claude 3 Haiko model
  - model: claude-3-haiku-20240307
    type: anthropic
    max_tokens: 1000
    temperature: 0
    system_prompt: You are Claude 3 Haiku, an AI assistant.
    disabled: false

Test Cases

name: The name of the Test Case
prompt: User prompt for the Test Case
expected: Expected Test Definitions on the response
forbidden: Forbidden Test Definitions on the response

Test Definitions

type: Type of test, currently supporting:
- word: Match a word in the response
- substring: Match a substring in the response (fastest)
- regex: Match a regular expression in the response
case_sensitive: Case sensitive match (default: false)
match_all: Match all of the values (default: false)
values: List of values to match
multiline: regex only. Match across multiple lines (default: false)
dotall: regex only. Make the '.' special character match any character at all (default: false)

Example

test_cases:
  - name: "Strawberry test"
    prompt: "Count how many Rs are there in the word strawberry"
    expected:
      - type: word
        case_sensitive: false   # Case insensitive match (default)
        match_all: false        # Match any of the values (default)
        values:
          - three
          - "3"
    forbidden:
      - type: word
        values:
          - two
          - "2"

Step 4: Install required packages

PIP

pip install -r requirements.txt

Poetry

poetry install

Step 5: Run tests

PIP

python src/main.py

Poetry

poetry run

High Level Flow Diagram

graph TD
    A(["Start"]) --> B[Load and Parse config.yaml]

    B --> C[Initialize Clients]

    C --> D{For Each Client}

    D -->DD{For Each Test Case}
    DD --> E[Send Prompt to Client]
    E --> F[Get Model Response]

    F --> G[Evaluate Model Response]

    G --> H{Run All Expected Tests}
    G --> I{Run All Forbidden Tests}

    H --> J[Test Pass/Fail Report]
    I --> J[Test Pass/Fail Report]

    J --> K[Log Results & Generate Reports]

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
src		src
test		test
.editorconfig		.editorconfig
.env.sample		.env.sample
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Motivation

User Guide

Pre-requisites

Step 1: Clone the repo from GitHub

Step 2: Update the API keys in `.env` file

Step 3: Define models and test cases

API Clients

Example

Test Cases

Test Definitions

Example

Step 4: Install required packages

PIP

Poetry

Step 5: Run tests

PIP

Poetry

High Level Flow Diagram

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction

Motivation

User Guide

Pre-requisites

Step 1: Clone the repo from GitHub

Step 2: Update the API keys in .env file

Step 3: Define models and test cases

API Clients

Example

Test Cases

Test Definitions

Example

Step 4: Install required packages

PIP

Poetry

Step 5: Run tests

PIP

Poetry

High Level Flow Diagram

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Step 2: Update the API keys in `.env` file

Packages