Welcome to the Test Suites for Generative AI Prompt Testing project! This repository offers a comprehensive testing framework designed to evaluate the performance and behavior of various Generative AI models using prompt-based test cases. By supporting multiple AI clients—including Ollama, OpenAI, Anthropic, and Amazon Bedrock—this project provides a unified approach to testing, ensuring consistency and reliability across different models and platforms.
As Generative AI models become increasingly integral to applications ranging from chatbots to content generation, ensuring their correctness, reliability, and ethical compliance is more important than ever. Traditional software engineering employs testing strategies like Test-Driven Development (TDD) and Behavior-Driven Development (BDD) to validate code functionality. However, testing AI models introduces unique challenges:
- Probabilistic Outputs: AI models may produce different outputs for the same input due to their stochastic nature.
- Complexity of Language Understanding: Evaluating the correctness of natural language outputs can be subjective and context-dependent.
- Diverse Scenarios: Models must be tested across a wide range of prompts to ensure robustness and handle edge cases.
This project was created to address these challenges by:
- Providing a Systematic Testing Framework: Automate the evaluation of AI model outputs against expected results.
- Supporting Multiple AI Clients: Facilitate testing across different AI models and APIs without the need for separate testing tools.
- Supported AI Clients:
- Support for more AI Clients to come. If you like support for any AI API endpoints, please put in a Feature Request on GitHub
- Enhancing Model Reliability: Identify inconsistencies, biases, or unwanted behaviors in AI model responses.
- Streamlining Development Workflow: Integrate testing into the development pipeline, promoting best practices in AI model deployment.
By leveraging this testing suite, developers and researchers can ensure that their Generative AI models meet the desired performance standards and behave as intended in various scenarios.
Whether you're developing a new AI application or maintaining an existing one, this project aims to make the testing process more efficient and effective. We welcome contributions and feedback from the community to improve and expand this testing framework.
- Python 3.12.x
- PIP or Poetry package manager
- Access to at least one API enpoint:
- Ollama installed and running
- Access to OpenAI API Key
- Access to Anthropic API Key
- AWS CLI set up to use Amazon Bedrock
git clone https://github.com/joelee/GenAI-Prompt-Test-Suites.gitCreate a .env file, example below:
OLLAMA_API_URL='http://localhost:11434/api/generate'
OPENAI_API_KEY="Your-OpenAI-api-key-here"
ANTHROPIC_API_KEY="Your-Anthropic-api-key-here"See
.env.samplefile for example
Edit the config.yaml file for your Models and Test Cases.
To use other configuration file, you can specify the file's full path in the
GENAI_TEST_CONFIG_FILEenvironment variable. Onlyyamlandjsonfiles are supported.
config.yaml is self-explanatory. There are two sections for the API Clients and Test Cases.
model: Name of the AI Modeltype: Type of API Client, currently supporting:ollama: Ollama APIopenai: OpenAI (ChatGPT) APIanthropic: Anthropic (Claude) APIbedrock: Amazon Bedrock
max_tokens: Maximum Tokenstemperature: Temperaturesystem_prompt: Optional System Promptdisabled: Iftruewill exclude this client
clients:
# Anthropic Claude 3 Haiko model
- model: claude-3-haiku-20240307
type: anthropic
max_tokens: 1000
temperature: 0
system_prompt: You are Claude 3 Haiku, an AI assistant.
disabled: falsename: The name of the Test Caseprompt: User prompt for the Test Caseexpected: Expected Test Definitions on the responseforbidden: Forbidden Test Definitions on the response
type: Type of test, currently supporting:word: Match a word in the responsesubstring: Match a substring in the response (fastest)regex: Match a regular expression in the response
case_sensitive: Case sensitive match (default:false)match_all: Match all of the values (default:false)values: List of values to matchmultiline:regexonly. Match across multiple lines (default:false)dotall:regexonly. Make the '.' special character match any character at all (default:false)
test_cases:
- name: "Strawberry test"
prompt: "Count how many Rs are there in the word strawberry"
expected:
- type: word
case_sensitive: false # Case insensitive match (default)
match_all: false # Match any of the values (default)
values:
- three
- "3"
forbidden:
- type: word
values:
- two
- "2"pip install -r requirements.txtpoetry installpython src/main.pypoetry rungraph TD
A(["Start"]) --> B[Load and Parse config.yaml]
B --> C[Initialize Clients]
C --> D{For Each Client}
D -->DD{For Each Test Case}
DD --> E[Send Prompt to Client]
E --> F[Get Model Response]
F --> G[Evaluate Model Response]
G --> H{Run All Expected Tests}
G --> I{Run All Forbidden Tests}
H --> J[Test Pass/Fail Report]
I --> J[Test Pass/Fail Report]
J --> K[Log Results & Generate Reports]