ai-stability is a CLI-first LLM stability analyzer for developers who want to measure output consistency, detect prompt variance, and inspect unstable model behavior locally.
It runs the same prompt multiple times against the same model, compares the responses, computes a simple stability score, and saves a local JSON artifact for replay and debugging.
- install:
pipx install ai-stability - run:
ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini - get: repeated outputs, similarity scores, a stability label, inline diffs, and a saved JSON artifact
- prompt engineers testing reliability
- AI application developers debugging flaky model behavior
- teams comparing model changes before shipping
- developers who want a local CLI, not a hosted eval platform
Example prompt file:
Explain why deterministic prompts can still produce non-deterministic LLM outputs in exactly three sentences.
Example output:
Summary
- Average similarity: 82.64%
- Stability score: 83/100
- Stability label: High stability
Diff
[- hidden system behavior -] [+ internal state shifts +]
LLM outputs often vary even when the prompt, model, and calling code stay the same. That makes it harder to:
- evaluate prompt reliability
- spot regressions during model upgrades
- understand whether output drift is minor wording variance or meaningful behavior change
- build confidence in AI-powered developer tooling
ai-stability is intentionally narrow and local-first:
- one prompt file in
- repeated model calls
- simple, explicit similarity scoring
- readable terminal output
- JSON artifact saved locally for replay and debugging
- CLI-first workflow with no database, dashboard, or hosted backend
- repeated prompt execution against the same model
- explicit pairwise similarity and aggregate stability scoring
- run-by-run output review
- inline reference-vs-run diffing for fast variance inspection
- local JSON artifact saving for debugging and replay
- provider abstraction with OpenAI implemented first
- Python 3.11+
- An OpenAI API key in
OPENAI_API_KEY
pipx install ai-stabilitypython -m venv .venv
.venv\Scripts\activate
python -m pip install -e .[dev]Set your API key in the shell:
$env:OPENAI_API_KEY="your_api_key"You can copy .env.example for reference, but the CLI reads the key from the environment.
Run the analyzer:
ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-miniIf you want to invoke it through the module instead of the installed script:
python -m ai_stability run prompt.txt --n 5 --provider openai --model gpt-4.1-miniExample with a custom JSON output path:
ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini --out results\sample-run.jsonai-stability run PROMPT_FILE --n 5 --provider openai --model MODEL_NAMECurrent options:
--n: number of repeated runs, minimum2--provider: currentlyopenai--model: target model name--temperature: sampling temperature, default1.0--out: optional output file or output directory for the JSON artifact
The v1 scoring heuristic is intentionally simple and inspectable:
- normalize whitespace in each output
- compute pairwise text similarity with Python's
difflib.SequenceMatcher - average all pairwise similarity scores
- convert the average to a
0-100stability score
Stability labels:
80-100: High stability50-79: Medium stability0-49: Low stability
- summary first
- average and pairwise similarity
- final stability score and label
- each run output
- a simple reference-vs-run diff for variation review
By default, results are written to results/ai-stability-YYYYMMDD-HHMMSS.json.
The JSON artifact includes:
- prompt metadata
- provider and model
- all collected outputs
- pairwise similarities
- stability score and label
- human-readable diffs
ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-miniUse this when you want to compare how stable a model is for a fixed prompt before shipping a prompt change, swapping models, or debugging flaky output behavior.
python -m pytestsrc/ai_stability/
cli.py
runner.py
scoring.py
diffing.py
output.py
storage.py
providers/
base.py
openai_provider.py
tests/
test_scoring.py
test_runner.py
ai-stability is published on PyPI:
Releases are published from GitHub Actions with PyPI Trusted Publishing.
Typical release flow:
- update the version in
pyproject.tomlandsrc/ai_stability/__init__.py - commit and push the release commit
- create and push a Git tag like
vX.Y.Z - let the
publish.ymlworkflow run tests, build distributions, publish to PyPI, and create or update the matching GitHub release automatically
PyPI Trusted Publishing still requires one-time configuration on PyPI for this repository before automated publishing will succeed.
Example:
git tag vX.Y.Z
git push origin vX.Y.Zsrc/ai_stability/cli.pysrc/ai_stability/runner.pysrc/ai_stability/scoring.pysrc/ai_stability/providers/openai_provider.py
- V1 runs requests sequentially on purpose.
- Only OpenAI is implemented, but the provider boundary is small and ready for Anthropic later.
- The scoring heuristic is intentionally simple and inspectable rather than statistically sophisticated.