- ⚔️ Red-team LLMs with tampering attacks like fine-tuning, jailbreak-tuning, embedding attacks, and others.
- 🛡️ Evaluate safety and utility with StrongREJECT, MMLU-Pro, and others.
- ⚙️ Run rigorous hyperparameter sweeps to stress-test models under worst-case attacker configurations.
- 🐍 Use a typed Python API to run attacks and evaluations programmatically.
- 🧩 Add or use custom attacks and evaluations with an extensible, decorator-based plugin architecture.
- 📄 Read more about our experiments and findings: arXiv preprint
git clone https://github.com/criticalml-uw/tamperbench.git
cd tamperbench
uv sync --all-groups
pre-commit installStress-test Qwen3-4B with 50 hyperparameter trials using the lora_finetune attack:
uv run scripts/whitebox/optuna_single.py Qwen/Qwen3-4B \
--attacks lora_finetune \
--n-trials 50Run Qwen3-4B against a fixed grid of hyperparameters for multiple attacks (see CONFIGS.md for available configurations):
uv run scripts/whitebox/benchmark_grid.py Qwen/Qwen3-4B \
--attacks lora_finetune full_parameter_finetuneConfigure and run a LoRA fine-tuning attack against Llama-3.1-8B-Instruct, then evaluate safety (StrongReject) and utility (MMLU-Pro) on the tampered model:
from tamperbench.whitebox.attacks.lora_finetune.lora_finetune import (
LoraFinetune,
LoraFinetuneConfig,
)
from tamperbench.whitebox.utils.models.config import ModelConfig
from tamperbench.whitebox.utils.names import EvalName, TemplateName
config = LoraFinetuneConfig(
input_checkpoint_path="meta-llama/Llama-3.1-8B-Instruct",
out_dir="results/my_attack",
evals=[EvalName.STRONG_REJECT, EvalName.MMLU_PRO_VAL],
model_config=ModelConfig(
template_name=TemplateName.NATIVE,
max_generation_length=1024,
inference_batch_size=16,
),
per_device_train_batch_size=8,
learning_rate=1e-4,
num_train_epochs=1,
max_steps=-1,
lr_scheduler_type="constant",
optim="adamw_torch",
lora_rank=16,
random_seed=42,
)
attack = LoraFinetune(attack_config=config)
results = attack.benchmark()
print(results)Benchmarking tamper resistance across 21 open-weight LLMs. For each model-attack pair, we select the configuration that maximizes harmfulness (StrongREJECT score) while constraining utility loss to at most 10% MMLU-Pro drop. Darker cells indicate higher harmfulness; lighter cells indicate greater tamper resistance.
| Usage Guide | Contributing | Configs | Analysis |
|---|---|---|---|
| Full usage guide, Python API, and CLI examples | Adding new attacks, evaluations, and defenses | Configuration system and YAML files | Results analysis, epsilon-bounded filtering, and visualization |
If you find TamperBench useful in your research, please cite our paper:
@article{hossain2026tamperbench,
title={{TamperBench}: Systematically Stress-Testing {LLM} Safety Under Fine-Tuning and Tampering},
author={Hossain, Saad and Tseng, Tom and Pandey, Punya Syon and Vajpayee, Samanvay and Kowal, Matthew and Nonta, Nayeema and Simko, Samuel and Casper, Stephen and Jin, Zhijing and Pelrine, Kellin and others},
journal={arXiv preprint arXiv:2602.06911},
year={2026}
}

