Skip to content

criticalml-uw/TamperBench

Repository files navigation

TamperBench

Benchmark the tamper-resistance of open-weight large language models (LLMs)

CriticalML    FAR.AI

Transformers Python 3.10+ uv arXiv


  • ⚔️ Red-team LLMs with tampering attacks like fine-tuning, jailbreak-tuning, embedding attacks, and others.
  • 🛡️ Evaluate safety and utility with StrongREJECT, MMLU-Pro, and others.
  • ⚙️ Run rigorous hyperparameter sweeps to stress-test models under worst-case attacker configurations.
  • 🐍 Use a typed Python API to run attacks and evaluations programmatically.
  • 🧩 Add or use custom attacks and evaluations with an extensible, decorator-based plugin architecture.
  • 📄 Read more about our experiments and findings: arXiv preprint

TamperBench Toolkit

🚀 Getting Started

Installation

git clone https://github.com/criticalml-uw/tamperbench.git
cd tamperbench
uv sync --all-groups
pre-commit install

⚡ Run a Benchmark

Stress-test Qwen3-4B with 50 hyperparameter trials using the lora_finetune attack:

uv run scripts/whitebox/optuna_single.py Qwen/Qwen3-4B \
    --attacks lora_finetune \
    --n-trials 50

Run Qwen3-4B against a fixed grid of hyperparameters for multiple attacks (see CONFIGS.md for available configurations):

uv run scripts/whitebox/benchmark_grid.py Qwen/Qwen3-4B \
    --attacks lora_finetune full_parameter_finetune

🐍 Python API

Configure and run a LoRA fine-tuning attack against Llama-3.1-8B-Instruct, then evaluate safety (StrongReject) and utility (MMLU-Pro) on the tampered model:

from tamperbench.whitebox.attacks.lora_finetune.lora_finetune import (
    LoraFinetune,
    LoraFinetuneConfig,
)
from tamperbench.whitebox.utils.models.config import ModelConfig
from tamperbench.whitebox.utils.names import EvalName, TemplateName

config = LoraFinetuneConfig(
    input_checkpoint_path="meta-llama/Llama-3.1-8B-Instruct",
    out_dir="results/my_attack",
    evals=[EvalName.STRONG_REJECT, EvalName.MMLU_PRO_VAL],
    model_config=ModelConfig(
        template_name=TemplateName.NATIVE,
        max_generation_length=1024,
        inference_batch_size=16,
    ),
    per_device_train_batch_size=8,
    learning_rate=1e-4,
    num_train_epochs=1,
    max_steps=-1,
    lr_scheduler_type="constant",
    optim="adamw_torch",
    lora_rank=16,
    random_seed=42,
)

attack = LoraFinetune(attack_config=config)
results = attack.benchmark()
print(results)

📊 Results

Benchmarking tamper resistance across 21 open-weight LLMs. For each model-attack pair, we select the configuration that maximizes harmfulness (StrongREJECT score) while constraining utility loss to at most 10% MMLU-Pro drop. Darker cells indicate higher harmfulness; lighter cells indicate greater tamper resistance.

StrongReject Results Heatmap

🔗 Quick Links

Usage Guide Contributing Configs Analysis
Full usage guide, Python API, and CLI examples Adding new attacks, evaluations, and defenses Configuration system and YAML files Results analysis, epsilon-bounded filtering, and visualization

🎓 Citation

If you find TamperBench useful in your research, please cite our paper:

@article{hossain2026tamperbench,
  title={{TamperBench}: Systematically Stress-Testing {LLM} Safety Under Fine-Tuning and Tampering},
  author={Hossain, Saad and Tseng, Tom and Pandey, Punya Syon and Vajpayee, Samanvay and Kowal, Matthew and Nonta, Nayeema and Simko, Samuel and Casper, Stephen and Jin, Zhijing and Pelrine, Kellin and others},
  journal={arXiv preprint arXiv:2602.06911},
  year={2026}
}

About

Toolkit to benchmark the tamper-resistance of LLMs.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages