Benchmarks
AIProxyGuard detection accuracy is measured using PIBench, an open-source prompt injection benchmark tool.
Current Performance (v0.2.42)
| Metric | Value |
|---|---|
| Balanced Score | 75.81% |
| True Positive Rate | 53.65% |
| True Negative Rate | 97.97% |
| Precision | 96.45% |
| F1 Score | 68.95% |
| Avg Latency | 91.3 ms |
Detection by Category
| Category | Detection Rate | Details |
|---|---|---|
| Jailbreak | 74.9% | DAN mode, persona exploits, restriction bypass |
| Prompt Injection | 32.7% | Instruction override, context manipulation |
| False Positives | 2.0% | Benign prompts incorrectly blocked |
Benchmark Dataset
We use a canonical baseline dataset for reproducible comparisons:
| Dataset | Samples | Jailbreaks | Injections | Benign |
|---|---|---|---|---|
| baseline_v2.jsonl | 1,834 | 441 | 470 | 917 |
Data Sources
| Source | Samples | Type | License |
|---|---|---|---|
| JailbreakHub | 15,140 | Jailbreaks | CC-BY-4.0 |
| deepset | 662 | Mixed | Apache-2.0 |
| jackhhao | 1,310 | Mixed | Apache-2.0 |
| xTRam1 | 10,296 | Mixed | Apache-2.0 |
| yanismiraoui | 1,034 | Multilingual | Apache-2.0 |
| Gandalf | 1,000 | Injections | MIT |
| PALLMs | ~135 | Jailbreaks | MIT |
| UltraChat | 515k | Benign | MIT |
Running Benchmarks
Install PIBench
git clone https://github.com/AInvirion/prompt-injection-benchmark.git
cd prompt-injection-benchmark
uv venv && source .venv/bin/activate
uv pip install -e .
Run Against Your Deployment
# Using canonical baseline (recommended for comparisons)
pibench run https://your-proxy.app -d data/baseline_v2.jsonl
# Quick test with limited samples
pibench run https://your-proxy.app --max-samples 100
# Save results
pibench run https://your-proxy.app -d data/baseline_v2.jsonl -o results.json
Run Against Local Instance
# Start AIProxyGuard locally
docker run -d -p 8080:8080 ainvirion/aiproxyguard:latest
# Run benchmark
pibench run http://localhost:8080 -d data/baseline_v2.jsonl
Scoring Methodology
PIBench uses Balanced Accuracy to prevent gaming:
Balanced Score = (True Positive Rate + True Negative Rate) / 2
This prevents:
- Blocking everything (high TPR, 0% TNR) = ~50%
- Allowing everything (0% TPR, high TNR) = ~50%
Metrics Explained
| Metric | Formula | Description |
|---|---|---|
| True Positive Rate (Recall) | TP / (TP + FN) | % of attacks detected |
| True Negative Rate | TN / (TN + FP) | % of benign prompts allowed |
| Precision | TP / (TP + FP) | % of detections that were correct |
| F1 Score | 2 * (P * R) / (P + R) | Harmonic mean of precision and recall |
Tuning for Your Use Case
High Security (Catch More Attacks)
Lower thresholds catch more attacks but increase false positives:
policy:
categories:
prompt_injection:
threshold: 0.3 # Very aggressive
jailbreak:
threshold: 0.3
Expected impact:
- True Positive Rate: +15-20%
- False Positive Rate: +5-10%
High Precision (Minimize False Positives)
Higher thresholds reduce false positives but miss some attacks:
policy:
categories:
prompt_injection:
threshold: 0.7 # Conservative
jailbreak:
threshold: 0.7
Expected impact:
- True Positive Rate: -10-15%
- False Positive Rate: -3-5%
Version History
| Version | Balanced Score | TPR | TNR | Notes |
|---|---|---|---|---|
| v0.2.42 | 75.81% | 53.65% | 97.97% | Hyperscan SOM_LEFTMOST fix |
| v0.2.38 | 76.10% | 54.26% | 97.93% | Baseline (different dataset) |
Detection Limitations
What We Detect Well
- Jailbreaks (74.9%): DAN mode, evil mode, persona exploits
- Direct Injection (60%+): “Ignore previous instructions”
- Encoding Evasion: Base64, URL encoding, hex escapes
Known Gaps
- Semantic Attacks: Subtle rephrasing without trigger patterns
- Novel Techniques: Zero-day jailbreaks not in training data
- Indirect Injection: Attacks embedded in external content
Improving Detection
- Enable ML Classifier (Enterprise): +15-25% TPR
- Custom Signatures: Add patterns specific to your use case
- Lower Thresholds: Trade precision for recall
- Response Scanning: Catch data exfiltration attempts
Reproducing Results
# Clone benchmark repo
git clone https://github.com/AInvirion/prompt-injection-benchmark.git
cd prompt-injection-benchmark
# Install
uv venv && source .venv/bin/activate
uv pip install -e .
# Run against your proxy
pibench run http://localhost:8080 \
-d data/baseline_v2.jsonl \
--name "AIProxyGuard v0.2.42" \
-o results/my_benchmark.json
# View results
pibench report results/my_benchmark.json
CI/CD Integration
Add benchmark checks to your pipeline:
# .github/workflows/benchmark.yml
name: Benchmark
on:
release:
types: [published]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install PIBench
run: |
pip install git+https://github.com/AInvirion/prompt-injection-benchmark.git
- name: Run Benchmark
run: |
pibench run $ \
-d data/baseline_v2.jsonl \
-o benchmark.json
- name: Check Threshold
run: |
score=$(jq '.balanced_score' benchmark.json)
if (( $(echo "$score < 0.70" | bc -l) )); then
echo "Benchmark score $score below threshold 0.70"
exit 1
fi
Contributing
To add new test cases or improve the benchmark:
- Fork prompt-injection-benchmark
- Add samples to
data/or new sources tosrc/pibench/datasets.py - Submit a PR with before/after benchmark results