Skip to content

Benchmarking

llmsoup includes a built-in benchmarking tool that evaluates routing accuracy — whether prompts are sent to the correct model — and reports cost savings, latency overhead, and memory footprint. It is not a load testing tool; it replays a labeled test set through the routing engine without making actual API calls to model providers.

The benchmark command loads your configuration and a test set of labeled prompts, evaluates each prompt through the signal and routing pipeline, and compares the routed model against the expected model. It then measures routing latency (signal evaluation + rule matching overhead), simulates a concurrent workload to capture memory usage, and calculates cost savings based on model pricing in your config.

A test set is a YAML or JSON file containing labeled test cases. YAML examples are shown below; JSON files with a .json extension are also accepted with the same schema. Each case provides a prompt and the model you expect the router to select.

FieldTypeRequiredDescription
namestringyesName for this test set
descriptionstringnoOptional description
test_casesarrayyesList of labeled test cases

Each entry in test_cases:

FieldTypeRequiredDescription
idstringyesUnique identifier for the test case
inputobjectyesThe simulated request (see below)
expected_modelstringyesModel the router should select
expected_decisionstringnoRule name the router should match
prompt_tokensintegernoEstimated prompt tokens (enables cost analysis)
completion_tokensintegernoEstimated completion tokens (enables cost analysis)

The input object:

FieldTypeRequiredDescription
modelstringyesModel field value; set to "" to let the router decide
messagesarrayyesArray of message objects with role and content
name: routing-accuracy-test
description: Verify routing decisions across query categories
test_cases:
- id: case-001
input:
model: ""
messages:
- role: user
content: "What is machine learning?"
expected_model: gpt-4o
prompt_tokens: 512
completion_tokens: 128
- id: case-002
input:
model: ""
messages:
- role: user
content: "Translate 'hello world' to French"
expected_model: gpt-3.5-turbo
expected_decision: simple_tasks_route
prompt_tokens: 256
completion_tokens: 64
- id: case-003
input:
model: ""
messages:
- role: user
content: "Explain quantum entanglement with proofs"
expected_model: gpt-4o
prompt_tokens: 768
completion_tokens: 4096

Notes:

  • Set model to "" so the router evaluates signals and selects a model. If you provide a model name, it bypasses routing.
  • expected_decision is optional. When present, the benchmark also validates that the matched rule name matches — useful for verifying specific routing rules fire correctly.
  • prompt_tokens and completion_tokens are optional. If omitted, cost analysis still runs but those cases are excluded from cost calculations with a note in the output.

A workload file (YAML or JSON) controls the concurrent simulation used for memory measurement. This is separate from the test set — the test set defines what to route, while the workload defines how hard to push during the memory footprint test.

FieldTypeRequiredDefaultDescription
namestringyesWorkload identifier
concurrencyintegerno1Number of concurrent requests
duration_secsintegernoSimulation duration in seconds
requests_per_secondintegernoTarget requests per second
name: stress-test
concurrency: 50
duration_secs: 60
requests_per_second: 100

When --workload is omitted entirely, the benchmark uses built-in defaults for the memory simulation: 50 concurrent requests, 1 second duration, 50 requests per second. These differ from the schema defaults above, which apply when a workload file is provided but fields are omitted.

For the full flag reference, see the CLI Reference — benchmark.

FlagDescription
--test-set <path>Path to labeled test set file (required)
--workload <path>Path to workload configuration
--config <path>Path to llmsoup configuration (default: config.yaml)
--output <format>Output format: text, json, markdown, md
--export <path>Write results to a file instead of stdout
--use-modelsDownload and enable embedding/domain ML models

By default, benchmarks run without downloading ML models. Only keyword, language, and latency signals are evaluated. Pass --use-models to enable embedding similarity and domain classification signals — this requires a one-time model download and produces more comprehensive routing evaluations.

--download-models is accepted as an alias for --use-models.

When --export is provided, results are written to the specified file path and a confirmation message is printed to stdout:

Benchmark summary exported to: results.md

If --output is not specified alongside --export, the format automatically defaults to markdown (not text). To export in a different format, specify --output explicitly:

Terminal window
# Defaults to markdown
llmsoup benchmark --test-set test.yaml --export results.md
# Explicit text format
llmsoup benchmark --test-set test.yaml --output text --export results.txt
# Explicit JSON format
llmsoup benchmark --test-set test.yaml --output json --export results.json

Human-readable output printed to stdout (the default when --export is not used).

llmsoup Benchmark Results
========================
Test Set: routing-accuracy-test (10 cases)
Routing Accuracy: 80.0% (8/10 correct)
Latency (routing overhead):
Min: 0.1ms | Max: 1.4ms | Mean: 0.5ms
p50: 0.4ms | p95: 1.1ms | p99: 1.3ms
Memory Footprint:
Idle RSS: 28 MB
Loaded RSS (50 rps): 45 MB
Cost Analysis:
Currency: USD
Baseline (single model): $0.20
Routed cost: $0.11
Savings: 45.0%
Per-model costs:
gpt-3.5-turbo: $0.02
gpt-4o: $0.09
Note: 2 of 10 cases lacked token usage (cost skipped)

Machine-readable output for programmatic consumption.

{
"test_set_name": "routing-accuracy-test",
"accuracy": {
"total": 10,
"correct": 8,
"accuracy_percent": 80.0
},
"latency": {
"min_ms": 0.1,
"max_ms": 1.4,
"mean_ms": 0.5,
"p50_ms": 0.4,
"p95_ms": 1.1,
"p99_ms": 1.3
},
"memory": {
"idle_rss_mb": 28,
"loaded_rss_mb": 45
},
"cost": {
"baseline_cost": 0.2,
"actual_cost": 0.11,
"savings_percent": 45.0,
"currency": "USD",
"per_model_costs": {
"gpt-3.5-turbo": 0.02,
"gpt-4o": 0.09
},
"cases_with_cost": 8,
"cases_without_cost": 2
}
}

Structured report with tables, suitable for sharing or archiving. This is the default format when using --export.

# llmsoup Benchmark Summary
**Test Set:** routing-accuracy-test | **Date:** 2026-02-26
## Routing Accuracy
| Metric | Value |
|--------|-------|
| Total Cases | 10 |
| Correct | 8 |
| Accuracy | 80.0% |
## Routing Latency
| Percentile | Value |
|-----------|-------|
| Min | 0.1ms |
| Max | 1.4ms |
| Mean | 0.5ms |
| p50 | 0.4ms |
| p95 | 1.1ms |
| p99 | 1.3ms |
## Memory Footprint
| Metric | Value |
|--------|-------|
| Idle RSS | 28MB |
| Loaded RSS (50 rps) | 45MB |
## Cost Analysis
| Metric | Value |
|--------|-------|
| Currency | USD |
| Baseline (single model) | $0.20 |
| Routed cost | $0.11 |
| Savings | 45.0% |
### Per-Model Costs
| Model | Cost |
|-------|------|
| gpt-3.5-turbo | $0.02 |
| gpt-4o | $0.09 |
---
*Generated by llmsoup benchmark*

Routing accuracy measures how often the router selects the model you expected. Each test case compares the routed model against expected_model. If expected_decision is also set, the matched rule name must match too.

  • 100% accuracy means every prompt was routed to the expected model.
  • Low accuracy usually indicates signal configuration issues — keywords may not match, embeddings may not be loaded (check --use-models), or rule priorities may need adjustment.

Latency measures routing overhead only — the time spent evaluating signals and matching rules. It does not include model inference time (no API calls are made).

MetricMeaning
Min / MaxRange of routing times across all test cases
MeanAverage routing overhead
p50Median — half of requests were faster than this
p9595th percentile — the target for production SLAs
p9999th percentile — worst-case excluding outliers

The llmsoup performance target is p95 < 10ms routing overhead.

MetricMeaning
Idle RSSMemory usage before any workload (target: ≤ 500 MB)
Loaded RSS (50 rps)Memory under simulated concurrent load (target: ≤ 1 GB)

Memory is measured using the process RSS (Resident Set Size). The loaded measurement runs the workload simulation defined by --workload (or defaults).

Cost analysis compares what you would pay using a single baseline model versus the routed model mix.

MetricMeaning
Baseline (single model)Total cost if every request used the most expensive model
Routed costTotal cost using the models the router actually selected
SavingsPercentage reduction from routing to cheaper models when appropriate
Per-model costsBreakdown of cost by each model selected

Cost calculation requires:

  1. Token counts in your test cases (prompt_tokens and completion_tokens fields).
  2. Pricing in your configuration file (under each model’s pricing section).

Cases without token data are skipped and reported: “Note: X of Y cases lacked token usage (cost skipped)”.

The baseline model is determined by the cost_baseline_model configuration field. If not set, llmsoup uses the most expensive model as the baseline.

Save as my-test-set.yaml:

name: my-routing-test
description: Validate that coding questions go to the code model
test_cases:
- id: coding-question
input:
model: ""
messages:
- role: user
content: "Write a Python function to sort a list"
expected_model: gpt-4o
prompt_tokens: 256
completion_tokens: 512
- id: simple-question
input:
model: ""
messages:
- role: user
content: "What is the capital of France?"
expected_model: gpt-3.5-turbo
prompt_tokens: 128
completion_tokens: 64
- id: translation-task
input:
model: ""
messages:
- role: user
content: "Translate 'good morning' to Japanese"
expected_model: gpt-3.5-turbo
expected_decision: simple_tasks_route
prompt_tokens: 128
completion_tokens: 64
Terminal window
# Quick run with keyword/language signals only
llmsoup benchmark --test-set my-test-set.yaml
# Full run with embedding and domain signals
llmsoup benchmark --test-set my-test-set.yaml --use-models
# Export results to a markdown file
llmsoup benchmark --test-set my-test-set.yaml --export benchmark-results.md
llmsoup Benchmark Results
========================
Test Set: my-routing-test (3 cases)
Routing Accuracy: 66.7% (2/3 correct)
Latency (routing overhead):
Min: 0.1ms | Max: 0.8ms | Mean: 0.3ms
p50: 0.2ms | p95: 0.7ms | p99: 0.8ms
Memory Footprint:
Idle RSS: 24 MB
Loaded RSS (50 rps): 38 MB
Cost Analysis:
Currency: USD
Baseline (single model): $0.10
Routed cost: $0.06
Savings: 40.0%
Per-model costs:
gpt-3.5-turbo: $0.01
gpt-4o: $0.05

In this example, 2 of 3 cases routed correctly (66.7% accuracy). To improve:

  • Check which case failed — run with LLMSOUP_LOG=debug to see per-case routing decisions.
  • Review signal configuration — ensure keywords, embeddings, or domain rules cover the failing prompt.
  • Enable ML models — if running without --use-models, embedding and domain signals are inactive. Adding --use-models may improve accuracy.
  • Adjust rule priorities — if the wrong rule matches first, lower its priority or tighten its conditions.