Benchmarking
llmsoup includes a built-in benchmarking tool that evaluates routing accuracy — whether prompts are sent to the correct model — and reports cost savings, latency overhead, and memory footprint. It is not a load testing tool; it replays a labeled test set through the routing engine without making actual API calls to model providers.
How it works
Section titled “How it works”The benchmark command loads your configuration and a test set of labeled prompts, evaluates each prompt through the signal and routing pipeline, and compares the routed model against the expected model. It then measures routing latency (signal evaluation + rule matching overhead), simulates a concurrent workload to capture memory usage, and calculates cost savings based on model pricing in your config.
Test set format
Section titled “Test set format”A test set is a YAML or JSON file containing labeled test cases. YAML examples are shown below; JSON files with a .json extension are also accepted with the same schema. Each case provides a prompt and the model you expect the router to select.
Schema
Section titled “Schema”| Field | Type | Required | Description |
|---|---|---|---|
name | string | yes | Name for this test set |
description | string | no | Optional description |
test_cases | array | yes | List of labeled test cases |
Each entry in test_cases:
| Field | Type | Required | Description |
|---|---|---|---|
id | string | yes | Unique identifier for the test case |
input | object | yes | The simulated request (see below) |
expected_model | string | yes | Model the router should select |
expected_decision | string | no | Rule name the router should match |
prompt_tokens | integer | no | Estimated prompt tokens (enables cost analysis) |
completion_tokens | integer | no | Estimated completion tokens (enables cost analysis) |
The input object:
| Field | Type | Required | Description |
|---|---|---|---|
model | string | yes | Model field value; set to "" to let the router decide |
messages | array | yes | Array of message objects with role and content |
Example test set
Section titled “Example test set”name: routing-accuracy-testdescription: Verify routing decisions across query categories
test_cases: - id: case-001 input: model: "" messages: - role: user content: "What is machine learning?" expected_model: gpt-4o prompt_tokens: 512 completion_tokens: 128
- id: case-002 input: model: "" messages: - role: user content: "Translate 'hello world' to French" expected_model: gpt-3.5-turbo expected_decision: simple_tasks_route prompt_tokens: 256 completion_tokens: 64
- id: case-003 input: model: "" messages: - role: user content: "Explain quantum entanglement with proofs" expected_model: gpt-4o prompt_tokens: 768 completion_tokens: 4096Notes:
- Set
modelto""so the router evaluates signals and selects a model. If you provide a model name, it bypasses routing. expected_decisionis optional. When present, the benchmark also validates that the matched rule name matches — useful for verifying specific routing rules fire correctly.prompt_tokensandcompletion_tokensare optional. If omitted, cost analysis still runs but those cases are excluded from cost calculations with a note in the output.
Workload configuration
Section titled “Workload configuration”A workload file (YAML or JSON) controls the concurrent simulation used for memory measurement. This is separate from the test set — the test set defines what to route, while the workload defines how hard to push during the memory footprint test.
Schema
Section titled “Schema”| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name | string | yes | — | Workload identifier |
concurrency | integer | no | 1 | Number of concurrent requests |
duration_secs | integer | no | — | Simulation duration in seconds |
requests_per_second | integer | no | — | Target requests per second |
Example workload
Section titled “Example workload”name: stress-testconcurrency: 50duration_secs: 60requests_per_second: 100When --workload is omitted entirely, the benchmark uses built-in defaults for the memory simulation: 50 concurrent requests, 1 second duration, 50 requests per second. These differ from the schema defaults above, which apply when a workload file is provided but fields are omitted.
CLI flags
Section titled “CLI flags”For the full flag reference, see the CLI Reference — benchmark.
| Flag | Description |
|---|---|
--test-set <path> | Path to labeled test set file (required) |
--workload <path> | Path to workload configuration |
--config <path> | Path to llmsoup configuration (default: config.yaml) |
--output <format> | Output format: text, json, markdown, md |
--export <path> | Write results to a file instead of stdout |
--use-models | Download and enable embedding/domain ML models |
The --use-models flag
Section titled “The --use-models flag”By default, benchmarks run without downloading ML models. Only keyword, language, and latency signals are evaluated. Pass --use-models to enable embedding similarity and domain classification signals — this requires a one-time model download and produces more comprehensive routing evaluations.
--download-modelsis accepted as an alias for--use-models.
The --export flag
Section titled “The --export flag”When --export is provided, results are written to the specified file path and a confirmation message is printed to stdout:
Benchmark summary exported to: results.mdIf --output is not specified alongside --export, the format automatically defaults to markdown (not text). To export in a different format, specify --output explicitly:
# Defaults to markdownllmsoup benchmark --test-set test.yaml --export results.md
# Explicit text formatllmsoup benchmark --test-set test.yaml --output text --export results.txt
# Explicit JSON formatllmsoup benchmark --test-set test.yaml --output json --export results.jsonOutput formats
Section titled “Output formats”Human-readable output printed to stdout (the default when --export is not used).
llmsoup Benchmark Results========================Test Set: routing-accuracy-test (10 cases)
Routing Accuracy: 80.0% (8/10 correct)
Latency (routing overhead): Min: 0.1ms | Max: 1.4ms | Mean: 0.5ms p50: 0.4ms | p95: 1.1ms | p99: 1.3ms
Memory Footprint: Idle RSS: 28 MB Loaded RSS (50 rps): 45 MB
Cost Analysis: Currency: USD Baseline (single model): $0.20 Routed cost: $0.11 Savings: 45.0% Per-model costs: gpt-3.5-turbo: $0.02 gpt-4o: $0.09 Note: 2 of 10 cases lacked token usage (cost skipped)Machine-readable output for programmatic consumption.
{ "test_set_name": "routing-accuracy-test", "accuracy": { "total": 10, "correct": 8, "accuracy_percent": 80.0 }, "latency": { "min_ms": 0.1, "max_ms": 1.4, "mean_ms": 0.5, "p50_ms": 0.4, "p95_ms": 1.1, "p99_ms": 1.3 }, "memory": { "idle_rss_mb": 28, "loaded_rss_mb": 45 }, "cost": { "baseline_cost": 0.2, "actual_cost": 0.11, "savings_percent": 45.0, "currency": "USD", "per_model_costs": { "gpt-3.5-turbo": 0.02, "gpt-4o": 0.09 }, "cases_with_cost": 8, "cases_without_cost": 2 }}Markdown
Section titled “Markdown”Structured report with tables, suitable for sharing or archiving. This is the default format when using --export.
# llmsoup Benchmark Summary
**Test Set:** routing-accuracy-test | **Date:** 2026-02-26
## Routing Accuracy
| Metric | Value ||--------|-------|| Total Cases | 10 || Correct | 8 || Accuracy | 80.0% |
## Routing Latency
| Percentile | Value ||-----------|-------|| Min | 0.1ms || Max | 1.4ms || Mean | 0.5ms || p50 | 0.4ms || p95 | 1.1ms || p99 | 1.3ms |
## Memory Footprint
| Metric | Value ||--------|-------|| Idle RSS | 28MB || Loaded RSS (50 rps) | 45MB |
## Cost Analysis
| Metric | Value ||--------|-------|| Currency | USD || Baseline (single model) | $0.20 || Routed cost | $0.11 || Savings | 45.0% |
### Per-Model Costs
| Model | Cost ||-------|------|| gpt-3.5-turbo | $0.02 || gpt-4o | $0.09 |
---*Generated by llmsoup benchmark*Interpreting results
Section titled “Interpreting results”Routing accuracy
Section titled “Routing accuracy”Routing accuracy measures how often the router selects the model you expected. Each test case compares the routed model against expected_model. If expected_decision is also set, the matched rule name must match too.
- 100% accuracy means every prompt was routed to the expected model.
- Low accuracy usually indicates signal configuration issues — keywords may not match, embeddings may not be loaded (check
--use-models), or rule priorities may need adjustment.
Latency metrics
Section titled “Latency metrics”Latency measures routing overhead only — the time spent evaluating signals and matching rules. It does not include model inference time (no API calls are made).
| Metric | Meaning |
|---|---|
| Min / Max | Range of routing times across all test cases |
| Mean | Average routing overhead |
| p50 | Median — half of requests were faster than this |
| p95 | 95th percentile — the target for production SLAs |
| p99 | 99th percentile — worst-case excluding outliers |
The llmsoup performance target is p95 < 10ms routing overhead.
Memory footprint
Section titled “Memory footprint”| Metric | Meaning |
|---|---|
| Idle RSS | Memory usage before any workload (target: ≤ 500 MB) |
| Loaded RSS (50 rps) | Memory under simulated concurrent load (target: ≤ 1 GB) |
Memory is measured using the process RSS (Resident Set Size). The loaded measurement runs the workload simulation defined by --workload (or defaults).
Cost analysis
Section titled “Cost analysis”Cost analysis compares what you would pay using a single baseline model versus the routed model mix.
| Metric | Meaning |
|---|---|
| Baseline (single model) | Total cost if every request used the most expensive model |
| Routed cost | Total cost using the models the router actually selected |
| Savings | Percentage reduction from routing to cheaper models when appropriate |
| Per-model costs | Breakdown of cost by each model selected |
Cost calculation requires:
- Token counts in your test cases (
prompt_tokensandcompletion_tokensfields). - Pricing in your configuration file (under each model’s
pricingsection).
Cases without token data are skipped and reported: “Note: X of Y cases lacked token usage (cost skipped)”.
The baseline model is determined by the cost_baseline_model configuration field. If not set, llmsoup uses the most expensive model as the baseline.
End-to-end example
Section titled “End-to-end example”1. Create a test set
Section titled “1. Create a test set”Save as my-test-set.yaml:
name: my-routing-testdescription: Validate that coding questions go to the code model
test_cases: - id: coding-question input: model: "" messages: - role: user content: "Write a Python function to sort a list" expected_model: gpt-4o prompt_tokens: 256 completion_tokens: 512
- id: simple-question input: model: "" messages: - role: user content: "What is the capital of France?" expected_model: gpt-3.5-turbo prompt_tokens: 128 completion_tokens: 64
- id: translation-task input: model: "" messages: - role: user content: "Translate 'good morning' to Japanese" expected_model: gpt-3.5-turbo expected_decision: simple_tasks_route prompt_tokens: 128 completion_tokens: 642. Run the benchmark
Section titled “2. Run the benchmark”# Quick run with keyword/language signals onlyllmsoup benchmark --test-set my-test-set.yaml
# Full run with embedding and domain signalsllmsoup benchmark --test-set my-test-set.yaml --use-models
# Export results to a markdown filellmsoup benchmark --test-set my-test-set.yaml --export benchmark-results.md3. Interpret the output
Section titled “3. Interpret the output”llmsoup Benchmark Results========================Test Set: my-routing-test (3 cases)
Routing Accuracy: 66.7% (2/3 correct)
Latency (routing overhead): Min: 0.1ms | Max: 0.8ms | Mean: 0.3ms p50: 0.2ms | p95: 0.7ms | p99: 0.8ms
Memory Footprint: Idle RSS: 24 MB Loaded RSS (50 rps): 38 MB
Cost Analysis: Currency: USD Baseline (single model): $0.10 Routed cost: $0.06 Savings: 40.0% Per-model costs: gpt-3.5-turbo: $0.01 gpt-4o: $0.05In this example, 2 of 3 cases routed correctly (66.7% accuracy). To improve:
- Check which case failed — run with
LLMSOUP_LOG=debugto see per-case routing decisions. - Review signal configuration — ensure keywords, embeddings, or domain rules cover the failing prompt.
- Enable ML models — if running without
--use-models, embedding and domain signals are inactive. Adding--use-modelsmay improve accuracy. - Adjust rule priorities — if the wrong rule matches first, lower its priority or tighten its conditions.