Benchmarking

llmsoup includes a built-in benchmarking tool that evaluates routing accuracy — whether prompts are sent to the correct model — and reports cost savings, latency overhead, and memory footprint. It is not a load testing tool; it replays a labeled test set through the routing engine without making actual API calls to model providers.

How it works

The benchmark command loads your configuration and a test set of labeled prompts, evaluates each prompt through the signal and routing pipeline, and compares the routed model against the expected model. It then measures routing latency (signal evaluation + rule matching overhead), simulates a concurrent workload to capture memory usage, and calculates cost savings based on model pricing in your config.

Test set format

A test set is a YAML or JSON file containing labeled test cases. YAML examples are shown below; JSON files with a .json extension are also accepted with the same schema. Each case provides a prompt and the model you expect the router to select.

Schema

Field	Type	Required	Description
`name`	string	yes	Name for this test set
`description`	string	no	Optional description
`test_cases`	array	yes	List of labeled test cases

Each entry in test_cases:

Field	Type	Required	Description
`id`	string	yes	Unique identifier for the test case
`input`	object	yes	The simulated request (see below)
`expected_model`	string	yes	Model the router should select
`expected_decision`	string	no	Rule name the router should match
`prompt_tokens`	integer	no	Estimated prompt tokens (enables cost analysis)
`completion_tokens`	integer	no	Estimated completion tokens (enables cost analysis)

The input object:

Field	Type	Required	Description
`model`	string	yes	Model field value; set to `""` to let the router decide
`messages`	array	yes	Array of message objects with `role` and `content`

Example test set

name: routing-accuracy-test
description: Verify routing decisions across query categories

test_cases:
  - id: case-001
    input:
      model: ""
      messages:
        - role: user
          content: "What is machine learning?"
    expected_model: gpt-4o
    prompt_tokens: 512
    completion_tokens: 128

  - id: case-002
    input:
      model: ""
      messages:
        - role: user
          content: "Translate 'hello world' to French"
    expected_model: gpt-3.5-turbo
    expected_decision: simple_tasks_route
    prompt_tokens: 256
    completion_tokens: 64

  - id: case-003
    input:
      model: ""
      messages:
        - role: user
          content: "Explain quantum entanglement with proofs"
    expected_model: gpt-4o
    prompt_tokens: 768
    completion_tokens: 4096

Notes:

Set model to "" so the router evaluates signals and selects a model. If you provide a model name, it bypasses routing.
expected_decision is optional. When present, the benchmark also validates that the matched rule name matches — useful for verifying specific routing rules fire correctly.
prompt_tokens and completion_tokens are optional. If omitted, cost analysis still runs but those cases are excluded from cost calculations with a note in the output.

Workload configuration

A workload file (YAML or JSON) controls the concurrent simulation used for memory measurement. This is separate from the test set — the test set defines what to route, while the workload defines how hard to push during the memory footprint test.

Schema

Field	Type	Required	Default	Description
`name`	string	yes	—	Workload identifier
`concurrency`	integer	no	`1`	Number of concurrent requests
`duration_secs`	integer	no	—	Simulation duration in seconds
`requests_per_second`	integer	no	—	Target requests per second

Example workload

name: stress-test
concurrency: 50
duration_secs: 60
requests_per_second: 100

When --workload is omitted entirely, the benchmark uses built-in defaults for the memory simulation: 50 concurrent requests, 1 second duration, 50 requests per second. These differ from the schema defaults above, which apply when a workload file is provided but fields are omitted.

CLI flags

For the full flag reference, see the CLI Reference — benchmark.

Flag	Description
`--test-set <path>`	Path to labeled test set file (required)
`--workload <path>`	Path to workload configuration
`--config <path>`	Path to llmsoup configuration (default: `config.yaml`)
`--output <format>`	Output format: `text`, `json`, `markdown`, `md`
`--export <path>`	Write results to a file instead of stdout
`--use-models`	Download and enable embedding/domain ML models

The `--use-models` flag

By default, benchmarks run without downloading ML models. Only keyword, language, and latency signals are evaluated. Pass --use-models to enable embedding similarity and domain classification signals — this requires a one-time model download and produces more comprehensive routing evaluations.

--download-models is accepted as an alias for --use-models.

The `--export` flag

When --export is provided, results are written to the specified file path and a confirmation message is printed to stdout:

Benchmark summary exported to: results.md

If --output is not specified alongside --export, the format automatically defaults to markdown (not text). To export in a different format, specify --output explicitly:

# Defaults to markdown
llmsoup benchmark --test-set test.yaml --export results.md

# Explicit text format
llmsoup benchmark --test-set test.yaml --output text --export results.txt

# Explicit JSON format
llmsoup benchmark --test-set test.yaml --output json --export results.json

Output formats

Text

Human-readable output printed to stdout (the default when --export is not used).

llmsoup Benchmark Results
========================
Test Set: routing-accuracy-test (10 cases)

Routing Accuracy: 80.0% (8/10 correct)

Latency (routing overhead):
  Min: 0.1ms | Max: 1.4ms | Mean: 0.5ms
  p50: 0.4ms | p95: 1.1ms | p99: 1.3ms

Memory Footprint:
  Idle RSS: 28 MB
  Loaded RSS (50 rps): 45 MB

Cost Analysis:
  Currency: USD
  Baseline (single model): $0.20
  Routed cost: $0.11
  Savings: 45.0%
  Per-model costs:
    gpt-3.5-turbo: $0.02
    gpt-4o: $0.09
  Note: 2 of 10 cases lacked token usage (cost skipped)

JSON

Machine-readable output for programmatic consumption.

{
  "test_set_name": "routing-accuracy-test",
  "accuracy": {
    "total": 10,
    "correct": 8,
    "accuracy_percent": 80.0
  },
  "latency": {
    "min_ms": 0.1,
    "max_ms": 1.4,
    "mean_ms": 0.5,
    "p50_ms": 0.4,
    "p95_ms": 1.1,
    "p99_ms": 1.3
  },
  "memory": {
    "idle_rss_mb": 28,
    "loaded_rss_mb": 45
  },
  "cost": {
    "baseline_cost": 0.2,
    "actual_cost": 0.11,
    "savings_percent": 45.0,
    "currency": "USD",
    "per_model_costs": {
      "gpt-3.5-turbo": 0.02,
      "gpt-4o": 0.09
    },
    "cases_with_cost": 8,
    "cases_without_cost": 2
  }
}

Markdown

Structured report with tables, suitable for sharing or archiving. This is the default format when using --export.

# llmsoup Benchmark Summary

**Test Set:** routing-accuracy-test | **Date:** 2026-02-26

## Routing Accuracy

| Metric | Value |
|--------|-------|
| Total Cases | 10 |
| Correct | 8 |
| Accuracy | 80.0% |

## Routing Latency

| Percentile | Value |
|-----------|-------|
| Min | 0.1ms |
| Max | 1.4ms |
| Mean | 0.5ms |
| p50 | 0.4ms |
| p95 | 1.1ms |
| p99 | 1.3ms |

## Memory Footprint

| Metric | Value |
|--------|-------|
| Idle RSS | 28MB |
| Loaded RSS (50 rps) | 45MB |

## Cost Analysis

| Metric | Value |
|--------|-------|
| Currency | USD |
| Baseline (single model) | $0.20 |
| Routed cost | $0.11 |
| Savings | 45.0% |

### Per-Model Costs

| Model | Cost |
|-------|------|
| gpt-3.5-turbo | $0.02 |
| gpt-4o | $0.09 |

---
*Generated by llmsoup benchmark*

Interpreting results

Routing accuracy

Routing accuracy measures how often the router selects the model you expected. Each test case compares the routed model against expected_model. If expected_decision is also set, the matched rule name must match too.

100% accuracy means every prompt was routed to the expected model.
Low accuracy usually indicates signal configuration issues — keywords may not match, embeddings may not be loaded (check --use-models), or rule priorities may need adjustment.

Latency metrics

Latency measures routing overhead only — the time spent evaluating signals and matching rules. It does not include model inference time (no API calls are made).

Metric	Meaning
Min / Max	Range of routing times across all test cases
Mean	Average routing overhead
p50	Median — half of requests were faster than this
p95	95th percentile — the target for production SLAs
p99	99th percentile — worst-case excluding outliers

The llmsoup performance target is p95 < 10ms routing overhead.

Memory footprint

Metric	Meaning
Idle RSS	Memory usage before any workload (target: ≤ 500 MB)
Loaded RSS (50 rps)	Memory under simulated concurrent load (target: ≤ 1 GB)

Memory is measured using the process RSS (Resident Set Size). The loaded measurement runs the workload simulation defined by --workload (or defaults).

Cost analysis

Cost analysis compares what you would pay using a single baseline model versus the routed model mix.

Metric	Meaning
Baseline (single model)	Total cost if every request used the most expensive model
Routed cost	Total cost using the models the router actually selected
Savings	Percentage reduction from routing to cheaper models when appropriate
Per-model costs	Breakdown of cost by each model selected

Cost calculation requires:

Token counts in your test cases (prompt_tokens and completion_tokens fields).
Pricing in your configuration file (under each model’s pricing section).

Cases without token data are skipped and reported: “Note: X of Y cases lacked token usage (cost skipped)”.

The baseline model is determined by the cost_baseline_model configuration field. If not set, llmsoup uses the most expensive model as the baseline.

End-to-end example

1. Create a test set

Save as my-test-set.yaml:

name: my-routing-test
description: Validate that coding questions go to the code model

test_cases:
  - id: coding-question
    input:
      model: ""
      messages:
        - role: user
          content: "Write a Python function to sort a list"
    expected_model: gpt-4o
    prompt_tokens: 256
    completion_tokens: 512

  - id: simple-question
    input:
      model: ""
      messages:
        - role: user
          content: "What is the capital of France?"
    expected_model: gpt-3.5-turbo
    prompt_tokens: 128
    completion_tokens: 64

  - id: translation-task
    input:
      model: ""
      messages:
        - role: user
          content: "Translate 'good morning' to Japanese"
    expected_model: gpt-3.5-turbo
    expected_decision: simple_tasks_route
    prompt_tokens: 128
    completion_tokens: 64

2. Run the benchmark

# Quick run with keyword/language signals only
llmsoup benchmark --test-set my-test-set.yaml

# Full run with embedding and domain signals
llmsoup benchmark --test-set my-test-set.yaml --use-models

# Export results to a markdown file
llmsoup benchmark --test-set my-test-set.yaml --export benchmark-results.md

3. Interpret the output

llmsoup Benchmark Results
========================
Test Set: my-routing-test (3 cases)

Routing Accuracy: 66.7% (2/3 correct)

Latency (routing overhead):
  Min: 0.1ms | Max: 0.8ms | Mean: 0.3ms
  p50: 0.2ms | p95: 0.7ms | p99: 0.8ms

Memory Footprint:
  Idle RSS: 24 MB
  Loaded RSS (50 rps): 38 MB

Cost Analysis:
  Currency: USD
  Baseline (single model): $0.10
  Routed cost: $0.06
  Savings: 40.0%
  Per-model costs:
    gpt-3.5-turbo: $0.01
    gpt-4o: $0.05

In this example, 2 of 3 cases routed correctly (66.7% accuracy). To improve:

Check which case failed — run with LLMSOUP_LOG=debug to see per-case routing decisions.
Review signal configuration — ensure keywords, embeddings, or domain rules cover the failing prompt.
Enable ML models — if running without --use-models, embedding and domain signals are inactive. Adding --use-models may improve accuracy.
Adjust rule priorities — if the wrong rule matches first, lower its priority or tighten its conditions.

Benchmarking

How it works

Test set format

Schema

Example test set

Workload configuration

Schema

Example workload

CLI flags

The --use-models flag

The --export flag

Output formats

Text

JSON

Markdown

Interpreting results

Routing accuracy

Latency metrics

Memory footprint

Cost analysis

End-to-end example

1. Create a test set

2. Run the benchmark

3. Interpret the output

The `--use-models` flag

The `--export` flag