Configuration Reference
llmsoup uses a single YAML configuration file to define models, signals, routing rules, plugins, and authentication. This reference covers every configuration option available.
Overview
Section titled “Overview”llmsoup loads configuration from a YAML file at startup. The file location is resolved in this order:
--configCLI flag (e.g.,llmsoup serve --config my-config.yaml)LLMSOUP_CONFIGenvironment variableconfig.yamlin the current directory (default)
Generate a starter config with all options commented:
llmsoup prepareTop-level structure
Section titled “Top-level structure”A config file has the following top-level keys:
version: v0.1 # Optional version identifierdefaults: # Global routing defaultsmodels: # Model definitions (min 2)signals: # Signal evaluatorsrules: # Routing rules (priority-ordered)classifier: # Domain classifier model (optional)auth: # Authentication (optional)The version field is an optional string carried forward from the config file — it has no effect on behavior but can be used to track config revisions.
Semantic-router compatibility
Section titled “Semantic-router compatibility”llmsoup is backward-compatible with the semantic-router configuration format. The parser auto-normalizes legacy field names:
| Legacy field | Normalized to |
|---|---|
providers.default_model | defaults.default_model |
providers.models | models |
decisions | rules |
modelRefs | action.model_refs + action.primary_model / fallback_models |
Condition type field | {type}.{name} signal reference format |
Both formats are accepted. This reference documents the normalized format; the Getting Started guide uses the legacy decisions / modelRefs format for compatibility.
Defaults
Section titled “Defaults”The defaults block sets global behavior for the routing engine.
defaults: default_model: gpt-5-mini preference_model: gpt-5-mini request_timeout_ms: 60000 model_cache_ttl_seconds: 300 embedding_cache_capacity: 1000 prefer_max_completion_tokens: false cost_aware_routing: false cost_quality_tradeoff: 0.3 include_cost_headers: true context_overflow: truncate_middle default_fallback_models: [] cost_baseline_model: gpt-5.2 model_cache_max_capacity: 1000 semantic_cache_max_entries: 10000| Field | Type | Default | Description |
|---|---|---|---|
default_model | string | — | Model used when no routing rule matches. Must reference a configured model. |
preference_model | string | — | Model used for preference signal classification. |
request_timeout_ms | integer | — | Global request timeout in milliseconds. |
model_cache_ttl_seconds | integer | — | TTL for model response cache entries. |
embedding_cache_capacity | integer | — | Maximum entries in the embedding LRU cache. |
prefer_max_completion_tokens | boolean | — | Send max_completion_tokens instead of max_tokens to upstream models. |
cost_aware_routing | boolean | false | Enable cost-aware routing algorithm. |
cost_quality_tradeoff | float | 0.3 | Balance between quality (0.0) and cost savings (1.0). |
include_cost_headers | boolean | true | Include cost data in response headers (X-LLMSoup-Cost-*). |
context_overflow | string | truncate_middle | Strategy when prompt exceeds context window: rolling_window, truncate_middle, or stop_at_limit. |
default_fallback_models | array | [] | Fallback model chain when the default model fails. |
cost_baseline_model | string | — | Model used as baseline for cost savings calculation. Uses the most expensive model if omitted. |
model_cache_max_capacity | integer | 1000 | Maximum entries in the model response cache. Oldest entries are evicted when capacity is reached. |
semantic_cache_max_entries | integer | 10000 | Maximum entries in the semantic cache store. Oldest entries are evicted when capacity is reached. |
Models
Section titled “Models”The models array defines all available LLM endpoints. A minimum of 2 models is required.
models: - name: gpt-5-mini provider: openai access_key: env: OPENAI_API_KEY endpoints: - url: https://api.openai.com/v1/chat/completions weight: 1 metadata: context_window: 8000 parameter_count: 120000 latency_seconds: 0.5 pricing: prompt_per_1m: 0.25 completion_per_1m: 2.00 currency: USD
- name: gpt-5.2 provider: openai access_key: env: OPENAI_API_KEY endpoints: - url: https://api.openai.com/v1/chat/completions metadata: context_window: 128000 parameter_count: 520000 latency_seconds: 1.2 pricing: prompt_per_1m: 1.75 completion_per_1m: 14.00 reasoning_family: reasoning_effortModel fields
Section titled “Model fields”| Field | Required | Type | Description |
|---|---|---|---|
name | yes | string | Unique model identifier used in routing rules. |
provider | no | string | Provider name (e.g., openai, anthropic). |
access_key | no | string or object | API key — see Secret Resolution. |
endpoints | yes | array | At least one endpoint. |
metadata | no | object | Model metadata for routing decisions. |
pricing | no | object | Token pricing for cost tracking. |
reasoning_family | no | string or object | Reasoning control configuration. |
Endpoints
Section titled “Endpoints”Each endpoint requires a url. Optional fields control load balancing and timeouts.
| Field | Required | Type | Description |
|---|---|---|---|
url | yes | string | Full URL to the chat completions endpoint. |
weight | no | integer (>0) | Load balancing weight across multiple endpoints. |
timeout_ms | no | integer | Per-endpoint timeout override. |
description | no | string | Human-readable description. |
Metadata
Section titled “Metadata”| Field | Type | Description |
|---|---|---|
context_window | integer (>0) | Maximum token context window. |
parameter_count | integer (>0) | Model parameter count (used for size-based escalation). |
latency_seconds | float (>=0) | Expected response latency. |
Pricing
Section titled “Pricing”| Field | Type | Description |
|---|---|---|
prompt_per_1m | float | Cost per 1M prompt tokens. |
completion_per_1m | float | Cost per 1M completion tokens. |
cached_prompt_per_1m | float | Cost per 1M cached prompt tokens. |
cached_completion_per_1m | float | Cost per 1M cached completion tokens. |
currency | string | Currency code (default: USD). |
Reasoning family
Section titled “Reasoning family”Controls how reasoning parameters are passed to upstream models.
# String shorthandreasoning_family: reasoning_effort
# Object formreasoning_family: type: chat_template_kwargs parameter: thinking_modeSignals
Section titled “Signals”Signals evaluate incoming requests and produce scores used by routing rules. Configure them under the signals block, grouped by type.
Keyword
Section titled “Keyword”Match requests based on keyword presence in the prompt.
signals: keyword: - name: code_keywords operator: OR keywords: ["code", "function", "implement", "debug"] case_sensitive: false| Field | Required | Default | Description |
|---|---|---|---|
name | yes | — | Unique signal identifier. |
keywords | yes | — | Array of trigger phrases (minimum 1). |
operator | no | OR | Match logic: AND (all must match), OR (any), NOR (none). |
case_sensitive | no | false | Enable case-sensitive matching. |
Embedding
Section titled “Embedding”Semantic similarity matching using BERT embeddings.
signals: embedding: - name: quick_answer model: sentence-transformers/all-MiniLM-L12-v2 threshold: 0.90 candidates: - "quick question" - "simple answer" - "brief response" aggregation_method: max| Field | Required | Default | Description |
|---|---|---|---|
name | yes | — | Unique signal identifier. |
model | yes | — | Embedding model ID (e.g., sentence-transformers/all-MiniLM-L12-v2). |
threshold | yes | — | Similarity threshold (0.0–1.0). |
reference_text | no | — | Single reference text for comparison. |
candidates | no | — | Array of candidate texts for comparison. |
aggregation_method | no | — | How to combine candidate scores: max, avg, or any. Requires candidates. |
Domain
Section titled “Domain”Classify requests into subject domains using MMLU categories or embedding-based matching.
signals: domain: - name: math mmlu_categories: [math] threshold: 0.6
- name: programming examples: - "Write a Python function" - "Debug this JavaScript code" description: "Programming and software development" threshold: 0.7| Field | Required | Default | Description |
|---|---|---|---|
name | yes | — | Unique signal identifier. |
examples | no | — | Example phrases for embedding-based matching. |
description | no | — | Human-readable domain description. |
mmlu_categories | no | — | MMLU category aliases (e.g., math, physics, computer_science). |
threshold | no | — | Confidence threshold (0.0–1.0). |
Language
Section titled “Language”Detect the language of incoming requests.
signals: language: - name: english language: en
- name: chinese language: zh| Field | Required | Description |
|---|---|---|
name | yes | Unique signal identifier. |
language | yes | ISO 639-1 code (en, es, fr, de, zh, ja, ru, etc.). |
Latency
Section titled “Latency”Route based on time-per-output-token (TPOT) requirements.
signals: latency: - name: low_latency max_tpot: 0.050
- name: medium_latency max_tpot: 0.150| Field | Required | Description |
|---|---|---|
name | yes | Unique signal identifier. |
max_tpot | yes | Maximum time per output token in seconds (must be >0). |
description | no | Human-readable description. |
Fact check
Section titled “Fact check”Flag requests that may require factual verification.
signals: fact_check: - name: needs_fact_check description: "Requests requiring factual accuracy"| Field | Required | Description |
|---|---|---|
name | yes | Unique signal identifier. |
description | no | Human-readable description. |
User feedback
Section titled “User feedback”Classify user feedback signals for adaptive routing.
signals: user_feedback: - name: clarification_needed description: "User needs clarification"
- name: satisfied description: "User is satisfied with the response"| Field | Required | Description |
|---|---|---|
name | yes | Unique signal identifier. |
description | no | Human-readable description. |
Preference
Section titled “Preference”Route based on user preference classification (uses an external LLM for classification).
signals: preference: - name: code_generation description: "User wants code generation"
- name: bug_fixing description: "User wants help fixing bugs"| Field | Required | Description |
|---|---|---|
name | yes | Unique signal identifier. |
description | no | Human-readable description. |
Classifier
Section titled “Classifier”The optional classifier block configures the MMLU-based domain classification model used by domain signals.
classifier: category_model: model_id: "models/mom-domain-classifier" threshold: 0.6 category_mapping_path: "models/mom-domain-classifier/category_mapping.json"| Field | Default | Description |
|---|---|---|
category_model.model_id | — | HuggingFace repo ID or local path to the classification model. |
category_model.threshold | 0.6 | Confidence threshold (0.0–1.0) for classification. |
category_model.category_mapping_path | — | Path to the category mapping JSON file. |
Routing Rules
Section titled “Routing Rules”Routing rules (also called “decisions” in semantic-router format) evaluate signal results and route requests to models. Rules are evaluated in priority order (highest priority wins).
rules: - name: code-routing priority: 100 operator: AND conditions: - signal: keyword.code_keywords action: strategy: default primary_model: gpt-5.2 fallback_models: [gpt-5-mini]Rule fields
Section titled “Rule fields”| Field | Required | Default | Description |
|---|---|---|---|
name | yes | — | Unique rule identifier. |
priority | yes | — | Integer >0. Higher priority rules are evaluated first. |
conditions | yes | — | Array of conditions (minimum 1). |
operator | no | AND | How conditions combine: AND (all must match), OR (any), NOR (none). |
action | yes | — | What happens when the rule matches. |
plugins | no | — | Plugins to apply for this rule (see Plugins). |
Conditions
Section titled “Conditions”Each condition references a signal and optionally applies a comparison.
conditions: - signal: keyword.code_keywords - signal: domain.math operator: greater-than value: 0.8 - signal: language.english negate: true| Field | Required | Default | Description |
|---|---|---|---|
signal | yes | — | Signal reference in {type}.{name} format (e.g., keyword.code_keywords). |
operator | no | equals | Comparison: equals, contains, greater-than, less-than, in. |
value | no | — | Comparison value. Numeric for greater-than/less-than, array for in. |
negate | no | false | Invert the condition result. |
Action
Section titled “Action”The action block defines the execution strategy when a rule matches.
action: strategy: fallback primary_model: gpt-5.2 fallback_models: [gpt-5-mini]| Field | Required | Description |
|---|---|---|
strategy | yes | Execution strategy: default, parallel, or fallback. |
primary_model | yes | Model to use (must reference a configured model). |
fallback_models | no | Ordered fallback chain if primary fails. |
model_refs | no | Model references with reasoning control overrides. |
algorithm | no | Selection algorithm (confidence or ratings). |
Strategies:
default— Send to the primary model. On failure, try fallback models in order.parallel— Send to multiple models simultaneously. Return the first successful response.fallback— Try primary model first, then each fallback in order until one succeeds.
Algorithms
Section titled “Algorithms”Algorithms control how models are selected within a rule.
Confidence algorithm — Escalate to larger models when confidence is low:
algorithm: type: confidence confidence: threshold: 0.75 confidence_method: margin escalation_order: size on_error: skip| Field | Required | Default | Description |
|---|---|---|---|
threshold | yes | — | Confidence threshold (0.0–1.0). Below this, escalate to next model. |
confidence_method | no | — | Method: margin, avg_logprob, or hybrid. |
escalation_order | no | — | Order: size (by parameter count), cost, or automix. |
on_error | no | — | Error handling: skip the model on failure. |
Ratings algorithm — Select among multiple models by policy:
algorithm: type: ratings ratings: policy: highest on_error: skip| Field | Required | Default | Description |
|---|---|---|---|
policy | no | — | Selection policy: highest, lowest, or prefer_first. |
on_error | no | — | Error handling: skip the model on failure. |
Plugins
Section titled “Plugins”Plugins add pre-processing and post-processing capabilities to routing rules. Configure them in the plugins array on each rule. Each plugin requires a type field and a configuration block with plugin-specific options.
rules: - name: secure-routing priority: 100 conditions: - signal: keyword.code_keywords action: strategy: default primary_model: gpt-5.2 plugins: - type: jailbreak configuration: enabled: true threshold: 0.8 action: block - type: pii configuration: enabled: true threshold: 0.7 - type: system_prompt configuration: system_prompt: "You are a helpful coding assistant."Built-in plugins
Section titled “Built-in plugins”system_prompt
Section titled “system_prompt”Inject a system prompt into the request.
- type: system_prompt configuration: system_prompt: "You are a helpful assistant specialized in mathematics." mode: replace| Field | Default | Description |
|---|---|---|
system_prompt | — | System prompt text to inject. |
mode | replace | How to apply: replace (overwrite existing), prepend, or append. |
semantic-cache
Section titled “semantic-cache”Cache responses based on semantic similarity of prompts.
- type: semantic-cache configuration: enabled: true similarity_threshold: 0.92 ttl_seconds: 3600| Field | Default | Description |
|---|---|---|
enabled | true | Enable or disable the plugin. |
similarity_threshold | — | Minimum similarity (0.0–1.0) to return a cached response. |
ttl_seconds | — | Cache entry time-to-live. |
jailbreak
Section titled “jailbreak”Detect and handle prompt injection attempts.
- type: jailbreak configuration: enabled: true threshold: 0.8 action: block| Field | Default | Description |
|---|---|---|
enabled | true | Enable or disable the plugin. |
threshold | — | Detection threshold (0.0–1.0). |
action | — | Response action when detected. |
Detect personally identifiable information in prompts.
- type: pii configuration: enabled: true threshold: 0.7 pii_types_allowed: []| Field | Default | Description |
|---|---|---|
enabled | true | Enable or disable the plugin. |
threshold | — | Detection threshold (0.0–1.0). |
pii_types_allowed | [] | Array of PII types to allow through (empty = block all). |
header_mutation
Section titled “header_mutation”Modify HTTP headers on proxied requests and responses using a mutations array.
- type: header_mutation configuration: enabled: true mutations: - header: X-Routed-By operation: set value: llmsoup phase: response - header: X-Request-Source operation: add value: llmsoup-proxy phase: request| Field | Description |
|---|---|
enabled | Enable or disable the plugin. |
mutations | Array of mutation rules (see below). |
Each mutation entry:
| Field | Description |
|---|---|
header | Header name. |
operation | set (overwrite), add (append), or remove. |
value | Header value (not required for remove). |
phase | When to apply: request or response. |
hallucination
Section titled “hallucination”Detect potential hallucinations in model responses.
- type: hallucination configuration: enabled: true threshold: 0.7 action: header heuristic_sensitivity: 0.5| Field | Default | Description |
|---|---|---|
enabled | true | Enable or disable the plugin. |
threshold | — | Detection threshold (0.0–1.0). |
action | — | Response action: header, body, block, or log. |
heuristic_sensitivity | 0.5 | Heuristic sensitivity (0.0–1.0). Higher values are more sensitive. |
router_replay
Section titled “router_replay”Capture routing decisions for debugging and analysis.
- type: router_replay configuration: enabled: true max_records: 200 capture_request_body: false max_body_bytes: 4096| Field | Default | Description |
|---|---|---|
enabled | true | Enable or disable the plugin. |
max_records | 200 | Maximum routing records stored in memory. |
capture_request_body | false | Whether to capture request payloads. |
max_body_bytes | 4096 | Maximum bytes per captured body (truncated if exceeded). |
Authentication
Section titled “Authentication”llmsoup supports token-based authentication on all endpoints except /metrics.
auth: enabled: true tokens: - env: MY_API_TOKEN tokens_file: /etc/llmsoup/tokens.yaml| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | true | Enable authentication (defaults to true when auth section exists). |
tokens | array | — | Inline token definitions using secret references. |
tokens_file | string | — | Path to an external tokens file (preferred over inline). |
Tokens file format
Section titled “Tokens file format”tokens: - id: alice description: "Alice - Data team" secret: env: ALICE_TOKEN
- id: bob description: "Bob - Engineering" secret: file: /run/secrets/bob_token
- id: ci-pipeline description: "CI/CD service account" secret: env: CI_API_TOKENSecret Resolution
Section titled “Secret Resolution”All access_key and token secret fields support multiple resolution methods.
Environment variable
Section titled “Environment variable”access_key: env: OPENAI_API_KEYReads the value from the specified environment variable.
access_key: file: /run/secrets/api_keyReads the value from a file path. Useful with Docker/Kubernetes secrets.
Command
Section titled “Command”access_key: command: "cat /etc/secret/key"access_key: vault: "secret/data/api_key"Environment Variables
Section titled “Environment Variables”| Variable | Purpose | Default |
|---|---|---|
LLMSOUP_CONFIG | Config file path | config.yaml |
LLMSOUP_HOST | Server bind address | 127.0.0.1 |
LLMSOUP_PORT | Server port | 8080 |
LLMSOUP_LOG | Log level filter | info |
LLMSOUP_ALLOW_COMMAND_SECRETS | Enable command-based secret resolution | unset |
LLMSOUP_PREPARE_OUTPUT | Override output path for llmsoup prepare | config.yaml |
LLMSOUP_ONNX_MODELS_DIR | Override directory for ML model storage | ~/.llmsoup/models |
LLMSOUP_SKIP_MODEL_DOWNLOAD | Skip downloading embedding/domain models (for CI/testing) | unset |
Complete Example
Section titled “Complete Example”A full configuration with two models, multiple signal types, routing rules, authentication, and plugins:
version: v0.1
defaults: default_model: gpt-5-mini preference_model: gpt-5-mini request_timeout_ms: 60000 model_cache_ttl_seconds: 300 embedding_cache_capacity: 1000 cost_aware_routing: true cost_quality_tradeoff: 0.3 include_cost_headers: true context_overflow: truncate_middle default_fallback_models: [gpt-5-mini] cost_baseline_model: gpt-5.2
models: - name: gpt-5-mini provider: openai access_key: env: OPENAI_API_KEY endpoints: - url: https://api.openai.com/v1/chat/completions metadata: context_window: 8000 parameter_count: 120000 latency_seconds: 0.5 pricing: prompt_per_1m: 0.25 completion_per_1m: 2.00
- name: gpt-5.2 provider: openai access_key: env: OPENAI_API_KEY endpoints: - url: https://api.openai.com/v1/chat/completions metadata: context_window: 128000 parameter_count: 520000 latency_seconds: 1.2 pricing: prompt_per_1m: 1.75 completion_per_1m: 14.00 reasoning_family: reasoning_effort
signals: keyword: - name: code_keywords operator: OR keywords: ["code", "function", "implement", "debug", "program"]
- name: math_keywords operator: OR keywords: ["calculate", "equation", "formula", "solve"]
embedding: - name: quick_answer model: sentence-transformers/all-MiniLM-L12-v2 threshold: 0.90 candidates: - "quick question" - "simple answer" aggregation_method: max
domain: - name: math mmlu_categories: [math] threshold: 0.6
- name: programming examples: - "Write a function" - "Debug this code" threshold: 0.7
language: - name: english language: en
- name: chinese language: zh
latency: - name: low_latency max_tpot: 0.050
fact_check: - name: needs_fact_check description: "Requests requiring factual accuracy"
preference: - name: code_generation description: "User wants code generation"
rules: - name: code-routing priority: 100 operator: AND conditions: - signal: keyword.code_keywords action: strategy: fallback primary_model: gpt-5.2 fallback_models: [gpt-5-mini] plugins: - type: system_prompt configuration: system_prompt: "You are a senior software engineer." - type: jailbreak configuration: enabled: true threshold: 0.8 action: block
- name: math-routing priority: 90 operator: OR conditions: - signal: keyword.math_keywords - signal: domain.math operator: greater-than value: 0.6 action: strategy: default primary_model: gpt-5.2 plugins: - type: system_prompt configuration: system_prompt: "Show your work step by step."
- name: quick-answers priority: 80 conditions: - signal: embedding.quick_answer action: strategy: default primary_model: gpt-5-mini
- name: cost-efficient priority: 40 conditions: - signal: domain.programming action: strategy: default primary_model: gpt-5-mini fallback_models: [gpt-5.2] algorithm: type: confidence confidence: threshold: 0.75 confidence_method: margin escalation_order: size on_error: skip
auth: enabled: true tokens_file: /etc/llmsoup/tokens.yaml