Skip to content

Configuration Reference

llmsoup uses a single YAML configuration file to define models, signals, routing rules, plugins, and authentication. This reference covers every configuration option available.

llmsoup loads configuration from a YAML file at startup. The file location is resolved in this order:

  1. --config CLI flag (e.g., llmsoup serve --config my-config.yaml)
  2. LLMSOUP_CONFIG environment variable
  3. config.yaml in the current directory (default)

Generate a starter config with all options commented:

Terminal window
llmsoup prepare

A config file has the following top-level keys:

version: v0.1 # Optional version identifier
defaults: # Global routing defaults
models: # Model definitions (min 2)
signals: # Signal evaluators
rules: # Routing rules (priority-ordered)
classifier: # Domain classifier model (optional)
auth: # Authentication (optional)

The version field is an optional string carried forward from the config file — it has no effect on behavior but can be used to track config revisions.

llmsoup is backward-compatible with the semantic-router configuration format. The parser auto-normalizes legacy field names:

Legacy fieldNormalized to
providers.default_modeldefaults.default_model
providers.modelsmodels
decisionsrules
modelRefsaction.model_refs + action.primary_model / fallback_models
Condition type field{type}.{name} signal reference format

Both formats are accepted. This reference documents the normalized format; the Getting Started guide uses the legacy decisions / modelRefs format for compatibility.

The defaults block sets global behavior for the routing engine.

defaults:
default_model: gpt-5-mini
preference_model: gpt-5-mini
request_timeout_ms: 60000
model_cache_ttl_seconds: 300
embedding_cache_capacity: 1000
prefer_max_completion_tokens: false
cost_aware_routing: false
cost_quality_tradeoff: 0.3
include_cost_headers: true
context_overflow: truncate_middle
default_fallback_models: []
cost_baseline_model: gpt-5.2
model_cache_max_capacity: 1000
semantic_cache_max_entries: 10000
FieldTypeDefaultDescription
default_modelstringModel used when no routing rule matches. Must reference a configured model.
preference_modelstringModel used for preference signal classification.
request_timeout_msintegerGlobal request timeout in milliseconds.
model_cache_ttl_secondsintegerTTL for model response cache entries.
embedding_cache_capacityintegerMaximum entries in the embedding LRU cache.
prefer_max_completion_tokensbooleanSend max_completion_tokens instead of max_tokens to upstream models.
cost_aware_routingbooleanfalseEnable cost-aware routing algorithm.
cost_quality_tradeofffloat0.3Balance between quality (0.0) and cost savings (1.0).
include_cost_headersbooleantrueInclude cost data in response headers (X-LLMSoup-Cost-*).
context_overflowstringtruncate_middleStrategy when prompt exceeds context window: rolling_window, truncate_middle, or stop_at_limit.
default_fallback_modelsarray[]Fallback model chain when the default model fails.
cost_baseline_modelstringModel used as baseline for cost savings calculation. Uses the most expensive model if omitted.
model_cache_max_capacityinteger1000Maximum entries in the model response cache. Oldest entries are evicted when capacity is reached.
semantic_cache_max_entriesinteger10000Maximum entries in the semantic cache store. Oldest entries are evicted when capacity is reached.

The models array defines all available LLM endpoints. A minimum of 2 models is required.

models:
- name: gpt-5-mini
provider: openai
access_key:
env: OPENAI_API_KEY
endpoints:
- url: https://api.openai.com/v1/chat/completions
weight: 1
metadata:
context_window: 8000
parameter_count: 120000
latency_seconds: 0.5
pricing:
prompt_per_1m: 0.25
completion_per_1m: 2.00
currency: USD
- name: gpt-5.2
provider: openai
access_key:
env: OPENAI_API_KEY
endpoints:
- url: https://api.openai.com/v1/chat/completions
metadata:
context_window: 128000
parameter_count: 520000
latency_seconds: 1.2
pricing:
prompt_per_1m: 1.75
completion_per_1m: 14.00
reasoning_family: reasoning_effort
FieldRequiredTypeDescription
nameyesstringUnique model identifier used in routing rules.
providernostringProvider name (e.g., openai, anthropic).
access_keynostring or objectAPI key — see Secret Resolution.
endpointsyesarrayAt least one endpoint.
metadatanoobjectModel metadata for routing decisions.
pricingnoobjectToken pricing for cost tracking.
reasoning_familynostring or objectReasoning control configuration.

Each endpoint requires a url. Optional fields control load balancing and timeouts.

FieldRequiredTypeDescription
urlyesstringFull URL to the chat completions endpoint.
weightnointeger (>0)Load balancing weight across multiple endpoints.
timeout_msnointegerPer-endpoint timeout override.
descriptionnostringHuman-readable description.
FieldTypeDescription
context_windowinteger (>0)Maximum token context window.
parameter_countinteger (>0)Model parameter count (used for size-based escalation).
latency_secondsfloat (>=0)Expected response latency.
FieldTypeDescription
prompt_per_1mfloatCost per 1M prompt tokens.
completion_per_1mfloatCost per 1M completion tokens.
cached_prompt_per_1mfloatCost per 1M cached prompt tokens.
cached_completion_per_1mfloatCost per 1M cached completion tokens.
currencystringCurrency code (default: USD).

Controls how reasoning parameters are passed to upstream models.

# String shorthand
reasoning_family: reasoning_effort
# Object form
reasoning_family:
type: chat_template_kwargs
parameter: thinking_mode

Signals evaluate incoming requests and produce scores used by routing rules. Configure them under the signals block, grouped by type.

Match requests based on keyword presence in the prompt.

signals:
keyword:
- name: code_keywords
operator: OR
keywords: ["code", "function", "implement", "debug"]
case_sensitive: false
FieldRequiredDefaultDescription
nameyesUnique signal identifier.
keywordsyesArray of trigger phrases (minimum 1).
operatornoORMatch logic: AND (all must match), OR (any), NOR (none).
case_sensitivenofalseEnable case-sensitive matching.

Semantic similarity matching using BERT embeddings.

signals:
embedding:
- name: quick_answer
model: sentence-transformers/all-MiniLM-L12-v2
threshold: 0.90
candidates:
- "quick question"
- "simple answer"
- "brief response"
aggregation_method: max
FieldRequiredDefaultDescription
nameyesUnique signal identifier.
modelyesEmbedding model ID (e.g., sentence-transformers/all-MiniLM-L12-v2).
thresholdyesSimilarity threshold (0.0–1.0).
reference_textnoSingle reference text for comparison.
candidatesnoArray of candidate texts for comparison.
aggregation_methodnoHow to combine candidate scores: max, avg, or any. Requires candidates.

Classify requests into subject domains using MMLU categories or embedding-based matching.

signals:
domain:
- name: math
mmlu_categories: [math]
threshold: 0.6
- name: programming
examples:
- "Write a Python function"
- "Debug this JavaScript code"
description: "Programming and software development"
threshold: 0.7
FieldRequiredDefaultDescription
nameyesUnique signal identifier.
examplesnoExample phrases for embedding-based matching.
descriptionnoHuman-readable domain description.
mmlu_categoriesnoMMLU category aliases (e.g., math, physics, computer_science).
thresholdnoConfidence threshold (0.0–1.0).

Detect the language of incoming requests.

signals:
language:
- name: english
language: en
- name: chinese
language: zh
FieldRequiredDescription
nameyesUnique signal identifier.
languageyesISO 639-1 code (en, es, fr, de, zh, ja, ru, etc.).

Route based on time-per-output-token (TPOT) requirements.

signals:
latency:
- name: low_latency
max_tpot: 0.050
- name: medium_latency
max_tpot: 0.150
FieldRequiredDescription
nameyesUnique signal identifier.
max_tpotyesMaximum time per output token in seconds (must be >0).
descriptionnoHuman-readable description.

Flag requests that may require factual verification.

signals:
fact_check:
- name: needs_fact_check
description: "Requests requiring factual accuracy"
FieldRequiredDescription
nameyesUnique signal identifier.
descriptionnoHuman-readable description.

Classify user feedback signals for adaptive routing.

signals:
user_feedback:
- name: clarification_needed
description: "User needs clarification"
- name: satisfied
description: "User is satisfied with the response"
FieldRequiredDescription
nameyesUnique signal identifier.
descriptionnoHuman-readable description.

Route based on user preference classification (uses an external LLM for classification).

signals:
preference:
- name: code_generation
description: "User wants code generation"
- name: bug_fixing
description: "User wants help fixing bugs"
FieldRequiredDescription
nameyesUnique signal identifier.
descriptionnoHuman-readable description.

The optional classifier block configures the MMLU-based domain classification model used by domain signals.

classifier:
category_model:
model_id: "models/mom-domain-classifier"
threshold: 0.6
category_mapping_path: "models/mom-domain-classifier/category_mapping.json"
FieldDefaultDescription
category_model.model_idHuggingFace repo ID or local path to the classification model.
category_model.threshold0.6Confidence threshold (0.0–1.0) for classification.
category_model.category_mapping_pathPath to the category mapping JSON file.

Routing rules (also called “decisions” in semantic-router format) evaluate signal results and route requests to models. Rules are evaluated in priority order (highest priority wins).

rules:
- name: code-routing
priority: 100
operator: AND
conditions:
- signal: keyword.code_keywords
action:
strategy: default
primary_model: gpt-5.2
fallback_models: [gpt-5-mini]
FieldRequiredDefaultDescription
nameyesUnique rule identifier.
priorityyesInteger >0. Higher priority rules are evaluated first.
conditionsyesArray of conditions (minimum 1).
operatornoANDHow conditions combine: AND (all must match), OR (any), NOR (none).
actionyesWhat happens when the rule matches.
pluginsnoPlugins to apply for this rule (see Plugins).

Each condition references a signal and optionally applies a comparison.

conditions:
- signal: keyword.code_keywords
- signal: domain.math
operator: greater-than
value: 0.8
- signal: language.english
negate: true
FieldRequiredDefaultDescription
signalyesSignal reference in {type}.{name} format (e.g., keyword.code_keywords).
operatornoequalsComparison: equals, contains, greater-than, less-than, in.
valuenoComparison value. Numeric for greater-than/less-than, array for in.
negatenofalseInvert the condition result.

The action block defines the execution strategy when a rule matches.

action:
strategy: fallback
primary_model: gpt-5.2
fallback_models: [gpt-5-mini]
FieldRequiredDescription
strategyyesExecution strategy: default, parallel, or fallback.
primary_modelyesModel to use (must reference a configured model).
fallback_modelsnoOrdered fallback chain if primary fails.
model_refsnoModel references with reasoning control overrides.
algorithmnoSelection algorithm (confidence or ratings).

Strategies:

  • default — Send to the primary model. On failure, try fallback models in order.
  • parallel — Send to multiple models simultaneously. Return the first successful response.
  • fallback — Try primary model first, then each fallback in order until one succeeds.

Algorithms control how models are selected within a rule.

Confidence algorithm — Escalate to larger models when confidence is low:

algorithm:
type: confidence
confidence:
threshold: 0.75
confidence_method: margin
escalation_order: size
on_error: skip
FieldRequiredDefaultDescription
thresholdyesConfidence threshold (0.0–1.0). Below this, escalate to next model.
confidence_methodnoMethod: margin, avg_logprob, or hybrid.
escalation_ordernoOrder: size (by parameter count), cost, or automix.
on_errornoError handling: skip the model on failure.

Ratings algorithm — Select among multiple models by policy:

algorithm:
type: ratings
ratings:
policy: highest
on_error: skip
FieldRequiredDefaultDescription
policynoSelection policy: highest, lowest, or prefer_first.
on_errornoError handling: skip the model on failure.

Plugins add pre-processing and post-processing capabilities to routing rules. Configure them in the plugins array on each rule. Each plugin requires a type field and a configuration block with plugin-specific options.

rules:
- name: secure-routing
priority: 100
conditions:
- signal: keyword.code_keywords
action:
strategy: default
primary_model: gpt-5.2
plugins:
- type: jailbreak
configuration:
enabled: true
threshold: 0.8
action: block
- type: pii
configuration:
enabled: true
threshold: 0.7
- type: system_prompt
configuration:
system_prompt: "You are a helpful coding assistant."

Inject a system prompt into the request.

- type: system_prompt
configuration:
system_prompt: "You are a helpful assistant specialized in mathematics."
mode: replace
FieldDefaultDescription
system_promptSystem prompt text to inject.
modereplaceHow to apply: replace (overwrite existing), prepend, or append.

Cache responses based on semantic similarity of prompts.

- type: semantic-cache
configuration:
enabled: true
similarity_threshold: 0.92
ttl_seconds: 3600
FieldDefaultDescription
enabledtrueEnable or disable the plugin.
similarity_thresholdMinimum similarity (0.0–1.0) to return a cached response.
ttl_secondsCache entry time-to-live.

Detect and handle prompt injection attempts.

- type: jailbreak
configuration:
enabled: true
threshold: 0.8
action: block
FieldDefaultDescription
enabledtrueEnable or disable the plugin.
thresholdDetection threshold (0.0–1.0).
actionResponse action when detected.

Detect personally identifiable information in prompts.

- type: pii
configuration:
enabled: true
threshold: 0.7
pii_types_allowed: []
FieldDefaultDescription
enabledtrueEnable or disable the plugin.
thresholdDetection threshold (0.0–1.0).
pii_types_allowed[]Array of PII types to allow through (empty = block all).

Modify HTTP headers on proxied requests and responses using a mutations array.

- type: header_mutation
configuration:
enabled: true
mutations:
- header: X-Routed-By
operation: set
value: llmsoup
phase: response
- header: X-Request-Source
operation: add
value: llmsoup-proxy
phase: request
FieldDescription
enabledEnable or disable the plugin.
mutationsArray of mutation rules (see below).

Each mutation entry:

FieldDescription
headerHeader name.
operationset (overwrite), add (append), or remove.
valueHeader value (not required for remove).
phaseWhen to apply: request or response.

Detect potential hallucinations in model responses.

- type: hallucination
configuration:
enabled: true
threshold: 0.7
action: header
heuristic_sensitivity: 0.5
FieldDefaultDescription
enabledtrueEnable or disable the plugin.
thresholdDetection threshold (0.0–1.0).
actionResponse action: header, body, block, or log.
heuristic_sensitivity0.5Heuristic sensitivity (0.0–1.0). Higher values are more sensitive.

Capture routing decisions for debugging and analysis.

- type: router_replay
configuration:
enabled: true
max_records: 200
capture_request_body: false
max_body_bytes: 4096
FieldDefaultDescription
enabledtrueEnable or disable the plugin.
max_records200Maximum routing records stored in memory.
capture_request_bodyfalseWhether to capture request payloads.
max_body_bytes4096Maximum bytes per captured body (truncated if exceeded).

llmsoup supports token-based authentication on all endpoints except /metrics.

auth:
enabled: true
tokens:
- env: MY_API_TOKEN
tokens_file: /etc/llmsoup/tokens.yaml
FieldTypeDefaultDescription
enabledbooleantrueEnable authentication (defaults to true when auth section exists).
tokensarrayInline token definitions using secret references.
tokens_filestringPath to an external tokens file (preferred over inline).
tokens:
- id: alice
description: "Alice - Data team"
secret:
env: ALICE_TOKEN
- id: bob
description: "Bob - Engineering"
secret:
file: /run/secrets/bob_token
- id: ci-pipeline
description: "CI/CD service account"
secret:
env: CI_API_TOKEN

All access_key and token secret fields support multiple resolution methods.

access_key:
env: OPENAI_API_KEY

Reads the value from the specified environment variable.

access_key:
file: /run/secrets/api_key

Reads the value from a file path. Useful with Docker/Kubernetes secrets.

access_key:
command: "cat /etc/secret/key"
access_key:
vault: "secret/data/api_key"
VariablePurposeDefault
LLMSOUP_CONFIGConfig file pathconfig.yaml
LLMSOUP_HOSTServer bind address127.0.0.1
LLMSOUP_PORTServer port8080
LLMSOUP_LOGLog level filterinfo
LLMSOUP_ALLOW_COMMAND_SECRETSEnable command-based secret resolutionunset
LLMSOUP_PREPARE_OUTPUTOverride output path for llmsoup prepareconfig.yaml
LLMSOUP_ONNX_MODELS_DIROverride directory for ML model storage~/.llmsoup/models
LLMSOUP_SKIP_MODEL_DOWNLOADSkip downloading embedding/domain models (for CI/testing)unset

A full configuration with two models, multiple signal types, routing rules, authentication, and plugins:

version: v0.1
defaults:
default_model: gpt-5-mini
preference_model: gpt-5-mini
request_timeout_ms: 60000
model_cache_ttl_seconds: 300
embedding_cache_capacity: 1000
cost_aware_routing: true
cost_quality_tradeoff: 0.3
include_cost_headers: true
context_overflow: truncate_middle
default_fallback_models: [gpt-5-mini]
cost_baseline_model: gpt-5.2
models:
- name: gpt-5-mini
provider: openai
access_key:
env: OPENAI_API_KEY
endpoints:
- url: https://api.openai.com/v1/chat/completions
metadata:
context_window: 8000
parameter_count: 120000
latency_seconds: 0.5
pricing:
prompt_per_1m: 0.25
completion_per_1m: 2.00
- name: gpt-5.2
provider: openai
access_key:
env: OPENAI_API_KEY
endpoints:
- url: https://api.openai.com/v1/chat/completions
metadata:
context_window: 128000
parameter_count: 520000
latency_seconds: 1.2
pricing:
prompt_per_1m: 1.75
completion_per_1m: 14.00
reasoning_family: reasoning_effort
signals:
keyword:
- name: code_keywords
operator: OR
keywords: ["code", "function", "implement", "debug", "program"]
- name: math_keywords
operator: OR
keywords: ["calculate", "equation", "formula", "solve"]
embedding:
- name: quick_answer
model: sentence-transformers/all-MiniLM-L12-v2
threshold: 0.90
candidates:
- "quick question"
- "simple answer"
aggregation_method: max
domain:
- name: math
mmlu_categories: [math]
threshold: 0.6
- name: programming
examples:
- "Write a function"
- "Debug this code"
threshold: 0.7
language:
- name: english
language: en
- name: chinese
language: zh
latency:
- name: low_latency
max_tpot: 0.050
fact_check:
- name: needs_fact_check
description: "Requests requiring factual accuracy"
preference:
- name: code_generation
description: "User wants code generation"
rules:
- name: code-routing
priority: 100
operator: AND
conditions:
- signal: keyword.code_keywords
action:
strategy: fallback
primary_model: gpt-5.2
fallback_models: [gpt-5-mini]
plugins:
- type: system_prompt
configuration:
system_prompt: "You are a senior software engineer."
- type: jailbreak
configuration:
enabled: true
threshold: 0.8
action: block
- name: math-routing
priority: 90
operator: OR
conditions:
- signal: keyword.math_keywords
- signal: domain.math
operator: greater-than
value: 0.6
action:
strategy: default
primary_model: gpt-5.2
plugins:
- type: system_prompt
configuration:
system_prompt: "Show your work step by step."
- name: quick-answers
priority: 80
conditions:
- signal: embedding.quick_answer
action:
strategy: default
primary_model: gpt-5-mini
- name: cost-efficient
priority: 40
conditions:
- signal: domain.programming
action:
strategy: default
primary_model: gpt-5-mini
fallback_models: [gpt-5.2]
algorithm:
type: confidence
confidence:
threshold: 0.75
confidence_method: margin
escalation_order: size
on_error: skip
auth:
enabled: true
tokens_file: /etc/llmsoup/tokens.yaml