Configuration Reference

llmsoup uses a single YAML configuration file to define models, signals, routing rules, plugins, and authentication. This reference covers every configuration option available.

Overview

llmsoup loads configuration from a YAML file at startup. The file location is resolved in this order:

--config CLI flag (e.g., llmsoup serve --config my-config.yaml)
LLMSOUP_CONFIG environment variable
config.yaml in the current directory (default)

Generate a starter config with all options commented:

llmsoup prepare

Top-level structure

A config file has the following top-level keys:

version: v0.1            # Optional version identifier
defaults:                 # Global routing defaults
models:                   # Model definitions (min 2)
signals:                  # Signal evaluators
rules:                    # Routing rules (priority-ordered)
classifier:               # Domain classifier model (optional)
auth:                     # Authentication (optional)

The version field is an optional string carried forward from the config file — it has no effect on behavior but can be used to track config revisions.

Semantic-router compatibility

llmsoup is backward-compatible with the semantic-router configuration format. The parser auto-normalizes legacy field names:

Legacy field	Normalized to
`providers.default_model`	`defaults.default_model`
`providers.models`	`models`
`decisions`	`rules`
`modelRefs`	`action.model_refs` + `action.primary_model` / `fallback_models`
Condition `type` field	`{type}.{name}` signal reference format

Both formats are accepted. This reference documents the normalized format; the Getting Started guide uses the legacy decisions / modelRefs format for compatibility.

Defaults

The defaults block sets global behavior for the routing engine.

defaults:
  default_model: gpt-5-mini
  preference_model: gpt-5-mini
  request_timeout_ms: 60000
  model_cache_ttl_seconds: 300
  embedding_cache_capacity: 1000
  prefer_max_completion_tokens: false
  cost_aware_routing: false
  cost_quality_tradeoff: 0.3
  include_cost_headers: true
  context_overflow: truncate_middle
  default_fallback_models: []
  cost_baseline_model: gpt-5.2
  model_cache_max_capacity: 1000
  semantic_cache_max_entries: 10000

Field	Type	Default	Description
`default_model`	string	—	Model used when no routing rule matches. Must reference a configured model.
`preference_model`	string	—	Model used for preference signal classification.
`request_timeout_ms`	integer	—	Global request timeout in milliseconds.
`model_cache_ttl_seconds`	integer	—	TTL for model response cache entries.
`embedding_cache_capacity`	integer	—	Maximum entries in the embedding LRU cache.
`prefer_max_completion_tokens`	boolean	—	Send `max_completion_tokens` instead of `max_tokens` to upstream models.
`cost_aware_routing`	boolean	`false`	Enable cost-aware routing algorithm.
`cost_quality_tradeoff`	float	`0.3`	Balance between quality (0.0) and cost savings (1.0).
`include_cost_headers`	boolean	`true`	Include cost data in response headers (`X-LLMSoup-Cost-*`).
`context_overflow`	string	`truncate_middle`	Strategy when prompt exceeds context window: `rolling_window`, `truncate_middle`, or `stop_at_limit`.
`default_fallback_models`	array	`[]`	Fallback model chain when the default model fails.
`cost_baseline_model`	string	—	Model used as baseline for cost savings calculation. Uses the most expensive model if omitted.
`model_cache_max_capacity`	integer	`1000`	Maximum entries in the model response cache. Oldest entries are evicted when capacity is reached.
`semantic_cache_max_entries`	integer	`10000`	Maximum entries in the semantic cache store. Oldest entries are evicted when capacity is reached.

Models

The models array defines all available LLM endpoints. A minimum of 2 models is required.

models:
  - name: gpt-5-mini
    provider: openai
    access_key:
      env: OPENAI_API_KEY
    endpoints:
      - url: https://api.openai.com/v1/chat/completions
        weight: 1
    metadata:
      context_window: 8000
      parameter_count: 120000
      latency_seconds: 0.5
    pricing:
      prompt_per_1m: 0.25
      completion_per_1m: 2.00
      currency: USD

  - name: gpt-5.2
    provider: openai
    access_key:
      env: OPENAI_API_KEY
    endpoints:
      - url: https://api.openai.com/v1/chat/completions
    metadata:
      context_window: 128000
      parameter_count: 520000
      latency_seconds: 1.2
    pricing:
      prompt_per_1m: 1.75
      completion_per_1m: 14.00
    reasoning_family: reasoning_effort

Model fields

Field	Required	Type	Description
`name`	yes	string	Unique model identifier used in routing rules.
`provider`	no	string	Provider name (e.g., `openai`, `anthropic`).
`access_key`	no	string or object	API key — see Secret Resolution.
`endpoints`	yes	array	At least one endpoint.
`metadata`	no	object	Model metadata for routing decisions.
`pricing`	no	object	Token pricing for cost tracking.
`reasoning_family`	no	string or object	Reasoning control configuration.

Endpoints

Each endpoint requires a url. Optional fields control load balancing and timeouts.

Field	Required	Type	Description
`url`	yes	string	Full URL to the chat completions endpoint.
`weight`	no	integer (>0)	Load balancing weight across multiple endpoints.
`timeout_ms`	no	integer	Per-endpoint timeout override.
`description`	no	string	Human-readable description.

Metadata

Field	Type	Description
`context_window`	integer (>0)	Maximum token context window.
`parameter_count`	integer (>0)	Model parameter count (used for size-based escalation).
`latency_seconds`	float (>=0)	Expected response latency.

Pricing

Field	Type	Description
`prompt_per_1m`	float	Cost per 1M prompt tokens.
`completion_per_1m`	float	Cost per 1M completion tokens.
`cached_prompt_per_1m`	float	Cost per 1M cached prompt tokens.
`cached_completion_per_1m`	float	Cost per 1M cached completion tokens.
`currency`	string	Currency code (default: `USD`).

Reasoning family

Controls how reasoning parameters are passed to upstream models.

# String shorthand
reasoning_family: reasoning_effort

# Object form
reasoning_family:
  type: chat_template_kwargs
  parameter: thinking_mode

Signals

Signals evaluate incoming requests and produce scores used by routing rules. Configure them under the signals block, grouped by type.

Keyword

Match requests based on keyword presence in the prompt.

signals:
  keyword:
    - name: code_keywords
      operator: OR
      keywords: ["code", "function", "implement", "debug"]
      case_sensitive: false

Field	Required	Default	Description
`name`	yes	—	Unique signal identifier.
`keywords`	yes	—	Array of trigger phrases (minimum 1).
`operator`	no	`OR`	Match logic: `AND` (all must match), `OR` (any), `NOR` (none).
`case_sensitive`	no	`false`	Enable case-sensitive matching.

Embedding

Semantic similarity matching using BERT embeddings.

signals:
  embedding:
    - name: quick_answer
      model: sentence-transformers/all-MiniLM-L12-v2
      threshold: 0.90
      candidates:
        - "quick question"
        - "simple answer"
        - "brief response"
      aggregation_method: max

Field	Required	Default	Description
`name`	yes	—	Unique signal identifier.
`model`	yes	—	Embedding model ID (e.g., `sentence-transformers/all-MiniLM-L12-v2`).
`threshold`	yes	—	Similarity threshold (0.0–1.0).
`reference_text`	no	—	Single reference text for comparison.
`candidates`	no	—	Array of candidate texts for comparison.
`aggregation_method`	no	—	How to combine candidate scores: `max`, `avg`, or `any`. Requires `candidates`.

Domain

Classify requests into subject domains using MMLU categories or embedding-based matching.

signals:
  domain:
    - name: math
      mmlu_categories: [math]
      threshold: 0.6

    - name: programming
      examples:
        - "Write a Python function"
        - "Debug this JavaScript code"
      description: "Programming and software development"
      threshold: 0.7

Field	Required	Default	Description
`name`	yes	—	Unique signal identifier.
`examples`	no	—	Example phrases for embedding-based matching.
`description`	no	—	Human-readable domain description.
`mmlu_categories`	no	—	MMLU category aliases (e.g., `math`, `physics`, `computer_science`).
`threshold`	no	—	Confidence threshold (0.0–1.0).

Language

Detect the language of incoming requests.

signals:
  language:
    - name: english
      language: en

    - name: chinese
      language: zh

Field	Required	Description
`name`	yes	Unique signal identifier.
`language`	yes	ISO 639-1 code (`en`, `es`, `fr`, `de`, `zh`, `ja`, `ru`, etc.).

Latency

Route based on time-per-output-token (TPOT) requirements.

signals:
  latency:
    - name: low_latency
      max_tpot: 0.050

    - name: medium_latency
      max_tpot: 0.150

Field	Required	Description
`name`	yes	Unique signal identifier.
`max_tpot`	yes	Maximum time per output token in seconds (must be >0).
`description`	no	Human-readable description.

Fact check

Flag requests that may require factual verification.

signals:
  fact_check:
    - name: needs_fact_check
      description: "Requests requiring factual accuracy"

Field	Required	Description
`name`	yes	Unique signal identifier.
`description`	no	Human-readable description.

User feedback

Classify user feedback signals for adaptive routing.

signals:
  user_feedback:
    - name: clarification_needed
      description: "User needs clarification"

    - name: satisfied
      description: "User is satisfied with the response"

Field	Required	Description
`name`	yes	Unique signal identifier.
`description`	no	Human-readable description.

Preference

Route based on user preference classification (uses an external LLM for classification).

signals:
  preference:
    - name: code_generation
      description: "User wants code generation"

    - name: bug_fixing
      description: "User wants help fixing bugs"

Field	Required	Description
`name`	yes	Unique signal identifier.
`description`	no	Human-readable description.

Classifier

The optional classifier block configures the MMLU-based domain classification model used by domain signals.

classifier:
  category_model:
    model_id: "models/mom-domain-classifier"
    threshold: 0.6
    category_mapping_path: "models/mom-domain-classifier/category_mapping.json"

Field	Default	Description
`category_model.model_id`	—	HuggingFace repo ID or local path to the classification model.
`category_model.threshold`	`0.6`	Confidence threshold (0.0–1.0) for classification.
`category_model.category_mapping_path`	—	Path to the category mapping JSON file.

Routing Rules

Routing rules (also called “decisions” in semantic-router format) evaluate signal results and route requests to models. Rules are evaluated in priority order (highest priority wins).

rules:
  - name: code-routing
    priority: 100
    operator: AND
    conditions:
      - signal: keyword.code_keywords
    action:
      strategy: default
      primary_model: gpt-5.2
      fallback_models: [gpt-5-mini]

Rule fields

Field	Required	Default	Description
`name`	yes	—	Unique rule identifier.
`priority`	yes	—	Integer >0. Higher priority rules are evaluated first.
`conditions`	yes	—	Array of conditions (minimum 1).
`operator`	no	`AND`	How conditions combine: `AND` (all must match), `OR` (any), `NOR` (none).
`action`	yes	—	What happens when the rule matches.
`plugins`	no	—	Plugins to apply for this rule (see Plugins).

Conditions

Each condition references a signal and optionally applies a comparison.

conditions:
  - signal: keyword.code_keywords
  - signal: domain.math
    operator: greater-than
    value: 0.8
  - signal: language.english
    negate: true

Field	Required	Default	Description
`signal`	yes	—	Signal reference in `{type}.{name}` format (e.g., `keyword.code_keywords`).
`operator`	no	`equals`	Comparison: `equals`, `contains`, `greater-than`, `less-than`, `in`.
`value`	no	—	Comparison value. Numeric for `greater-than`/`less-than`, array for `in`.
`negate`	no	`false`	Invert the condition result.

Action

The action block defines the execution strategy when a rule matches.

action:
  strategy: fallback
  primary_model: gpt-5.2
  fallback_models: [gpt-5-mini]

Field	Required	Description
`strategy`	yes	Execution strategy: `default`, `parallel`, or `fallback`.
`primary_model`	yes	Model to use (must reference a configured model).
`fallback_models`	no	Ordered fallback chain if primary fails.
`model_refs`	no	Model references with reasoning control overrides.
`algorithm`	no	Selection algorithm (`confidence` or `ratings`).

Strategies:

default — Send to the primary model. On failure, try fallback models in order.
parallel — Send to multiple models simultaneously. Return the first successful response.
fallback — Try primary model first, then each fallback in order until one succeeds.

Algorithms

Algorithms control how models are selected within a rule.

Confidence algorithm — Escalate to larger models when confidence is low:

algorithm:
  type: confidence
  confidence:
    threshold: 0.75
    confidence_method: margin
    escalation_order: size
    on_error: skip

Field	Required	Default	Description
`threshold`	yes	—	Confidence threshold (0.0–1.0). Below this, escalate to next model.
`confidence_method`	no	—	Method: `margin`, `avg_logprob`, or `hybrid`.
`escalation_order`	no	—	Order: `size` (by parameter count), `cost`, or `automix`.
`on_error`	no	—	Error handling: `skip` the model on failure.

Ratings algorithm — Select among multiple models by policy:

algorithm:
  type: ratings
  ratings:
    policy: highest
    on_error: skip

Field	Required	Default	Description
`policy`	no	—	Selection policy: `highest`, `lowest`, or `prefer_first`.
`on_error`	no	—	Error handling: `skip` the model on failure.

Plugins

Plugins add pre-processing and post-processing capabilities to routing rules. Configure them in the plugins array on each rule. Each plugin requires a type field and a configuration block with plugin-specific options.

rules:
  - name: secure-routing
    priority: 100
    conditions:
      - signal: keyword.code_keywords
    action:
      strategy: default
      primary_model: gpt-5.2
    plugins:
      - type: jailbreak
        configuration:
          enabled: true
          threshold: 0.8
          action: block
      - type: pii
        configuration:
          enabled: true
          threshold: 0.7
      - type: system_prompt
        configuration:
          system_prompt: "You are a helpful coding assistant."

Built-in plugins

system_prompt

Inject a system prompt into the request.

- type: system_prompt
  configuration:
    system_prompt: "You are a helpful assistant specialized in mathematics."
    mode: replace

Field	Default	Description
`system_prompt`	—	System prompt text to inject.
`mode`	`replace`	How to apply: `replace` (overwrite existing), `prepend`, or `append`.

semantic-cache

Cache responses based on semantic similarity of prompts.

- type: semantic-cache
  configuration:
    enabled: true
    similarity_threshold: 0.92
    ttl_seconds: 3600

Field	Default	Description
`enabled`	`true`	Enable or disable the plugin.
`similarity_threshold`	—	Minimum similarity (0.0–1.0) to return a cached response.
`ttl_seconds`	—	Cache entry time-to-live.

jailbreak

Detect and handle prompt injection attempts.

- type: jailbreak
  configuration:
    enabled: true
    threshold: 0.8
    action: block

Field	Default	Description
`enabled`	`true`	Enable or disable the plugin.
`threshold`	—	Detection threshold (0.0–1.0).
`action`	—	Response action when detected.

pii

Detect personally identifiable information in prompts.

- type: pii
  configuration:
    enabled: true
    threshold: 0.7
    pii_types_allowed: []

Field	Default	Description
`enabled`	`true`	Enable or disable the plugin.
`threshold`	—	Detection threshold (0.0–1.0).
`pii_types_allowed`	`[]`	Array of PII types to allow through (empty = block all).

header_mutation

Modify HTTP headers on proxied requests and responses using a mutations array.

- type: header_mutation
  configuration:
    enabled: true
    mutations:
      - header: X-Routed-By
        operation: set
        value: llmsoup
        phase: response
      - header: X-Request-Source
        operation: add
        value: llmsoup-proxy
        phase: request

Field	Description
`enabled`	Enable or disable the plugin.
`mutations`	Array of mutation rules (see below).

Each mutation entry:

Field	Description
`header`	Header name.
`operation`	`set` (overwrite), `add` (append), or `remove`.
`value`	Header value (not required for `remove`).
`phase`	When to apply: `request` or `response`.

hallucination

Detect potential hallucinations in model responses.

- type: hallucination
  configuration:
    enabled: true
    threshold: 0.7
    action: header
    heuristic_sensitivity: 0.5

Field	Default	Description
`enabled`	`true`	Enable or disable the plugin.
`threshold`	—	Detection threshold (0.0–1.0).
`action`	—	Response action: `header`, `body`, `block`, or `log`.
`heuristic_sensitivity`	`0.5`	Heuristic sensitivity (0.0–1.0). Higher values are more sensitive.

router_replay

Capture routing decisions for debugging and analysis.

- type: router_replay
  configuration:
    enabled: true
    max_records: 200
    capture_request_body: false
    max_body_bytes: 4096

Field	Default	Description
`enabled`	`true`	Enable or disable the plugin.
`max_records`	`200`	Maximum routing records stored in memory.
`capture_request_body`	`false`	Whether to capture request payloads.
`max_body_bytes`	`4096`	Maximum bytes per captured body (truncated if exceeded).

Authentication

llmsoup supports token-based authentication on all endpoints except /metrics.

auth:
  enabled: true
  tokens:
    - env: MY_API_TOKEN
  tokens_file: /etc/llmsoup/tokens.yaml

Field	Type	Default	Description
`enabled`	boolean	`true`	Enable authentication (defaults to true when `auth` section exists).
`tokens`	array	—	Inline token definitions using secret references.
`tokens_file`	string	—	Path to an external tokens file (preferred over inline).

Tokens file format

tokens:
  - id: alice
    description: "Alice - Data team"
    secret:
      env: ALICE_TOKEN

  - id: bob
    description: "Bob - Engineering"
    secret:
      file: /run/secrets/bob_token

  - id: ci-pipeline
    description: "CI/CD service account"
    secret:
      env: CI_API_TOKEN

Secret Resolution

All access_key and token secret fields support multiple resolution methods.

Environment variable

access_key:
  env: OPENAI_API_KEY

Reads the value from the specified environment variable.

File

access_key:
  file: /run/secrets/api_key

Reads the value from a file path. Useful with Docker/Kubernetes secrets.

Command

access_key:
  command: "cat /etc/secret/key"

Vault

access_key:
  vault: "secret/data/api_key"

Environment Variables

Variable	Purpose	Default
`LLMSOUP_CONFIG`	Config file path	`config.yaml`
`LLMSOUP_HOST`	Server bind address	`127.0.0.1`
`LLMSOUP_PORT`	Server port	`8080`
`LLMSOUP_LOG`	Log level filter	`info`
`LLMSOUP_ALLOW_COMMAND_SECRETS`	Enable command-based secret resolution	unset
`LLMSOUP_PREPARE_OUTPUT`	Override output path for `llmsoup prepare`	`config.yaml`
`LLMSOUP_ONNX_MODELS_DIR`	Override directory for ML model storage	`~/.llmsoup/models`
`LLMSOUP_SKIP_MODEL_DOWNLOAD`	Skip downloading embedding/domain models (for CI/testing)	unset

Complete Example

A full configuration with two models, multiple signal types, routing rules, authentication, and plugins:

version: v0.1

defaults:
  default_model: gpt-5-mini
  preference_model: gpt-5-mini
  request_timeout_ms: 60000
  model_cache_ttl_seconds: 300
  embedding_cache_capacity: 1000
  cost_aware_routing: true
  cost_quality_tradeoff: 0.3
  include_cost_headers: true
  context_overflow: truncate_middle
  default_fallback_models: [gpt-5-mini]
  cost_baseline_model: gpt-5.2

models:
  - name: gpt-5-mini
    provider: openai
    access_key:
      env: OPENAI_API_KEY
    endpoints:
      - url: https://api.openai.com/v1/chat/completions
    metadata:
      context_window: 8000
      parameter_count: 120000
      latency_seconds: 0.5
    pricing:
      prompt_per_1m: 0.25
      completion_per_1m: 2.00

  - name: gpt-5.2
    provider: openai
    access_key:
      env: OPENAI_API_KEY
    endpoints:
      - url: https://api.openai.com/v1/chat/completions
    metadata:
      context_window: 128000
      parameter_count: 520000
      latency_seconds: 1.2
    pricing:
      prompt_per_1m: 1.75
      completion_per_1m: 14.00
    reasoning_family: reasoning_effort

signals:
  keyword:
    - name: code_keywords
      operator: OR
      keywords: ["code", "function", "implement", "debug", "program"]

    - name: math_keywords
      operator: OR
      keywords: ["calculate", "equation", "formula", "solve"]

  embedding:
    - name: quick_answer
      model: sentence-transformers/all-MiniLM-L12-v2
      threshold: 0.90
      candidates:
        - "quick question"
        - "simple answer"
      aggregation_method: max

  domain:
    - name: math
      mmlu_categories: [math]
      threshold: 0.6

    - name: programming
      examples:
        - "Write a function"
        - "Debug this code"
      threshold: 0.7

  language:
    - name: english
      language: en

    - name: chinese
      language: zh

  latency:
    - name: low_latency
      max_tpot: 0.050

  fact_check:
    - name: needs_fact_check
      description: "Requests requiring factual accuracy"

  preference:
    - name: code_generation
      description: "User wants code generation"

rules:
  - name: code-routing
    priority: 100
    operator: AND
    conditions:
      - signal: keyword.code_keywords
    action:
      strategy: fallback
      primary_model: gpt-5.2
      fallback_models: [gpt-5-mini]
    plugins:
      - type: system_prompt
        configuration:
          system_prompt: "You are a senior software engineer."
      - type: jailbreak
        configuration:
          enabled: true
          threshold: 0.8
          action: block

  - name: math-routing
    priority: 90
    operator: OR
    conditions:
      - signal: keyword.math_keywords
      - signal: domain.math
        operator: greater-than
        value: 0.6
    action:
      strategy: default
      primary_model: gpt-5.2
    plugins:
      - type: system_prompt
        configuration:
          system_prompt: "Show your work step by step."

  - name: quick-answers
    priority: 80
    conditions:
      - signal: embedding.quick_answer
    action:
      strategy: default
      primary_model: gpt-5-mini

  - name: cost-efficient
    priority: 40
    conditions:
      - signal: domain.programming
    action:
      strategy: default
      primary_model: gpt-5-mini
      fallback_models: [gpt-5.2]
      algorithm:
        type: confidence
        confidence:
          threshold: 0.75
          confidence_method: margin
          escalation_order: size
          on_error: skip

auth:
  enabled: true
  tokens_file: /etc/llmsoup/tokens.yaml