Advanced Configuration

This page covers advanced configuration topics that go beyond the basics in the Configuration Reference. Each section dives deeper into tuning, behavior details, and practical examples.

Multi-endpoint load balancing

A model can have multiple endpoints for redundancy and load distribution. When more than one endpoint is configured, llmsoup distributes requests across them based on their weight values.

Weight-based distribution

Each endpoint’s weight determines its share of traffic. The probability of an endpoint being selected is weight / total_weights.

models:
  - name: gpt-5-mini
    provider: openai
    access_key:
      env: OPENAI_API_KEY
    endpoints:
      - url: https://primary.openai.example.com/v1/chat/completions
        weight: 70
        timeout_ms: 5000
        description: "Primary region (us-east)"
      - url: https://secondary.openai.example.com/v1/chat/completions
        weight: 30
        timeout_ms: 8000
        description: "Secondary region (eu-west)"

In this example, roughly 70% of requests go to the primary endpoint and 30% to the secondary. If weight is omitted, it defaults to 1 — so two unweighted endpoints split traffic 50/50.

Endpoint fields

Field	Type	Default	Description
`url`	string	(required)	Full URL to the chat completions endpoint.
`weight`	integer (>0)	1	Load balancing weight. Higher values receive more traffic.
`timeout_ms`	integer	global default	Per-endpoint timeout override in milliseconds.
`description`	string	—	Human-readable label (appears in logs and metrics).

When to use multiple endpoints

Geographic redundancy — Route to the nearest region with higher weight, fail over to others.
Provider diversity — Split traffic between OpenAI and Azure OpenAI for the same model.
Rate limit management — Distribute requests across multiple API keys or accounts.

Reasoning configuration

Reasoning configuration controls how llmsoup tells upstream models to use chain-of-thought reasoning. This is configured per-model with reasoning_family and overridden per-rule with model_refs.

reasoning_family

The reasoning_family field on a model definition tells llmsoup how to pass reasoning parameters to that model’s API. It accepts two forms:

String shorthand — for providers that use a standard parameter name:

models:
  - name: gpt-5.2
    reasoning_family: reasoning_effort

This tells llmsoup the model supports a reasoning_effort parameter directly in the API request.

Object form — for providers with custom parameter names:

models:
  - name: deepseek-r1
    reasoning_family:
      type: chat_template_kwargs
      parameter: thinking_mode

This tells llmsoup to pass reasoning controls via the thinking_mode parameter in chat_template_kwargs.

Model reference overrides

Within a routing rule, model_refs can override reasoning behavior per-model:

rules:
  - name: deep-analysis
    priority: 100
    conditions:
      - signal: keyword.complex_query
    action:
      strategy: default
      primary_model: gpt-5.2
      model_refs:
        - model: gpt-5.2
          use_reasoning: true
          reasoning_effort: high
        - model: gpt-5-mini
          use_reasoning: false

Field	Type	Description
`model`	string	Model name (must reference a configured model).
`use_reasoning`	boolean	Enable or disable reasoning for this rule.
`reasoning_effort`	string	Effort level: `low`, `medium`, or `high`.

When use_reasoning is true, llmsoup mutates the outbound request to include the reasoning parameter appropriate for the model’s reasoning_family. When false, reasoning parameters are stripped even if the model supports them.

Plugins overview

Plugins add pre-processing and post-processing to routing rules. They are configured in the plugins array on each rule and execute in order.

rules:
  - name: secure-route
    priority: 100
    conditions:
      - signal: keyword.sensitive
    action:
      strategy: default
      primary_model: gpt-5.2
    plugins:
      - type: jailbreak
        configuration:
          enabled: true
          threshold: 0.8
      - type: system_prompt
        configuration:
          system_prompt: "Answer carefully and factually."

Every plugin has a type field and a configuration block. The available plugin types are:

Plugin type	Purpose
`system_prompt`	Inject or replace system prompts
`semantic-cache`	Cache responses by semantic similarity
`jailbreak`	Detect prompt injection attempts
`pii`	Detect personally identifiable information
`header_mutation`	Modify HTTP headers on requests/responses
`hallucination`	Flag potential hallucinations in responses
`router_replay`	Record routing decisions for debugging

For complete per-plugin configuration details, field descriptions, and examples, see the Plugins Reference.

Cache tuning

llmsoup uses three independent caching layers, each with different eviction strategies and tuning knobs.

Model response cache (TTL)

Caches full model responses keyed by request content. When a cache hit occurs, the response is returned immediately without calling the upstream model.

defaults:
  model_cache_ttl_seconds: 3600
  model_cache_max_capacity: 1000

Setting	Default	Description
`model_cache_ttl_seconds`	3600	Time-to-live in seconds. Entries expire after this duration regardless of access.
`model_cache_max_capacity`	1000	Maximum number of cached model response entries. Oldest entries are evicted when capacity is reached (LRU).

Eviction behavior: Entries are evicted after the TTL expires or when the cache reaches model_cache_max_capacity entries (whichever comes first). The capacity limit uses LRU eviction — the least recently used entry is removed to make room for new ones. This prevents unbounded memory growth under heavy traffic.

When to tune:

Lower TTL (e.g., 60–300) for rapidly changing data or when freshness matters.
Higher TTL (e.g., 7200+) for stable queries where the same prompt always has the same answer.
Set to 0 to effectively disable model response caching.

Embedding cache (LRU)

Caches computed embedding vectors to avoid re-running BERT inference for repeated text. This primarily benefits the embedding signal evaluator.

defaults:
  embedding_cache_capacity: 1000

Setting	Default	Description
`embedding_cache_capacity`	1000	Maximum number of cached embedding entries. Oldest entries are evicted when capacity is reached.

Eviction behavior: Least Recently Used (LRU). When the cache is full, the entry that hasn’t been accessed for the longest time is evicted to make room.

When to tune:

Increase if your workload has many unique prompts and you see high embedding computation times in metrics.
Decrease if memory is constrained — each embedding entry holds a vector of floats.

TPOT cache (EMA)

Tracks per-model Time Per Output Token (TPOT) using Exponential Moving Average smoothing. This is used internally by the latency signal evaluator — it is not directly configurable via YAML.

How it works: After each model response, llmsoup computes the actual TPOT and updates the smoothed average:

smoothed_tpot = alpha × new_tpot + (1 - alpha) × previous_smoothed_tpot

The default EMA alpha is 0.3, which means:

30% weight to the most recent observation
70% weight to historical average

This smoothing prevents a single slow response from drastically changing the model’s latency estimate. The latency signal evaluator compares the smoothed TPOT against the max_tpot threshold configured on each latency signal.

When multiple embedding signals reference the same model (e.g., sentence-transformers/all-MiniLM-L12-v2), llmsoup automatically shares a single loaded model instance across all signals. This is transparent — no configuration is needed.

Impact: Each BERT embedding model uses ~130MB of RAM. Without sharing, six signals using the same model would consume ~780MB. With sharing, they use ~130MB total.

How it works: During startup, llmsoup groups embedding signals by their resolved model path. Signals that share a path receive a shared handle to a single model instance. Each signal still maintains its own reference text, candidates, and cache — only the underlying BERT model weights are shared.

Secret resolution

llmsoup resolves secrets (API keys, authentication tokens) from external sources at startup. Secrets are never stored in the configuration file itself. The resolution methods are tried in the order they appear in the configuration.

Environment variable

The simplest and most common method. Reads the secret from an environment variable.

access_key:
  env: OPENAI_API_KEY

Behavior: Looks up the environment variable at config load time. Fails validation if the variable is not set or is empty.

Error: Secret resolution failed: environment variable 'OPENAI_API_KEY' not set

File

Reads the secret from a file path. Ideal for Docker secrets and Kubernetes secret volumes.

access_key:
  file: /run/secrets/openai_api_key

Behavior: Reads the entire file content and trims leading/trailing whitespace. Fails if the file does not exist or is not readable.

Error: Secret resolution failed: cannot read file '/run/secrets/openai_api_key'

Common patterns:

# Docker secret
access_key:
  file: /run/secrets/api_key

# Kubernetes secret volume
access_key:
  file: /etc/llmsoup/secrets/api-key

Command

Executes a shell command and captures its stdout as the secret value. Useful for integration with secret managers like AWS Secrets Manager or 1Password CLI.

access_key:
  command: "aws secretsmanager get-secret-value --secret-id openai-key --query SecretString --output text"

Behavior: Runs the command via the system shell, captures stdout, and trims whitespace. The command must exit with code 0.

Command-based secret resolution is disabled by default for security. You must explicitly enable it by setting the environment variable:

export LLMSOUP_ALLOW_COMMAND_SECRETS=1

Without this variable, any command secret reference will fail validation with an error explaining the security gate.

Error (when disabled): Secret resolution failed: command-based secrets are disabled. Set LLMSOUP_ALLOW_COMMAND_SECRETS=1 to enable.

Error (when command fails): Secret resolution failed: command exited with non-zero status

Vault (planned)

access_key:
  vault: "secret/data/openai/api-key"

Error: Secret resolution failed: vault secret resolution is not yet implemented

Resolution priority

When multiple methods are specified in the same secret reference, they are resolved in order: env → file → vault → command. The first successful resolution wins.

# Tries env first, falls back to file
access_key:
  env: OPENAI_API_KEY
  file: /run/secrets/openai_api_key

Context overflow strategies

When a prompt exceeds the model’s context_window (defined in model metadata), llmsoup applies a context overflow strategy to fit the conversation within the limit. The strategy is set globally in defaults.

defaults:
  context_overflow: truncate_middle

truncate_middle (default)

Keeps the leading system and developer messages (the “protected prefix”) and the most recent messages. Drops messages from the middle of the conversation.

Before (6 messages, context exceeded):

[system]   You are a helpful assistant.
[user]     What is Rust?                    ← DROPPED
[assistant] Rust is a systems language...   ← DROPPED
[user]     How about Go?                    ← DROPPED
[assistant] Go is a compiled language...    ← kept
[user]     Compare their async models.      ← kept

After:

[system]   You are a helpful assistant.
[assistant] Go is a compiled language...
[user]     Compare their async models.

Best for: General-purpose use. Preserves the system/developer instructions and the most recent turns, which are usually the most relevant.

rolling_window

Drops the oldest non-system messages first, keeping the most recent conversation.

Before (6 messages, context exceeded):

[system]   You are a helpful assistant.
[user]     What is Rust?                    ← DROPPED
[assistant] Rust is a systems language...   ← DROPPED
[user]     How about Go?                    ← DROPPED
[assistant] Go is a compiled language...    ← kept
[user]     Compare their async models.      ← kept

After:

[system]   You are a helpful assistant.
[assistant] Go is a compiled language...
[user]     Compare their async models.

Best for: Long-running conversations where only the recent context matters. Similar to a sliding window over the chat history.

stop_at_limit

Returns an error immediately without sending the request. No messages are dropped.

Behavior: Returns an error response in OpenAI format:

{
  "error": {
    "message": "Prompt exceeds model context window",
    "type": "invalid_request_error",
    "code": "context_length_exceeded"
  }
}

Best for: Applications that need to know when context is exceeded so they can handle it themselves (e.g., summarize the conversation before retrying).

Advanced model metadata

Model metadata fields influence routing decisions beyond simple name matching.

models:
  - name: gpt-5.2
    metadata:
      context_window: 128000
      parameter_count: 520000
      latency_seconds: 1.2

Field	Type	Used by
`context_window`	integer	Context overflow strategies, prompt validation.
`parameter_count`	integer	Confidence algorithm’s `escalation_order: size` — models are escalated in order of increasing parameter count.
`latency_seconds`	float	Initial latency estimates before real TPOT data is collected.

How parameter_count affects routing

When using the confidence algorithm with escalation_order: size, models in the model_refs list are sorted by parameter_count (ascending). If the first model’s response falls below the confidence threshold, the request escalates to the next larger model:

action:
  algorithm:
    type: confidence
    confidence:
      threshold: 0.8
      escalation_order: size
  model_refs:
    - model: gpt-5-mini      # parameter_count: 120000 → tried first
    - model: gpt-5.2          # parameter_count: 520000 → escalation target

Algorithm error handling

Both the confidence and ratings algorithms support an on_error field that controls behavior when a model call fails during algorithm evaluation.

algorithm:
  type: confidence
  confidence:
    threshold: 0.8
    on_error: skip

Value	Default	Behavior
`skip`	Yes	Skip the failed model and try the next one in the list.
`fail`	—	Return an error immediately without trying other models.

When on_error is skip (the default) and a model returns an error or times out, the algorithm moves to the next model in the evaluation order instead of failing the entire request. When set to fail, the first model error aborts the algorithm and returns the error to the caller.

Confidence method options

The confidence_method field controls how the confidence algorithm calculates confidence from model responses:

Method	Description
`margin`	Uses the difference between the top token probability and the second-highest. Larger margins indicate higher confidence.
`avg_logprob`	Uses the average log probability across all output tokens. Higher average means more confident.
`hybrid`	Combines both methods using configurable weights.

algorithm:
  type: confidence
  confidence:
    threshold: 0.75
    confidence_method: hybrid
    hybrid_weights:
      logprob_weight: 0.6
      margin_weight: 0.4

Cost-quality tradeoff in algorithms

Both algorithm types support a per-rule cost_quality_tradeoff that overrides the global defaults.cost_quality_tradeoff:

rules:
  - name: budget-route
    action:
      algorithm:
        type: confidence
        confidence:
          threshold: 0.7
          cost_quality_tradeoff: 0.8  # strongly prefer cheaper models
  - name: quality-route
    action:
      algorithm:
        type: ratings
        ratings:
          policy: highest
          cost_quality_tradeoff: 0.1  # strongly prefer quality

The value ranges from 0.0 (pure quality) to 1.0 (pure cost savings). This requires cost_aware_routing: true in defaults and pricing configured on models.