Skip to content

Advanced Configuration

This page covers advanced configuration topics that go beyond the basics in the Configuration Reference. Each section dives deeper into tuning, behavior details, and practical examples.

A model can have multiple endpoints for redundancy and load distribution. When more than one endpoint is configured, llmsoup distributes requests across them based on their weight values.

Each endpoint’s weight determines its share of traffic. The probability of an endpoint being selected is weight / total_weights.

models:
- name: gpt-5-mini
provider: openai
access_key:
env: OPENAI_API_KEY
endpoints:
- url: https://primary.openai.example.com/v1/chat/completions
weight: 70
timeout_ms: 5000
description: "Primary region (us-east)"
- url: https://secondary.openai.example.com/v1/chat/completions
weight: 30
timeout_ms: 8000
description: "Secondary region (eu-west)"

In this example, roughly 70% of requests go to the primary endpoint and 30% to the secondary. If weight is omitted, it defaults to 1 — so two unweighted endpoints split traffic 50/50.

FieldTypeDefaultDescription
urlstring(required)Full URL to the chat completions endpoint.
weightinteger (>0)1Load balancing weight. Higher values receive more traffic.
timeout_msintegerglobal defaultPer-endpoint timeout override in milliseconds.
descriptionstringHuman-readable label (appears in logs and metrics).
  • Geographic redundancy — Route to the nearest region with higher weight, fail over to others.
  • Provider diversity — Split traffic between OpenAI and Azure OpenAI for the same model.
  • Rate limit management — Distribute requests across multiple API keys or accounts.

Reasoning configuration controls how llmsoup tells upstream models to use chain-of-thought reasoning. This is configured per-model with reasoning_family and overridden per-rule with model_refs.

The reasoning_family field on a model definition tells llmsoup how to pass reasoning parameters to that model’s API. It accepts two forms:

String shorthand — for providers that use a standard parameter name:

models:
- name: gpt-5.2
reasoning_family: reasoning_effort

This tells llmsoup the model supports a reasoning_effort parameter directly in the API request.

Object form — for providers with custom parameter names:

models:
- name: deepseek-r1
reasoning_family:
type: chat_template_kwargs
parameter: thinking_mode

This tells llmsoup to pass reasoning controls via the thinking_mode parameter in chat_template_kwargs.

Within a routing rule, model_refs can override reasoning behavior per-model:

rules:
- name: deep-analysis
priority: 100
conditions:
- signal: keyword.complex_query
action:
strategy: default
primary_model: gpt-5.2
model_refs:
- model: gpt-5.2
use_reasoning: true
reasoning_effort: high
- model: gpt-5-mini
use_reasoning: false
FieldTypeDescription
modelstringModel name (must reference a configured model).
use_reasoningbooleanEnable or disable reasoning for this rule.
reasoning_effortstringEffort level: low, medium, or high.

When use_reasoning is true, llmsoup mutates the outbound request to include the reasoning parameter appropriate for the model’s reasoning_family. When false, reasoning parameters are stripped even if the model supports them.


Plugins add pre-processing and post-processing to routing rules. They are configured in the plugins array on each rule and execute in order.

rules:
- name: secure-route
priority: 100
conditions:
- signal: keyword.sensitive
action:
strategy: default
primary_model: gpt-5.2
plugins:
- type: jailbreak
configuration:
enabled: true
threshold: 0.8
- type: system_prompt
configuration:
system_prompt: "Answer carefully and factually."

Every plugin has a type field and a configuration block. The available plugin types are:

Plugin typePurpose
system_promptInject or replace system prompts
semantic-cacheCache responses by semantic similarity
jailbreakDetect prompt injection attempts
piiDetect personally identifiable information
header_mutationModify HTTP headers on requests/responses
hallucinationFlag potential hallucinations in responses
router_replayRecord routing decisions for debugging

For complete per-plugin configuration details, field descriptions, and examples, see the Plugins Reference.


llmsoup uses three independent caching layers, each with different eviction strategies and tuning knobs.

Caches full model responses keyed by request content. When a cache hit occurs, the response is returned immediately without calling the upstream model.

defaults:
model_cache_ttl_seconds: 3600
model_cache_max_capacity: 1000
SettingDefaultDescription
model_cache_ttl_seconds3600Time-to-live in seconds. Entries expire after this duration regardless of access.
model_cache_max_capacity1000Maximum number of cached model response entries. Oldest entries are evicted when capacity is reached (LRU).

Eviction behavior: Entries are evicted after the TTL expires or when the cache reaches model_cache_max_capacity entries (whichever comes first). The capacity limit uses LRU eviction — the least recently used entry is removed to make room for new ones. This prevents unbounded memory growth under heavy traffic.

When to tune:

  • Lower TTL (e.g., 60–300) for rapidly changing data or when freshness matters.
  • Higher TTL (e.g., 7200+) for stable queries where the same prompt always has the same answer.
  • Set to 0 to effectively disable model response caching.

Caches computed embedding vectors to avoid re-running BERT inference for repeated text. This primarily benefits the embedding signal evaluator.

defaults:
embedding_cache_capacity: 1000
SettingDefaultDescription
embedding_cache_capacity1000Maximum number of cached embedding entries. Oldest entries are evicted when capacity is reached.

Eviction behavior: Least Recently Used (LRU). When the cache is full, the entry that hasn’t been accessed for the longest time is evicted to make room.

When to tune:

  • Increase if your workload has many unique prompts and you see high embedding computation times in metrics.
  • Decrease if memory is constrained — each embedding entry holds a vector of floats.

Tracks per-model Time Per Output Token (TPOT) using Exponential Moving Average smoothing. This is used internally by the latency signal evaluator — it is not directly configurable via YAML.

How it works: After each model response, llmsoup computes the actual TPOT and updates the smoothed average:

smoothed_tpot = alpha × new_tpot + (1 - alpha) × previous_smoothed_tpot

The default EMA alpha is 0.3, which means:

  • 30% weight to the most recent observation
  • 70% weight to historical average

This smoothing prevents a single slow response from drastically changing the model’s latency estimate. The latency signal evaluator compares the smoothed TPOT against the max_tpot threshold configured on each latency signal.

When multiple embedding signals reference the same model (e.g., sentence-transformers/all-MiniLM-L12-v2), llmsoup automatically shares a single loaded model instance across all signals. This is transparent — no configuration is needed.

Impact: Each BERT embedding model uses ~130MB of RAM. Without sharing, six signals using the same model would consume ~780MB. With sharing, they use ~130MB total.

How it works: During startup, llmsoup groups embedding signals by their resolved model path. Signals that share a path receive a shared handle to a single model instance. Each signal still maintains its own reference text, candidates, and cache — only the underlying BERT model weights are shared.


llmsoup resolves secrets (API keys, authentication tokens) from external sources at startup. Secrets are never stored in the configuration file itself. The resolution methods are tried in the order they appear in the configuration.

The simplest and most common method. Reads the secret from an environment variable.

access_key:
env: OPENAI_API_KEY

Behavior: Looks up the environment variable at config load time. Fails validation if the variable is not set or is empty.

Error: Secret resolution failed: environment variable 'OPENAI_API_KEY' not set

Reads the secret from a file path. Ideal for Docker secrets and Kubernetes secret volumes.

access_key:
file: /run/secrets/openai_api_key

Behavior: Reads the entire file content and trims leading/trailing whitespace. Fails if the file does not exist or is not readable.

Error: Secret resolution failed: cannot read file '/run/secrets/openai_api_key'

Common patterns:

# Docker secret
access_key:
file: /run/secrets/api_key
# Kubernetes secret volume
access_key:
file: /etc/llmsoup/secrets/api-key

Executes a shell command and captures its stdout as the secret value. Useful for integration with secret managers like AWS Secrets Manager or 1Password CLI.

access_key:
command: "aws secretsmanager get-secret-value --secret-id openai-key --query SecretString --output text"

Behavior: Runs the command via the system shell, captures stdout, and trims whitespace. The command must exit with code 0.

Error (when disabled): Secret resolution failed: command-based secrets are disabled. Set LLMSOUP_ALLOW_COMMAND_SECRETS=1 to enable.

Error (when command fails): Secret resolution failed: command exited with non-zero status

access_key:
vault: "secret/data/openai/api-key"

Error: Secret resolution failed: vault secret resolution is not yet implemented

When multiple methods are specified in the same secret reference, they are resolved in order: envfilevaultcommand. The first successful resolution wins.

# Tries env first, falls back to file
access_key:
env: OPENAI_API_KEY
file: /run/secrets/openai_api_key

When a prompt exceeds the model’s context_window (defined in model metadata), llmsoup applies a context overflow strategy to fit the conversation within the limit. The strategy is set globally in defaults.

defaults:
context_overflow: truncate_middle

Keeps the leading system and developer messages (the “protected prefix”) and the most recent messages. Drops messages from the middle of the conversation.

Before (6 messages, context exceeded):

[system] You are a helpful assistant.
[user] What is Rust? ← DROPPED
[assistant] Rust is a systems language... ← DROPPED
[user] How about Go? ← DROPPED
[assistant] Go is a compiled language... ← kept
[user] Compare their async models. ← kept

After:

[system] You are a helpful assistant.
[assistant] Go is a compiled language...
[user] Compare their async models.

Best for: General-purpose use. Preserves the system/developer instructions and the most recent turns, which are usually the most relevant.

Drops the oldest non-system messages first, keeping the most recent conversation.

Before (6 messages, context exceeded):

[system] You are a helpful assistant.
[user] What is Rust? ← DROPPED
[assistant] Rust is a systems language... ← DROPPED
[user] How about Go? ← DROPPED
[assistant] Go is a compiled language... ← kept
[user] Compare their async models. ← kept

After:

[system] You are a helpful assistant.
[assistant] Go is a compiled language...
[user] Compare their async models.

Best for: Long-running conversations where only the recent context matters. Similar to a sliding window over the chat history.

Returns an error immediately without sending the request. No messages are dropped.

Behavior: Returns an error response in OpenAI format:

{
"error": {
"message": "Prompt exceeds model context window",
"type": "invalid_request_error",
"code": "context_length_exceeded"
}
}

Best for: Applications that need to know when context is exceeded so they can handle it themselves (e.g., summarize the conversation before retrying).


Model metadata fields influence routing decisions beyond simple name matching.

models:
- name: gpt-5.2
metadata:
context_window: 128000
parameter_count: 520000
latency_seconds: 1.2
FieldTypeUsed by
context_windowintegerContext overflow strategies, prompt validation.
parameter_countintegerConfidence algorithm’s escalation_order: size — models are escalated in order of increasing parameter count.
latency_secondsfloatInitial latency estimates before real TPOT data is collected.

When using the confidence algorithm with escalation_order: size, models in the model_refs list are sorted by parameter_count (ascending). If the first model’s response falls below the confidence threshold, the request escalates to the next larger model:

action:
algorithm:
type: confidence
confidence:
threshold: 0.8
escalation_order: size
model_refs:
- model: gpt-5-mini # parameter_count: 120000 → tried first
- model: gpt-5.2 # parameter_count: 520000 → escalation target

Both the confidence and ratings algorithms support an on_error field that controls behavior when a model call fails during algorithm evaluation.

algorithm:
type: confidence
confidence:
threshold: 0.8
on_error: skip
ValueDefaultBehavior
skipYesSkip the failed model and try the next one in the list.
failReturn an error immediately without trying other models.

When on_error is skip (the default) and a model returns an error or times out, the algorithm moves to the next model in the evaluation order instead of failing the entire request. When set to fail, the first model error aborts the algorithm and returns the error to the caller.

The confidence_method field controls how the confidence algorithm calculates confidence from model responses:

MethodDescription
marginUses the difference between the top token probability and the second-highest. Larger margins indicate higher confidence.
avg_logprobUses the average log probability across all output tokens. Higher average means more confident.
hybridCombines both methods using configurable weights.
algorithm:
type: confidence
confidence:
threshold: 0.75
confidence_method: hybrid
hybrid_weights:
logprob_weight: 0.6
margin_weight: 0.4

Both algorithm types support a per-rule cost_quality_tradeoff that overrides the global defaults.cost_quality_tradeoff:

rules:
- name: budget-route
action:
algorithm:
type: confidence
confidence:
threshold: 0.7
cost_quality_tradeoff: 0.8 # strongly prefer cheaper models
- name: quality-route
action:
algorithm:
type: ratings
ratings:
policy: highest
cost_quality_tradeoff: 0.1 # strongly prefer quality

The value ranges from 0.0 (pure quality) to 1.0 (pure cost savings). This requires cost_aware_routing: true in defaults and pricing configured on models.