Advanced Configuration
This page covers advanced configuration topics that go beyond the basics in the Configuration Reference. Each section dives deeper into tuning, behavior details, and practical examples.
Multi-endpoint load balancing
Section titled “Multi-endpoint load balancing”A model can have multiple endpoints for redundancy and load distribution. When more than one endpoint is configured, llmsoup distributes requests across them based on their weight values.
Weight-based distribution
Section titled “Weight-based distribution”Each endpoint’s weight determines its share of traffic. The probability of an endpoint being selected is weight / total_weights.
models: - name: gpt-5-mini provider: openai access_key: env: OPENAI_API_KEY endpoints: - url: https://primary.openai.example.com/v1/chat/completions weight: 70 timeout_ms: 5000 description: "Primary region (us-east)" - url: https://secondary.openai.example.com/v1/chat/completions weight: 30 timeout_ms: 8000 description: "Secondary region (eu-west)"In this example, roughly 70% of requests go to the primary endpoint and 30% to the secondary. If weight is omitted, it defaults to 1 — so two unweighted endpoints split traffic 50/50.
Endpoint fields
Section titled “Endpoint fields”| Field | Type | Default | Description |
|---|---|---|---|
url | string | (required) | Full URL to the chat completions endpoint. |
weight | integer (>0) | 1 | Load balancing weight. Higher values receive more traffic. |
timeout_ms | integer | global default | Per-endpoint timeout override in milliseconds. |
description | string | — | Human-readable label (appears in logs and metrics). |
When to use multiple endpoints
Section titled “When to use multiple endpoints”- Geographic redundancy — Route to the nearest region with higher weight, fail over to others.
- Provider diversity — Split traffic between OpenAI and Azure OpenAI for the same model.
- Rate limit management — Distribute requests across multiple API keys or accounts.
Reasoning configuration
Section titled “Reasoning configuration”Reasoning configuration controls how llmsoup tells upstream models to use chain-of-thought reasoning. This is configured per-model with reasoning_family and overridden per-rule with model_refs.
reasoning_family
Section titled “reasoning_family”The reasoning_family field on a model definition tells llmsoup how to pass reasoning parameters to that model’s API. It accepts two forms:
String shorthand — for providers that use a standard parameter name:
models: - name: gpt-5.2 reasoning_family: reasoning_effortThis tells llmsoup the model supports a reasoning_effort parameter directly in the API request.
Object form — for providers with custom parameter names:
models: - name: deepseek-r1 reasoning_family: type: chat_template_kwargs parameter: thinking_modeThis tells llmsoup to pass reasoning controls via the thinking_mode parameter in chat_template_kwargs.
Model reference overrides
Section titled “Model reference overrides”Within a routing rule, model_refs can override reasoning behavior per-model:
rules: - name: deep-analysis priority: 100 conditions: - signal: keyword.complex_query action: strategy: default primary_model: gpt-5.2 model_refs: - model: gpt-5.2 use_reasoning: true reasoning_effort: high - model: gpt-5-mini use_reasoning: false| Field | Type | Description |
|---|---|---|
model | string | Model name (must reference a configured model). |
use_reasoning | boolean | Enable or disable reasoning for this rule. |
reasoning_effort | string | Effort level: low, medium, or high. |
When use_reasoning is true, llmsoup mutates the outbound request to include the reasoning parameter appropriate for the model’s reasoning_family. When false, reasoning parameters are stripped even if the model supports them.
Plugins overview
Section titled “Plugins overview”Plugins add pre-processing and post-processing to routing rules. They are configured in the plugins array on each rule and execute in order.
rules: - name: secure-route priority: 100 conditions: - signal: keyword.sensitive action: strategy: default primary_model: gpt-5.2 plugins: - type: jailbreak configuration: enabled: true threshold: 0.8 - type: system_prompt configuration: system_prompt: "Answer carefully and factually."Every plugin has a type field and a configuration block. The available plugin types are:
| Plugin type | Purpose |
|---|---|
system_prompt | Inject or replace system prompts |
semantic-cache | Cache responses by semantic similarity |
jailbreak | Detect prompt injection attempts |
pii | Detect personally identifiable information |
header_mutation | Modify HTTP headers on requests/responses |
hallucination | Flag potential hallucinations in responses |
router_replay | Record routing decisions for debugging |
For complete per-plugin configuration details, field descriptions, and examples, see the Plugins Reference.
Cache tuning
Section titled “Cache tuning”llmsoup uses three independent caching layers, each with different eviction strategies and tuning knobs.
Model response cache (TTL)
Section titled “Model response cache (TTL)”Caches full model responses keyed by request content. When a cache hit occurs, the response is returned immediately without calling the upstream model.
defaults: model_cache_ttl_seconds: 3600 model_cache_max_capacity: 1000| Setting | Default | Description |
|---|---|---|
model_cache_ttl_seconds | 3600 | Time-to-live in seconds. Entries expire after this duration regardless of access. |
model_cache_max_capacity | 1000 | Maximum number of cached model response entries. Oldest entries are evicted when capacity is reached (LRU). |
Eviction behavior: Entries are evicted after the TTL expires or when the cache reaches model_cache_max_capacity entries (whichever comes first). The capacity limit uses LRU eviction — the least recently used entry is removed to make room for new ones. This prevents unbounded memory growth under heavy traffic.
When to tune:
- Lower TTL (e.g., 60–300) for rapidly changing data or when freshness matters.
- Higher TTL (e.g., 7200+) for stable queries where the same prompt always has the same answer.
- Set to 0 to effectively disable model response caching.
Embedding cache (LRU)
Section titled “Embedding cache (LRU)”Caches computed embedding vectors to avoid re-running BERT inference for repeated text. This primarily benefits the embedding signal evaluator.
defaults: embedding_cache_capacity: 1000| Setting | Default | Description |
|---|---|---|
embedding_cache_capacity | 1000 | Maximum number of cached embedding entries. Oldest entries are evicted when capacity is reached. |
Eviction behavior: Least Recently Used (LRU). When the cache is full, the entry that hasn’t been accessed for the longest time is evicted to make room.
When to tune:
- Increase if your workload has many unique prompts and you see high embedding computation times in metrics.
- Decrease if memory is constrained — each embedding entry holds a vector of floats.
TPOT cache (EMA)
Section titled “TPOT cache (EMA)”Tracks per-model Time Per Output Token (TPOT) using Exponential Moving Average smoothing. This is used internally by the latency signal evaluator — it is not directly configurable via YAML.
How it works: After each model response, llmsoup computes the actual TPOT and updates the smoothed average:
smoothed_tpot = alpha × new_tpot + (1 - alpha) × previous_smoothed_tpotThe default EMA alpha is 0.3, which means:
- 30% weight to the most recent observation
- 70% weight to historical average
This smoothing prevents a single slow response from drastically changing the model’s latency estimate. The latency signal evaluator compares the smoothed TPOT against the max_tpot threshold configured on each latency signal.
Embedding model sharing
Section titled “Embedding model sharing”When multiple embedding signals reference the same model (e.g., sentence-transformers/all-MiniLM-L12-v2), llmsoup automatically shares a single loaded model instance across all signals. This is transparent — no configuration is needed.
Impact: Each BERT embedding model uses ~130MB of RAM. Without sharing, six signals using the same model would consume ~780MB. With sharing, they use ~130MB total.
How it works: During startup, llmsoup groups embedding signals by their resolved model path. Signals that share a path receive a shared handle to a single model instance. Each signal still maintains its own reference text, candidates, and cache — only the underlying BERT model weights are shared.
Secret resolution
Section titled “Secret resolution”llmsoup resolves secrets (API keys, authentication tokens) from external sources at startup. Secrets are never stored in the configuration file itself. The resolution methods are tried in the order they appear in the configuration.
Environment variable
Section titled “Environment variable”The simplest and most common method. Reads the secret from an environment variable.
access_key: env: OPENAI_API_KEYBehavior: Looks up the environment variable at config load time. Fails validation if the variable is not set or is empty.
Error: Secret resolution failed: environment variable 'OPENAI_API_KEY' not set
Reads the secret from a file path. Ideal for Docker secrets and Kubernetes secret volumes.
access_key: file: /run/secrets/openai_api_keyBehavior: Reads the entire file content and trims leading/trailing whitespace. Fails if the file does not exist or is not readable.
Error: Secret resolution failed: cannot read file '/run/secrets/openai_api_key'
Common patterns:
# Docker secretaccess_key: file: /run/secrets/api_key
# Kubernetes secret volumeaccess_key: file: /etc/llmsoup/secrets/api-keyCommand
Section titled “Command”Executes a shell command and captures its stdout as the secret value. Useful for integration with secret managers like AWS Secrets Manager or 1Password CLI.
access_key: command: "aws secretsmanager get-secret-value --secret-id openai-key --query SecretString --output text"Behavior: Runs the command via the system shell, captures stdout, and trims whitespace. The command must exit with code 0.
Error (when disabled): Secret resolution failed: command-based secrets are disabled. Set LLMSOUP_ALLOW_COMMAND_SECRETS=1 to enable.
Error (when command fails): Secret resolution failed: command exited with non-zero status
Vault (planned)
Section titled “Vault (planned)”access_key: vault: "secret/data/openai/api-key"Error: Secret resolution failed: vault secret resolution is not yet implemented
Resolution priority
Section titled “Resolution priority”When multiple methods are specified in the same secret reference, they are resolved in order: env → file → vault → command. The first successful resolution wins.
# Tries env first, falls back to fileaccess_key: env: OPENAI_API_KEY file: /run/secrets/openai_api_keyContext overflow strategies
Section titled “Context overflow strategies”When a prompt exceeds the model’s context_window (defined in model metadata), llmsoup applies a context overflow strategy to fit the conversation within the limit. The strategy is set globally in defaults.
defaults: context_overflow: truncate_middletruncate_middle (default)
Section titled “truncate_middle (default)”Keeps the leading system and developer messages (the “protected prefix”) and the most recent messages. Drops messages from the middle of the conversation.
Before (6 messages, context exceeded):
[system] You are a helpful assistant.[user] What is Rust? ← DROPPED[assistant] Rust is a systems language... ← DROPPED[user] How about Go? ← DROPPED[assistant] Go is a compiled language... ← kept[user] Compare their async models. ← keptAfter:
[system] You are a helpful assistant.[assistant] Go is a compiled language...[user] Compare their async models.Best for: General-purpose use. Preserves the system/developer instructions and the most recent turns, which are usually the most relevant.
rolling_window
Section titled “rolling_window”Drops the oldest non-system messages first, keeping the most recent conversation.
Before (6 messages, context exceeded):
[system] You are a helpful assistant.[user] What is Rust? ← DROPPED[assistant] Rust is a systems language... ← DROPPED[user] How about Go? ← DROPPED[assistant] Go is a compiled language... ← kept[user] Compare their async models. ← keptAfter:
[system] You are a helpful assistant.[assistant] Go is a compiled language...[user] Compare their async models.Best for: Long-running conversations where only the recent context matters. Similar to a sliding window over the chat history.
stop_at_limit
Section titled “stop_at_limit”Returns an error immediately without sending the request. No messages are dropped.
Behavior: Returns an error response in OpenAI format:
{ "error": { "message": "Prompt exceeds model context window", "type": "invalid_request_error", "code": "context_length_exceeded" }}Best for: Applications that need to know when context is exceeded so they can handle it themselves (e.g., summarize the conversation before retrying).
Advanced model metadata
Section titled “Advanced model metadata”Model metadata fields influence routing decisions beyond simple name matching.
models: - name: gpt-5.2 metadata: context_window: 128000 parameter_count: 520000 latency_seconds: 1.2| Field | Type | Used by |
|---|---|---|
context_window | integer | Context overflow strategies, prompt validation. |
parameter_count | integer | Confidence algorithm’s escalation_order: size — models are escalated in order of increasing parameter count. |
latency_seconds | float | Initial latency estimates before real TPOT data is collected. |
How parameter_count affects routing
Section titled “How parameter_count affects routing”When using the confidence algorithm with escalation_order: size, models in the model_refs list are sorted by parameter_count (ascending). If the first model’s response falls below the confidence threshold, the request escalates to the next larger model:
action: algorithm: type: confidence confidence: threshold: 0.8 escalation_order: size model_refs: - model: gpt-5-mini # parameter_count: 120000 → tried first - model: gpt-5.2 # parameter_count: 520000 → escalation targetAlgorithm error handling
Section titled “Algorithm error handling”Both the confidence and ratings algorithms support an on_error field that controls behavior when a model call fails during algorithm evaluation.
algorithm: type: confidence confidence: threshold: 0.8 on_error: skip| Value | Default | Behavior |
|---|---|---|
skip | Yes | Skip the failed model and try the next one in the list. |
fail | — | Return an error immediately without trying other models. |
When on_error is skip (the default) and a model returns an error or times out, the algorithm moves to the next model in the evaluation order instead of failing the entire request. When set to fail, the first model error aborts the algorithm and returns the error to the caller.
Confidence method options
Section titled “Confidence method options”The confidence_method field controls how the confidence algorithm calculates confidence from model responses:
| Method | Description |
|---|---|
margin | Uses the difference between the top token probability and the second-highest. Larger margins indicate higher confidence. |
avg_logprob | Uses the average log probability across all output tokens. Higher average means more confident. |
hybrid | Combines both methods using configurable weights. |
algorithm: type: confidence confidence: threshold: 0.75 confidence_method: hybrid hybrid_weights: logprob_weight: 0.6 margin_weight: 0.4Cost-quality tradeoff in algorithms
Section titled “Cost-quality tradeoff in algorithms”Both algorithm types support a per-rule cost_quality_tradeoff that overrides the global defaults.cost_quality_tradeoff:
rules: - name: budget-route action: algorithm: type: confidence confidence: threshold: 0.7 cost_quality_tradeoff: 0.8 # strongly prefer cheaper models - name: quality-route action: algorithm: type: ratings ratings: policy: highest cost_quality_tradeoff: 0.1 # strongly prefer qualityThe value ranges from 0.0 (pure quality) to 1.0 (pure cost savings). This requires cost_aware_routing: true in defaults and pricing configured on models.