Signals & Routing
llmsoup evaluates configurable signals on every incoming request, matches routing rules based on signal results, selects the best model using algorithms, and executes the request through a chosen strategy. This page explains the entire pipeline in depth.
For YAML configuration syntax and field-level details, see the Configuration Reference.
Request flow
Section titled “Request flow”Every chat completion request follows this pipeline:
HTTP request │ ▼Auth middleware ── rejects if token invalid │ ▼Signal evaluation ── all signals run concurrently │ ▼Rule matching ── rules evaluated in priority order (highest first) │ ▼Algorithm (optional) ── confidence or ratings model selection │ ▼Strategy execution ── default, fallback, or parallel │ ▼Model HTTP call ── forwarded as OpenAI-compatible request │ ▼Plugin execution ── pre/post-routing plugins, if configured (see Plugins Reference) │ ▼Response ── returned in OpenAI format- Signal evaluation — All configured signals evaluate concurrently against the request payload. Each signal produces a score (0.0–1.0), a triggered boolean, and metadata. Signal failures are isolated — routing continues without the failed signal (see Graceful degradation).
- Rule matching — The decision engine evaluates rules in descending priority order. Conditions within a rule use AND, OR, or NOR logic with five condition operators. The first matching rule wins.
- Algorithm selection — If the matched rule specifies an algorithm (confidence or ratings), it selects the optimal model from the rule’s
model_refslist. - Strategy execution — The selected model(s) are called according to the execution strategy: single call (default), sequential retry (fallback), or concurrent race (parallel).
- Response — The model response is returned in OpenAI-compatible format.
Signal types
Section titled “Signal types”Signals are the building blocks of routing decisions. Each signal evaluates one aspect of the incoming request and produces a result that conditions can match against.
All signals share these result properties:
| Property | Type | Description |
|---|---|---|
| score | 0.0–1.0 | Confidence or similarity score |
| triggered | boolean | Whether the signal matched |
| metadata | key-value pairs | Signal-specific details for debugging |
keyword
Section titled “keyword”Performs substring matching against the request payload.
| Field | Type | Default | Description |
|---|---|---|---|
name | string | required | Signal identifier |
keywords | list | required | Phrases to match against |
case_sensitive | boolean | false | Enable case-sensitive matching |
operator | string | OR | Match logic: AND, OR, or NOR |
Operators:
- OR — triggers if any keyword is found in the payload
- AND — triggers if all keywords are found in the payload
- NOR — triggers if no keywords are found (inverse of OR)
Score: 1.0 when triggered, 0.0 otherwise.
signals: keyword: - name: urgent keywords: [urgent, critical, emergency] case_sensitive: false operator: OR - name: coding keywords: [code, program, function, debug]embedding
Section titled “embedding”Computes semantic similarity between the request payload and reference text or candidate phrases using BERT embeddings (via Candle).
| Field | Type | Default | Description |
|---|---|---|---|
name | string | required | Signal identifier |
model | string | required | HuggingFace model ID or local path |
threshold | float | required | Similarity threshold (0.0–1.0) |
context_window | integer | 512 | Maximum token length |
reference_text | string | — | Single reference phrase |
candidates | list | — | Multiple candidate phrases |
aggregation_method | string | max | How candidate scores combine: max, avg, or any |
How it works:
- The payload and reference/candidates are embedded using the configured model.
- Cosine similarity is computed between the payload and each reference.
- The raw cosine similarity (range -1.0 to 1.0) is normalized to 0.0–1.0 using
(cosine + 1.0) / 2.0. - The normalized score is compared against the threshold.
Aggregation methods (when using candidates):
| Method | Behavior |
|---|---|
| max | Highest similarity score across all candidates (default) |
| avg | Average similarity score across all candidates |
| any | Triggers if any single candidate exceeds the threshold |
The key difference: max and avg produce a single aggregate score compared to the threshold, while any checks each candidate individually — the signal triggers as soon as one candidate passes.
Model loading: Models are loaded lazily on first use. Supported sources:
- HuggingFace repository ID (downloaded automatically, e.g.,
sentence-transformers/all-MiniLM-L12-v2) - Local file path (must contain
config.json,tokenizer.json, and weight files)
Caching: Embeddings are cached per-request (shared across evaluators) and per-evaluator (LRU). Batch precomputation runs before individual evaluators for efficiency.
signals: embedding: - name: semantic-match model: sentence-transformers/all-MiniLM-L12-v2 threshold: 0.8 candidates: - technical documentation - API reference - code examples aggregation_method: maxdomain
Section titled “domain”Classifies the request into subject-matter categories using a BERT-based classifier trained on MMLU categories.
| Field | Type | Default | Description |
|---|---|---|---|
name | string | required | Domain name to match |
examples | list | — | Example phrases for the domain |
description | string | — | Human-readable domain description |
mmlu_categories | list | — | MMLU category aliases to match |
threshold | float | see below | Minimum confidence for triggering |
Threshold resolution order:
- Signal-level
thresholdfield - Global
classifier.category_model.threshold - Default:
0.5
How it works:
- The classifier predicts an MMLU category for the request payload.
- The predicted category is compared against the signal’s
nameandmmlu_categoriesaliases. - If the category matches and confidence meets the threshold, the signal triggers.
All domain signals share a single classifier instance for memory efficiency.
classifier: category_model: model_id: LLM-Semantic-Router/lora_intent_classifier_bert-base-uncased_model category_mapping_path: tests/data/domain/category_mapping.json threshold: 0.5
signals: domain: - name: math mmlu_categories: [math, abstract_algebra, college_mathematics] threshold: 0.7 - name: science examples: [physics, chemistry, biology]language
Section titled “language”Detects the natural language of the request payload using the whatlang library (no ML model required).
| Field | Type | Default | Description |
|---|---|---|---|
name | string | required | Signal identifier |
language | string | required | Expected ISO 639-1 code (case-insensitive) |
How it works:
- The payload text is analyzed by
whatlangfor language detection. - The detected ISO 639-3 code is mapped to ISO 639-1 (e.g.,
eng→en). - The signal triggers when the detected language matches the configured language.
Score: whatlang confidence score (0.0–1.0). Returns 0.0 if detection fails (e.g., very short or ambiguous text).
Supported languages (ISO 639-1 codes):
| Code | Language | Code | Language | Code | Language |
|---|---|---|---|---|---|
af | Afrikaans | hi | Hindi | pa | Punjabi |
ak | Akan | hr | Croatian | pl | Polish |
am | Amharic | hu | Hungarian | pt | Portuguese |
ar | Arabic | hy | Armenian | ro | Romanian |
az | Azerbaijani | id | Indonesian | ru | Russian |
be | Belarusian | it | Italian | si | Sinhala |
bg | Bulgarian | ja | Japanese | sk | Slovak |
bn | Bengali | jv | Javanese | sl | Slovenian |
ca | Catalan | ka | Georgian | sn | Shona |
cs | Czech | km | Khmer | sr | Serbian |
cy | Welsh | kn | Kannada | sv | Swedish |
da | Danish | ko | Korean | ta | Tamil |
de | German | la | Latin | te | Telugu |
el | Greek | lt | Lithuanian | th | Thai |
en | English | lv | Latvian | tk | Turkmen |
eo | Esperanto | mk | Macedonian | tl | Tagalog |
es | Spanish | ml | Malayalam | tr | Turkish |
et | Estonian | mr | Marathi | uk | Ukrainian |
fa | Persian | my | Myanmar | ur | Urdu |
fi | Finnish | nb | Norwegian Bokmal | uz | Uzbek |
fr | French | ne | Nepali | vi | Vietnamese |
gu | Gujarati | nl | Dutch | yi | Yiddish |
he | Hebrew | or | Odia | zh | Chinese |
zu | Zulu |
signals: language: - name: english language: en - name: french language: frlatency
Section titled “latency”Evaluates model performance based on TPOT (Time Per Output Token) metrics collected from previous requests. TPOT values are smoothed using exponential moving average (EMA).
| Field | Type | Default | Description |
|---|---|---|---|
name | string | required | Signal identifier |
max_tpot | float | required | TPOT threshold in seconds |
How it works:
- The evaluator reads TPOT metrics from the shared cache (populated by model responses).
- If the routing context includes
candidate_modelsmetadata, only those models are checked. Otherwise, all models with TPOT data are considered. - The best (lowest) TPOT value among candidates is compared against the threshold.
- Triggers when the best TPOT is at or below
max_tpot.
Score: Confidence is calculated as 1.0 - (best_tpot / max_tpot), clamped to 0.0–1.0. Higher scores indicate faster models relative to the threshold.
TPOT calculation: total_latency_seconds / completion_token_count per response, smoothed with EMA across requests.
Note: TPOT data is only available after at least one response from a model. On first request (cold start), no TPOT data exists and the signal returns score 0.0 without triggering.
signals: latency: - name: fast_response max_tpot: 0.05 - name: moderate_latency max_tpot: 0.15preference
Section titled “preference”Routes requests based on LLM-classified user preferences. An external LLM analyzes the conversation to determine which preference route best matches.
| Field | Type | Default | Description |
|---|---|---|---|
name | string | required | Preference label (route name) |
description | string | — | Description of this preference style |
How it works:
- All preference signals are collected and sent to an external LLM classifier along with the conversation context.
- The LLM determines which preference label (signal name) best matches the request.
- The signal whose name matches the classifier’s output triggers.
- Classification results are cached per-request — multiple preference signals share one LLM call.
Important: Preference evaluation makes an external LLM call during routing, which consumes tokens and adds latency. This cost is tracked separately from the main model call.
signals: preference: - name: concise description: User prefers short, direct answers - name: detailed description: User prefers thorough, comprehensive answers - name: creative description: User wants creative or exploratory responsesfact_check (stub)
Section titled “fact_check (stub)”Classifies whether content requires fact-checking using a local binary classifier.
| Field | Type | Default | Description |
|---|---|---|---|
name | string | required | Must be needs_fact_check or no_fact_check_needed |
description | string | — | Optional description |
Recognized signal names:
needs_fact_check— triggers when the classifier predictsFACT_CHECK_NEEDEDno_fact_check_needed— triggers when the classifier predictsNO_FACT_CHECK_NEEDED- Any other name will never trigger
signals: fact_check: - name: needs_fact_check description: Content requiring verification - name: no_fact_check_needed description: Creative content not needing verificationuser_feedback (stub)
Section titled “user_feedback (stub)”Categorizes user follow-up messages using a local classifier.
| Field | Type | Default | Description |
|---|---|---|---|
name | string | required | One of the recognized feedback labels |
description | string | — | Optional description |
Recognized signal names:
satisfied— user is satisfied with the responseneed_clarification— user needs more detail or explanationwant_different— user wants a different type of responsewrong_answer— user indicates the response was incorrect
The evaluator uses follow_up_message context (if available) rather than the original payload.
signals: user_feedback: - name: satisfied - name: need_clarification - name: want_different - name: wrong_answerCondition operators
Section titled “Condition operators”Routing rules use conditions to check signal results. Each condition specifies a signal name, an operator, and a value to compare against.
equals
Section titled “equals”Compares signal results for exact match. This is the default operator when none is specified.
- Boolean comparison: Matches
triggeredstate —truemeans the signal fired,falsemeans it did not. - Numeric comparison: Matches the signal
scorewith floating-point tolerance (±0.0001). - String comparison: Matches signal metadata values (e.g., detected language code).
conditions: - signal: keyword.urgent operator: equals value: true # keyword signal triggered
- signal: language.english operator: equals value: true # detected language is English
- signal: embedding.semantic operator: equals value: 0.95 # exact score match (with tolerance)contains
Section titled “contains”String substring matching on signal metadata.
conditions: - signal: keyword.tech operator: contains value: "router" # "router" appears in matched keywords
- signal: domain.science operator: contains value: "physics" # "physics" is a substring of the predicted categorygreater-than
Section titled “greater-than”Numeric comparison — signal score exceeds the value. For latency signals, compares against the best TPOT value.
conditions: - signal: embedding.relevance operator: greater-than value: 0.85 # similarity score > 0.85
- signal: latency.speed operator: greater-than value: 0.1 # best TPOT > 0.1 seconds (slow models)less-than
Section titled “less-than”Numeric comparison — signal score is below the value. For latency signals, compares against the best TPOT value.
conditions: - signal: embedding.relevance operator: less-than value: 0.3 # low similarity
- signal: latency.speed operator: less-than value: 0.05 # best TPOT < 50ms (fast models)Checks whether the signal score or metadata value is contained in a list.
conditions: - signal: domain.category operator: in value: [math, physics, chemistry] # predicted domain in set
- signal: language.detected operator: in value: [en, fr, de] # detected language in setRule logic operators
Section titled “Rule logic operators”Conditions within a rule are combined using the rule’s operator field:
| Operator | Behavior |
|---|---|
| AND (default) | All conditions must match |
| OR | At least one condition must match |
| NOR | No conditions must match |
routing: rules: - name: complex_query priority: 100 operator: AND # both conditions required conditions: - signal: keyword.technical operator: equals value: true - signal: embedding.complexity operator: greater-than value: 0.8 action: primary_model: gpt-4Algorithms
Section titled “Algorithms”When a routing rule includes an algorithm configuration, the decision engine uses it to select the optimal model from the rule’s model_refs list. Two algorithms are available: confidence and ratings.
confidence
Section titled “confidence”Selects the first model that meets a confidence threshold, with escalation to the last model if none qualify.
| Field | Type | Default | Description |
|---|---|---|---|
threshold | float | required | Minimum confidence score (0.0–1.0) |
confidence_method | string | null | Scoring method: null, avg_logprob, margin, or hybrid |
hybrid_weights | object | — | Weights for hybrid method |
escalation_order | string | — | Reserved for future use (planned modes: size, cost, automix) |
cost_quality_tradeoff | float | — | Per-algorithm override (0.0 = quality, 1.0 = cost) |
on_error | string | skip | Error handling: skip or fail |
Confidence methods:
| Method | How score is computed |
|---|---|
| null (default) | Maximum signal score from the matched rule’s conditions |
| avg_logprob | Normalized average log-probability from model response tokens |
| margin | Normalized average margin between top-2 token probabilities |
| hybrid | Weighted combination of logprob and margin scores |
Logprob normalization: Average log-probability is mapped from [-3.0, 0.0] to [0.0, 1.0]:
normalized = (avg_logprob + 3.0) / 3.0 (clamped to 0.0–1.0)Margin normalization: Average margin between top-1 and top-2 token log-probabilities:
normalized = 1.0 - exp(-avg_margin / 3.0) (clamped to 0.0–1.0)When fewer than 2 top logprobs are available for a token, a default margin of 2.0 is used.
Hybrid method: Combines logprob and margin scores:
score = logprob_weight × normalized_logprob + margin_weight × normalized_marginhybrid_weights: logprob_weight: 0.6 margin_weight: 0.4Selection behavior:
- Each model in
model_refsreceives a confidence score. - If cost-aware routing is enabled, a cost penalty is subtracted from each score.
- The first model meeting the
thresholdis selected. - If no model meets the threshold, the last model in
model_refsis selected (escalation).
Error handling (on_error):
skip(default) — if the confidence method (e.g., avg_logprob) is unavailable, fall back to signal-based scoringfail— return an error immediately
algorithm: type: confidence confidence: threshold: 0.7 confidence_method: margin cost_quality_tradeoff: 0.5 on_error: skipmodel_refs: - gpt-3.5-turbo - gpt-4 - gpt-4-turboIn this example, if no model reaches 0.7 confidence, gpt-4-turbo (last) is selected as escalation.
ratings
Section titled “ratings”Computes a score per model using a configurable policy and selects the highest-rated model.
| Field | Type | Default | Description |
|---|---|---|---|
policy | string | highest | Scoring policy |
cost_quality_tradeoff | float | — | Per-algorithm override (0.0 = quality, 1.0 = cost) |
on_error | string | skip | Error handling: skip or fail |
Scoring policies:
| Policy | Behavior |
|---|---|
| highest (default) | Select the model with the highest signal score |
| lowest | Select the model with the lowest signal score (inverted: 1.0 - score) |
| prefer_first | Tie-break favoring earlier models in model_refs order |
| prefer_last | Tie-break favoring later models in model_refs order |
prefer_first and prefer_last add a tiny position-based bonus (0.0001 per position) to break ties deterministically.
Selection behavior:
- Each model receives a score from the policy.
- If cost-aware routing is enabled, a cost penalty is subtracted.
- The model with the highest final score is selected.
- Ties are broken by
model_refsorder.
algorithm: type: ratings ratings: policy: highest cost_quality_tradeoff: 0.3model_refs: - claude-3-haiku - gpt-3.5-turbo - gpt-4Cost-aware routing
Section titled “Cost-aware routing”Both algorithms support cost-aware model selection. When enabled, model pricing is factored into the selection score.
Tradeoff resolution order:
- Per-algorithm
cost_quality_tradeoff(overrides everything) - Global
defaults.cost_quality_tradeoff - Default:
0.3(when cost-aware routing is enabled but no tradeoff specified)
Tradeoff values:
0.0— pure quality (no cost penalty)0.3— default balance (slight cost preference)0.5— equal weight to quality and cost1.0— pure cost minimization
Cost penalty formula:
cost_penalty = normalized_cost × tradeoffadjusted_score = max(confidence_score - cost_penalty, 0.0)Cost normalization: avg(prompt_per_1m, completion_per_1m) / 100.0, clamped to 0.0–1.0.
defaults: cost_aware_routing: true cost_quality_tradeoff: 0.3
routing: rules: - name: budget_query action: algorithm: type: confidence confidence: threshold: 0.6 cost_quality_tradeoff: 0.8 # override: prefer cheap modelsExecution strategies
Section titled “Execution strategies”After model selection, the execution strategy determines how the request is sent to model endpoint(s).
default
Section titled “default”Routes to a single model (the primary model). This is the simplest strategy.
Behavior:
- Send request to the primary model.
- Return response or timeout error.
action: strategy: default primary_model: gpt-4fallback
Section titled “fallback”Tries the primary model first, then falls back to alternative models sequentially on failure.
Behavior:
- Send request to the primary model.
- If primary fails (error or timeout), try the first fallback model.
- Continue through fallback models in order until one succeeds.
- If all models fail, return a combined error.
The response indicates whether a fallback was used via metadata.
action: strategy: fallback primary_model: gpt-4-turbo fallback_models: - gpt-4 - gpt-3.5-turboparallel
Section titled “parallel”Sends the request to the primary model and all fallbacks concurrently, returning the first successful response with deterministic priority.
Behavior:
- Launch concurrent requests to all models (primary + fallbacks).
- As responses arrive, select the winner using priority order (primary first).
- A later model only wins if all earlier models have failed.
- Abort remaining requests once a winner is determined.
Determinism: The primary model always wins over fallbacks if both succeed, regardless of which responds first. This prevents non-deterministic routing behavior.
action: strategy: parallel primary_model: gpt-4-turbo fallback_models: - gpt-4 - claude-3-opusGraceful degradation
Section titled “Graceful degradation”llmsoup is designed to continue routing even when individual components fail. Non-critical operations — metrics recording, cache reads/writes, and tracing — are wrapped in error isolation boundaries.
Signal evaluation failures:
- If a signal evaluator fails (model loading error, timeout, invalid context), the signal is excluded from the evaluation results.
- Routing continues with the remaining successful signals.
- The failure is logged as a warning but does not produce a hard error.
Non-critical operation failures:
- Metrics recording — if Prometheus metrics fail to record, the request proceeds normally.
- Cache operations — if embedding or response cache fails, the request falls through to a direct model call (cache miss behavior).
- Tracing — if distributed tracing fails, request processing is unaffected.
What does cause hard errors:
- Authentication failure (invalid or missing token)
- No model available to handle the request (no matching rule and no default model)
- All models in a strategy execution fail (all fallbacks exhausted)
This design ensures that observability and optimization features never block request processing. A signal failure simply means routing has less information — it does not mean the request fails.