Skip to content

Signals & Routing

llmsoup evaluates configurable signals on every incoming request, matches routing rules based on signal results, selects the best model using algorithms, and executes the request through a chosen strategy. This page explains the entire pipeline in depth.

For YAML configuration syntax and field-level details, see the Configuration Reference.

Every chat completion request follows this pipeline:

HTTP request
Auth middleware ── rejects if token invalid
Signal evaluation ── all signals run concurrently
Rule matching ── rules evaluated in priority order (highest first)
Algorithm (optional) ── confidence or ratings model selection
Strategy execution ── default, fallback, or parallel
Model HTTP call ── forwarded as OpenAI-compatible request
Plugin execution ── pre/post-routing plugins, if configured (see Plugins Reference)
Response ── returned in OpenAI format
  1. Signal evaluation — All configured signals evaluate concurrently against the request payload. Each signal produces a score (0.0–1.0), a triggered boolean, and metadata. Signal failures are isolated — routing continues without the failed signal (see Graceful degradation).
  2. Rule matching — The decision engine evaluates rules in descending priority order. Conditions within a rule use AND, OR, or NOR logic with five condition operators. The first matching rule wins.
  3. Algorithm selection — If the matched rule specifies an algorithm (confidence or ratings), it selects the optimal model from the rule’s model_refs list.
  4. Strategy execution — The selected model(s) are called according to the execution strategy: single call (default), sequential retry (fallback), or concurrent race (parallel).
  5. Response — The model response is returned in OpenAI-compatible format.

Signals are the building blocks of routing decisions. Each signal evaluates one aspect of the incoming request and produces a result that conditions can match against.

All signals share these result properties:

PropertyTypeDescription
score0.01.0Confidence or similarity score
triggeredbooleanWhether the signal matched
metadatakey-value pairsSignal-specific details for debugging

Performs substring matching against the request payload.

FieldTypeDefaultDescription
namestringrequiredSignal identifier
keywordslistrequiredPhrases to match against
case_sensitivebooleanfalseEnable case-sensitive matching
operatorstringORMatch logic: AND, OR, or NOR

Operators:

  • OR — triggers if any keyword is found in the payload
  • AND — triggers if all keywords are found in the payload
  • NOR — triggers if no keywords are found (inverse of OR)

Score: 1.0 when triggered, 0.0 otherwise.

signals:
keyword:
- name: urgent
keywords: [urgent, critical, emergency]
case_sensitive: false
operator: OR
- name: coding
keywords: [code, program, function, debug]

Computes semantic similarity between the request payload and reference text or candidate phrases using BERT embeddings (via Candle).

FieldTypeDefaultDescription
namestringrequiredSignal identifier
modelstringrequiredHuggingFace model ID or local path
thresholdfloatrequiredSimilarity threshold (0.0–1.0)
context_windowinteger512Maximum token length
reference_textstringSingle reference phrase
candidateslistMultiple candidate phrases
aggregation_methodstringmaxHow candidate scores combine: max, avg, or any

How it works:

  1. The payload and reference/candidates are embedded using the configured model.
  2. Cosine similarity is computed between the payload and each reference.
  3. The raw cosine similarity (range -1.0 to 1.0) is normalized to 0.0–1.0 using (cosine + 1.0) / 2.0.
  4. The normalized score is compared against the threshold.

Aggregation methods (when using candidates):

MethodBehavior
maxHighest similarity score across all candidates (default)
avgAverage similarity score across all candidates
anyTriggers if any single candidate exceeds the threshold

The key difference: max and avg produce a single aggregate score compared to the threshold, while any checks each candidate individually — the signal triggers as soon as one candidate passes.

Model loading: Models are loaded lazily on first use. Supported sources:

  • HuggingFace repository ID (downloaded automatically, e.g., sentence-transformers/all-MiniLM-L12-v2)
  • Local file path (must contain config.json, tokenizer.json, and weight files)

Caching: Embeddings are cached per-request (shared across evaluators) and per-evaluator (LRU). Batch precomputation runs before individual evaluators for efficiency.

signals:
embedding:
- name: semantic-match
model: sentence-transformers/all-MiniLM-L12-v2
threshold: 0.8
candidates:
- technical documentation
- API reference
- code examples
aggregation_method: max

Classifies the request into subject-matter categories using a BERT-based classifier trained on MMLU categories.

FieldTypeDefaultDescription
namestringrequiredDomain name to match
exampleslistExample phrases for the domain
descriptionstringHuman-readable domain description
mmlu_categorieslistMMLU category aliases to match
thresholdfloatsee belowMinimum confidence for triggering

Threshold resolution order:

  1. Signal-level threshold field
  2. Global classifier.category_model.threshold
  3. Default: 0.5

How it works:

  1. The classifier predicts an MMLU category for the request payload.
  2. The predicted category is compared against the signal’s name and mmlu_categories aliases.
  3. If the category matches and confidence meets the threshold, the signal triggers.

All domain signals share a single classifier instance for memory efficiency.

classifier:
category_model:
model_id: LLM-Semantic-Router/lora_intent_classifier_bert-base-uncased_model
category_mapping_path: tests/data/domain/category_mapping.json
threshold: 0.5
signals:
domain:
- name: math
mmlu_categories: [math, abstract_algebra, college_mathematics]
threshold: 0.7
- name: science
examples: [physics, chemistry, biology]

Detects the natural language of the request payload using the whatlang library (no ML model required).

FieldTypeDefaultDescription
namestringrequiredSignal identifier
languagestringrequiredExpected ISO 639-1 code (case-insensitive)

How it works:

  1. The payload text is analyzed by whatlang for language detection.
  2. The detected ISO 639-3 code is mapped to ISO 639-1 (e.g., engen).
  3. The signal triggers when the detected language matches the configured language.

Score: whatlang confidence score (0.0–1.0). Returns 0.0 if detection fails (e.g., very short or ambiguous text).

Supported languages (ISO 639-1 codes):

CodeLanguageCodeLanguageCodeLanguage
afAfrikaanshiHindipaPunjabi
akAkanhrCroatianplPolish
amAmharichuHungarianptPortuguese
arArabichyArmenianroRomanian
azAzerbaijaniidIndonesianruRussian
beBelarusianitItaliansiSinhala
bgBulgarianjaJapaneseskSlovak
bnBengalijvJavaneseslSlovenian
caCatalankaGeorgiansnShona
csCzechkmKhmersrSerbian
cyWelshknKannadasvSwedish
daDanishkoKoreantaTamil
deGermanlaLatinteTelugu
elGreekltLithuanianthThai
enEnglishlvLatviantkTurkmen
eoEsperantomkMacedoniantlTagalog
esSpanishmlMalayalamtrTurkish
etEstonianmrMarathiukUkrainian
faPersianmyMyanmarurUrdu
fiFinnishnbNorwegian BokmaluzUzbek
frFrenchneNepaliviVietnamese
guGujaratinlDutchyiYiddish
heHebreworOdiazhChinese
zuZulu
signals:
language:
- name: english
language: en
- name: french
language: fr

Evaluates model performance based on TPOT (Time Per Output Token) metrics collected from previous requests. TPOT values are smoothed using exponential moving average (EMA).

FieldTypeDefaultDescription
namestringrequiredSignal identifier
max_tpotfloatrequiredTPOT threshold in seconds

How it works:

  1. The evaluator reads TPOT metrics from the shared cache (populated by model responses).
  2. If the routing context includes candidate_models metadata, only those models are checked. Otherwise, all models with TPOT data are considered.
  3. The best (lowest) TPOT value among candidates is compared against the threshold.
  4. Triggers when the best TPOT is at or below max_tpot.

Score: Confidence is calculated as 1.0 - (best_tpot / max_tpot), clamped to 0.0–1.0. Higher scores indicate faster models relative to the threshold.

TPOT calculation: total_latency_seconds / completion_token_count per response, smoothed with EMA across requests.

Note: TPOT data is only available after at least one response from a model. On first request (cold start), no TPOT data exists and the signal returns score 0.0 without triggering.

signals:
latency:
- name: fast_response
max_tpot: 0.05
- name: moderate_latency
max_tpot: 0.15

Routes requests based on LLM-classified user preferences. An external LLM analyzes the conversation to determine which preference route best matches.

FieldTypeDefaultDescription
namestringrequiredPreference label (route name)
descriptionstringDescription of this preference style

How it works:

  1. All preference signals are collected and sent to an external LLM classifier along with the conversation context.
  2. The LLM determines which preference label (signal name) best matches the request.
  3. The signal whose name matches the classifier’s output triggers.
  4. Classification results are cached per-request — multiple preference signals share one LLM call.

Important: Preference evaluation makes an external LLM call during routing, which consumes tokens and adds latency. This cost is tracked separately from the main model call.

signals:
preference:
- name: concise
description: User prefers short, direct answers
- name: detailed
description: User prefers thorough, comprehensive answers
- name: creative
description: User wants creative or exploratory responses

Classifies whether content requires fact-checking using a local binary classifier.

FieldTypeDefaultDescription
namestringrequiredMust be needs_fact_check or no_fact_check_needed
descriptionstringOptional description

Recognized signal names:

  • needs_fact_check — triggers when the classifier predicts FACT_CHECK_NEEDED
  • no_fact_check_needed — triggers when the classifier predicts NO_FACT_CHECK_NEEDED
  • Any other name will never trigger
signals:
fact_check:
- name: needs_fact_check
description: Content requiring verification
- name: no_fact_check_needed
description: Creative content not needing verification

Categorizes user follow-up messages using a local classifier.

FieldTypeDefaultDescription
namestringrequiredOne of the recognized feedback labels
descriptionstringOptional description

Recognized signal names:

  • satisfied — user is satisfied with the response
  • need_clarification — user needs more detail or explanation
  • want_different — user wants a different type of response
  • wrong_answer — user indicates the response was incorrect

The evaluator uses follow_up_message context (if available) rather than the original payload.

signals:
user_feedback:
- name: satisfied
- name: need_clarification
- name: want_different
- name: wrong_answer

Routing rules use conditions to check signal results. Each condition specifies a signal name, an operator, and a value to compare against.

Compares signal results for exact match. This is the default operator when none is specified.

  • Boolean comparison: Matches triggered state — true means the signal fired, false means it did not.
  • Numeric comparison: Matches the signal score with floating-point tolerance (±0.0001).
  • String comparison: Matches signal metadata values (e.g., detected language code).
conditions:
- signal: keyword.urgent
operator: equals
value: true # keyword signal triggered
- signal: language.english
operator: equals
value: true # detected language is English
- signal: embedding.semantic
operator: equals
value: 0.95 # exact score match (with tolerance)

String substring matching on signal metadata.

conditions:
- signal: keyword.tech
operator: contains
value: "router" # "router" appears in matched keywords
- signal: domain.science
operator: contains
value: "physics" # "physics" is a substring of the predicted category

Numeric comparison — signal score exceeds the value. For latency signals, compares against the best TPOT value.

conditions:
- signal: embedding.relevance
operator: greater-than
value: 0.85 # similarity score > 0.85
- signal: latency.speed
operator: greater-than
value: 0.1 # best TPOT > 0.1 seconds (slow models)

Numeric comparison — signal score is below the value. For latency signals, compares against the best TPOT value.

conditions:
- signal: embedding.relevance
operator: less-than
value: 0.3 # low similarity
- signal: latency.speed
operator: less-than
value: 0.05 # best TPOT < 50ms (fast models)

Checks whether the signal score or metadata value is contained in a list.

conditions:
- signal: domain.category
operator: in
value: [math, physics, chemistry] # predicted domain in set
- signal: language.detected
operator: in
value: [en, fr, de] # detected language in set

Conditions within a rule are combined using the rule’s operator field:

OperatorBehavior
AND (default)All conditions must match
ORAt least one condition must match
NORNo conditions must match
routing:
rules:
- name: complex_query
priority: 100
operator: AND # both conditions required
conditions:
- signal: keyword.technical
operator: equals
value: true
- signal: embedding.complexity
operator: greater-than
value: 0.8
action:
primary_model: gpt-4

When a routing rule includes an algorithm configuration, the decision engine uses it to select the optimal model from the rule’s model_refs list. Two algorithms are available: confidence and ratings.

Selects the first model that meets a confidence threshold, with escalation to the last model if none qualify.

FieldTypeDefaultDescription
thresholdfloatrequiredMinimum confidence score (0.0–1.0)
confidence_methodstringnullScoring method: null, avg_logprob, margin, or hybrid
hybrid_weightsobjectWeights for hybrid method
escalation_orderstringReserved for future use (planned modes: size, cost, automix)
cost_quality_tradeofffloatPer-algorithm override (0.0 = quality, 1.0 = cost)
on_errorstringskipError handling: skip or fail

Confidence methods:

MethodHow score is computed
null (default)Maximum signal score from the matched rule’s conditions
avg_logprobNormalized average log-probability from model response tokens
marginNormalized average margin between top-2 token probabilities
hybridWeighted combination of logprob and margin scores

Logprob normalization: Average log-probability is mapped from [-3.0, 0.0] to [0.0, 1.0]:

normalized = (avg_logprob + 3.0) / 3.0 (clamped to 0.0–1.0)

Margin normalization: Average margin between top-1 and top-2 token log-probabilities:

normalized = 1.0 - exp(-avg_margin / 3.0) (clamped to 0.0–1.0)

When fewer than 2 top logprobs are available for a token, a default margin of 2.0 is used.

Hybrid method: Combines logprob and margin scores:

score = logprob_weight × normalized_logprob + margin_weight × normalized_margin
hybrid_weights:
logprob_weight: 0.6
margin_weight: 0.4

Selection behavior:

  1. Each model in model_refs receives a confidence score.
  2. If cost-aware routing is enabled, a cost penalty is subtracted from each score.
  3. The first model meeting the threshold is selected.
  4. If no model meets the threshold, the last model in model_refs is selected (escalation).

Error handling (on_error):

  • skip (default) — if the confidence method (e.g., avg_logprob) is unavailable, fall back to signal-based scoring
  • fail — return an error immediately
algorithm:
type: confidence
confidence:
threshold: 0.7
confidence_method: margin
cost_quality_tradeoff: 0.5
on_error: skip
model_refs:
- gpt-3.5-turbo
- gpt-4
- gpt-4-turbo

In this example, if no model reaches 0.7 confidence, gpt-4-turbo (last) is selected as escalation.

Computes a score per model using a configurable policy and selects the highest-rated model.

FieldTypeDefaultDescription
policystringhighestScoring policy
cost_quality_tradeofffloatPer-algorithm override (0.0 = quality, 1.0 = cost)
on_errorstringskipError handling: skip or fail

Scoring policies:

PolicyBehavior
highest (default)Select the model with the highest signal score
lowestSelect the model with the lowest signal score (inverted: 1.0 - score)
prefer_firstTie-break favoring earlier models in model_refs order
prefer_lastTie-break favoring later models in model_refs order

prefer_first and prefer_last add a tiny position-based bonus (0.0001 per position) to break ties deterministically.

Selection behavior:

  1. Each model receives a score from the policy.
  2. If cost-aware routing is enabled, a cost penalty is subtracted.
  3. The model with the highest final score is selected.
  4. Ties are broken by model_refs order.
algorithm:
type: ratings
ratings:
policy: highest
cost_quality_tradeoff: 0.3
model_refs:
- claude-3-haiku
- gpt-3.5-turbo
- gpt-4

Both algorithms support cost-aware model selection. When enabled, model pricing is factored into the selection score.

Tradeoff resolution order:

  1. Per-algorithm cost_quality_tradeoff (overrides everything)
  2. Global defaults.cost_quality_tradeoff
  3. Default: 0.3 (when cost-aware routing is enabled but no tradeoff specified)

Tradeoff values:

  • 0.0 — pure quality (no cost penalty)
  • 0.3 — default balance (slight cost preference)
  • 0.5 — equal weight to quality and cost
  • 1.0 — pure cost minimization

Cost penalty formula:

cost_penalty = normalized_cost × tradeoff
adjusted_score = max(confidence_score - cost_penalty, 0.0)

Cost normalization: avg(prompt_per_1m, completion_per_1m) / 100.0, clamped to 0.0–1.0.

defaults:
cost_aware_routing: true
cost_quality_tradeoff: 0.3
routing:
rules:
- name: budget_query
action:
algorithm:
type: confidence
confidence:
threshold: 0.6
cost_quality_tradeoff: 0.8 # override: prefer cheap models

After model selection, the execution strategy determines how the request is sent to model endpoint(s).

Routes to a single model (the primary model). This is the simplest strategy.

Behavior:

  1. Send request to the primary model.
  2. Return response or timeout error.
action:
strategy: default
primary_model: gpt-4

Tries the primary model first, then falls back to alternative models sequentially on failure.

Behavior:

  1. Send request to the primary model.
  2. If primary fails (error or timeout), try the first fallback model.
  3. Continue through fallback models in order until one succeeds.
  4. If all models fail, return a combined error.

The response indicates whether a fallback was used via metadata.

action:
strategy: fallback
primary_model: gpt-4-turbo
fallback_models:
- gpt-4
- gpt-3.5-turbo

Sends the request to the primary model and all fallbacks concurrently, returning the first successful response with deterministic priority.

Behavior:

  1. Launch concurrent requests to all models (primary + fallbacks).
  2. As responses arrive, select the winner using priority order (primary first).
  3. A later model only wins if all earlier models have failed.
  4. Abort remaining requests once a winner is determined.

Determinism: The primary model always wins over fallbacks if both succeed, regardless of which responds first. This prevents non-deterministic routing behavior.

action:
strategy: parallel
primary_model: gpt-4-turbo
fallback_models:
- gpt-4
- claude-3-opus

llmsoup is designed to continue routing even when individual components fail. Non-critical operations — metrics recording, cache reads/writes, and tracing — are wrapped in error isolation boundaries.

Signal evaluation failures:

  • If a signal evaluator fails (model loading error, timeout, invalid context), the signal is excluded from the evaluation results.
  • Routing continues with the remaining successful signals.
  • The failure is logged as a warning but does not produce a hard error.

Non-critical operation failures:

  • Metrics recording — if Prometheus metrics fail to record, the request proceeds normally.
  • Cache operations — if embedding or response cache fails, the request falls through to a direct model call (cache miss behavior).
  • Tracing — if distributed tracing fails, request processing is unaffected.

What does cause hard errors:

  • Authentication failure (invalid or missing token)
  • No model available to handle the request (no matching rule and no default model)
  • All models in a strategy execution fail (all fallbacks exhausted)

This design ensures that observability and optimization features never block request processing. A signal failure simply means routing has less information — it does not mean the request fails.