Signals & Routing

llmsoup evaluates configurable signals on every incoming request, matches routing rules based on signal results, selects the best model using algorithms, and executes the request through a chosen strategy. This page explains the entire pipeline in depth.

For YAML configuration syntax and field-level details, see the Configuration Reference.

Request flow

Every chat completion request follows this pipeline:

HTTP request
    │
    ▼
Auth middleware ── rejects if token invalid
    │
    ▼
Signal evaluation ── all signals run concurrently
    │
    ▼
Rule matching ── rules evaluated in priority order (highest first)
    │
    ▼
Algorithm (optional) ── confidence or ratings model selection
    │
    ▼
Strategy execution ── default, fallback, or parallel
    │
    ▼
Model HTTP call ── forwarded as OpenAI-compatible request
    │
    ▼
Plugin execution ── pre/post-routing plugins, if configured (see Plugins Reference)
    │
    ▼
Response ── returned in OpenAI format

Signal evaluation — All configured signals evaluate concurrently against the request payload. Each signal produces a score (0.0–1.0), a triggered boolean, and metadata. Signal failures are isolated — routing continues without the failed signal (see Graceful degradation).
Rule matching — The decision engine evaluates rules in descending priority order. Conditions within a rule use AND, OR, or NOR logic with five condition operators. The first matching rule wins.
Algorithm selection — If the matched rule specifies an algorithm (confidence or ratings), it selects the optimal model from the rule’s model_refs list.
Strategy execution — The selected model(s) are called according to the execution strategy: single call (default), sequential retry (fallback), or concurrent race (parallel).
Response — The model response is returned in OpenAI-compatible format.

Signal types

Signals are the building blocks of routing decisions. Each signal evaluates one aspect of the incoming request and produces a result that conditions can match against.

All signals share these result properties:

Property	Type	Description
score	`0.0`–`1.0`	Confidence or similarity score
triggered	boolean	Whether the signal matched
metadata	key-value pairs	Signal-specific details for debugging

keyword

Performs substring matching against the request payload.

Field	Type	Default	Description
`name`	string	required	Signal identifier
`keywords`	list	required	Phrases to match against
`case_sensitive`	boolean	`false`	Enable case-sensitive matching
`operator`	string	`OR`	Match logic: `AND`, `OR`, or `NOR`

Operators:

OR — triggers if any keyword is found in the payload
AND — triggers if all keywords are found in the payload
NOR — triggers if no keywords are found (inverse of OR)

Score: 1.0 when triggered, 0.0 otherwise.

signals:
  keyword:
    - name: urgent
      keywords: [urgent, critical, emergency]
      case_sensitive: false
      operator: OR
    - name: coding
      keywords: [code, program, function, debug]

embedding

Computes semantic similarity between the request payload and reference text or candidate phrases using BERT embeddings (via Candle).

Field	Type	Default	Description
`name`	string	required	Signal identifier
`model`	string	required	HuggingFace model ID or local path
`threshold`	float	required	Similarity threshold (0.0–1.0)
`context_window`	integer	`512`	Maximum token length
`reference_text`	string	—	Single reference phrase
`candidates`	list	—	Multiple candidate phrases
`aggregation_method`	string	`max`	How candidate scores combine: `max`, `avg`, or `any`

How it works:

The payload and reference/candidates are embedded using the configured model.
Cosine similarity is computed between the payload and each reference.
The raw cosine similarity (range -1.0 to 1.0) is normalized to 0.0–1.0 using (cosine + 1.0) / 2.0.
The normalized score is compared against the threshold.

Aggregation methods (when using candidates):

Method	Behavior
max	Highest similarity score across all candidates (default)
avg	Average similarity score across all candidates
any	Triggers if any single candidate exceeds the threshold

The key difference: max and avg produce a single aggregate score compared to the threshold, while any checks each candidate individually — the signal triggers as soon as one candidate passes.

Model loading: Models are loaded lazily on first use. Supported sources:

HuggingFace repository ID (downloaded automatically, e.g., sentence-transformers/all-MiniLM-L12-v2)
Local file path (must contain config.json, tokenizer.json, and weight files)

Caching: Embeddings are cached per-request (shared across evaluators) and per-evaluator (LRU). Batch precomputation runs before individual evaluators for efficiency.

signals:
  embedding:
    - name: semantic-match
      model: sentence-transformers/all-MiniLM-L12-v2
      threshold: 0.8
      candidates:
        - technical documentation
        - API reference
        - code examples
      aggregation_method: max

domain

Classifies the request into subject-matter categories using a BERT-based classifier trained on MMLU categories.

Field	Type	Default	Description
`name`	string	required	Domain name to match
`examples`	list	—	Example phrases for the domain
`description`	string	—	Human-readable domain description
`mmlu_categories`	list	—	MMLU category aliases to match
`threshold`	float	see below	Minimum confidence for triggering

Threshold resolution order:

Signal-level threshold field
Global classifier.category_model.threshold
Default: 0.5

How it works:

The classifier predicts an MMLU category for the request payload.
The predicted category is compared against the signal’s name and mmlu_categories aliases.
If the category matches and confidence meets the threshold, the signal triggers.

All domain signals share a single classifier instance for memory efficiency.

classifier:
  category_model:
    model_id: LLM-Semantic-Router/lora_intent_classifier_bert-base-uncased_model
    category_mapping_path: tests/data/domain/category_mapping.json
    threshold: 0.5

signals:
  domain:
    - name: math
      mmlu_categories: [math, abstract_algebra, college_mathematics]
      threshold: 0.7
    - name: science
      examples: [physics, chemistry, biology]

language

Detects the natural language of the request payload using the whatlang library (no ML model required).

Field	Type	Default	Description
`name`	string	required	Signal identifier
`language`	string	required	Expected ISO 639-1 code (case-insensitive)

How it works:

The payload text is analyzed by whatlang for language detection.
The detected ISO 639-3 code is mapped to ISO 639-1 (e.g., eng → en).
The signal triggers when the detected language matches the configured language.

Score: whatlang confidence score (0.0–1.0). Returns 0.0 if detection fails (e.g., very short or ambiguous text).

Supported languages (ISO 639-1 codes):

Code	Language	Code	Language	Code	Language
`af`	Afrikaans	`hi`	Hindi	`pa`	Punjabi
`ak`	Akan	`hr`	Croatian	`pl`	Polish
`am`	Amharic	`hu`	Hungarian	`pt`	Portuguese
`ar`	Arabic	`hy`	Armenian	`ro`	Romanian
`az`	Azerbaijani	`id`	Indonesian	`ru`	Russian
`be`	Belarusian	`it`	Italian	`si`	Sinhala
`bg`	Bulgarian	`ja`	Japanese	`sk`	Slovak
`bn`	Bengali	`jv`	Javanese	`sl`	Slovenian
`ca`	Catalan	`ka`	Georgian	`sn`	Shona
`cs`	Czech	`km`	Khmer	`sr`	Serbian
`cy`	Welsh	`kn`	Kannada	`sv`	Swedish
`da`	Danish	`ko`	Korean	`ta`	Tamil
`de`	German	`la`	Latin	`te`	Telugu
`el`	Greek	`lt`	Lithuanian	`th`	Thai
`en`	English	`lv`	Latvian	`tk`	Turkmen
`eo`	Esperanto	`mk`	Macedonian	`tl`	Tagalog
`es`	Spanish	`ml`	Malayalam	`tr`	Turkish
`et`	Estonian	`mr`	Marathi	`uk`	Ukrainian
`fa`	Persian	`my`	Myanmar	`ur`	Urdu
`fi`	Finnish	`nb`	Norwegian Bokmal	`uz`	Uzbek
`fr`	French	`ne`	Nepali	`vi`	Vietnamese
`gu`	Gujarati	`nl`	Dutch	`yi`	Yiddish
`he`	Hebrew	`or`	Odia	`zh`	Chinese
				`zu`	Zulu

signals:
  language:
    - name: english
      language: en
    - name: french
      language: fr

latency

Evaluates model performance based on TPOT (Time Per Output Token) metrics collected from previous requests. TPOT values are smoothed using exponential moving average (EMA).

Field	Type	Default	Description
`name`	string	required	Signal identifier
`max_tpot`	float	required	TPOT threshold in seconds

How it works:

The evaluator reads TPOT metrics from the shared cache (populated by model responses).
If the routing context includes candidate_models metadata, only those models are checked. Otherwise, all models with TPOT data are considered.
The best (lowest) TPOT value among candidates is compared against the threshold.
Triggers when the best TPOT is at or below max_tpot.

Score: Confidence is calculated as 1.0 - (best_tpot / max_tpot), clamped to 0.0–1.0. Higher scores indicate faster models relative to the threshold.

TPOT calculation: total_latency_seconds / completion_token_count per response, smoothed with EMA across requests.

Note: TPOT data is only available after at least one response from a model. On first request (cold start), no TPOT data exists and the signal returns score 0.0 without triggering.

signals:
  latency:
    - name: fast_response
      max_tpot: 0.05
    - name: moderate_latency
      max_tpot: 0.15

preference

Routes requests based on LLM-classified user preferences. An external LLM analyzes the conversation to determine which preference route best matches.

Field	Type	Default	Description
`name`	string	required	Preference label (route name)
`description`	string	—	Description of this preference style

How it works:

All preference signals are collected and sent to an external LLM classifier along with the conversation context.
The LLM determines which preference label (signal name) best matches the request.
The signal whose name matches the classifier’s output triggers.
Classification results are cached per-request — multiple preference signals share one LLM call.

Important: Preference evaluation makes an external LLM call during routing, which consumes tokens and adds latency. This cost is tracked separately from the main model call.

signals:
  preference:
    - name: concise
      description: User prefers short, direct answers
    - name: detailed
      description: User prefers thorough, comprehensive answers
    - name: creative
      description: User wants creative or exploratory responses

fact_check (stub)

Classifies whether content requires fact-checking using a local binary classifier.

Field	Type	Default	Description
`name`	string	required	Must be `needs_fact_check` or `no_fact_check_needed`
`description`	string	—	Optional description

Recognized signal names:

needs_fact_check — triggers when the classifier predicts FACT_CHECK_NEEDED
no_fact_check_needed — triggers when the classifier predicts NO_FACT_CHECK_NEEDED
Any other name will never trigger

signals:
  fact_check:
    - name: needs_fact_check
      description: Content requiring verification
    - name: no_fact_check_needed
      description: Creative content not needing verification

user_feedback (stub)

Categorizes user follow-up messages using a local classifier.

Field	Type	Default	Description
`name`	string	required	One of the recognized feedback labels
`description`	string	—	Optional description

Recognized signal names:

satisfied — user is satisfied with the response
need_clarification — user needs more detail or explanation
want_different — user wants a different type of response
wrong_answer — user indicates the response was incorrect

The evaluator uses follow_up_message context (if available) rather than the original payload.

signals:
  user_feedback:
    - name: satisfied
    - name: need_clarification
    - name: want_different
    - name: wrong_answer

Condition operators

Routing rules use conditions to check signal results. Each condition specifies a signal name, an operator, and a value to compare against.

equals

Compares signal results for exact match. This is the default operator when none is specified.

Boolean comparison: Matches triggered state — true means the signal fired, false means it did not.
Numeric comparison: Matches the signal score with floating-point tolerance (±0.0001).
String comparison: Matches signal metadata values (e.g., detected language code).

conditions:
  - signal: keyword.urgent
    operator: equals
    value: true         # keyword signal triggered

  - signal: language.english
    operator: equals
    value: true         # detected language is English

  - signal: embedding.semantic
    operator: equals
    value: 0.95         # exact score match (with tolerance)

contains

String substring matching on signal metadata.

conditions:
  - signal: keyword.tech
    operator: contains
    value: "router"     # "router" appears in matched keywords

  - signal: domain.science
    operator: contains
    value: "physics"    # "physics" is a substring of the predicted category

greater-than

Numeric comparison — signal score exceeds the value. For latency signals, compares against the best TPOT value.

conditions:
  - signal: embedding.relevance
    operator: greater-than
    value: 0.85         # similarity score > 0.85

  - signal: latency.speed
    operator: greater-than
    value: 0.1          # best TPOT > 0.1 seconds (slow models)

less-than

Numeric comparison — signal score is below the value. For latency signals, compares against the best TPOT value.

conditions:
  - signal: embedding.relevance
    operator: less-than
    value: 0.3          # low similarity

  - signal: latency.speed
    operator: less-than
    value: 0.05         # best TPOT < 50ms (fast models)

in

Checks whether the signal score or metadata value is contained in a list.

conditions:
  - signal: domain.category
    operator: in
    value: [math, physics, chemistry]  # predicted domain in set

  - signal: language.detected
    operator: in
    value: [en, fr, de]   # detected language in set

Rule logic operators

Conditions within a rule are combined using the rule’s operator field:

Operator	Behavior
AND (default)	All conditions must match
OR	At least one condition must match
NOR	No conditions must match

routing:
  rules:
    - name: complex_query
      priority: 100
      operator: AND          # both conditions required
      conditions:
        - signal: keyword.technical
          operator: equals
          value: true
        - signal: embedding.complexity
          operator: greater-than
          value: 0.8
      action:
        primary_model: gpt-4

Algorithms

When a routing rule includes an algorithm configuration, the decision engine uses it to select the optimal model from the rule’s model_refs list. Two algorithms are available: confidence and ratings.

confidence

Selects the first model that meets a confidence threshold, with escalation to the last model if none qualify.

Field	Type	Default	Description
`threshold`	float	required	Minimum confidence score (0.0–1.0)
`confidence_method`	string	`null`	Scoring method: `null`, `avg_logprob`, `margin`, or `hybrid`
`hybrid_weights`	object	—	Weights for hybrid method
`escalation_order`	string	—	Reserved for future use (planned modes: `size`, `cost`, `automix`)
`cost_quality_tradeoff`	float	—	Per-algorithm override (0.0 = quality, 1.0 = cost)
`on_error`	string	`skip`	Error handling: `skip` or `fail`

Confidence methods:

Method	How score is computed
null (default)	Maximum signal score from the matched rule’s conditions
avg_logprob	Normalized average log-probability from model response tokens
margin	Normalized average margin between top-2 token probabilities
hybrid	Weighted combination of logprob and margin scores

Logprob normalization: Average log-probability is mapped from [-3.0, 0.0] to [0.0, 1.0]:

normalized = (avg_logprob + 3.0) / 3.0   (clamped to 0.0–1.0)

Margin normalization: Average margin between top-1 and top-2 token log-probabilities:

normalized = 1.0 - exp(-avg_margin / 3.0)   (clamped to 0.0–1.0)

When fewer than 2 top logprobs are available for a token, a default margin of 2.0 is used.

Hybrid method: Combines logprob and margin scores:

score = logprob_weight × normalized_logprob + margin_weight × normalized_margin

hybrid_weights:
  logprob_weight: 0.6
  margin_weight: 0.4

Selection behavior:

Each model in model_refs receives a confidence score.
If cost-aware routing is enabled, a cost penalty is subtracted from each score.
The first model meeting the threshold is selected.
If no model meets the threshold, the last model in model_refs is selected (escalation).

Error handling (on_error):

skip (default) — if the confidence method (e.g., avg_logprob) is unavailable, fall back to signal-based scoring
fail — return an error immediately

algorithm:
  type: confidence
  confidence:
    threshold: 0.7
    confidence_method: margin
    cost_quality_tradeoff: 0.5
    on_error: skip
model_refs:
  - gpt-3.5-turbo
  - gpt-4
  - gpt-4-turbo

In this example, if no model reaches 0.7 confidence, gpt-4-turbo (last) is selected as escalation.

ratings

Computes a score per model using a configurable policy and selects the highest-rated model.

Field	Type	Default	Description
`policy`	string	`highest`	Scoring policy
`cost_quality_tradeoff`	float	—	Per-algorithm override (0.0 = quality, 1.0 = cost)
`on_error`	string	`skip`	Error handling: `skip` or `fail`

Scoring policies:

Policy	Behavior
highest (default)	Select the model with the highest signal score
lowest	Select the model with the lowest signal score (inverted: `1.0 - score`)
prefer_first	Tie-break favoring earlier models in `model_refs` order
prefer_last	Tie-break favoring later models in `model_refs` order

prefer_first and prefer_last add a tiny position-based bonus (0.0001 per position) to break ties deterministically.

Selection behavior:

Each model receives a score from the policy.
If cost-aware routing is enabled, a cost penalty is subtracted.
The model with the highest final score is selected.
Ties are broken by model_refs order.

algorithm:
  type: ratings
  ratings:
    policy: highest
    cost_quality_tradeoff: 0.3
model_refs:
  - claude-3-haiku
  - gpt-3.5-turbo
  - gpt-4

Cost-aware routing

Both algorithms support cost-aware model selection. When enabled, model pricing is factored into the selection score.

Tradeoff resolution order:

Per-algorithm cost_quality_tradeoff (overrides everything)
Global defaults.cost_quality_tradeoff
Default: 0.3 (when cost-aware routing is enabled but no tradeoff specified)

Tradeoff values:

0.0 — pure quality (no cost penalty)
0.3 — default balance (slight cost preference)
0.5 — equal weight to quality and cost
1.0 — pure cost minimization

Cost penalty formula:

cost_penalty = normalized_cost × tradeoff
adjusted_score = max(confidence_score - cost_penalty, 0.0)

Cost normalization: avg(prompt_per_1m, completion_per_1m) / 100.0, clamped to 0.0–1.0.

defaults:
  cost_aware_routing: true
  cost_quality_tradeoff: 0.3

routing:
  rules:
    - name: budget_query
      action:
        algorithm:
          type: confidence
          confidence:
            threshold: 0.6
            cost_quality_tradeoff: 0.8  # override: prefer cheap models

Execution strategies

After model selection, the execution strategy determines how the request is sent to model endpoint(s).

default

Routes to a single model (the primary model). This is the simplest strategy.

Behavior:

Send request to the primary model.
Return response or timeout error.

action:
  strategy: default
  primary_model: gpt-4

fallback

Tries the primary model first, then falls back to alternative models sequentially on failure.

Behavior:

Send request to the primary model.
If primary fails (error or timeout), try the first fallback model.
Continue through fallback models in order until one succeeds.
If all models fail, return a combined error.

The response indicates whether a fallback was used via metadata.

action:
  strategy: fallback
  primary_model: gpt-4-turbo
  fallback_models:
    - gpt-4
    - gpt-3.5-turbo

parallel

Sends the request to the primary model and all fallbacks concurrently, returning the first successful response with deterministic priority.

Behavior:

Launch concurrent requests to all models (primary + fallbacks).
As responses arrive, select the winner using priority order (primary first).
A later model only wins if all earlier models have failed.
Abort remaining requests once a winner is determined.

Determinism: The primary model always wins over fallbacks if both succeed, regardless of which responds first. This prevents non-deterministic routing behavior.

action:
  strategy: parallel
  primary_model: gpt-4-turbo
  fallback_models:
    - gpt-4
    - claude-3-opus

Graceful degradation

llmsoup is designed to continue routing even when individual components fail. Non-critical operations — metrics recording, cache reads/writes, and tracing — are wrapped in error isolation boundaries.

Signal evaluation failures:

If a signal evaluator fails (model loading error, timeout, invalid context), the signal is excluded from the evaluation results.
Routing continues with the remaining successful signals.
The failure is logged as a warning but does not produce a hard error.

Non-critical operation failures:

Metrics recording — if Prometheus metrics fail to record, the request proceeds normally.
Cache operations — if embedding or response cache fails, the request falls through to a direct model call (cache miss behavior).
Tracing — if distributed tracing fails, request processing is unaffected.

What does cause hard errors:

Authentication failure (invalid or missing token)
No model available to handle the request (no matching rule and no default model)
All models in a strategy execution fail (all fallbacks exhausted)

This design ensures that observability and optimization features never block request processing. A signal failure simply means routing has less information — it does not mean the request fails.