Skip to content

Introduction

Most teams using large language models start the same way: pick a model, wire up an API call, and ship. It works — until it doesn’t. Costs creep up. A cheaper model would handle 80% of requests just fine, but there’s no clean way to split traffic. You add a second provider for redundancy, and suddenly your routing logic is a growing tangle of if-else branches scattered across application code.

llmsoup exists to move that decision out of your application and into infrastructure where it belongs.

llmsoup is a lightweight routing proxy that sits between your application and your LLM providers. It accepts standard OpenAI-compatible requests, evaluates a set of configurable signals on each incoming message — keyword patterns, semantic similarity, domain classification, language detection, latency targets — and combines them with cost-aware routing to send each request to the best model for the job, balancing quality, speed, and price.

Your application makes the same API call it always has. llmsoup handles the rest: which model to use, whether to try a cheaper option first, when to fall back to a more capable one, and how to track what happened.

One binary. One YAML config file. No code changes in your application.

The standard approach to multi-model deployments is hardcoded routing: “coding questions go to Model A, everything else goes to Model B.” This works until requirements shift — and they always do. New models appear. Pricing changes. A model that was fast becomes slow under load. Hardcoded rules can’t adapt.

llmsoup takes a different approach: signal-based routing. Instead of mapping request categories to models by hand, you define signals that evaluate each request in real time, and rules that combine those signals into routing decisions. The system adapts to what the request actually contains, not what you guessed it might contain when you wrote the routing logic.

This means you can express nuanced policies in configuration rather than code:

  • Route requests containing medical terminology to a high-accuracy model, everything else to a cost-efficient one
  • Detect when a user is asking in French and prefer a model with strong multilingual performance
  • Fall back to a secondary provider when the primary one exceeds your latency threshold
  • Use semantic similarity to catch questions about a specific domain even when the keywords don’t match

The routing logic lives in a YAML file that you can version, review, and update without redeploying your application.

llmsoup’s design is heavily influenced by Aurelio Labs’ semantic-router project, which pioneered the idea of routing LLM traffic based on semantic signals rather than simple pattern matching.

Several core concepts come directly from semantic-router: the structure of decisions (routing rules composed of signal conditions), the concept of signals as first-class evaluation units, utterances for keyword-based matching, and the way model references are organized. llmsoup’s YAML configuration schema maintains backward compatibility with semantic-router’s format, so existing configurations can be migrated with minimal changes.

Where llmsoup diverges is in implementation:

  • Rust implementation — llmsoup is a ground-up rewrite in Rust for low overhead and predictable performance. Routing decisions add single-digit milliseconds of latency, and the proxy idles well under 500 MB of memory.
  • Built-in authentication — llmsoup adds token-based authentication middleware on all endpoints, so access control is handled at the proxy level without external dependencies.

A few deliberate choices shape how llmsoup works:

Drop-in OpenAI compatibility. llmsoup exposes the same /v1/chat/completions endpoint with the same request and response format as the OpenAI API. Point your existing client at llmsoup and it works — no SDK changes, no wrapper libraries, no new abstractions.

Single YAML config. All routing rules, signal definitions, model endpoints, plugin configuration, and operational settings live in one configuration file. There’s no database, no admin UI, and no runtime API for changing behavior. You edit a file, restart the proxy, and the new configuration takes effect.

Local-first. llmsoup runs as a single binary with no external service dependencies. Embedding models and domain classifiers run locally via Candle. Caching uses in-process memory. The only network calls are the ones you configure — to your LLM providers.

Graceful degradation. If a signal evaluator fails — a model doesn’t load, an embedding times out — llmsoup continues routing without that signal rather than returning an error. Non-critical failures shouldn’t break your application.

Lazy model loading. ML models used for signal evaluation (embeddings, domain classification) are loaded on first use, not at startup. This keeps memory footprint low when not all signal types are active.

Extensible via plugins. The plugin system lets you add processing steps — caching, content filtering, header mutation — without modifying routing logic. Plugins execute at the decision level and can run before or after model calls.

llmsoup is built for three audiences:

Developers who route requests across multiple LLM providers and want to stop maintaining ad-hoc routing logic in application code. If you’re writing if-else chains to pick models, or duplicating API client code across services, llmsoup replaces that with a single proxy and a config file.

Operators who manage LLM infrastructure and need visibility into how models are being used. llmsoup’s Prometheus metrics give you per-model request counts, latency distributions, cost tracking, signal evaluation breakdowns, and cache hit rates — the kind of observability you need to make informed decisions about model selection and capacity.

Teams optimizing LLM costs who want to route cheaper models for simple requests and reserve expensive ones for tasks that need them. Cost-aware routing lets you set a quality-vs-cost tradeoff in configuration and let the proxy handle the rest — no application changes required.

If you’re running a single model with no plans to change, llmsoup adds complexity you don’t need. It’s designed for environments where multiple models coexist and the routing between them matters.

Ready to try it? The Getting Started guide walks you through installation, creating your first configuration, and routing your first request — all in under five minutes.