Skip to content

Deployment

Deploy llmsoup in production — from a single binary to a Docker Compose stack with monitoring.

llmsoup is a single static binary with no runtime dependencies. There are three ways to deploy it:

MethodBest forRequirements
BinarySimplest setup, bare-metal serversPre-built binary via install script
DockerContainer orchestration, CI/CDDocker + Docker Compose
systemdLong-running Linux servicesLinux with systemd

All methods use the same YAML configuration file and environment variables.

Terminal window
curl -fsSL https://llmsoup.insideapp.fr/install.sh | sh

The install script downloads the latest pre-built binary for your platform (Linux / macOS).

Terminal window
# Generate a starter config
llmsoup prepare
# Validate the config
llmsoup validate
# Start the server
llmsoup serve

Before starting the server, generate and customize a configuration file:

Terminal window
# Generate config.yaml with all options commented
llmsoup prepare
# Validate config before starting
llmsoup validate

The server resolves the config file in this order:

  1. --config CLI flag (e.g., llmsoup serve --config /etc/llmsoup/config.yaml)
  2. LLMSOUP_CONFIG environment variable
  3. config.yaml in the current directory (default)

For the full list of configuration options, see the Configuration Reference.

All llmsoup environment variables use the LLMSOUP_ prefix.

VariablePurposeDefault
LLMSOUP_CONFIGPath to YAML configuration fileconfig.yaml
LLMSOUP_HOSTServer bind address127.0.0.1
LLMSOUP_PORTServer bind port8080
LLMSOUP_LOGLog level filter (trace, debug, info, warn, error)info
LLMSOUP_ONNX_MODELS_DIRDirectory for ML model storage~/.llmsoup/models
LLMSOUP_ALLOW_COMMAND_SECRETSEnable command-based secret resolutiondisabled
LLMSOUP_PREPARE_OUTPUTOverride output path for llmsoup prepareconfig.yaml (cwd)
Terminal window
export LLMSOUP_HOST=0.0.0.0
export LLMSOUP_PORT=9000
export LLMSOUP_LOG=debug
export LLMSOUP_CONFIG=/etc/llmsoup/config.yaml
llmsoup serve

Create a Dockerfile:

FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates curl && rm -rf /var/lib/apt/lists/*
RUN curl -fsSL https://llmsoup.insideapp.fr/install.sh | sh
ENTRYPOINT ["llmsoup"]
CMD ["serve"]
services:
llmsoup:
build: .
container_name: llmsoup
ports:
- "8080:8080"
volumes:
- ./config.yaml:/etc/llmsoup/config.yaml:ro
environment:
- LLMSOUP_CONFIG=/etc/llmsoup/config.yaml
- LLMSOUP_HOST=0.0.0.0
- LLMSOUP_PORT=8080
- LLMSOUP_LOG=info
healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:8080/metrics"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
restart: unless-stopped
Terminal window
# Build and start
docker compose up -d
# Check logs
docker compose logs -f llmsoup
# Stop
docker compose down

Full stack: llmsoup + Prometheus + Grafana

Section titled “Full stack: llmsoup + Prometheus + Grafana”

This example deploys llmsoup with authentication, Prometheus metrics collection, and a Grafana dashboard — all in one compose file.

Directory structure:

deploy/
├── docker-compose.yml
├── config.yaml # llmsoup configuration
├── tokens.yaml # auth tokens (YAML format)
├── prometheus/
│ └── prometheus.yml # Prometheus scrape config
└── grafana/
├── provisioning/
│ ├── datasources/
│ │ └── datasource.yml
│ └── dashboards/
│ └── dashboard.yml
└── dashboards/
└── llmsoup.json # Grafana dashboard

docker-compose.yml:

services:
llmsoup:
build: ..
container_name: llmsoup
ports:
- "8080:8080"
volumes:
- ./config.yaml:/etc/llmsoup/config.yaml:ro
- ./tokens.yaml:/etc/llmsoup/tokens.yaml:ro
environment:
- LLMSOUP_CONFIG=/etc/llmsoup/config.yaml
- LLMSOUP_HOST=0.0.0.0
- LLMSOUP_PORT=8080
- LLMSOUP_LOG=info
- OPENAI_API_KEY=${OPENAI_API_KEY}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- LLMSOUP_TOKEN_A=${LLMSOUP_TOKEN_A}
- LLMSOUP_TOKEN_B=${LLMSOUP_TOKEN_B}
healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:8080/metrics"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
restart: unless-stopped
prometheus:
image: prom/prometheus:v2.53.3
container_name: llmsoup-prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
depends_on:
llmsoup:
condition: service_healthy
restart: unless-stopped
grafana:
image: grafana/grafana:11.4.0
container_name: llmsoup-grafana
ports:
- "3000:3000"
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
- grafana_data:/var/lib/grafana
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
- GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/var/lib/grafana/dashboards/llmsoup.json
depends_on:
- prometheus
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:

prometheus/prometheus.yml:

global:
scrape_interval: 15s
scrape_configs:
- job_name: llmsoup
metrics_path: /metrics
static_configs:
- targets:
- llmsoup:8080

grafana/provisioning/datasources/datasource.yml:

apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false

grafana/provisioning/dashboards/dashboard.yml:

apiVersion: 1
providers:
- name: default
orgId: 1
folder: ""
type: file
disableDeletion: false
updateIntervalSeconds: 10
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: false

config.yaml (with auth and secret resolution):

auth:
enabled: true
tokens_file: "/etc/llmsoup/tokens.yaml"
models:
- name: gpt-4o
provider: openai
access_key:
env: OPENAI_API_KEY
endpoints:
- url: https://api.openai.com/v1/chat/completions
- name: claude-sonnet
provider: anthropic
access_key:
env: ANTHROPIC_API_KEY
endpoints:
- url: https://api.anthropic.com/v1/chat/completions

tokens.yaml:

tokens:
- id: service-a
description: "Service A access"
secret:
env: LLMSOUP_TOKEN_A
- id: service-b
description: "Service B access"
secret:
env: LLMSOUP_TOKEN_B

Start the full stack:

Terminal window
# Set your provider API keys and auth tokens
export OPENAI_API_KEY=<your-openai-key>
export ANTHROPIC_API_KEY=<your-anthropic-key>
export LLMSOUP_TOKEN_A=<your-service-a-token>
export LLMSOUP_TOKEN_B=<your-service-b-token>
# Start everything
cd deploy
docker compose up -d
# Verify llmsoup is healthy
curl -s http://localhost:8080/metrics | head -5
# Make an authenticated request
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer <your-service-a-token>" \
-H "Content-Type: application/json" \
-d '{"model": "auto", "messages": [{"role": "user", "content": "Hello"}]}'
# Open Grafana dashboard
open http://localhost:3000

Use environment variables or mounted files for API keys — never put secrets directly in config.yaml:

services:
llmsoup:
environment:
- OPENAI_API_KEY=<your-api-key>
- ANTHROPIC_API_KEY=<your-api-key>
- LLMSOUP_TOKEN_A=<your-token>
- LLMSOUP_TOKEN_B=<your-token>
volumes:
- ./config.yaml:/etc/llmsoup/config.yaml:ro
- ./tokens.yaml:/etc/llmsoup/tokens.yaml:ro

In your config.yaml, reference secrets with structured YAML fields:

models:
- name: gpt-4o
provider: openai
access_key:
env: OPENAI_API_KEY
endpoints:
- url: https://api.openai.com/v1/chat/completions

See the Configuration Reference — Secret Resolution for all resolution methods (env, file, vault, command).

llmsoup requires token-based authentication on all endpoints except /metrics.

In config.yaml, define tokens via secret references or an external tokens file:

# Inline tokens (via environment variables)
auth:
enabled: true
tokens:
- env: SERVICE_A_TOKEN
- env: SERVICE_B_TOKEN
# External tokens file (YAML format with user IDs)
auth:
enabled: true
tokens_file: "/etc/llmsoup/tokens.yaml"

Include the resolved token value in the Authorization header:

Terminal window
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer <your-token-value>" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Hello"}]
}'

To rotate tokens:

  1. Update the environment variables or secret files referenced by your token configuration
  2. Restart the llmsoup process (configuration is re-read on startup)

llmsoup exposes a Prometheus-compatible metrics endpoint at /metrics.

  • URL: GET /metrics
  • Authentication: None required (bypasses auth middleware)
  • Format: Prometheus text exposition format
  • Metric prefix: llmsoup_ with snake_case names and unit suffixes (_total, _seconds, _bytes)

Key metrics include request counts, routing latency, signal evaluation timing, cache hit rates, model errors, and cost tracking.

Add llmsoup as a scrape target in your prometheus.yml:

global:
scrape_interval: 15s
scrape_configs:
- job_name: llmsoup
metrics_path: /metrics
static_configs:
- targets:
- host.docker.internal:8080

You can extend the full-stack Docker Compose example above with Grafana for dashboarding. The example already includes Prometheus and Grafana services with auto-provisioned datasources.

Access the Grafana dashboard at http://localhost:3000. Anonymous viewer access is enabled by default for local development.

Plan your deployment resources based on which signals are configured.

StateExpected RSSNotes
Idle (no models)~50–100 MBBinary only, no signal evaluators loaded
Idle (models loaded)≤ 500 MBAfter first request triggers lazy model loading
Under load (50 rps)≤ 1 GBIncludes response caches, embedding caches, active connections

Memory usage depends on which signals are enabled in your configuration:

  • No embedding/domain signals — the server stays well under 100 MB because no ML models are loaded.
  • Embedding signals — each embedding model adds 30–250 MB depending on the model variant (light, flash, or pro).
  • Domain classification signal — the BERT-based classifier adds approximately 150–200 MB.

Models are loaded lazily on first use rather than at startup, so idle memory remains low until the first request that triggers a signal evaluation.

llmsoup uses the Tokio async runtime and spends most CPU time on:

  • ML inference (embedding generation, domain classification) — the primary CPU consumer
  • HTTP request handling — minimal overhead per request
  • Routing decisions — rule matching and signal evaluation are fast (p95 < 10 ms)

For production workloads, 2–4 CPU cores are sufficient. CPU-bound ML inference runs on Tokio’s blocking thread pool and does not block the async event loop.

When you start llmsoup serve, the server automatically downloads ML models from HuggingFace Hub if embedding or domain signals are configured in your config.yaml. Downloads happen once and are cached locally.

What triggers downloads:

Signal typeModelApproximate size
embedding (light)sentence-transformers/all-MiniLM-L12-v2~33 MB
embedding (pro)Qwen/Qwen3-Embedding-0.6B~250 MB
embedding (flash)google/embeddinggemma-300m~100 MB
domainLLM-Semantic-Router/lora_intent_classifier_bert-base-uncased_model~150–200 MB

Models are stored at ~/.llmsoup/models/ by default, or wherever LLMSOUP_ONNX_MODELS_DIR points.

  1. Server starts and loads configuration
  2. If embedding/domain signals are configured, models are downloaded (first startup only)
  3. Server begins accepting requests
  4. Models are loaded into memory lazily on first use (not at startup)

Set LLMSOUP_SKIP_MODEL_DOWNLOAD=true to skip all model downloads. This is primarily used for CI/CD and testing — the server will start without embedding or domain signal support.

After startup, models are loaded lazily — the first request that triggers an embedding or domain signal will be slower while the model initializes in memory.

Recommended warm-up approach:

Terminal window
# After starting the server, send a warm-up request
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer <your-token>" \
-H "Content-Type: application/json" \
-d '{"model": "auto", "messages": [{"role": "user", "content": "warm up"}]}'

This triggers lazy loading of any configured ML models so subsequent requests are served at full speed.

llmsoup does not expose a dedicated /health or /ready endpoint. Use the /metrics endpoint as a health probe:

PropertyValue
URLGET /metrics
AuthenticationNone required (bypasses auth middleware)
Healthy responseHTTP 200 with Prometheus text format
UnhealthyConnection refused or timeout
SettingValueRationale
Interval30 sBalances responsiveness with low overhead
Timeout5 sGenerous enough for loaded servers
Retries3Tolerates transient blips
Start period10 sAllows time for model downloads on first start

The Docker Compose examples in this guide already include a health check configuration:

healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:8080/metrics"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
livenessProbe:
httpGet:
path: /metrics
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3

For systemd deployments, validate the server is running after startup and restart on failure:

[Service]
ExecStartPost=/bin/sh -c 'sleep 5 && curl -sf http://localhost:8080/metrics > /dev/null'
Restart=on-failure
RestartSec=5

llmsoup uses the LLMSOUP_LOG environment variable to control log verbosity. It defaults to info.

LevelOutputUse case
errorErrors onlyMinimal output, alerts-only monitoring
warnErrors + warningsProduction baseline — includes graceful degradation notices
infoNormal operational logsRecommended for production — request flow, startup, config
debugDetailed internal stateTroubleshooting routing decisions and signal evaluation
traceEverythingDeep debugging — includes raw request/response data
Terminal window
# Production
export LLMSOUP_LOG=info
# Troubleshooting
export LLMSOUP_LOG=debug
# Selective filtering (advanced)
export LLMSOUP_LOG="llmsoup=debug,tower_http=info"

The LLMSOUP_LOG value is passed directly to the tracing-subscriber EnvFilter, which supports the full env_filter syntax. This means you can set different levels per module — useful for isolating noisy components during debugging.

llmsoup is designed to keep routing requests even when non-critical components fail. Signal evaluation failures do not crash the server — routing continues without the failed signal.

What degrades gracefully:

  • Signal evaluation — if an individual signal (embedding, domain, language, etc.) fails to evaluate, the routing engine skips that signal and makes a decision based on the remaining signals.
  • Model downloads — if a model fails to download (network error, unauthorized), the server starts without that signal type and logs a warning.
  • Metrics collection — Prometheus metric recording failures are caught and logged; they never block request processing.
  • Caching — cache read/write failures return None and fall back to direct computation.

What does NOT degrade:

  • Routing engine — core rule matching and decision logic must succeed.
  • Authentication — token validation failures correctly reject requests (this is security, not degradation).
  • Configuration loading — invalid configuration prevents startup (fail-fast by design).

Use this checklist before going live:

  • Authentication is enabled with strong tokens
  • Secrets are resolved from environment variables or files (not hardcoded in config)
  • LLMSOUP_ALLOW_COMMAND_SECRETS is disabled unless explicitly needed
  • API keys for upstream providers are stored securely
  • Running the official pre-built binary (not a debug build)
  • ML models pre-downloaded to avoid first-request latency (run a warm-up request after startup)
  • Appropriate cache TTLs configured for your workload
  • Prometheus is scraping the /metrics endpoint
  • Log level set appropriately (info for production, debug for troubleshooting)
  • Grafana dashboard configured for monitoring
  • Restart policy configured (restart: unless-stopped for Docker, Restart=on-failure for systemd)
  • Health checks configured (Docker healthcheck or external monitoring on /metrics)
  • Config validated before deployment (llmsoup validate)