Deployment
Deploy llmsoup in production — from a single binary to a Docker Compose stack with monitoring.
Overview
Section titled “Overview”llmsoup is a single static binary with no runtime dependencies. There are three ways to deploy it:
| Method | Best for | Requirements |
|---|---|---|
| Binary | Simplest setup, bare-metal servers | Pre-built binary via install script |
| Docker | Container orchestration, CI/CD | Docker + Docker Compose |
| systemd | Long-running Linux services | Linux with systemd |
All methods use the same YAML configuration file and environment variables.
Binary Deployment
Section titled “Binary Deployment”Install
Section titled “Install”curl -fsSL https://llmsoup.insideapp.fr/install.sh | shThe install script downloads the latest pre-built binary for your platform (Linux / macOS).
# Generate a starter configllmsoup prepare
# Validate the configllmsoup validate
# Start the serverllmsoup serveConfiguration
Section titled “Configuration”Before starting the server, generate and customize a configuration file:
# Generate config.yaml with all options commentedllmsoup prepare
# Validate config before startingllmsoup validateThe server resolves the config file in this order:
--configCLI flag (e.g.,llmsoup serve --config /etc/llmsoup/config.yaml)LLMSOUP_CONFIGenvironment variableconfig.yamlin the current directory (default)
For the full list of configuration options, see the Configuration Reference.
Environment Variables
Section titled “Environment Variables”All llmsoup environment variables use the LLMSOUP_ prefix.
| Variable | Purpose | Default |
|---|---|---|
LLMSOUP_CONFIG | Path to YAML configuration file | config.yaml |
LLMSOUP_HOST | Server bind address | 127.0.0.1 |
LLMSOUP_PORT | Server bind port | 8080 |
LLMSOUP_LOG | Log level filter (trace, debug, info, warn, error) | info |
LLMSOUP_ONNX_MODELS_DIR | Directory for ML model storage | ~/.llmsoup/models |
LLMSOUP_ALLOW_COMMAND_SECRETS | Enable command-based secret resolution | disabled |
LLMSOUP_PREPARE_OUTPUT | Override output path for llmsoup prepare | config.yaml (cwd) |
Example: Custom bind address and port
Section titled “Example: Custom bind address and port”export LLMSOUP_HOST=0.0.0.0export LLMSOUP_PORT=9000export LLMSOUP_LOG=debugexport LLMSOUP_CONFIG=/etc/llmsoup/config.yaml
llmsoup serveDocker Compose Deployment
Section titled “Docker Compose Deployment”Dockerfile
Section titled “Dockerfile”Create a Dockerfile:
FROM debian:bookworm-slimRUN apt-get update && apt-get install -y --no-install-recommends ca-certificates curl && rm -rf /var/lib/apt/lists/*RUN curl -fsSL https://llmsoup.insideapp.fr/install.sh | shENTRYPOINT ["llmsoup"]CMD ["serve"]docker-compose.yml
Section titled “docker-compose.yml”services: llmsoup: build: . container_name: llmsoup ports: - "8080:8080" volumes: - ./config.yaml:/etc/llmsoup/config.yaml:ro environment: - LLMSOUP_CONFIG=/etc/llmsoup/config.yaml - LLMSOUP_HOST=0.0.0.0 - LLMSOUP_PORT=8080 - LLMSOUP_LOG=info healthcheck: test: ["CMD", "curl", "-sf", "http://localhost:8080/metrics"] interval: 30s timeout: 5s retries: 3 start_period: 10s restart: unless-stoppedStart the stack
Section titled “Start the stack”# Build and startdocker compose up -d
# Check logsdocker compose logs -f llmsoup
# Stopdocker compose downFull stack: llmsoup + Prometheus + Grafana
Section titled “Full stack: llmsoup + Prometheus + Grafana”This example deploys llmsoup with authentication, Prometheus metrics collection, and a Grafana dashboard — all in one compose file.
Directory structure:
deploy/├── docker-compose.yml├── config.yaml # llmsoup configuration├── tokens.yaml # auth tokens (YAML format)├── prometheus/│ └── prometheus.yml # Prometheus scrape config└── grafana/ ├── provisioning/ │ ├── datasources/ │ │ └── datasource.yml │ └── dashboards/ │ └── dashboard.yml └── dashboards/ └── llmsoup.json # Grafana dashboarddocker-compose.yml:
services: llmsoup: build: .. container_name: llmsoup ports: - "8080:8080" volumes: - ./config.yaml:/etc/llmsoup/config.yaml:ro - ./tokens.yaml:/etc/llmsoup/tokens.yaml:ro environment: - LLMSOUP_CONFIG=/etc/llmsoup/config.yaml - LLMSOUP_HOST=0.0.0.0 - LLMSOUP_PORT=8080 - LLMSOUP_LOG=info - OPENAI_API_KEY=${OPENAI_API_KEY} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - LLMSOUP_TOKEN_A=${LLMSOUP_TOKEN_A} - LLMSOUP_TOKEN_B=${LLMSOUP_TOKEN_B} healthcheck: test: ["CMD", "curl", "-sf", "http://localhost:8080/metrics"] interval: 30s timeout: 5s retries: 3 start_period: 10s restart: unless-stopped
prometheus: image: prom/prometheus:v2.53.3 container_name: llmsoup-prometheus ports: - "9090:9090" volumes: - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro - prometheus_data:/prometheus depends_on: llmsoup: condition: service_healthy restart: unless-stopped
grafana: image: grafana/grafana:11.4.0 container_name: llmsoup-grafana ports: - "3000:3000" volumes: - ./grafana/provisioning:/etc/grafana/provisioning:ro - ./grafana/dashboards:/var/lib/grafana/dashboards:ro - grafana_data:/var/lib/grafana environment: - GF_AUTH_ANONYMOUS_ENABLED=true - GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer - GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/var/lib/grafana/dashboards/llmsoup.json depends_on: - prometheus restart: unless-stopped
volumes: prometheus_data: grafana_data:prometheus/prometheus.yml:
global: scrape_interval: 15s
scrape_configs: - job_name: llmsoup metrics_path: /metrics static_configs: - targets: - llmsoup:8080grafana/provisioning/datasources/datasource.yml:
apiVersion: 1
datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: falsegrafana/provisioning/dashboards/dashboard.yml:
apiVersion: 1
providers: - name: default orgId: 1 folder: "" type: file disableDeletion: false updateIntervalSeconds: 10 options: path: /var/lib/grafana/dashboards foldersFromFilesStructure: falseconfig.yaml (with auth and secret resolution):
auth: enabled: true tokens_file: "/etc/llmsoup/tokens.yaml"
models: - name: gpt-4o provider: openai access_key: env: OPENAI_API_KEY endpoints: - url: https://api.openai.com/v1/chat/completions - name: claude-sonnet provider: anthropic access_key: env: ANTHROPIC_API_KEY endpoints: - url: https://api.anthropic.com/v1/chat/completionstokens.yaml:
tokens: - id: service-a description: "Service A access" secret: env: LLMSOUP_TOKEN_A - id: service-b description: "Service B access" secret: env: LLMSOUP_TOKEN_BStart the full stack:
# Set your provider API keys and auth tokensexport OPENAI_API_KEY=<your-openai-key>export ANTHROPIC_API_KEY=<your-anthropic-key>export LLMSOUP_TOKEN_A=<your-service-a-token>export LLMSOUP_TOKEN_B=<your-service-b-token>
# Start everythingcd deploydocker compose up -d
# Verify llmsoup is healthycurl -s http://localhost:8080/metrics | head -5
# Make an authenticated requestcurl http://localhost:8080/v1/chat/completions \ -H "Authorization: Bearer <your-service-a-token>" \ -H "Content-Type: application/json" \ -d '{"model": "auto", "messages": [{"role": "user", "content": "Hello"}]}'
# Open Grafana dashboardopen http://localhost:3000Passing secrets
Section titled “Passing secrets”Use environment variables or mounted files for API keys — never put secrets directly in config.yaml:
services: llmsoup: environment: - OPENAI_API_KEY=<your-api-key> - ANTHROPIC_API_KEY=<your-api-key> - LLMSOUP_TOKEN_A=<your-token> - LLMSOUP_TOKEN_B=<your-token> volumes: - ./config.yaml:/etc/llmsoup/config.yaml:ro - ./tokens.yaml:/etc/llmsoup/tokens.yaml:roIn your config.yaml, reference secrets with structured YAML fields:
models: - name: gpt-4o provider: openai access_key: env: OPENAI_API_KEY endpoints: - url: https://api.openai.com/v1/chat/completionsSee the Configuration Reference — Secret Resolution for all resolution methods (env, file, vault, command).
Authentication Setup
Section titled “Authentication Setup”llmsoup requires token-based authentication on all endpoints except /metrics.
Enable authentication
Section titled “Enable authentication”In config.yaml, define tokens via secret references or an external tokens file:
# Inline tokens (via environment variables)auth: enabled: true tokens: - env: SERVICE_A_TOKEN - env: SERVICE_B_TOKEN# External tokens file (YAML format with user IDs)auth: enabled: true tokens_file: "/etc/llmsoup/tokens.yaml"Making authenticated requests
Section titled “Making authenticated requests”Include the resolved token value in the Authorization header:
curl http://localhost:8080/v1/chat/completions \ -H "Authorization: Bearer <your-token-value>" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [{"role": "user", "content": "Hello"}] }'Token rotation
Section titled “Token rotation”To rotate tokens:
- Update the environment variables or secret files referenced by your token configuration
- Restart the llmsoup process (configuration is re-read on startup)
Monitoring with Prometheus
Section titled “Monitoring with Prometheus”llmsoup exposes a Prometheus-compatible metrics endpoint at /metrics.
Metrics endpoint
Section titled “Metrics endpoint”- URL:
GET /metrics - Authentication: None required (bypasses auth middleware)
- Format: Prometheus text exposition format
- Metric prefix:
llmsoup_with snake_case names and unit suffixes (_total,_seconds,_bytes)
Key metrics include request counts, routing latency, signal evaluation timing, cache hit rates, model errors, and cost tracking.
Prometheus scrape configuration
Section titled “Prometheus scrape configuration”Add llmsoup as a scrape target in your prometheus.yml:
global: scrape_interval: 15s
scrape_configs: - job_name: llmsoup metrics_path: /metrics static_configs: - targets: - host.docker.internal:8080Monitoring stack with Grafana
Section titled “Monitoring stack with Grafana”You can extend the full-stack Docker Compose example above with Grafana for dashboarding. The example already includes Prometheus and Grafana services with auto-provisioned datasources.
Access the Grafana dashboard at http://localhost:3000. Anonymous viewer access is enabled by default for local development.
Resource Sizing
Section titled “Resource Sizing”Plan your deployment resources based on which signals are configured.
Memory
Section titled “Memory”| State | Expected RSS | Notes |
|---|---|---|
| Idle (no models) | ~50–100 MB | Binary only, no signal evaluators loaded |
| Idle (models loaded) | ≤ 500 MB | After first request triggers lazy model loading |
| Under load (50 rps) | ≤ 1 GB | Includes response caches, embedding caches, active connections |
Memory usage depends on which signals are enabled in your configuration:
- No embedding/domain signals — the server stays well under 100 MB because no ML models are loaded.
- Embedding signals — each embedding model adds 30–250 MB depending on the model variant (light, flash, or pro).
- Domain classification signal — the BERT-based classifier adds approximately 150–200 MB.
Models are loaded lazily on first use rather than at startup, so idle memory remains low until the first request that triggers a signal evaluation.
llmsoup uses the Tokio async runtime and spends most CPU time on:
- ML inference (embedding generation, domain classification) — the primary CPU consumer
- HTTP request handling — minimal overhead per request
- Routing decisions — rule matching and signal evaluation are fast (p95 < 10 ms)
For production workloads, 2–4 CPU cores are sufficient. CPU-bound ML inference runs on Tokio’s blocking thread pool and does not block the async event loop.
Cold Start Behavior
Section titled “Cold Start Behavior”Model downloads
Section titled “Model downloads”When you start llmsoup serve, the server automatically downloads ML models from HuggingFace Hub if embedding or domain signals are configured in your config.yaml. Downloads happen once and are cached locally.
What triggers downloads:
| Signal type | Model | Approximate size |
|---|---|---|
embedding (light) | sentence-transformers/all-MiniLM-L12-v2 | ~33 MB |
embedding (pro) | Qwen/Qwen3-Embedding-0.6B | ~250 MB |
embedding (flash) | google/embeddinggemma-300m | ~100 MB |
domain | LLM-Semantic-Router/lora_intent_classifier_bert-base-uncased_model | ~150–200 MB |
Models are stored at ~/.llmsoup/models/ by default, or wherever LLMSOUP_ONNX_MODELS_DIR points.
First startup timeline
Section titled “First startup timeline”- Server starts and loads configuration
- If embedding/domain signals are configured, models are downloaded (first startup only)
- Server begins accepting requests
- Models are loaded into memory lazily on first use (not at startup)
Skipping downloads
Section titled “Skipping downloads”Set LLMSOUP_SKIP_MODEL_DOWNLOAD=true to skip all model downloads. This is primarily used for CI/CD and testing — the server will start without embedding or domain signal support.
Warm-Up Recommendations
Section titled “Warm-Up Recommendations”After startup, models are loaded lazily — the first request that triggers an embedding or domain signal will be slower while the model initializes in memory.
Recommended warm-up approach:
# After starting the server, send a warm-up requestcurl http://localhost:8080/v1/chat/completions \ -H "Authorization: Bearer <your-token>" \ -H "Content-Type: application/json" \ -d '{"model": "auto", "messages": [{"role": "user", "content": "warm up"}]}'This triggers lazy loading of any configured ML models so subsequent requests are served at full speed.
Health Checks
Section titled “Health Checks”Endpoint
Section titled “Endpoint”llmsoup does not expose a dedicated /health or /ready endpoint. Use the /metrics endpoint as a health probe:
| Property | Value |
|---|---|
| URL | GET /metrics |
| Authentication | None required (bypasses auth middleware) |
| Healthy response | HTTP 200 with Prometheus text format |
| Unhealthy | Connection refused or timeout |
Recommended intervals
Section titled “Recommended intervals”| Setting | Value | Rationale |
|---|---|---|
| Interval | 30 s | Balances responsiveness with low overhead |
| Timeout | 5 s | Generous enough for loaded servers |
| Retries | 3 | Tolerates transient blips |
| Start period | 10 s | Allows time for model downloads on first start |
Docker healthcheck
Section titled “Docker healthcheck”The Docker Compose examples in this guide already include a health check configuration:
healthcheck: test: ["CMD", "curl", "-sf", "http://localhost:8080/metrics"] interval: 30s timeout: 5s retries: 3 start_period: 10sKubernetes liveness probe
Section titled “Kubernetes liveness probe”livenessProbe: httpGet: path: /metrics port: 8080 initialDelaySeconds: 10 periodSeconds: 30 timeoutSeconds: 5 failureThreshold: 3systemd health check
Section titled “systemd health check”For systemd deployments, validate the server is running after startup and restart on failure:
[Service]ExecStartPost=/bin/sh -c 'sleep 5 && curl -sf http://localhost:8080/metrics > /dev/null'Restart=on-failureRestartSec=5Logging
Section titled “Logging”Log levels
Section titled “Log levels”llmsoup uses the LLMSOUP_LOG environment variable to control log verbosity. It defaults to info.
| Level | Output | Use case |
|---|---|---|
error | Errors only | Minimal output, alerts-only monitoring |
warn | Errors + warnings | Production baseline — includes graceful degradation notices |
info | Normal operational logs | Recommended for production — request flow, startup, config |
debug | Detailed internal state | Troubleshooting routing decisions and signal evaluation |
trace | Everything | Deep debugging — includes raw request/response data |
Setting the log level
Section titled “Setting the log level”# Productionexport LLMSOUP_LOG=info
# Troubleshootingexport LLMSOUP_LOG=debug
# Selective filtering (advanced)export LLMSOUP_LOG="llmsoup=debug,tower_http=info"The LLMSOUP_LOG value is passed directly to the tracing-subscriber EnvFilter, which supports the full env_filter syntax. This means you can set different levels per module — useful for isolating noisy components during debugging.
Graceful Degradation
Section titled “Graceful Degradation”llmsoup is designed to keep routing requests even when non-critical components fail. Signal evaluation failures do not crash the server — routing continues without the failed signal.
What degrades gracefully:
- Signal evaluation — if an individual signal (embedding, domain, language, etc.) fails to evaluate, the routing engine skips that signal and makes a decision based on the remaining signals.
- Model downloads — if a model fails to download (network error, unauthorized), the server starts without that signal type and logs a warning.
- Metrics collection — Prometheus metric recording failures are caught and logged; they never block request processing.
- Caching — cache read/write failures return
Noneand fall back to direct computation.
What does NOT degrade:
- Routing engine — core rule matching and decision logic must succeed.
- Authentication — token validation failures correctly reject requests (this is security, not degradation).
- Configuration loading — invalid configuration prevents startup (fail-fast by design).
Production Checklist
Section titled “Production Checklist”Use this checklist before going live:
Security
Section titled “Security”- Authentication is enabled with strong tokens
- Secrets are resolved from environment variables or files (not hardcoded in config)
-
LLMSOUP_ALLOW_COMMAND_SECRETSis disabled unless explicitly needed - API keys for upstream providers are stored securely
Performance
Section titled “Performance”- Running the official pre-built binary (not a debug build)
- ML models pre-downloaded to avoid first-request latency (run a warm-up request after startup)
- Appropriate cache TTLs configured for your workload
Observability
Section titled “Observability”- Prometheus is scraping the
/metricsendpoint - Log level set appropriately (
infofor production,debugfor troubleshooting) - Grafana dashboard configured for monitoring
Reliability
Section titled “Reliability”- Restart policy configured (
restart: unless-stoppedfor Docker,Restart=on-failurefor systemd) - Health checks configured (Docker healthcheck or external monitoring on
/metrics) - Config validated before deployment (
llmsoup validate)