Deployment

Deploy llmsoup in production — from a single binary to a Docker Compose stack with monitoring.

Overview

llmsoup is a single static binary with no runtime dependencies. There are three ways to deploy it:

Method	Best for	Requirements
Binary	Simplest setup, bare-metal servers	Pre-built binary via install script
Docker	Container orchestration, CI/CD	Docker + Docker Compose
systemd	Long-running Linux services	Linux with systemd

All methods use the same YAML configuration file and environment variables.

Binary Deployment

Install

curl -fsSL https://llmsoup.insideapp.fr/install.sh | sh

The install script downloads the latest pre-built binary for your platform (Linux / macOS).

Run

# Generate a starter config
llmsoup prepare

# Validate the config
llmsoup validate

# Start the server
llmsoup serve

Configuration

Before starting the server, generate and customize a configuration file:

# Generate config.yaml with all options commented
llmsoup prepare

# Validate config before starting
llmsoup validate

The server resolves the config file in this order:

--config CLI flag (e.g., llmsoup serve --config /etc/llmsoup/config.yaml)
LLMSOUP_CONFIG environment variable
config.yaml in the current directory (default)

For the full list of configuration options, see the Configuration Reference.

Environment Variables

All llmsoup environment variables use the LLMSOUP_ prefix.

Variable	Purpose	Default
`LLMSOUP_CONFIG`	Path to YAML configuration file	`config.yaml`
`LLMSOUP_HOST`	Server bind address	`127.0.0.1`
`LLMSOUP_PORT`	Server bind port	`8080`
`LLMSOUP_LOG`	Log level filter (`trace`, `debug`, `info`, `warn`, `error`)	`info`
`LLMSOUP_ONNX_MODELS_DIR`	Directory for ML model storage	`~/.llmsoup/models`
`LLMSOUP_ALLOW_COMMAND_SECRETS`	Enable command-based secret resolution	disabled
`LLMSOUP_PREPARE_OUTPUT`	Override output path for `llmsoup prepare`	`config.yaml` (cwd)

Example: Custom bind address and port

export LLMSOUP_HOST=0.0.0.0
export LLMSOUP_PORT=9000
export LLMSOUP_LOG=debug
export LLMSOUP_CONFIG=/etc/llmsoup/config.yaml

llmsoup serve

Docker Compose Deployment

Dockerfile

Create a Dockerfile:

FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y --no-install-recommends ca-certificates curl && rm -rf /var/lib/apt/lists/*
RUN curl -fsSL https://llmsoup.insideapp.fr/install.sh | sh
ENTRYPOINT ["llmsoup"]
CMD ["serve"]

docker-compose.yml

services:
  llmsoup:
    build: .
    container_name: llmsoup
    ports:
      - "8080:8080"
    volumes:
      - ./config.yaml:/etc/llmsoup/config.yaml:ro
    environment:
      - LLMSOUP_CONFIG=/etc/llmsoup/config.yaml
      - LLMSOUP_HOST=0.0.0.0
      - LLMSOUP_PORT=8080
      - LLMSOUP_LOG=info
    healthcheck:
      test: ["CMD", "curl", "-sf", "http://localhost:8080/metrics"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s
    restart: unless-stopped

Start the stack

# Build and start
docker compose up -d

# Check logs
docker compose logs -f llmsoup

# Stop
docker compose down

Full stack: llmsoup + Prometheus + Grafana

This example deploys llmsoup with authentication, Prometheus metrics collection, and a Grafana dashboard — all in one compose file.

Directory structure:

deploy/
├── docker-compose.yml
├── config.yaml              # llmsoup configuration
├── tokens.yaml              # auth tokens (YAML format)
├── prometheus/
│   └── prometheus.yml        # Prometheus scrape config
└── grafana/
    ├── provisioning/
    │   ├── datasources/
    │   │   └── datasource.yml
    │   └── dashboards/
    │       └── dashboard.yml
    └── dashboards/
        └── llmsoup.json      # Grafana dashboard

docker-compose.yml:

services:
  llmsoup:
    build: ..
    container_name: llmsoup
    ports:
      - "8080:8080"
    volumes:
      - ./config.yaml:/etc/llmsoup/config.yaml:ro
      - ./tokens.yaml:/etc/llmsoup/tokens.yaml:ro
    environment:
      - LLMSOUP_CONFIG=/etc/llmsoup/config.yaml
      - LLMSOUP_HOST=0.0.0.0
      - LLMSOUP_PORT=8080
      - LLMSOUP_LOG=info
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - LLMSOUP_TOKEN_A=${LLMSOUP_TOKEN_A}
      - LLMSOUP_TOKEN_B=${LLMSOUP_TOKEN_B}
    healthcheck:
      test: ["CMD", "curl", "-sf", "http://localhost:8080/metrics"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:v2.53.3
    container_name: llmsoup-prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    depends_on:
      llmsoup:
        condition: service_healthy
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.4.0
    container_name: llmsoup-grafana
    ports:
      - "3000:3000"
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
      - grafana_data:/var/lib/grafana
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
      - GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/var/lib/grafana/dashboards/llmsoup.json
    depends_on:
      - prometheus
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

prometheus/prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: llmsoup
    metrics_path: /metrics
    static_configs:
      - targets:
          - llmsoup:8080

grafana/provisioning/datasources/datasource.yml:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

grafana/provisioning/dashboards/dashboard.yml:

apiVersion: 1

providers:
  - name: default
    orgId: 1
    folder: ""
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: false

config.yaml (with auth and secret resolution):

auth:
  enabled: true
  tokens_file: "/etc/llmsoup/tokens.yaml"

models:
  - name: gpt-4o
    provider: openai
    access_key:
      env: OPENAI_API_KEY
    endpoints:
      - url: https://api.openai.com/v1/chat/completions
  - name: claude-sonnet
    provider: anthropic
    access_key:
      env: ANTHROPIC_API_KEY
    endpoints:
      - url: https://api.anthropic.com/v1/chat/completions

tokens.yaml:

tokens:
  - id: service-a
    description: "Service A access"
    secret:
      env: LLMSOUP_TOKEN_A
  - id: service-b
    description: "Service B access"
    secret:
      env: LLMSOUP_TOKEN_B

Start the full stack:

# Set your provider API keys and auth tokens
export OPENAI_API_KEY=<your-openai-key>
export ANTHROPIC_API_KEY=<your-anthropic-key>
export LLMSOUP_TOKEN_A=<your-service-a-token>
export LLMSOUP_TOKEN_B=<your-service-b-token>

# Start everything
cd deploy
docker compose up -d

# Verify llmsoup is healthy
curl -s http://localhost:8080/metrics | head -5

# Make an authenticated request
curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer <your-service-a-token>" \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "Hello"}]}'

# Open Grafana dashboard
open http://localhost:3000

Passing secrets

Use environment variables or mounted files for API keys — never put secrets directly in config.yaml:

services:
  llmsoup:
    environment:
      - OPENAI_API_KEY=<your-api-key>
      - ANTHROPIC_API_KEY=<your-api-key>
      - LLMSOUP_TOKEN_A=<your-token>
      - LLMSOUP_TOKEN_B=<your-token>
    volumes:
      - ./config.yaml:/etc/llmsoup/config.yaml:ro
      - ./tokens.yaml:/etc/llmsoup/tokens.yaml:ro

In your config.yaml, reference secrets with structured YAML fields:

models:
  - name: gpt-4o
    provider: openai
    access_key:
      env: OPENAI_API_KEY
    endpoints:
      - url: https://api.openai.com/v1/chat/completions

See the Configuration Reference — Secret Resolution for all resolution methods (env, file, vault, command).

Authentication Setup

llmsoup requires token-based authentication on all endpoints except /metrics.

Enable authentication

In config.yaml, define tokens via secret references or an external tokens file:

# Inline tokens (via environment variables)
auth:
  enabled: true
  tokens:
    - env: SERVICE_A_TOKEN
    - env: SERVICE_B_TOKEN

# External tokens file (YAML format with user IDs)
auth:
  enabled: true
  tokens_file: "/etc/llmsoup/tokens.yaml"

Making authenticated requests

Include the resolved token value in the Authorization header:

curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer <your-token-value>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Token rotation

To rotate tokens:

Update the environment variables or secret files referenced by your token configuration
Restart the llmsoup process (configuration is re-read on startup)

Monitoring with Prometheus

llmsoup exposes a Prometheus-compatible metrics endpoint at /metrics.

Metrics endpoint

URL: GET /metrics
Authentication: None required (bypasses auth middleware)
Format: Prometheus text exposition format
Metric prefix: llmsoup_ with snake_case names and unit suffixes (_total, _seconds, _bytes)

Key metrics include request counts, routing latency, signal evaluation timing, cache hit rates, model errors, and cost tracking.

Prometheus scrape configuration

Add llmsoup as a scrape target in your prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: llmsoup
    metrics_path: /metrics
    static_configs:
      - targets:
          - host.docker.internal:8080

Monitoring stack with Grafana

You can extend the full-stack Docker Compose example above with Grafana for dashboarding. The example already includes Prometheus and Grafana services with auto-provisioned datasources.

Access the Grafana dashboard at http://localhost:3000. Anonymous viewer access is enabled by default for local development.

Resource Sizing

Plan your deployment resources based on which signals are configured.

Memory

State	Expected RSS	Notes
Idle (no models)	~50–100 MB	Binary only, no signal evaluators loaded
Idle (models loaded)	≤ 500 MB	After first request triggers lazy model loading
Under load (50 rps)	≤ 1 GB	Includes response caches, embedding caches, active connections

Memory usage depends on which signals are enabled in your configuration:

No embedding/domain signals — the server stays well under 100 MB because no ML models are loaded.
Embedding signals — each embedding model adds 30–250 MB depending on the model variant (light, flash, or pro).
Domain classification signal — the BERT-based classifier adds approximately 150–200 MB.

Models are loaded lazily on first use rather than at startup, so idle memory remains low until the first request that triggers a signal evaluation.

CPU

llmsoup uses the Tokio async runtime and spends most CPU time on:

ML inference (embedding generation, domain classification) — the primary CPU consumer
HTTP request handling — minimal overhead per request
Routing decisions — rule matching and signal evaluation are fast (p95 < 10 ms)

For production workloads, 2–4 CPU cores are sufficient. CPU-bound ML inference runs on Tokio’s blocking thread pool and does not block the async event loop.

Cold Start Behavior

Model downloads

When you start llmsoup serve, the server automatically downloads ML models from HuggingFace Hub if embedding or domain signals are configured in your config.yaml. Downloads happen once and are cached locally.

What triggers downloads:

Signal type	Model	Approximate size
`embedding` (light)	`sentence-transformers/all-MiniLM-L12-v2`	~33 MB
`embedding` (pro)	`Qwen/Qwen3-Embedding-0.6B`	~250 MB
`embedding` (flash)	`google/embeddinggemma-300m`	~100 MB
`domain`	`LLM-Semantic-Router/lora_intent_classifier_bert-base-uncased_model`	~150–200 MB

Models are stored at ~/.llmsoup/models/ by default, or wherever LLMSOUP_ONNX_MODELS_DIR points.

First startup timeline

Server starts and loads configuration
If embedding/domain signals are configured, models are downloaded (first startup only)
Server begins accepting requests
Models are loaded into memory lazily on first use (not at startup)

Skipping downloads

Set LLMSOUP_SKIP_MODEL_DOWNLOAD=true to skip all model downloads. This is primarily used for CI/CD and testing — the server will start without embedding or domain signal support.

Warm-Up Recommendations

After startup, models are loaded lazily — the first request that triggers an embedding or domain signal will be slower while the model initializes in memory.

Recommended warm-up approach:

# After starting the server, send a warm-up request
curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer <your-token>" \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "messages": [{"role": "user", "content": "warm up"}]}'

This triggers lazy loading of any configured ML models so subsequent requests are served at full speed.

Health Checks

Endpoint

llmsoup does not expose a dedicated /health or /ready endpoint. Use the /metrics endpoint as a health probe:

Property	Value
URL	`GET /metrics`
Authentication	None required (bypasses auth middleware)
Healthy response	HTTP 200 with Prometheus text format
Unhealthy	Connection refused or timeout

Recommended intervals

Setting	Value	Rationale
Interval	30 s	Balances responsiveness with low overhead
Timeout	5 s	Generous enough for loaded servers
Retries	3	Tolerates transient blips
Start period	10 s	Allows time for model downloads on first start

Docker healthcheck

The Docker Compose examples in this guide already include a health check configuration:

healthcheck:
  test: ["CMD", "curl", "-sf", "http://localhost:8080/metrics"]
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 10s

Kubernetes liveness probe

livenessProbe:
  httpGet:
    path: /metrics
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 30
  timeoutSeconds: 5
  failureThreshold: 3

systemd health check

For systemd deployments, validate the server is running after startup and restart on failure:

[Service]
ExecStartPost=/bin/sh -c 'sleep 5 && curl -sf http://localhost:8080/metrics > /dev/null'
Restart=on-failure
RestartSec=5

Logging

Log levels

llmsoup uses the LLMSOUP_LOG environment variable to control log verbosity. It defaults to info.

Level	Output	Use case
`error`	Errors only	Minimal output, alerts-only monitoring
`warn`	Errors + warnings	Production baseline — includes graceful degradation notices
`info`	Normal operational logs	Recommended for production — request flow, startup, config
`debug`	Detailed internal state	Troubleshooting routing decisions and signal evaluation
`trace`	Everything	Deep debugging — includes raw request/response data

Setting the log level

# Production
export LLMSOUP_LOG=info

# Troubleshooting
export LLMSOUP_LOG=debug

# Selective filtering (advanced)
export LLMSOUP_LOG="llmsoup=debug,tower_http=info"

The LLMSOUP_LOG value is passed directly to the tracing-subscriber EnvFilter, which supports the full env_filter syntax. This means you can set different levels per module — useful for isolating noisy components during debugging.

Graceful Degradation

llmsoup is designed to keep routing requests even when non-critical components fail. Signal evaluation failures do not crash the server — routing continues without the failed signal.

What degrades gracefully:

Signal evaluation — if an individual signal (embedding, domain, language, etc.) fails to evaluate, the routing engine skips that signal and makes a decision based on the remaining signals.
Model downloads — if a model fails to download (network error, unauthorized), the server starts without that signal type and logs a warning.
Metrics collection — Prometheus metric recording failures are caught and logged; they never block request processing.
Caching — cache read/write failures return None and fall back to direct computation.

What does NOT degrade:

Routing engine — core rule matching and decision logic must succeed.
Authentication — token validation failures correctly reject requests (this is security, not degradation).
Configuration loading — invalid configuration prevents startup (fail-fast by design).

Production Checklist

Use this checklist before going live:

Security

Authentication is enabled with strong tokens
Secrets are resolved from environment variables or files (not hardcoded in config)
LLMSOUP_ALLOW_COMMAND_SECRETS is disabled unless explicitly needed
API keys for upstream providers are stored securely

Performance

Running the official pre-built binary (not a debug build)
ML models pre-downloaded to avoid first-request latency (run a warm-up request after startup)
Appropriate cache TTLs configured for your workload

Observability

Prometheus is scraping the /metrics endpoint
Log level set appropriately (info for production, debug for troubleshooting)
Grafana dashboard configured for monitoring

Reliability

Restart policy configured (restart: unless-stopped for Docker, Restart=on-failure for systemd)
Health checks configured (Docker healthcheck or external monitoring on /metrics)
Config validated before deployment (llmsoup validate)