Getting Started
Go from zero to routing LLM requests in under 5 minutes.
Prerequisites
Section titled “Prerequisites”- Shell — Linux or macOS (the installer downloads a pre-built binary for your platform)
- At least one LLM provider API key (e.g., OpenAI, Anthropic, or a local model endpoint)
Installation
Section titled “Installation”One-line installer
Section titled “One-line installer”curl -fsSL https://llmsoup.insideapp.fr/install.sh | shDownload a release binary
Section titled “Download a release binary”Pre-built binaries for Linux and macOS are available from the releases page. Download the binary for your platform, make it executable, and place it on your PATH.
Generate a config file
Section titled “Generate a config file”llmsoup uses a YAML configuration file. Generate a starter template:
llmsoup prepareThis creates a config.yaml in the current directory. The generated template is ready to use out of the box — here’s what it sets up:
version: v0.1
# ── Global defaults ──────────────────────────────────────────────defaults: request_timeout_ms: 60000 # 60s timeout for model calls preference_model: gpt-5-mini # model used for preference classification cost_aware_routing: true # factor cost into model selection cost_quality_tradeoff: 0.3 # 0.0 = pure quality, 1.0 = pure cost
# ── Models ───────────────────────────────────────────────────────# Two models: a fast/cheap one and a larger/smarter one.# API keys are read from environment variables — never hardcoded.providers: default_model: gpt-5-mini models: - name: gpt-5-mini provider: openai access_key: env: OPENAI_API_KEY # ← set this env var before starting endpoints: - url: https://api.openai.com/v1/chat/completions pricing: prompt_per_1m: 0.25 completion_per_1m: 2.00
- name: gpt-5.2 provider: openai access_key: env: OPENAI_API_KEY endpoints: - url: https://api.openai.com/v1/chat/completions pricing: prompt_per_1m: 1.75 completion_per_1m: 14.00
# ── Signals ──────────────────────────────────────────────────────# Signals evaluate each incoming prompt. The template includes:# • keyword — trigger on specific words ("calculate", "function", …)# • embedding — semantic similarity matching ("quick answer", "deep thinking")# • domain — MMLU-based classification (math, physics, CS, law, …)# • language — detect 7 languages (en, es, zh, fr, ru, de, ja)# • latency — TPOT-based thresholds (50ms, 150ms per token)# • fact_check, user_feedback, preference — advanced classifierssignals: keyword: - name: math_keywords operator: OR keywords: ["calculate", "equation"] - name: code_keywords operator: OR keywords: ["function", "class"] # … embedding, domain, language, latency, and more (see full file)
# ── Routing rules (decisions) ────────────────────────────────────# Rules are evaluated by priority (highest first). Each rule matches# one or more signals and routes to a model with an optional strategy.# The template ships with 11 rules covering:# • preference-based routing (code generation, bug fixing, code review)# • math & physics with reasoning enabled# • quick answers → fast model, deep thinking → large model# • language-specific routes (Russian, Chinese)# • confidence-based escalation (try cheap model first, upgrade if unsure)decisions: - name: math_problems priority: 190 rules: operator: OR conditions: - type: keyword name: math_keywords - type: domain name: math modelRefs: - model: gpt-5-mini use_reasoning: true reasoning_effort: high # … 10 more rules (see full file)
# ── Authentication (commented out by default) ────────────────────# Uncomment the auth section to require Bearer tokens on all# endpoints except /metrics. See Configuration Reference for details.# auth:# enabled: true# tokens_file: "/etc/llmsoup/tokens.yaml"The only thing you need before starting is your API key as an environment variable:
export OPENAI_API_KEY="your-openai-api-key"Download ML models
Section titled “Download ML models”The generated template includes embedding and domain classification signals that require ML models. These are downloaded from Hugging Face on first use.
To enable model downloads, set your Hugging Face access token:
export HUGGINGFACE_HUB_TOKEN="hf_your-token-here"Validate your configuration
Section titled “Validate your configuration”Before starting the server, check that your config is valid:
llmsoup validate --config config.yamlA clean validation means your models, signals, and routing rules are all properly configured.
Start the server
Section titled “Start the server”Set your API tokens and start llmsoup:
export OPENAI_API_KEY="your-openai-api-key"llmsoup serveYou should see the branded llmsoup banner followed by:
listening on 127.0.0.1:8080Add --stats to launch a live TUI dashboard showing costs, savings, per-model usage, triggered routes, and errors — all updated in real time:
llmsoup serve --statsMake your first request
Section titled “Make your first request”With the server running, send a request using the OpenAI-compatible API:
curl -X POST http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [ { "role": "user", "content": "Write a hello world function in Python" } ] }'llmsoup evaluates the prompt, matches the keyword signal, and routes the request to gpt-5.2. The response follows the standard OpenAI format:
{ "id": "chatcmpl-abc123", "object": "chat.completion", "model": "gpt-5.2", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "def hello_world():\n print(\"Hello, World!\")\n\nhello_world()" }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 15, "completion_tokens": 20, "total_tokens": 35 }}Next steps
Section titled “Next steps”- Configuration Reference — Deep dive into models, signals, routing rules, plugins, and authentication
- Deployment Guide — Run llmsoup in production with Docker, systemd, and monitoring