Skip to content

Getting Started

Go from zero to routing LLM requests in under 5 minutes.

  • Shell — Linux or macOS (the installer downloads a pre-built binary for your platform)
  • At least one LLM provider API key (e.g., OpenAI, Anthropic, or a local model endpoint)
Terminal window
curl -fsSL https://llmsoup.insideapp.fr/install.sh | sh

Pre-built binaries for Linux and macOS are available from the releases page. Download the binary for your platform, make it executable, and place it on your PATH.

llmsoup uses a YAML configuration file. Generate a starter template:

Terminal window
llmsoup prepare

This creates a config.yaml in the current directory. The generated template is ready to use out of the box — here’s what it sets up:

version: v0.1
# ── Global defaults ──────────────────────────────────────────────
defaults:
request_timeout_ms: 60000 # 60s timeout for model calls
preference_model: gpt-5-mini # model used for preference classification
cost_aware_routing: true # factor cost into model selection
cost_quality_tradeoff: 0.3 # 0.0 = pure quality, 1.0 = pure cost
# ── Models ───────────────────────────────────────────────────────
# Two models: a fast/cheap one and a larger/smarter one.
# API keys are read from environment variables — never hardcoded.
providers:
default_model: gpt-5-mini
models:
- name: gpt-5-mini
provider: openai
access_key:
env: OPENAI_API_KEY # ← set this env var before starting
endpoints:
- url: https://api.openai.com/v1/chat/completions
pricing:
prompt_per_1m: 0.25
completion_per_1m: 2.00
- name: gpt-5.2
provider: openai
access_key:
env: OPENAI_API_KEY
endpoints:
- url: https://api.openai.com/v1/chat/completions
pricing:
prompt_per_1m: 1.75
completion_per_1m: 14.00
# ── Signals ──────────────────────────────────────────────────────
# Signals evaluate each incoming prompt. The template includes:
# • keyword — trigger on specific words ("calculate", "function", …)
# • embedding — semantic similarity matching ("quick answer", "deep thinking")
# • domain — MMLU-based classification (math, physics, CS, law, …)
# • language — detect 7 languages (en, es, zh, fr, ru, de, ja)
# • latency — TPOT-based thresholds (50ms, 150ms per token)
# • fact_check, user_feedback, preference — advanced classifiers
signals:
keyword:
- name: math_keywords
operator: OR
keywords: ["calculate", "equation"]
- name: code_keywords
operator: OR
keywords: ["function", "class"]
# … embedding, domain, language, latency, and more (see full file)
# ── Routing rules (decisions) ────────────────────────────────────
# Rules are evaluated by priority (highest first). Each rule matches
# one or more signals and routes to a model with an optional strategy.
# The template ships with 11 rules covering:
# • preference-based routing (code generation, bug fixing, code review)
# • math & physics with reasoning enabled
# • quick answers → fast model, deep thinking → large model
# • language-specific routes (Russian, Chinese)
# • confidence-based escalation (try cheap model first, upgrade if unsure)
decisions:
- name: math_problems
priority: 190
rules:
operator: OR
conditions:
- type: keyword
name: math_keywords
- type: domain
name: math
modelRefs:
- model: gpt-5-mini
use_reasoning: true
reasoning_effort: high
# … 10 more rules (see full file)
# ── Authentication (commented out by default) ────────────────────
# Uncomment the auth section to require Bearer tokens on all
# endpoints except /metrics. See Configuration Reference for details.
# auth:
# enabled: true
# tokens_file: "/etc/llmsoup/tokens.yaml"

The only thing you need before starting is your API key as an environment variable:

Terminal window
export OPENAI_API_KEY="your-openai-api-key"

The generated template includes embedding and domain classification signals that require ML models. These are downloaded from Hugging Face on first use.

To enable model downloads, set your Hugging Face access token:

Terminal window
export HUGGINGFACE_HUB_TOKEN="hf_your-token-here"

Before starting the server, check that your config is valid:

Terminal window
llmsoup validate --config config.yaml

A clean validation means your models, signals, and routing rules are all properly configured.

Set your API tokens and start llmsoup:

Terminal window
export OPENAI_API_KEY="your-openai-api-key"
llmsoup serve

You should see the branded llmsoup banner followed by:

listening on 127.0.0.1:8080

Add --stats to launch a live TUI dashboard showing costs, savings, per-model usage, triggered routes, and errors — all updated in real time:

Terminal window
llmsoup serve --stats

With the server running, send a request using the OpenAI-compatible API:

Terminal window
curl -X POST http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{
"role": "user",
"content": "Write a hello world function in Python"
}
]
}'

llmsoup evaluates the prompt, matches the keyword signal, and routes the request to gpt-5.2. The response follows the standard OpenAI format:

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "gpt-5.2",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "def hello_world():\n print(\"Hello, World!\")\n\nhello_world()"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 20,
"total_tokens": 35
}
}