Getting Started

Go from zero to routing LLM requests in under 5 minutes.

Prerequisites

Shell — Linux or macOS (the installer downloads a pre-built binary for your platform)
At least one LLM provider API key (e.g., OpenAI, Anthropic, or a local model endpoint)

Installation

One-line installer

curl -fsSL https://llmsoup.insideapp.fr/install.sh | sh

Download a release binary

Pre-built binaries for Linux and macOS are available from the releases page. Download the binary for your platform, make it executable, and place it on your PATH.

Generate a config file

llmsoup uses a YAML configuration file. Generate a starter template:

llmsoup prepare

This creates a config.yaml in the current directory. The generated template is ready to use out of the box — here’s what it sets up:

version: v0.1

# ── Global defaults ──────────────────────────────────────────────
defaults:
  request_timeout_ms: 60000        # 60s timeout for model calls
  preference_model: gpt-5-mini     # model used for preference classification
  cost_aware_routing: true          # factor cost into model selection
  cost_quality_tradeoff: 0.3       # 0.0 = pure quality, 1.0 = pure cost

# ── Models ───────────────────────────────────────────────────────
# Two models: a fast/cheap one and a larger/smarter one.
# API keys are read from environment variables — never hardcoded.
providers:
  default_model: gpt-5-mini
  models:
    - name: gpt-5-mini
      provider: openai
      access_key:
        env: OPENAI_API_KEY          # ← set this env var before starting
      endpoints:
        - url: https://api.openai.com/v1/chat/completions
      pricing:
        prompt_per_1m: 0.25
        completion_per_1m: 2.00

    - name: gpt-5.2
      provider: openai
      access_key:
        env: OPENAI_API_KEY
      endpoints:
        - url: https://api.openai.com/v1/chat/completions
      pricing:
        prompt_per_1m: 1.75
        completion_per_1m: 14.00

# ── Signals ──────────────────────────────────────────────────────
# Signals evaluate each incoming prompt. The template includes:
#   • keyword   — trigger on specific words ("calculate", "function", …)
#   • embedding — semantic similarity matching ("quick answer", "deep thinking")
#   • domain    — MMLU-based classification (math, physics, CS, law, …)
#   • language  — detect 7 languages (en, es, zh, fr, ru, de, ja)
#   • latency   — TPOT-based thresholds (50ms, 150ms per token)
#   • fact_check, user_feedback, preference — advanced classifiers
signals:
  keyword:
    - name: math_keywords
      operator: OR
      keywords: ["calculate", "equation"]
    - name: code_keywords
      operator: OR
      keywords: ["function", "class"]
  # … embedding, domain, language, latency, and more (see full file)

# ── Routing rules (decisions) ────────────────────────────────────
# Rules are evaluated by priority (highest first). Each rule matches
# one or more signals and routes to a model with an optional strategy.
# The template ships with 11 rules covering:
#   • preference-based routing (code generation, bug fixing, code review)
#   • math & physics with reasoning enabled
#   • quick answers → fast model, deep thinking → large model
#   • language-specific routes (Russian, Chinese)
#   • confidence-based escalation (try cheap model first, upgrade if unsure)
decisions:
  - name: math_problems
    priority: 190
    rules:
      operator: OR
      conditions:
        - type: keyword
          name: math_keywords
        - type: domain
          name: math
    modelRefs:
      - model: gpt-5-mini
        use_reasoning: true
        reasoning_effort: high
  # … 10 more rules (see full file)

# ── Authentication (commented out by default) ────────────────────
# Uncomment the auth section to require Bearer tokens on all
# endpoints except /metrics. See Configuration Reference for details.
# auth:
#   enabled: true
#   tokens_file: "/etc/llmsoup/tokens.yaml"

The only thing you need before starting is your API key as an environment variable:

export OPENAI_API_KEY="your-openai-api-key"

Download ML models

The generated template includes embedding and domain classification signals that require ML models. These are downloaded from Hugging Face on first use.

To enable model downloads, set your Hugging Face access token:

export HUGGINGFACE_HUB_TOKEN="hf_your-token-here"

Validate your configuration

Before starting the server, check that your config is valid:

llmsoup validate --config config.yaml

A clean validation means your models, signals, and routing rules are all properly configured.

Start the server

Set your API tokens and start llmsoup:

export OPENAI_API_KEY="your-openai-api-key"
llmsoup serve

You should see the branded llmsoup banner followed by:

listening on 127.0.0.1:8080

Add --stats to launch a live TUI dashboard showing costs, savings, per-model usage, triggered routes, and errors — all updated in real time:

llmsoup serve --stats

Make your first request

With the server running, send a request using the OpenAI-compatible API:

curl -X POST http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      {
        "role": "user",
        "content": "Write a hello world function in Python"
      }
    ]
  }'

llmsoup evaluates the prompt, matches the keyword signal, and routes the request to gpt-5.2. The response follows the standard OpenAI format:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "gpt-5.2",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "def hello_world():\n    print(\"Hello, World!\")\n\nhello_world()"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 20,
    "total_tokens": 35
  }
}

Next steps

Configuration Reference — Deep dive into models, signals, routing rules, plugins, and authentication
Deployment Guide — Run llmsoup in production with Docker, systemd, and monitoring