API Reference

llmsoup exposes an OpenAI-compatible HTTP API. Any application or SDK that works with the OpenAI Chat Completions API can point at llmsoup with zero code changes — just swap the base URL.

Authentication

All endpoints except /metrics require a Bearer token in the Authorization header.

Authorization: Bearer <your-token>

Tokens are configured in the llmsoup config file via auth.tokens (inline list) or auth.tokens_file (external file). When authentication is enabled and the token is missing or invalid, llmsoup returns:

HTTP/1.1 401 Unauthorized
Content-Type: application/json

{
  "error": {
    "message": "Invalid or missing authentication token",
    "type": "invalid_request_error",
    "code": "invalid_api_key"
  }
}

If authentication is disabled in the config (auth.enabled: false), all requests are accepted without a token.

Endpoints

Method	Path	Auth Required	Description
`POST`	`/v1/chat/completions`	Yes	Chat completion (non-streaming and streaming)
`GET`	`/metrics`	No	Prometheus metrics

POST /v1/chat/completions

Request

Send a JSON body matching the OpenAI Chat Completions format.

Request fields

Field	Type	Required	Description
`model`	`string`	Yes	Model name. llmsoup routes to the best backend model based on your routing rules; you can also target a specific configured model by name.
`messages`	`array`	Yes	Conversation messages. See Message object.
`temperature`	`number`	No	Sampling temperature (0–2). Passed through to the routed model.
`top_p`	`number`	No	Nucleus sampling parameter. Passed through to the routed model.
`max_tokens`	`integer`	No	Maximum tokens to generate (deprecated — use `max_completion_tokens`).
`max_completion_tokens`	`integer`	No	Maximum completion tokens. Preferred over `max_tokens`.
`stream`	`boolean`	No	Set `true` for Server-Sent Events streaming. Default: `false`.

llmsoup tolerates additional fields (e.g., tools, tool_choice, response_format, logprobs) and passes them through to the upstream model unchanged. This means tool/function calling, structured output, and other OpenAI features work as long as the routed model supports them.

Message object

Field	Type	Required	Description
`role`	`string`	Yes	One of `system`, `user`, `assistant`, `tool`, or `developer`.
`content`	`string` or `array`	No	Text string, or an array of content parts for multimodal input. Omit for tool-call assistant messages.
`tool_calls`	`array`	No	Tool calls made by the assistant (assistant messages only).
`tool_call_id`	`string`	No	ID of the tool call being responded to (tool messages only).

Content parts (when content is an array):

[
  { "type": "text", "text": "Describe this image" },
  { "type": "image_url", "image_url": { "url": "https://example.com/photo.png" } }
]

Multimodal content parts (images, audio) are passed through to the upstream model. Support depends on the routed model’s capabilities.

Tool call object

Field	Type	Description
`id`	`string`	Unique identifier for the tool call.
`type`	`string`	Always `"function"`.
`function.name`	`string`	Name of the function to call.
`function.arguments`	`string`	JSON-encoded arguments.

Example request

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer your-token" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quicksort in two sentences."}
    ],
    "temperature": 0.7,
    "max_completion_tokens": 256
  }'

Non-streaming response

When stream is false (default), llmsoup returns a single JSON object.

Response fields

Field	Type	Description
`id`	`string`	Unique request identifier (format: `chatcmpl-{hex}-{hex}`).
`object`	`string`	Always `"chat.completion"`.
`created`	`integer`	Unix timestamp (seconds) when the completion was created.
`model`	`string`	The model that actually served the request.
`choices`	`array`	Array of completion choices (typically one). See Choice object.
`usage`	`object`	Token usage statistics. See Usage object.
`service_tier`	`string`	Service tier (optional, omitted when not applicable).

Choice object

Field	Type	Description
`index`	`integer`	Zero-based index of this choice.
`message`	`object`	The assistant’s response message.
`message.role`	`string`	Always `"assistant"`.
`message.content`	`string` or `null`	Text content. `null` when tool calls are made.
`message.tool_calls`	`array` or `null`	Tool calls, if any.
`finish_reason`	`string`	Why generation stopped: `"stop"`, `"length"`, `"content_filter"`, `"tool_calls"`, or `"function_call"`.
`logprobs`	`object` or absent	Log probability information. Currently omitted from responses (not present in JSON). Reserved for future use.

Usage object

Field	Type	Description
`prompt_tokens`	`integer`	Number of tokens in the prompt.
`completion_tokens`	`integer`	Number of tokens in the completion.
`total_tokens`	`integer`	Sum of prompt and completion tokens.

Example response

{
  "id": "chatcmpl-67a3f2c6-1a",
  "object": "chat.completion",
  "created": 1738867398,
  "model": "gpt-4o-mini",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quicksort is a divide-and-conquer algorithm that selects a pivot element and partitions the array into elements less than and greater than the pivot. It then recursively sorts each partition, achieving O(n log n) average-case performance."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 42,
    "total_tokens": 70
  }
}

Streaming response (SSE)

When stream is true, llmsoup returns a stream of Server-Sent Events. Each event is a data: line containing a JSON chunk, terminated by a data: [DONE] sentinel.

Implementation note: llmsoup sends the request to the upstream model with stream=false, receives the complete response, and then constructs the SSE stream locally. This means first-token latency is equivalent to a non-streaming request — the full response is generated before any events are emitted. Streaming is useful for progressive UI rendering on the client side, but does not reduce time-to-first-token.

Chunk fields

Field	Type	Description
`id`	`string`	Same request identifier as non-streaming.
`object`	`string`	Always `"chat.completion.chunk"`.
`created`	`integer`	Unix timestamp (seconds).
`model`	`string`	Model that served the request.
`choices`	`array`	Array with one chunk choice.

Chunk choice fields

Field	Type	Description
`index`	`integer`	Choice index (always `0`).
`delta`	`object`	Incremental content update.
`delta.role`	`string`	`"assistant"` — present in the first chunk only.
`delta.content`	`string`	Content fragment.
`delta.tool_calls`	`array`	Tool call deltas, if any.
`finish_reason`	`string` or `null`	`null` until the final chunk, then `"stop"`, `"length"`, etc.

SSE wire format

data: {"id":"chatcmpl-67a3f2c6-1a","object":"chat.completion.chunk","created":1738867398,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{"role":"assistant","content":"Quick"},"finish_reason":null}]}

data: {"id":"chatcmpl-67a3f2c6-1a","object":"chat.completion.chunk","created":1738867398,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{"content":"sort is"},"finish_reason":null}]}

data: {"id":"chatcmpl-67a3f2c6-1a","object":"chat.completion.chunk","created":1738867398,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Each data: line is followed by two newlines (\n\n). The [DONE] sentinel signals the end of the stream.

Streaming curl example

curl -N -X POST http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer your-token" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Streaming Python example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="your-token",
)

stream = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

Error responses

All errors follow the OpenAI error response format:

{
  "error": {
    "message": "Human-readable error description",
    "type": "error_type",
    "code": "error_code"
  }
}

Error types

HTTP Status	`type`	`code`	When
400	`invalid_request_error`	`null`	Malformed JSON, missing required fields, request too large for context window
401	`invalid_request_error`	`invalid_api_key`	Missing or invalid Bearer token
500	`internal_error`	`internal_error`	Routing failure, model call failure, serialization error

Common error scenarios

Scenario	Status	Message
Invalid JSON body	400	`Invalid JSON: {details}`
Missing `model` field	400	`Missing required field: model`
Missing `messages` field	400	`Missing required field: messages`
Request exceeds context window	400	`Request of ~{tokens} exceeds context window`
No routing decision available	500	`No routing decision available`
Upstream model call failed	500	`Model call failed: {error}`
Response serialization error	500	`Failed to serialize response`

Response headers

Cost headers

When defaults.include_cost_headers is true (the default), llmsoup adds cost and routing information to every response:

Header	Type	Description
`x-llmsoup-cost`	`string`	Total request cost (model + routing) formatted to 6 decimal places (e.g., `"0.001250"`). Currency is set in model pricing config.
`x-llmsoup-model`	`string`	Name of the model that served the request (e.g., `"gpt-4o-mini"`).
`x-llmsoup-tokens-prompt`	`string`	Prompt token count (e.g., `"150"`).
`x-llmsoup-tokens-completion`	`string`	Completion token count (e.g., `"42"`).

Cost headers are included on both non-streaming and streaming responses.

Example response headers

HTTP/1.1 200 OK
content-type: application/json
x-llmsoup-cost: 0.000125
x-llmsoup-model: gpt-4o-mini
x-llmsoup-tokens-prompt: 28
x-llmsoup-tokens-completion: 42

Plugin headers

Plugins can inject additional response headers after the standard cost headers. The built-in security plugins add the following headers when triggered:

Header	Type	Description
`x-llmsoup-jailbreak-blocked`	`string`	`"true"` when the jailbreak detection plugin blocks a request.
`x-llmsoup-jailbreak-confidence`	`string`	Confidence score (e.g., `"0.95"`) for the jailbreak detection.
`x-llmsoup-pii-blocked`	`string`	`"true"` when the PII detection plugin blocks a request.
`x-llmsoup-pii-types`	`string`	Comma-separated PII types detected (e.g., `"email,phone"`).

See the Plugins Reference for full plugin documentation and configuration.

GET /metrics

Returns Prometheus-formatted metrics. This endpoint does not require authentication.

Content-Type: text/plain; version=0.0.4

Example

curl http://localhost:8080/metrics

# HELP llmsoup_requests_total Total number of requests
# TYPE llmsoup_requests_total counter
llmsoup_requests_total{method="POST",endpoint="/v1/chat/completions",status="200"} 1523

# HELP llmsoup_active_connections Current number of active connections
# TYPE llmsoup_active_connections gauge
llmsoup_active_connections 3

# HELP llmsoup_model_request_duration_seconds Model request duration
# TYPE llmsoup_model_request_duration_seconds histogram
llmsoup_model_request_duration_seconds_bucket{model="gpt-4o-mini",le="0.5"} 1200
llmsoup_model_request_duration_seconds_bucket{model="gpt-4o-mini",le="1.0"} 1490
llmsoup_model_request_duration_seconds_bucket{model="gpt-4o-mini",le="+Inf"} 1523

# HELP llmsoup_tokens_total Total tokens processed
# TYPE llmsoup_tokens_total counter
llmsoup_tokens_total{type="prompt"} 45200
llmsoup_tokens_total{type="completion"} 12800

All metric names use the llmsoup_ prefix with snake_case naming and unit suffixes (_total, _seconds, _bytes). A full metrics reference is available in the Metrics Reference documentation page.

curl examples

Basic completion

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer your-token" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Streaming completion

curl -N -X POST http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer your-token" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "Write a haiku about coding."}
    ],
    "stream": true
  }'

With tool calling

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer your-token" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "What is the weather in Paris?"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get current weather for a city",
          "parameters": {
            "type": "object",
            "properties": {
              "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"]
          }
        }
      }
    ]
  }'

Using the OpenAI Python SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="your-token",
)

response = client.chat.completions.create(
    model="auto",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain recursion simply."},
    ],
    temperature=0.5,
    max_completion_tokens=200,
)

print(response.choices[0].message.content)