API Reference
llmsoup exposes an OpenAI-compatible HTTP API. Any application or SDK that works with the OpenAI Chat Completions API can point at llmsoup with zero code changes — just swap the base URL.
Authentication
Section titled “Authentication”All endpoints except /metrics require a Bearer token in the Authorization header.
Authorization: Bearer <your-token>Tokens are configured in the llmsoup config file via auth.tokens (inline list) or auth.tokens_file (external file). When authentication is enabled and the token is missing or invalid, llmsoup returns:
HTTP/1.1 401 UnauthorizedContent-Type: application/json
{ "error": { "message": "Invalid or missing authentication token", "type": "invalid_request_error", "code": "invalid_api_key" }}If authentication is disabled in the config (auth.enabled: false), all requests are accepted without a token.
Endpoints
Section titled “Endpoints”| Method | Path | Auth Required | Description |
|---|---|---|---|
POST | /v1/chat/completions | Yes | Chat completion (non-streaming and streaming) |
GET | /metrics | No | Prometheus metrics |
POST /v1/chat/completions
Section titled “POST /v1/chat/completions”Request
Section titled “Request”Send a JSON body matching the OpenAI Chat Completions format.
Request fields
Section titled “Request fields”| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model name. llmsoup routes to the best backend model based on your routing rules; you can also target a specific configured model by name. |
messages | array | Yes | Conversation messages. See Message object. |
temperature | number | No | Sampling temperature (0–2). Passed through to the routed model. |
top_p | number | No | Nucleus sampling parameter. Passed through to the routed model. |
max_tokens | integer | No | Maximum tokens to generate (deprecated — use max_completion_tokens). |
max_completion_tokens | integer | No | Maximum completion tokens. Preferred over max_tokens. |
stream | boolean | No | Set true for Server-Sent Events streaming. Default: false. |
llmsoup tolerates additional fields (e.g., tools, tool_choice, response_format, logprobs) and passes them through to the upstream model unchanged. This means tool/function calling, structured output, and other OpenAI features work as long as the routed model supports them.
Message object
Section titled “Message object”| Field | Type | Required | Description |
|---|---|---|---|
role | string | Yes | One of system, user, assistant, tool, or developer. |
content | string or array | No | Text string, or an array of content parts for multimodal input. Omit for tool-call assistant messages. |
tool_calls | array | No | Tool calls made by the assistant (assistant messages only). |
tool_call_id | string | No | ID of the tool call being responded to (tool messages only). |
Content parts (when content is an array):
[ { "type": "text", "text": "Describe this image" }, { "type": "image_url", "image_url": { "url": "https://example.com/photo.png" } }]Multimodal content parts (images, audio) are passed through to the upstream model. Support depends on the routed model’s capabilities.
Tool call object
Section titled “Tool call object”| Field | Type | Description |
|---|---|---|
id | string | Unique identifier for the tool call. |
type | string | Always "function". |
function.name | string | Name of the function to call. |
function.arguments | string | JSON-encoded arguments. |
Example request
Section titled “Example request”curl -X POST http://localhost:8080/v1/chat/completions \ -H "Authorization: Bearer your-token" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain quicksort in two sentences."} ], "temperature": 0.7, "max_completion_tokens": 256 }'Non-streaming response
Section titled “Non-streaming response”When stream is false (default), llmsoup returns a single JSON object.
Response fields
Section titled “Response fields”| Field | Type | Description |
|---|---|---|
id | string | Unique request identifier (format: chatcmpl-{hex}-{hex}). |
object | string | Always "chat.completion". |
created | integer | Unix timestamp (seconds) when the completion was created. |
model | string | The model that actually served the request. |
choices | array | Array of completion choices (typically one). See Choice object. |
usage | object | Token usage statistics. See Usage object. |
service_tier | string | Service tier (optional, omitted when not applicable). |
Choice object
Section titled “Choice object”| Field | Type | Description |
|---|---|---|
index | integer | Zero-based index of this choice. |
message | object | The assistant’s response message. |
message.role | string | Always "assistant". |
message.content | string or null | Text content. null when tool calls are made. |
message.tool_calls | array or null | Tool calls, if any. |
finish_reason | string | Why generation stopped: "stop", "length", "content_filter", "tool_calls", or "function_call". |
logprobs | object or absent | Log probability information. Currently omitted from responses (not present in JSON). Reserved for future use. |
Usage object
Section titled “Usage object”| Field | Type | Description |
|---|---|---|
prompt_tokens | integer | Number of tokens in the prompt. |
completion_tokens | integer | Number of tokens in the completion. |
total_tokens | integer | Sum of prompt and completion tokens. |
Example response
Section titled “Example response”{ "id": "chatcmpl-67a3f2c6-1a", "object": "chat.completion", "created": 1738867398, "model": "gpt-4o-mini", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Quicksort is a divide-and-conquer algorithm that selects a pivot element and partitions the array into elements less than and greater than the pivot. It then recursively sorts each partition, achieving O(n log n) average-case performance." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 28, "completion_tokens": 42, "total_tokens": 70 }}Streaming response (SSE)
Section titled “Streaming response (SSE)”When stream is true, llmsoup returns a stream of Server-Sent Events. Each event is a data: line containing a JSON chunk, terminated by a data: [DONE] sentinel.
Implementation note: llmsoup sends the request to the upstream model with
stream=false, receives the complete response, and then constructs the SSE stream locally. This means first-token latency is equivalent to a non-streaming request — the full response is generated before any events are emitted. Streaming is useful for progressive UI rendering on the client side, but does not reduce time-to-first-token.
Chunk fields
Section titled “Chunk fields”| Field | Type | Description |
|---|---|---|
id | string | Same request identifier as non-streaming. |
object | string | Always "chat.completion.chunk". |
created | integer | Unix timestamp (seconds). |
model | string | Model that served the request. |
choices | array | Array with one chunk choice. |
Chunk choice fields
Section titled “Chunk choice fields”| Field | Type | Description |
|---|---|---|
index | integer | Choice index (always 0). |
delta | object | Incremental content update. |
delta.role | string | "assistant" — present in the first chunk only. |
delta.content | string | Content fragment. |
delta.tool_calls | array | Tool call deltas, if any. |
finish_reason | string or null | null until the final chunk, then "stop", "length", etc. |
SSE wire format
Section titled “SSE wire format”data: {"id":"chatcmpl-67a3f2c6-1a","object":"chat.completion.chunk","created":1738867398,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{"role":"assistant","content":"Quick"},"finish_reason":null}]}
data: {"id":"chatcmpl-67a3f2c6-1a","object":"chat.completion.chunk","created":1738867398,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{"content":"sort is"},"finish_reason":null}]}
data: {"id":"chatcmpl-67a3f2c6-1a","object":"chat.completion.chunk","created":1738867398,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]Each data: line is followed by two newlines (\n\n). The [DONE] sentinel signals the end of the stream.
Streaming curl example
Section titled “Streaming curl example”curl -N -X POST http://localhost:8080/v1/chat/completions \ -H "Authorization: Bearer your-token" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [{"role": "user", "content": "Hello!"}], "stream": true }'Streaming Python example
Section titled “Streaming Python example”from openai import OpenAI
client = OpenAI( base_url="http://localhost:8080/v1", api_key="your-token",)
stream = client.chat.completions.create( model="auto", messages=[{"role": "user", "content": "Hello!"}], stream=True,)
for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True)print()Error responses
Section titled “Error responses”All errors follow the OpenAI error response format:
{ "error": { "message": "Human-readable error description", "type": "error_type", "code": "error_code" }}Error types
Section titled “Error types”| HTTP Status | type | code | When |
|---|---|---|---|
| 400 | invalid_request_error | null | Malformed JSON, missing required fields, request too large for context window |
| 401 | invalid_request_error | invalid_api_key | Missing or invalid Bearer token |
| 500 | internal_error | internal_error | Routing failure, model call failure, serialization error |
Common error scenarios
Section titled “Common error scenarios”| Scenario | Status | Message |
|---|---|---|
| Invalid JSON body | 400 | Invalid JSON: {details} |
Missing model field | 400 | Missing required field: model |
Missing messages field | 400 | Missing required field: messages |
| Request exceeds context window | 400 | Request of ~{tokens} exceeds context window |
| No routing decision available | 500 | No routing decision available |
| Upstream model call failed | 500 | Model call failed: {error} |
| Response serialization error | 500 | Failed to serialize response |
Response headers
Section titled “Response headers”Cost headers
Section titled “Cost headers”When defaults.include_cost_headers is true (the default), llmsoup adds cost and routing information to every response:
| Header | Type | Description |
|---|---|---|
x-llmsoup-cost | string | Total request cost (model + routing) formatted to 6 decimal places (e.g., "0.001250"). Currency is set in model pricing config. |
x-llmsoup-model | string | Name of the model that served the request (e.g., "gpt-4o-mini"). |
x-llmsoup-tokens-prompt | string | Prompt token count (e.g., "150"). |
x-llmsoup-tokens-completion | string | Completion token count (e.g., "42"). |
Cost headers are included on both non-streaming and streaming responses.
Example response headers
Section titled “Example response headers”HTTP/1.1 200 OKcontent-type: application/jsonx-llmsoup-cost: 0.000125x-llmsoup-model: gpt-4o-minix-llmsoup-tokens-prompt: 28x-llmsoup-tokens-completion: 42Plugin headers
Section titled “Plugin headers”Plugins can inject additional response headers after the standard cost headers. The built-in security plugins add the following headers when triggered:
| Header | Type | Description |
|---|---|---|
x-llmsoup-jailbreak-blocked | string | "true" when the jailbreak detection plugin blocks a request. |
x-llmsoup-jailbreak-confidence | string | Confidence score (e.g., "0.95") for the jailbreak detection. |
x-llmsoup-pii-blocked | string | "true" when the PII detection plugin blocks a request. |
x-llmsoup-pii-types | string | Comma-separated PII types detected (e.g., "email,phone"). |
See the Plugins Reference for full plugin documentation and configuration.
GET /metrics
Section titled “GET /metrics”Returns Prometheus-formatted metrics. This endpoint does not require authentication.
Content-Type: text/plain; version=0.0.4
Example
Section titled “Example”curl http://localhost:8080/metrics# HELP llmsoup_requests_total Total number of requests# TYPE llmsoup_requests_total counterllmsoup_requests_total{method="POST",endpoint="/v1/chat/completions",status="200"} 1523
# HELP llmsoup_active_connections Current number of active connections# TYPE llmsoup_active_connections gaugellmsoup_active_connections 3
# HELP llmsoup_model_request_duration_seconds Model request duration# TYPE llmsoup_model_request_duration_seconds histogramllmsoup_model_request_duration_seconds_bucket{model="gpt-4o-mini",le="0.5"} 1200llmsoup_model_request_duration_seconds_bucket{model="gpt-4o-mini",le="1.0"} 1490llmsoup_model_request_duration_seconds_bucket{model="gpt-4o-mini",le="+Inf"} 1523
# HELP llmsoup_tokens_total Total tokens processed# TYPE llmsoup_tokens_total counterllmsoup_tokens_total{type="prompt"} 45200llmsoup_tokens_total{type="completion"} 12800All metric names use the llmsoup_ prefix with snake_case naming and unit suffixes (_total, _seconds, _bytes). A full metrics reference is available in the Metrics Reference documentation page.
curl examples
Section titled “curl examples”Basic completion
Section titled “Basic completion”curl -X POST http://localhost:8080/v1/chat/completions \ -H "Authorization: Bearer your-token" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [ {"role": "user", "content": "What is the capital of France?"} ] }'Streaming completion
Section titled “Streaming completion”curl -N -X POST http://localhost:8080/v1/chat/completions \ -H "Authorization: Bearer your-token" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [ {"role": "user", "content": "Write a haiku about coding."} ], "stream": true }'With tool calling
Section titled “With tool calling”curl -X POST http://localhost:8080/v1/chat/completions \ -H "Authorization: Bearer your-token" \ -H "Content-Type: application/json" \ -d '{ "model": "auto", "messages": [ {"role": "user", "content": "What is the weather in Paris?"} ], "tools": [ { "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a city", "parameters": { "type": "object", "properties": { "city": {"type": "string", "description": "City name"} }, "required": ["city"] } } } ] }'Using the OpenAI Python SDK
Section titled “Using the OpenAI Python SDK”from openai import OpenAI
client = OpenAI( base_url="http://localhost:8080/v1", api_key="your-token",)
response = client.chat.completions.create( model="auto", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain recursion simply."}, ], temperature=0.5, max_completion_tokens=200,)
print(response.choices[0].message.content)