Skip to content

API Reference

llmsoup exposes an OpenAI-compatible HTTP API. Any application or SDK that works with the OpenAI Chat Completions API can point at llmsoup with zero code changes — just swap the base URL.

All endpoints except /metrics require a Bearer token in the Authorization header.

Authorization: Bearer <your-token>

Tokens are configured in the llmsoup config file via auth.tokens (inline list) or auth.tokens_file (external file). When authentication is enabled and the token is missing or invalid, llmsoup returns:

HTTP/1.1 401 Unauthorized
Content-Type: application/json
{
"error": {
"message": "Invalid or missing authentication token",
"type": "invalid_request_error",
"code": "invalid_api_key"
}
}

If authentication is disabled in the config (auth.enabled: false), all requests are accepted without a token.


MethodPathAuth RequiredDescription
POST/v1/chat/completionsYesChat completion (non-streaming and streaming)
GET/metricsNoPrometheus metrics

Send a JSON body matching the OpenAI Chat Completions format.

FieldTypeRequiredDescription
modelstringYesModel name. llmsoup routes to the best backend model based on your routing rules; you can also target a specific configured model by name.
messagesarrayYesConversation messages. See Message object.
temperaturenumberNoSampling temperature (0–2). Passed through to the routed model.
top_pnumberNoNucleus sampling parameter. Passed through to the routed model.
max_tokensintegerNoMaximum tokens to generate (deprecated — use max_completion_tokens).
max_completion_tokensintegerNoMaximum completion tokens. Preferred over max_tokens.
streambooleanNoSet true for Server-Sent Events streaming. Default: false.

llmsoup tolerates additional fields (e.g., tools, tool_choice, response_format, logprobs) and passes them through to the upstream model unchanged. This means tool/function calling, structured output, and other OpenAI features work as long as the routed model supports them.

FieldTypeRequiredDescription
rolestringYesOne of system, user, assistant, tool, or developer.
contentstring or arrayNoText string, or an array of content parts for multimodal input. Omit for tool-call assistant messages.
tool_callsarrayNoTool calls made by the assistant (assistant messages only).
tool_call_idstringNoID of the tool call being responded to (tool messages only).

Content parts (when content is an array):

[
{ "type": "text", "text": "Describe this image" },
{ "type": "image_url", "image_url": { "url": "https://example.com/photo.png" } }
]

Multimodal content parts (images, audio) are passed through to the upstream model. Support depends on the routed model’s capabilities.

FieldTypeDescription
idstringUnique identifier for the tool call.
typestringAlways "function".
function.namestringName of the function to call.
function.argumentsstringJSON-encoded arguments.
Terminal window
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer your-token" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quicksort in two sentences."}
],
"temperature": 0.7,
"max_completion_tokens": 256
}'

When stream is false (default), llmsoup returns a single JSON object.

FieldTypeDescription
idstringUnique request identifier (format: chatcmpl-{hex}-{hex}).
objectstringAlways "chat.completion".
createdintegerUnix timestamp (seconds) when the completion was created.
modelstringThe model that actually served the request.
choicesarrayArray of completion choices (typically one). See Choice object.
usageobjectToken usage statistics. See Usage object.
service_tierstringService tier (optional, omitted when not applicable).
FieldTypeDescription
indexintegerZero-based index of this choice.
messageobjectThe assistant’s response message.
message.rolestringAlways "assistant".
message.contentstring or nullText content. null when tool calls are made.
message.tool_callsarray or nullTool calls, if any.
finish_reasonstringWhy generation stopped: "stop", "length", "content_filter", "tool_calls", or "function_call".
logprobsobject or absentLog probability information. Currently omitted from responses (not present in JSON). Reserved for future use.
FieldTypeDescription
prompt_tokensintegerNumber of tokens in the prompt.
completion_tokensintegerNumber of tokens in the completion.
total_tokensintegerSum of prompt and completion tokens.
{
"id": "chatcmpl-67a3f2c6-1a",
"object": "chat.completion",
"created": 1738867398,
"model": "gpt-4o-mini",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Quicksort is a divide-and-conquer algorithm that selects a pivot element and partitions the array into elements less than and greater than the pivot. It then recursively sorts each partition, achieving O(n log n) average-case performance."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 28,
"completion_tokens": 42,
"total_tokens": 70
}
}

When stream is true, llmsoup returns a stream of Server-Sent Events. Each event is a data: line containing a JSON chunk, terminated by a data: [DONE] sentinel.

Implementation note: llmsoup sends the request to the upstream model with stream=false, receives the complete response, and then constructs the SSE stream locally. This means first-token latency is equivalent to a non-streaming request — the full response is generated before any events are emitted. Streaming is useful for progressive UI rendering on the client side, but does not reduce time-to-first-token.

FieldTypeDescription
idstringSame request identifier as non-streaming.
objectstringAlways "chat.completion.chunk".
createdintegerUnix timestamp (seconds).
modelstringModel that served the request.
choicesarrayArray with one chunk choice.
FieldTypeDescription
indexintegerChoice index (always 0).
deltaobjectIncremental content update.
delta.rolestring"assistant" — present in the first chunk only.
delta.contentstringContent fragment.
delta.tool_callsarrayTool call deltas, if any.
finish_reasonstring or nullnull until the final chunk, then "stop", "length", etc.
data: {"id":"chatcmpl-67a3f2c6-1a","object":"chat.completion.chunk","created":1738867398,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{"role":"assistant","content":"Quick"},"finish_reason":null}]}
data: {"id":"chatcmpl-67a3f2c6-1a","object":"chat.completion.chunk","created":1738867398,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{"content":"sort is"},"finish_reason":null}]}
data: {"id":"chatcmpl-67a3f2c6-1a","object":"chat.completion.chunk","created":1738867398,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]

Each data: line is followed by two newlines (\n\n). The [DONE] sentinel signals the end of the stream.

Terminal window
curl -N -X POST http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer your-token" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="your-token",
)
stream = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()

All errors follow the OpenAI error response format:

{
"error": {
"message": "Human-readable error description",
"type": "error_type",
"code": "error_code"
}
}
HTTP StatustypecodeWhen
400invalid_request_errornullMalformed JSON, missing required fields, request too large for context window
401invalid_request_errorinvalid_api_keyMissing or invalid Bearer token
500internal_errorinternal_errorRouting failure, model call failure, serialization error
ScenarioStatusMessage
Invalid JSON body400Invalid JSON: {details}
Missing model field400Missing required field: model
Missing messages field400Missing required field: messages
Request exceeds context window400Request of ~{tokens} exceeds context window
No routing decision available500No routing decision available
Upstream model call failed500Model call failed: {error}
Response serialization error500Failed to serialize response

When defaults.include_cost_headers is true (the default), llmsoup adds cost and routing information to every response:

HeaderTypeDescription
x-llmsoup-coststringTotal request cost (model + routing) formatted to 6 decimal places (e.g., "0.001250"). Currency is set in model pricing config.
x-llmsoup-modelstringName of the model that served the request (e.g., "gpt-4o-mini").
x-llmsoup-tokens-promptstringPrompt token count (e.g., "150").
x-llmsoup-tokens-completionstringCompletion token count (e.g., "42").

Cost headers are included on both non-streaming and streaming responses.

HTTP/1.1 200 OK
content-type: application/json
x-llmsoup-cost: 0.000125
x-llmsoup-model: gpt-4o-mini
x-llmsoup-tokens-prompt: 28
x-llmsoup-tokens-completion: 42

Plugins can inject additional response headers after the standard cost headers. The built-in security plugins add the following headers when triggered:

HeaderTypeDescription
x-llmsoup-jailbreak-blockedstring"true" when the jailbreak detection plugin blocks a request.
x-llmsoup-jailbreak-confidencestringConfidence score (e.g., "0.95") for the jailbreak detection.
x-llmsoup-pii-blockedstring"true" when the PII detection plugin blocks a request.
x-llmsoup-pii-typesstringComma-separated PII types detected (e.g., "email,phone").

See the Plugins Reference for full plugin documentation and configuration.


Returns Prometheus-formatted metrics. This endpoint does not require authentication.

Content-Type: text/plain; version=0.0.4

Terminal window
curl http://localhost:8080/metrics
# HELP llmsoup_requests_total Total number of requests
# TYPE llmsoup_requests_total counter
llmsoup_requests_total{method="POST",endpoint="/v1/chat/completions",status="200"} 1523
# HELP llmsoup_active_connections Current number of active connections
# TYPE llmsoup_active_connections gauge
llmsoup_active_connections 3
# HELP llmsoup_model_request_duration_seconds Model request duration
# TYPE llmsoup_model_request_duration_seconds histogram
llmsoup_model_request_duration_seconds_bucket{model="gpt-4o-mini",le="0.5"} 1200
llmsoup_model_request_duration_seconds_bucket{model="gpt-4o-mini",le="1.0"} 1490
llmsoup_model_request_duration_seconds_bucket{model="gpt-4o-mini",le="+Inf"} 1523
# HELP llmsoup_tokens_total Total tokens processed
# TYPE llmsoup_tokens_total counter
llmsoup_tokens_total{type="prompt"} 45200
llmsoup_tokens_total{type="completion"} 12800

All metric names use the llmsoup_ prefix with snake_case naming and unit suffixes (_total, _seconds, _bytes). A full metrics reference is available in the Metrics Reference documentation page.


Terminal window
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer your-token" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'
Terminal window
curl -N -X POST http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer your-token" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{"role": "user", "content": "Write a haiku about coding."}
],
"stream": true
}'
Terminal window
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer your-token" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{"role": "user", "content": "What is the weather in Paris?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"}
},
"required": ["city"]
}
}
}
]
}'
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="your-token",
)
response = client.chat.completions.create(
model="auto",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain recursion simply."},
],
temperature=0.5,
max_completion_tokens=200,
)
print(response.choices[0].message.content)