OpenAI-Compatible API
SwarmLLM provides a drop-in replacement for the OpenAI API. All endpoints require Bearer token authentication.
POST /v1/chat/completions
Chat completions with streaming support.
curl http://localhost:8800/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-coder-7b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Rust?"}
],
"stream": true,
"max_tokens": 512,
"temperature": 0.7
}'
Request Body
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | yes | — | Model name (or "auto" for first available) |
messages | array | yes | — | Chat messages (role + content). Roles: system, user, assistant, tool |
stream | boolean | no | false | Enable SSE streaming |
max_tokens | integer | no | 2048 | Max tokens to generate (clamped to 1–32768) |
temperature | float | no | 0.7 | Sampling temperature (0.0-2.0) |
top_p | float | no | 1.0 | Nucleus sampling threshold |
stop | string or array | no | — | Stop sequence(s), 1–256 chars each, max 16 |
frequency_penalty | float | no | 0.0 | Frequency penalty (-2.0 to 2.0) |
presence_penalty | float | no | 0.0 | Presence penalty (-2.0 to 2.0) |
tools | array | no | — | Tool/function definitions for function calling |
tool_choice | string or object | no | — | "none", "auto", "required", or {"type":"function","function":{"name":"..."}} |
logprobs | boolean | no | false | Return log probabilities for output tokens. Supported on split model (candle) inference paths |
top_logprobs | integer | no | — | Number of top log probabilities per token (0-20, requires logprobs: true). Computed from pre-sampling (raw) logits per OpenAI spec |
session_id | string | no | — | Reuse KV-cache from a previous request |
lora_adapter | string | no | — | LoRA adapter ID for fine-tuned inference |
Response (non-streaming)
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "qwen2.5-coder-7b",
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "Rust is a systems programming language..."},
"finish_reason": "stop",
"logprobs": null
}],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 42,
"total_tokens": 57
}
}
Response with logprobs
When logprobs: true and top_logprobs: 3:
{
"choices": [{
"message": {"role": "assistant", "content": "Hello"},
"finish_reason": "stop",
"logprobs": {
"content": [{
"token": "Hello",
"logprob": -0.234,
"bytes": null,
"top_logprobs": [
{"token": "Hello", "logprob": -0.234, "bytes": null},
{"token": "Hi", "logprob": -1.456, "bytes": null},
{"token": "Hey", "logprob": -2.012, "bytes": null}
]
}]
}
}]
}
Response with tool_calls
When the model calls a tool, finish_reason is "tool_calls" and content is null:
{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\":\"NYC\"}"
}
}]
},
"finish_reason": "tool_calls"
}]
}
Streaming (SSE)
When stream: true, responses arrive as Server-Sent Events:
data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":"Rust"},"index":0}]}
data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":" is"},"index":0}]}
data: [DONE]
GET /v1/models
List available models.
curl http://localhost:8800/v1/models \
-H "Authorization: Bearer YOUR_API_KEY"
{
"object": "list",
"data": [
{
"id": "qwen2.5-coder-7b",
"object": "model",
"owned_by": "swarmllm"
}
]
}
GET /v1/status
Node status (SwarmLLM extension).
curl http://localhost:8800/v1/status \
-H "Authorization: Bearer YOUR_API_KEY"
Using with OpenAI Client Libraries
Python (openai)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8800/v1",
api_key="YOUR_API_KEY"
)
# Basic streaming
response = client.chat.completions.create(
model="qwen2.5-coder-7b",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Python — Function calling
response = client.chat.completions.create(
model="qwen2.5-coder-7b",
messages=[{"role": "user", "content": "What's the weather in NYC?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}
}],
tool_choice="auto"
)
if response.choices[0].finish_reason == "tool_calls":
for tc in response.choices[0].message.tool_calls:
print(f"Call {tc.function.name}({tc.function.arguments})")
JavaScript (openai)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8800/v1",
apiKey: "YOUR_API_KEY",
});
const stream = await client.chat.completions.create({
model: "qwen2.5-coder-7b",
messages: [{ role: "user", content: "Hello!" }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
curl (streaming)
curl -N http://localhost:8800/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"qwen2.5-coder-7b","messages":[{"role":"user","content":"Hello!"}],"stream":true}'
POST /v1/embeddings
Returns 503 Service Unavailable. Text embeddings are not supported via the subprocess inference path. Use a dedicated embedding provider or the OpenAI embeddings API directly.
GET /v1/providers
List configured cloud providers and their available models.
curl http://localhost:8800/v1/providers \
-H "Authorization: Bearer YOUR_API_KEY"
Returns an array of { name, models: [...] } objects for each configured provider.