SwarmLLM
Run AI together — for free. A single Rust binary that turns your computer into a node in a peer-to-peer LLM inference network. Pool hardware with others to run models too large for any single machine, with no API tokens, no cloud fees, and end-to-end encryption between every peer.
This site is the long-form reference. For source code, releases, and issues, head to enapt/SwarmLLM.
What you can do with it
- Chat with AI locally — open
localhost:8800after running the binary; the dashboard auto-detects your hardware and walks you through downloading a model. - Use it as a drop-in API — OpenAI-compatible
/v1/chat/completions, the Anthropic Messages API at/v1/messages(full Claude Code support), an MCP server with seven tools, plus 12 cloud providers reachable through one endpoint. - Pool hardware — your phone with 2 GB of RAM can host a few shards of a 70B model and contribute alongside someone else's GPU. Shards download individually via byte-range requests; no node ever needs the full file.
- Stay private — every P2P hop uses X25519 + ChaCha20-Poly1305 with forward secrecy. The optional boomerang pipeline ensures no remote node ever sees plaintext.
Single-node performance (RTX 3070 Laptop, 8 GB VRAM)
| Model | GPU | CPU |
|---|---|---|
| TinyLlama 1.1B Q4 | 27.2 tok/s | 4.2 tok/s |
| Gemma-2 2B Q4 | 20.6 tok/s | 3.5 tok/s |
| Phi-3.5 3.8B Q4 | 46.4 tok/s | 1.8 tok/s |
| Qwen2.5-Coder 7B Q4 | 29.0 tok/s | 2.4 tok/s |
Distributed-inference speedups (all default-on): prefix-caching, batched prefill, the Parallax scheduler, and cross-node KV sharing. The cross-node prefix-KV benchmark (2026-04-20) measured a 12.9× iter-1 TTFT speedup on a 672-token Qwen-7B prompt when a peer had the same prefix already cached (151.7 s → 11.8 s, CPU-CPU, localhost). Each knob is documented in Performance & Inference Speedups.
How a node fits together
┌──────────────────────────────────────────────────────────────┐
│ Your computer (port 8800) │
│ │
│ P2P node HTTP server Web dashboard │
│ TCP+QUIC OpenAI · Anthropic (embedded) │
│ Noise+Yamux MCP · Admin 21 languages │
│ │
│ ───────────────────────────────────────────────────────── │
│ 11 Tokio subsystems · DashMap shared state · redb storage │
└──────────────────────────────────────────────────────────────┘
Each node simultaneously: connects over TCP and QUIC, serves four HTTP API surfaces (OpenAI · Anthropic · MCP · admin) on the same port, hosts shard files for popular models, participates in distributed inference pipelines, and ships an embedded web dashboard.
Where to go next
Status
Alpha — actively developed and moving into broader testing. Distributed inference is stable across multi-node deployments. Windows release binaries reach Linux parity (Round 8, 2026-04-23). 887 lib tests + 75 integration tests run on every PR; continuous security sweeps. Report issues.
Platform support
| Platform | Status | GPU |
|---|---|---|
| Linux x86_64 | Available | CUDA |
| Windows x86_64 | Available | CUDA |
| macOS aarch64 (Apple Silicon) | Binary available; compile-validated | CPU only (Metal planned) |
| macOS x86_64 (Intel) | Best-effort | CPU only |
| Linux aarch64 | Best-effort | CPU only |
macOS aarch64 runs
cargo test --lib+cargo clippyonmacos-15in CI. Integration tests stay Linux-only for now.
All binaries live on the Releases page.
Getting Started
SwarmLLM lets you combine your hardware with others to run AI models too large for any single machine — for free, with no API tokens or cloud fees. It's open-source and your conversations are end-to-end encrypted.
This guide walks you through installation, downloading your first model, and chatting.
Prerequisites
- A computer running Windows, macOS, or Linux
- At least 4 GB of RAM (8+ GB recommended)
- At least 2 GB of free disk space (more for larger models)
- An internet connection (for downloading models and connecting to peers)
Chapters
- Installation — Download and run SwarmLLM on your platform
- First Model — Download and chat with your first AI model
- Joining the Network — Connect to peers for distributed inference
Quick Commands
./swarmllm run # Start the node (default port 8800)
./swarmllm run -p 9000 # Start on a different port
./swarmllm run -v # Start with verbose logging
./swarmllm status # Check if the node is running
./swarmllm chat # Interactive CLI chat
./swarmllm bench # Benchmark inference performance
./swarmllm peers # List connected peers
./swarmllm version # Show version number
Installation
Download
Download the right file for your system from the GitHub Releases page:
| Your Computer | File Name |
|---|---|
| Windows (most PCs) | SwarmLLM-Setup.exe (installer — auto-detects GPU) |
| Windows (raw binary, GPU) | swarmllm-windows-x86_64-gpu.zip |
| Windows (raw binary, CPU) | swarmllm-windows-x86_64-cpu.zip |
| Mac (M1/M2/M3/M4) | swarmllm-macos-aarch64.tar.gz (compile-validated) |
| Mac (older Intel) | Best-effort — build from source |
| Linux (most distros) | swarmllm-linux-x86_64.tar.gz |
| Linux (NVIDIA GPU) | swarmllm-linux-x86_64-cuda.tar.gz |
Not sure which Mac? Apple menu > "About This Mac." If it says "Apple M1" (or M2/M3/etc.), pick Apple Silicon. If it says "Intel," pick Intel.
Install & Run
Windows
Recommended — installer: double-click SwarmLLM-Setup.exe. It detects your GPU (NVIDIA / AMD / Intel) and installs the matching binary. If SmartScreen warns you, click More info > Run anyway.
Raw binary alternative: download swarmllm-windows-x86_64-gpu.zip (Vulkan + CUDA static) or swarmllm-windows-x86_64-cpu.zip (CPU-only fallback), extract, and run swarmllm.exe.
From PowerShell on a raw binary:
cd Downloads\swarmllm-windows-x86_64-gpu
.\swarmllm.exe run
macOS
cd ~/Downloads
tar xzf swarmllm-macos-aarch64.tar.gz
cd swarmllm-macos-aarch64
chmod +x swarmllm
./swarmllm run
Note: macOS aarch64 binaries are compile-validated and exercised in CI (test + clippy on
macos-15); integration tests stay Linux-only for now. Intel Mac users should build from source. If macOS blocks the binary on first launch: System Settings > Privacy & Security > click Open Anyway next to SwarmLLM.
Linux
cd ~/Downloads
tar xzf swarmllm-linux-x86_64.tar.gz
cd swarmllm-linux-x86_64
chmod +x swarmllm
./swarmllm run
Docker
The fastest way to get running on any Linux server:
# 1. Get the compose file and example env
curl -LO https://raw.githubusercontent.com/enapt/SwarmLLM/main/docker-compose.yml
curl -LO https://raw.githubusercontent.com/enapt/SwarmLLM/main/.env.example
# 2. Configure (add API keys, change ports, etc.)
cp .env.example .env
nano .env
# 3. Start
docker compose up -d
For NVIDIA GPU support (requires NVIDIA Container Toolkit):
docker compose --profile gpu up -d
Pre-built images on GHCR:
| Image | Description |
|---|---|
ghcr.io/enapt/swarmllm:latest | CPU-only |
ghcr.io/enapt/swarmllm:latest-cuda | NVIDIA GPU (CUDA 12.4) |
ghcr.io/enapt/swarmllm:0.1.0 | Pinned version (CPU) |
ghcr.io/enapt/swarmllm:0.1.0-cuda | Pinned version (GPU) |
Data is persisted in Docker volumes. Model shards are stored in the swarmllm-models volume (or bind-mount a host directory via SWARMLLM_MODELS_DIR in .env).
View logs with docker compose logs -f. The API key is printed on first startup.
Cargo Install
Requires Rust 1.80+:
cargo install --git https://github.com/enapt/SwarmLLM.git --tag v0.1.0
swarmllm run
Building from Source
git clone https://github.com/enapt/SwarmLLM.git
cd SwarmLLM
cargo build --release
./target/release/swarmllm run
For CUDA GPU support:
cargo build --release --features candle-cuda
For Apple Silicon: the default build runs on CPU. A Metal-accelerated
build is on the roadmap but not yet implemented (no metal Cargo
feature exists yet); until then, use the default cargo build --release.
Open the Dashboard
Once running, open http://localhost:8800 in your browser. The setup wizard will walk you through initial configuration.
Your First Model
You need at least one AI model before you can chat.
Download via Dashboard
- Open the Dashboard at
http://localhost:8800 - Click Browse HuggingFace in the Models section
- Search for a model (try
TinyLlamafor a small, fast model) - Choose a quantization variant (Q4_K_M recommended for most hardware)
- Click Add to node — the node downloads its fair share of shards, and peers with auto-manage enabled auto-acquire the rest
- The dashboard auto-refreshes when downloads complete (no page reload needed)
Download via CLI
# Smart distribution: node downloads its fair share, peers get the rest
curl -X POST http://localhost:8800/api/admin/hf/download-shards \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"repo_id": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF", "filename": "qwen2.5-coder-7b-instruct.Q4_K_M.gguf", "peer_fair_share": true}'
# Or download specific shards manually:
curl -X POST http://localhost:8800/api/admin/hf/download-shards \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"repo_id": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF", "filename": "qwen2.5-coder-7b-instruct.Q4_K_M.gguf", "shards": [0, 1, 2]}'
Recommended Models by Hardware
| Hardware | Model | Size |
|---|---|---|
| Any (testing) | TinyLlama 1.1B Q4_K_M | ~700 MB |
| 8 GB RAM, no GPU | Qwen2.5-3B Q4_K_M | ~2 GB |
| 8 GB VRAM | Qwen2.5-7B Q4_K_M | ~4.5 GB |
| 16+ GB VRAM | Llama-3-13B Q4_K_M | ~7 GB |
On-Demand Loading
You do not need to pre-load models into VRAM. When you send an inference request for a model whose shards are on disk but not loaded, SwarmLLM automatically loads the model on the fly. If VRAM is full, the least-recently-used model is evicted to make room. The first request to a cold model may take a few extra seconds while loading completes.
Start Chatting
Web UI:
- Click the Chat tab
- Select your model from the dropdown
- Type a message and press Enter
CLI:
./swarmllm chat
# Or with a specific model:
./swarmllm chat --model-name "qwen2.5-coder-7b"
API:
curl http://localhost:8800/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-coder-7b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
What Are Shards?
Large AI models are split into smaller pieces called shards (~512 MB each) so they can be distributed across the network. Each shard contains a subset of the model's transformer layers. SwarmLLM handles this automatically — you just pick a model and download.
A node never needs all shards of a model. In distributed inference, each node loads only the layers it's responsible for.
Joining the Network
SwarmLLM works standalone, but connecting to peers unlocks distributed inference for larger models.
Automatic Discovery
SwarmLLM finds peers automatically:
- Same network (LAN): mDNS discovers peers on the same Wi-Fi/LAN in seconds.
- Returning users: Previously-seen peers are remembered and reconnected on startup.
- Peer exchange: Connected peers share their peer lists with you.
Invite Codes (Easiest)
- In the Dashboard, click "Share Network Code"
- Copy the encrypted code and share it with a friend
- They paste it into the "Join Network" field and click Join
- Both nodes connect immediately and start discovering the wider network
Invite codes are encrypted (ChaCha20Poly1305) — your IP address is not visible in the code itself. Anyone with the full code can decode it, but the IP can't be extracted by casual inspection.
Manual Bootstrap
./swarmllm run --bootstrap "/ip4/203.0.113.50/udp/8800/quic-v1/p2p/12D3KooW..."
Or in your config file:
[network]
bootstrap_peers = ["/ip4/203.0.113.50/udp/8800/quic-v1/p2p/12D3KooW..."]
Private Networks
To run a private cluster that doesn't mix with the public network:
[network]
gossip_network_id = "my-private-network"
Only nodes with the same gossip_network_id can communicate.
Firewall
SwarmLLM needs TCP port 8810 (P2P primary transport) and optionally UDP port 8800 (QUIC) open. If you're behind a router, either:
- Set up port forwarding (TCP 8810 + UDP 8800 to your machine's local IP)
- Rely on SwarmLLM's built-in relay (works automatically in most cases)
Configuration
SwarmLLM works out of the box with sensible defaults. This section covers customization.
Config Priority
Settings are read from four sources, in order of priority:
- Command-line flags (highest) — e.g.,
--port 9000 - Environment variables — e.g.,
SWARMLLM_NODE_LISTEN_PORT=9000 - Config file —
config.tomlin your data directory - Built-in defaults (lowest)
Provider API keys have an additional source: a .env file in the data directory or current working directory. Standard env var names are used (OPENAI_API_KEY, ANTHROPIC_API_KEY, DEEPSEEK_API_KEY, etc.). The .env file does not override existing environment variables or keys already set via the dashboard.
Config File Location
| OS | Path |
|---|---|
| Linux | ~/.local/share/swarmllm/config.toml |
| macOS | ~/Library/Application Support/swarmllm/config.toml |
| Windows | %APPDATA%\swarmllm\config.toml |
Specify a custom path: --config /path/to/config.toml
Minimal Example
[node]
listen_port = 8800
contribution = "moderate"
[resources]
max_disk_mb = 50000
[identity]
region = "US"
[inference]
gpu_layers = 35
[auto_manage]
enabled = true
Chapters
- Config File Reference — Every option explained
- Shard-Only Mode — Distributed inference with partial models
- CLI Flags & Environment Variables — Command-line and env var reference
Config File Reference
Every configuration option, organized by section.
[node] — Basic Node Settings
| Option | Type | Default | Description |
|---|---|---|---|
listen_port | integer | 8800 | Port for web dashboard and P2P networking |
data_dir | path | Platform-specific | Where SwarmLLM stores data |
contribution | string | "minimal" | Resource contribution: "minimal", "moderate", "maximum" |
[resources] — Resource Limits
| Option | Type | Default | Description |
|---|---|---|---|
max_gpu_vram_mb | integer | 0 | Max GPU memory in MB. 0 = auto-detect |
max_ram_mb | integer | 0 | Max system RAM in MB. 0 = auto |
max_disk_mb | integer | 50000 | Max disk space in MB for model storage |
max_bandwidth_mbps | integer | 0 | Max upload bandwidth. 0 = unlimited |
[resources.schedule] — Usage Schedule
| Option | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Enable scheduled resource reduction |
reduced_hours_start | integer | 22 | Hour (0-23) to start reduced mode |
reduced_hours_end | integer | 8 | Hour (0-23) to end reduced mode |
reduced_contribution | string | "minimal" | Contribution level during reduced hours |
prune_aggressiveness | string | "normal" | Shard pruning during reduced hours: "normal", "aggressive", "conservative" |
[network] — Networking
| Option | Type | Default | Description |
|---|---|---|---|
bootstrap_peers | list | [] | Peer addresses to connect on startup |
enable_mdns | boolean | true | LAN peer discovery |
gossip_network_id | string | none | Custom network ID for private networks |
peer_exchange | boolean | true | Share peer lists with connected nodes |
enable_relay | boolean | true | Act as relay for peers behind firewalls |
enable_relay_client | boolean | true | Use relays when behind a firewall |
max_peers | integer | 200 | Max simultaneous peer connections |
auto_relay | boolean | true | Auto-use relay when NAT detected |
relay_max_circuit_duration_secs | integer | 3600 | Max relay circuit duration |
relay_max_circuits | integer | 16 | Max relay circuits to serve |
enable_encryption | boolean | true | E2E encryption for tensor forwards and control messages |
enable_autonat | boolean | true | NAT detection. Disable on WSL2 to reduce noise |
enable_dcutr | boolean | true | Hole punching. Disable on WSL2 to reduce noise |
tensor_compression | boolean | true | Zstd compression for tensor payloads |
prefix_kv_compression | boolean | false | Zstd compression for cross-node prefix-KV snapshot wire frames. Default off — meaningful win on WAN where wire size is the bottleneck; roughly neutral on localhost. Receivers always decompress regardless of this flag. |
tensor_compress_level | integer | 1 | Zstd compression level (1-22, 1 = fastest). Shared between tensor and prefix-KV. |
tensor_compress_threshold | integer | 1024 | Min payload bytes before compression. Shared between tensor and prefix-KV. |
[inference] — AI Model Inference
| Option | Type | Default | Description |
|---|---|---|---|
default_model | string | "" | Default model. Empty = first available |
session_timeout_seconds | integer | 600 | Chat session memory lifetime (10 min) |
max_concurrent_requests | integer | 10 | Max parallel requests |
model_path | path | none | Path to a GGUF model file |
gpu_layers | integer | 0 | Layers to offload to GPU. 0 = CPU only |
kv_cache_ttl_secs | integer | 600 | KV-cache lifetime |
max_batch_size | integer | 1 | Max request batch size. 1 = no batching. When > 1, both local and remote forward requests batch together via BatchForwarder, filling pipeline bubbles in distributed inference |
batch_timeout_ms | integer | 50 | Ms to wait for additional requests before dispatching a partial batch. 0 = dispatch immediately (purely opportunistic batching) |
speculative_decoding | boolean | false | Enable speculative decoding |
speculative_gamma | integer | 4 | Draft tokens per verification step |
draft_model_path | path | none | Path to draft model |
max_split_model_memory_mb | integer | none | Max GPU memory for split model cache |
tp_max_latency_ms | integer | 10 | Max peer latency (ms) for tensor parallelism groups |
local_embedding_privacy | boolean | false | Embed tokens locally before sending to first segment. Remote nodes never see raw token IDs |
encrypted_pipeline | boolean | false | Force first+last segment to local node (boomerang topology). No remote sees plaintext. Adds ~1 RTT/token. Per-model override via API. Requires shard 0 + final shard locally |
[logging] — Log Output
| Option | Type | Default | Description |
|---|---|---|---|
level | string | "info" | Log level: "error", "warn", "info", "debug", "trace" |
format | string | "pretty" | Log format: "pretty" or "json" |
file | path | none | Write logs to file |
[ui] — Web Interface
| Option | Type | Default | Description |
|---|---|---|---|
open_browser_on_start | boolean | true | Open dashboard on launch |
theme | string | "dark" | Color theme: "dark" or "light" |
[api] — API Authentication
| Option | Type | Default | Description |
|---|---|---|---|
api_key | string | none | Bearer token. Empty = auto-generated |
rate_limit_rpm | integer | 60 | Rate limit for /v1/ endpoints (requests/min) |
rate_limit_admin_rpm | integer | 200 | Rate limit for /api/admin/ endpoints (requests/min) |
[model] — Model Storage
| Option | Type | Default | Description |
|---|---|---|---|
shard_size_mb | integer | 512 | Shard size in MB. Range: 64-2048 |
[auto_manage] — Automatic Shard Management
| Option | Type | Default | Description |
|---|---|---|---|
enabled | boolean | true | Auto-download popular shards (only for models at DemandVerified+ or Pinned trust level) |
max_storage_mb | integer | 0 | Max disk for auto-downloads. 0 = 50% of max_disk_mb |
interval_minutes | integer | 5 | Check interval for new shards |
max_shards | integer | 0 | Max shards. 0 = unlimited |
max_concurrent_downloads | integer | 3 | Max parallel downloads |
prune_enabled | boolean | true | Auto-remove over-replicated shards |
min_replicas | integer | 2 | Min network replicas before pruning |
prune_cooldown_secs | integer | 300 | Seconds between prune actions per model |
max_holder_load_for_prune | integer | 3 | Block pruning if holders are busy |
[pool] — Device Pool
| Option | Type | Default | Description |
|---|---|---|---|
max_pool_size | integer | 10 | Max devices in a pool |
invitation_ttl_hours | integer | 24 | Invitation validity period |
rate_limit_per_hour | integer | 10 | Max pool operations per hour |
gossip_interval_secs | integer | 600 | Pool state gossip interval |
private_mode | bool | false | Restrict inference to pool members only. Toggleable at runtime via API/UI |
private_mode_allow_lan | bool | true | Also allow LAN peers (mDNS-discovered) when private mode is on |
offline_mode | bool | false | Air-gapped: no bootstrap peers, no HF downloads, mDNS-only discovery |
[pool.credit_rates] — Credit Rates
| Option | Type | Default | Description |
|---|---|---|---|
inference_serve | integer | 10 | Credits earned per layer per token served |
inference_consume | integer | 10 | Credits spent per layer per token consumed |
shard_hosting | integer | 1 | Credits per GB per hour hosting |
shard_seeding | integer | 5 | Credits per GB seeding |
relay_service | integer | 2 | Credits per connection hour relaying |
penalty_serve_failure | integer | 50 | Credits deducted per failure |
[updates] — Auto-Update
| Option | Type | Default | Description |
|---|---|---|---|
auto_update | string | "stable" | Policy: "disabled", "stable", "all" |
check_interval_hours | integer | 6 | Update check frequency |
[identity] — Your Identity
| Option | Type | Default | Description |
|---|---|---|---|
region | string | none | Country code for network map (e.g., "US") |
[providers.claude_subscription] — Claude Subscription (feature-gated)
Requires
--features claude-subscriptionat build time. Managed via the dashboard orPUT /api/admin/providers.
| Option | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Route claude-* model requests through the local CLI |
claude_binary | string | "claude" | Path to the claude binary |
default_model | string | none | Override model for all requests |
max_concurrent | integer | 3 | Maximum concurrent subprocess invocations |
timeout_secs | integer | 300 | Per-request timeout in seconds |
working_dir | string | (temp dir) | Working directory for the subprocess. Empty or "none" uses system temp dir (recommended for API proxy use). Set to a project path for context-aware responses. |
Shard-Only Mode
SwarmLLM supports shard-only operation — a node only needs individual shard files (~512 MB each) plus a small GGUF header (~6 MB), not the full model file.
How It Works
A model directory in shard-only mode:
~/.local/share/swarmllm/models/qwen2.5-coder-7b/
├── manifest.json # Model metadata + shard layout
├── gguf_header.bin # First ~6MB of GGUF (metadata + tensor index)
├── shard_000.bin # 512MB shard
├── shard_001.bin
├── shard_002.bin
└── ...
SwarmLLM automatically extracts gguf_header.bin from shard_000.bin when first needed. The ShardReader constructs a virtual GGUF from header + shard files, so the model parser works exactly as if the full GGUF were present.
Why This Matters
- A 7B model is ~4.5 GB as a full GGUF, but a single shard is only ~512 MB
- Nodes only load the layers they're assigned — no wasted disk or VRAM
- You can participate in inference for a 70B model on a machine with 8 GB VRAM by hosting just a few shards
Manual Shard Assignment (--shards)
For multi-node split inference, assign each node a subset of shards:
./swarmllm run --shards "0-3" # This node handles shards 0, 1, 2, 3
The range is persisted to the database and restored on subsequent runs. Start without --shards to clear.
Behavior when --shards is set:
- The node only advertises the specified shard indices
- Auto-manage prioritizes downloading missing shards in the range (100x scoring bonus)
- Smart pruning never removes shards in the configured range
Multi-Node Example
Run a 7B model across two machines:
# Machine A (shards 0-3, layers 0-13):
./swarmllm run --shards "0-3" --bootstrap "/ip4/MACHINE_B_IP/udp/8800/quic-v1/p2p/PEER_ID"
# Machine B (shards 4-7, layers 14-27):
./swarmllm run --shards "4-7" --bootstrap "/ip4/MACHINE_A_IP/udp/8800/quic-v1/p2p/PEER_ID"
Both nodes discover each other, assemble a distributed pipeline, and forward hidden-state activations between them. The pipeline is assembled automatically by the InferenceRouter.
Without --shards
If you don't specify --shards, the node auto-detects and advertises all local shards. This is the normal mode for most users — --shards is only needed when you want explicit control over which layers a node handles.
CLI Flags & Environment Variables
CLI Flags
| Flag | Short | Description |
|---|---|---|
--port <PORT> | -p | Listen port |
--data-dir <PATH> | -d | Data directory |
--config <PATH> | -c | Config file path |
--model <PATH> | -m | Path to a GGUF model file |
--gpu-layers <N> | Layers to offload to GPU | |
--bootstrap <ADDR> | Bootstrap peer address (repeatable) | |
--shards <RANGE> | Shard range for split inference (e.g., "0-4") | |
--verbose | -v | Increase log verbosity (-v, -vv, -vvv) |
Subcommands
| Command | Description |
|---|---|
run | Start the daemon (default if no subcommand) |
status | Query running daemon status |
chat | Interactive CLI chat with streaming |
bench | Benchmark inference (tokens/sec, TTFT) |
peers | List connected peers |
pool | Device pool management (link your machines) |
test-split | Test split inference locally (diagnostic) |
version | Print version |
chat Options
| Flag | Default | Description |
|---|---|---|
--model-name <NAME> | auto-detect | Model to chat with |
--system <TEXT> | none | System prompt |
--max-tokens <N> | 2048 | Max tokens per response |
--temperature <F> | 0.7 | Sampling temperature |
bench Options
| Flag | Default | Description |
|---|---|---|
--model-name <NAME> | auto-detect | Model to benchmark |
--prompt <TEXT> | "Write a short essay..." | Benchmark prompt |
--max-tokens <N> | 128 | Tokens to generate |
--iterations <N> | 1 | Number of benchmark runs |
pool Subcommands
Link your personal devices so credits are combined on one main machine.
| Command | Description |
|---|---|
pool create --name "My Devices" | Create a device group (this machine becomes the main device) |
pool invite-code | Generate an 8-character invite code to share |
pool join <CODE> | Link this device using a code from your main machine |
pool status | Show linked devices, credits, and online status |
pool leave | Unlink this device from the group |
Example flow:
# Main device:
swarmllm pool create --name "My Devices"
swarmllm pool invite-code # → A3F7K2M9
# On each other device:
swarmllm pool join A3F7K2M9
Note: This links YOUR own devices. It's different from connecting to the SwarmLLM network (which uses
swarm://peer addresses).
Environment Variables
Every config option can be set via SWARMLLM_ prefix:
| Config Path | Environment Variable |
|---|---|
node.listen_port | SWARMLLM_NODE_LISTEN_PORT |
node.data_dir | SWARMLLM_NODE_DATA_DIR |
logging.level | SWARMLLM_LOGGING_LEVEL |
inference.model_path | SWARMLLM_INFERENCE_MODEL_PATH |
inference.gpu_layers | SWARMLLM_INFERENCE_GPU_LAYERS |
Example:
SWARMLLM_NODE_LISTEN_PORT=9000 SWARMLLM_LOGGING_LEVEL=debug ./swarmllm run
Provider API Keys via Environment
Cloud provider API keys use standard environment variable names:
| Provider | Environment Variable |
|---|---|
| OpenAI | OPENAI_API_KEY |
| Anthropic | ANTHROPIC_API_KEY |
| DeepSeek | DEEPSEEK_API_KEY |
| Mistral | MISTRAL_API_KEY |
| Groq | GROQ_API_KEY |
| NVIDIA NIM | NVIDIA_NIM_API_KEY |
| Cerebras | CEREBRAS_API_KEY |
| SambaNova | SAMBANOVA_API_KEY |
| Fireworks | FIREWORKS_API_KEY |
| Together | TOGETHER_API_KEY |
| DeepInfra | DEEPINFRA_API_KEY |
| Moonshot/Kimi | MOONSHOT_API_KEY |
These can also be placed in a .env file in your data directory:
# ~/.local/share/swarmllm/.env
OPENAI_API_KEY=sk-proj-...
DEEPSEEK_API_KEY=sk-...
NVIDIA_NIM_API_KEY=nvapi-...
The .env file is loaded at startup. It does not override existing environment variables or keys already configured via the dashboard/database. The dashboard settings UI shows "From .env" for keys loaded this way.
Troubleshooting
Can't Connect to Peers
Check the bootstrap address format:
/ip4/203.0.113.50/udp/8800/quic-v1/p2p/12D3KooW...
Firewall: SwarmLLM needs TCP port 8810 (P2P) and optionally UDP port 8800 (QUIC) open.
- Linux:
sudo ufw allow 8810/tcp && sudo ufw allow 8800/udp - Windows: Windows Defender Firewall > Inbound Rules > New > Port > TCP 8810 + UDP 8800
- macOS: System Settings > Network > Firewall > allow SwarmLLM
Same LAN? Use local IP (e.g., 192.168.1.x). LAN peers should be found automatically via mDNS.
Model Download Stuck
- Check disk space — a 7B model needs ~4-5 GB free
- Verify internet access to
https://huggingface.co - Cancel and retry from the Dashboard
- Start with
-vfor verbose logs:./swarmllm run -v - Try a smaller model first (TinyLlama, ~700 MB)
GPU Not Detected
- Verify GPU works:
nvidia-smi - Install NVIDIA drivers if needed
- Enable GPU offloading:
./swarmllm run --gpu-layers 99
WSL2 users: The CUDA driver comes from your Windows NVIDIA driver. Check that /usr/lib/wsl/lib/libcuda.so.1 exists and add to your ~/.bashrc:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/wsl/lib:$LD_LIBRARY_PATH
Port Already in Use
./swarmllm run --port 9000 # Use a different port
lsof -i :8800 # Find what's using 8800
./swarmllm status # Check if another instance is running
Slow First Request
If the first inference request to a model takes noticeably longer than subsequent ones, this is expected. SwarmLLM uses on-demand model loading — models whose shards are on disk but not loaded into VRAM are loaded when first requested. If VRAM is full, an LRU eviction occurs first. Subsequent requests to the same model will be fast.
Slow Inference
- GPU vs CPU: CPU is 5-20x slower. Check Dashboard for GPU status.
- Model too large: Use Q4 quantization, match model size to VRAM.
- Enable batching: Set
max_batch_size = 4in config.
Database Corrupted
# Back up first
cp -r ~/.local/share/swarmllm ~/.local/share/swarmllm-backup
# Delete database (models and config are preserved)
rm ~/.local/share/swarmllm/db.redb
# Restart
./swarmllm run
GPU Out of Memory
If a model exceeds your GPU's VRAM, SwarmLLM automatically falls back to CPU inference. You'll see this in the logs:
WARN GPU OOM detected, retrying on CPU
CPU inference is 5-20x slower but works for any model size. To avoid OOM:
- Use smaller quantizations (Q4 instead of Q8)
- Use a model that fits in VRAM (check model size vs available VRAM in the dashboard)
- For models too large for one GPU, use distributed inference across multiple nodes
Distributed Inference Issues
Peers visible but inference fails:
- Ensure both nodes have the required shards loaded (check Dashboard > Models)
- Verify P2P TCP connectivity: port
<base_port> + 10must be reachable - Run with
-vvand filter:./swarmllm run -vv 2>&1 | grep "DIAG:" - Check for
DIAG: segment TIMED OUT— indicates network or compute bottleneck
High latency per token:
- Distributed inference adds ~20-130ms per token for network round-trips
- Use TCP bootstrap addresses (not QUIC) for lowest latency
- Ensure nodes are on the same LAN for tensor parallelism
Pipeline assembly fails:
- The scheduler needs enough shard coverage to build a complete pipeline
- Check
DIAG: assemble_pipeline_forfor candidate counts
Inference fails with "peer never acknowledged" or "silent drop":
- A
SendDirectMessagewas issued but neither a Response nor anOutboundFailureevent arrived from libp2p within 10s (RR_ACK_TIMEOUT_SECS). Treated as a transient failure: the router automatically retries once with a fresh pipeline assembly that filters out the unreachable peer. If retry also fails, the user sees the error within ~20s (vs the 120sFIRST_TOKEN_TIMEOUT). - Most common cause: the target peer was killed or partitioned and the local libp2p connection state hasn't yet caught up.
- Look for
DIAG: rr ACK timeout — closing streaming callerin the logs to confirm the fast-fail path engaged.
Concurrent requests stall when only some get dispatched:
- Per-tier concurrency caps come from
inference.max_concurrent_requests(default 10): Bronze=2, Silver=5, Gold=10, Platinum=20. Excess requests queue until prior ones complete. To raise: bump the config knob or earn credits to climb tiers. - If queued requests don't dispatch even after others complete,
check for a missed
queue_notify.notify_one()afteractive_count.fetch_sub(1)(should never happen onmain; was a real regression fixed inda6f485).
Cross-Node Prefix-KV Sharing
The cross-node prefix fetch is default-on. Expected logs on a successful first hit of a peer's cached prefix:
B: DIAG: cross-node prefix HIT — hydrated KV matched_tokens=N total_tokens=M
A: DIAG: served PrefixKvFetch ... hit=true
I never see cross-node prefix HIT:
- Only fires on iter 1 of a prompt whose prefix your local node hasn't prefilled yet. Iter 2/3 hit the local cache (populated by iter 1).
- Check the peer even announced the prefix: look for
DIAG: PrefixCacheAnnounce indexed node_id=... blocks=Nin your log. No announce → peer's gossip never reached you (checkgrep 'Published message to GossipSub' | grep 'swarm/models'). - Check the peer passes the trust gate: default
cross_node_prefix_trust_min = 0.5equalsDEFAULT_TRUST, so a freshly-seen peer should just barely pass. Any misbehavior drops it below.
I see prefix-probe: fetch timed out:
- The peer didn't return a snapshot inside the worker-probe window (3000 ms by default). On a large model (7B+) with cold CPU this can happen if the snapshot is >100 MB. The path degrades to local prefill — no worse than not having the feature. The current 3000/2500/2000 ms chained timeouts are sized for 7B-class snapshots; the older 500/400/500 ms values were TinyLlama-sized and forced a fallback to local prefill on larger models.
I see rejected KV snapshot — penalizing peer trust:
- The returned snapshot failed BLAKE3 reverification or contained
NaN/Inf. Three rejection reasons:
hash_chain_mismatch→prefix_cache_block_tokensdiffers between nodes (default 64, common alternatives 32/128)non_finite_tensors→ GPU overflow on the serving sidedeserialize_failed→ wire corruption — open an issue
Disable cross-node fetch entirely:
Set inference.cross_node_prefix_trust_min = 2.0 in config.toml. The
probe never fires because no peer passes the trust gate.
Running the Test Suite
SwarmLLM ships 943 lib tests + 75 integration tests + VLM E2E.
# Run all tests (release, used in CI)
cargo test --release
# Unit tests only (fastest feedback loop)
cargo test --lib
# Integration tests only
cargo test --test '*'
# A specific test by name substring
cargo test --release prefix_cache
# With CUDA features on (requires NVIDIA GPU)
cargo test --release --features candle-cuda
If a test fails, the release build shows the name + line; rerun with
--nocapture to see its stderr:
cargo test failing_test_name -- --nocapture
Integration tests under tests/integration/ simulate multi-node P2P on
loopback — they're the slow ones, and CI runs them with
--test-threads=1 to avoid port contention.
See Benchmarking for reproducing the performance benchmarks and Performance for which knobs turn each speedup on/off.
Model Trust
Models go through trust levels: Discovered → Pinned → DemandVerified → NetworkPopular. Auto-manage only downloads shards for models at sufficient trust levels.
Model stuck at "Discovered":
- Pin it manually from the Dashboard to promote to "Pinned"
- Models reach "DemandVerified" after receiving inference requests
- Models reach "NetworkPopular" when enough peers host them
Still Stuck?
- Run with full diagnostics:
./swarmllm run -vv 2>&1 | grep "DIAG:" - See the Diagnostics Guide for detailed log instrumentation
- Check GitHub Issues
- Open a new issue with: OS, hardware,
./swarmllm version, and logs from-vv
System Overview
SwarmLLM is a single Rust binary that simultaneously functions as:
- A P2P network node — connects to peers over TCP (Noise+Yamux) and QUIC/UDP using libp2p
- An HTTP API server — serves OpenAI + Anthropic-compatible endpoints, MCP server, and cloud provider proxy via Axum
- A web dashboard — embedded frontend (component-based vanilla HTML/CSS/JS, 11 HTML templates, no build step)
All three share a single port (default 8800) and a common Arc<SharedState>.
┌──────────────────────────────────────────────────────────┐
│ swarmllm binary │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ P2P │ │ HTTP API │ │ Admin UI │ │
│ │ Node │ │ Server │ │ (embedded) │ │
│ │(TCP+QUIC)│ │ (Axum) │ │ │ │
│ └────┬─────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ ┌────┴───────────────┴─────────────────┴─────────────┐ │
│ │ Shared State (Arc) │ │
│ │ DashMap<NodeId, PeerInfo> — peer registry │ │
│ │ ModelRegistry — models + shards │ │
│ │ state.events (EventBus) — activity + dashboard│ │
│ │ state.credits (CreditPool) — balance + pool │ │
│ │ state.models (ModelMgmt) — acquisition + trust │ │
│ │ state.metrics (MetricsProviders)— stats + providers │ │
│ └────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
Key Design Decisions
- Config priority: CLI flags > env vars (
SWARMLLM_prefix) > config.toml > defaults - Data directory:
~/.local/share/swarmllm/(Linux),~/Library/Application Support/swarmllm/(macOS),%APPDATA%\swarmllm\(Windows) - Port layout: HTTP API on TCP:port, P2P TCP on port+10 (Noise+Yamux), P2P QUIC on UDP:port
- Shard-only: Nodes never need a full GGUF. Shards are downloaded individually.
- No blockchain: Credit system uses dual-signed transactions, not a token or chain
Technology Stack
| Component | Library |
|---|---|
| Async runtime | Tokio (multi-threaded) |
| P2P networking | libp2p 0.56 (Kademlia, GossipSub, QUIC) |
| HTTP server | Axum 0.8 |
| Tensor compute | candle-core/candle-transformers |
| GGUF inference | llama-cpp-2 (optional backend) |
| Cryptography | ed25519-dalek, x25519-dalek, chacha20poly1305 |
| Content hashing | BLAKE3 |
| Database | redb (pure-Rust, ACID, single-file) |
| Concurrent maps | DashMap 6 |
Daemon & Subsystems
The daemon spawns 12 Tokio tasks wired together with mpsc channels:
┌──────────────┐
│ daemon/ │
│ (bootstrap) │
└──────┬───────┘
│ spawns tokio tasks
┌───────┬───────┬───────┬───────┼───────┬──────────┬──────────┬──────────┬──────────┬──────────┬─────┐
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
Network Infer Credit Health API Rebal- Acquisi- Message Pool AutoShrd HfWat- Update
Manager Router Ledger Monitor Server ancer tion Mgr Dispatch Manager Manager cher Checker
Subsystem Responsibilities
| Subsystem | File | Role |
|---|---|---|
| NetworkManager | src/network/manager/ | libp2p swarm: Kademlia DHT + GossipSub + request/response |
| InferenceRouter | src/inference/router/ | Request queuing, pipeline assembly, execution coordination |
| MessageDispatcher | src/daemon/dispatch/mod.rs | Routes inbound network messages to appropriate subsystems |
| CreditLedger | src/credit/ledger.rs | Credit balance tracking, transaction signing, gossip |
| HealthMonitor | src/health/monitor.rs | Periodic health pings, rebalancing triggers |
| ShardRebalancer | src/health/rebalancer.rs | Shard redistribution on node join/leave |
| AcquisitionManager | src/model/acquisition.rs | BLAKE3-verified model downloads from peers and HuggingFace |
| ApiServer | src/api/server.rs | Axum HTTP: OpenAI + Anthropic APIs + MCP server + admin dashboard + WebSocket |
| PoolManager | src/pool/manager/ | Device pool management, credit forwarding |
| AutoShardManager | src/model/auto_manage/ | VRAM-aware shard acquisition + smart pruning (manager, scoring, download, prune, scan, vram, wishlist). R111: refreshes the user-visible wishlist at the end of every tick. |
| HfWatcher (R112) | src/model/huggingface/watcher.rs | Background task polling HuggingFace's trending GGUF feed once per hour. Caches the snapshot on state.models.hf_trending_cache (consumed by the wishlist scorer) and auto-promotes models above 100k downloads + 24h age from Discovered to DemandVerified. NonCritical — HF outages don't escalate to a daemon crash. Opt-out via auto_manage.hf_watcher_enabled = false. |
| UpdateChecker | src/update.rs | Periodic GitHub release polling, SHA256-verified binary download, atomic apply. Skipped entirely when auto_update = "disabled" (default until binary signing C1 lands), so the supervisor doesn't log a misleading "exited unexpectedly" warning. |
Channel Layout
| From | To | Message Types |
|---|---|---|
| NetworkManager | MessageDispatcher | All inbound SwarmMessage variants |
| MessageDispatcher | InferenceRouter | InferenceRequest, LayerForward, LayerResult |
| InferenceRouter | NetworkManager | Outgoing P2P messages |
| HealthMonitor | ShardRebalancer | RebalanceEvent |
| ApiServer | InferenceRouter | RouterCommand (from HTTP) |
| ApiServer | AcquisitionManager | AcquisitionCommand |
| AutoShardManager | AcquisitionManager | AcquisitionCommand |
| CreditLedger | NetworkManager | CreditGossip, CreditTransaction |
| MessageDispatcher | (spawned task) | VisionEncodeRequest → handler → VisionEncodeResponse |
Broadcast Channels
| Channel | Type | Subscribers | Purpose |
|---|---|---|---|
activity_tx | broadcast::Sender<ActivityEvent> (256) | WebSocket | Unified event bus — all subsystem events (shard ops, downloads, inference, pool, config changes). Events carry toast_level for frontend toast control. History replayed to new WS clients. |
dashboard_tx | broadcast::Sender<DashboardSignal> (32) | WebSocket | Dashboard refresh signals — PeersChanged (peer connect/disconnect), ModelsChanged (shard download/load/prune), UpdateAvailable(UpdateInfo) (new version). |
Note: Former separate channels (
prune_events_tx,models_changed_tx,lan_discovery_tx,system_notify_tx,peer_list_changed_tx,update_tx) were consolidated into these two in the event system unification.
Startup Sequence
- Parse CLI args (clap)
- Initialize tracing subscriber
- Load/create config (TOML + env + defaults + CLI overrides)
- Ensure data directory exists
- Load/generate Ed25519 identity
- Open redb database
- Build
Daemon { config, identity, db } - Initialize ModelExecutor (load GGUF if
--modelprovided) - Build
Arc<SharedState>(includes ModelRegistry from DB) - Scan local shards, register in registries
- Create mpsc channels
- Spawn all 12 tasks
- Open browser if configured
tokio::select!on Ctrl+C or task exit- Graceful shutdown: save peer cache, flush database
Graceful Shutdown
Shutdown is triggered by Ctrl+C (SIGINT/SIGTERM) or any task exiting:
- A
watchchannel signals all subsystems - Peer cache is saved to redb
- Database is flushed
- Open connections are drained
Networking & Discovery
Transport Stack
libp2p Swarm
├── Kademlia (DHT) — distributed hash table for peer/shard/model lookup
├── GossipSub — pub/sub for shard/health/credits/identity/pools/regions
├── request_response — unified protocol (/swarmllm/1.0.0, 600s timeout)
├── mDNS — optional LAN peer discovery
├── connection_limits — max 1/peer (>1 causes rr round-robin to dead connections), 500 total
├── Identify — protocol identification
├── AutoNAT — NAT detection
├── DCUtR — hole punching
└── relay::client — circuit relay
Protocol Format
The unified protocol uses a type-tag byte on every frame
(src/network/protocol/mod.rs):
| Tag | Constant | Use |
|---|---|---|
0x00 | WIRE_TAG_JSON | JSON control message (SwarmMessage, ShardRequest/ShardResponse) |
0x01 | WIRE_TAG_TENSOR | Binary tensor payload (LayerForward, LayerResult), f16 |
0x02 | WIRE_TAG_TENSOR_COMPRESSED | Q8_0 activation frame (flag-gated activation_compression) — ~3.76× smaller than 0x01 |
0x03 | WIRE_TAG_SHARD | Raw shard bytes (ShardResponse payload, 32 MB max — bypasses the 4 MB JSON cap) |
0x04 | WIRE_TAG_PREFIX_KV | Cross-node prefix-KV snapshot. Frame body's flag byte: 0 = miss, 1 = raw f32, 2 = zstd-compressed f32 (gated on NetworkConfig::prefix_kv_compression, default off). Receivers always decompress regardless of the send-side flag. |
Receivers auto-dispatch on the leading byte; senders choose based on
config + request kind. Only the 0x00 frame carries a JSON body; the
rest use binary framing with length prefixes.
Discovery Stack
SwarmLLM uses 5 independent discovery layers:
- mDNS — Discovers LAN peers in seconds. Config:
enable_mdns = true - Persistent Peer Cache — Saves up to 200 peers every 5 min + on shutdown. Fastest reconnect.
- Invite Codes — Format:
swarm://<base64url(key‖nonce‖encrypted_multiaddr)>. Encrypted with ChaCha20Poly1305. - Peer Exchange (PEX) — On each connection, exchanges up to 20 known peers.
- Kademlia DHT — Bootstrap flag + periodic re-bootstrap every 60s.
GossipSub Topics
Six topics, all subscribed at startup in discovery::subscribe_topics:
| Topic | Constant | Content |
|---|---|---|
swarm/models | TOPIC_MODELS | ShardAnnounce, ModelManifest, PrefixCacheAnnounce (cross-node prefix-KV index) |
swarm/health | TOPIC_HEALTH | HealthPing, NodeCapability (includes observed per-layer latencies for the Parallax scheduler), TpAllReduceResponse |
swarm/credits | TOPIC_CREDITS | CreditGossip, CreditTransaction |
swarm/identity | TOPIC_IDENTITY | NicknameGossip (signed) |
swarm/pools | TOPIC_POOLS | PoolMessage (PoolState, PoolInvitation, CreditForward) |
swarm/regions | TOPIC_REGIONS | RegionShardSummary (per-region shard availability for routing locality) |
The topic match in NetworkManager::handle_broadcast is
contract-not-default: a SwarmMessage variant with no topic arm falls
through _ => return and silently drops at the wire. Adding a new
gossip variant requires updating the match — an early multi-node test
caught PrefixCacheAnnounce missing from the TOPIC_MODELS arm, which
had silently dropped every cross-node prefix-cache announce at the
network layer until a two-daemon run flushed it out.
Messages older than 5 minutes are rejected (replay protection).
Cross-Node Prefix KV Sharing Dispatch
The cross-node prefix-cache fetch path uses the request_response
protocol, not gossip. The gossip layer only broadcasts which blocks
each peer holds (PrefixCacheAnnounce on swarm/models); the actual
snapshot transfer is a direct bilateral exchange:
- Requesting daemon sends
SwarmRequest::PrefixKvFetchto the peer chosen by the probe resolver (trust-gated bycross_node_prefix_trust_min, default 0.5) - Serving daemon runs
fetch_local_snapshotagainst its own worker over IPC (2000 ms timeout) and gets the serialized bytes orNone - Serving daemon returns
SwarmResponse::PrefixKvData { present, payload }with the bytes wrapped in theWIRE_TAG_PREFIX_KVframe on the binary payload slot (not in the JSON header —serde_jsoninflatesVec<u8>~5× and blows past the 64 MiB IPC cap) - Requesting daemon BLAKE3-reverifies + NaN/Inf-scans, hands bytes to
its worker to hydrate a
KvCacheEntry
See Inference > Prefix-Cache KV Sharing for the full pipeline and measured numbers.
Anti-Gaming
- Subnet clustering detection: >5 nodes per /24 triggers 25% spot-check rate (up from 5%)
SubnetClusteringtrust penalty (-0.03 per cycle)- Signed balance reports with timestamp freshness (5 min window)
- Gossip replay rejection (5 min window)
cross_node_prefix_trust_mingates fetch peers at a minimum trust score (default 0.5, equal toDEFAULT_TRUST; set to 2.0 to disable cross-node fetch entirely)
Inference Pipeline
Subprocess-Per-Model Isolation
Each loaded model runs in its own swarmllm model-worker subprocess (Ollama-style). When a model is unloaded, the subprocess is killed and the OS + CUDA driver immediately reclaim all GPU memory — no daemon restart required.
Main daemon model-worker subprocess (one per model)
─────────────────────────────── ───────────────────────────────────────
ModelProcessPool.generate() ─────► loads shards from disk on first request
ModelProcessPool.forward() ─────► runs forward passes / full decode loop
◄───── streams WorkerMsg::Token / LayerResult
unload_model() ─────► kill process → OS frees all VRAM
IPC: Unix domain socket with binary framing — [4B json_len][json header][4B payload_len][raw tensor bytes]. JSON carries message metadata; the payload carries raw activation bytes to avoid base64 overhead.
Message types (src/inference/worker_ipc.rs):
| Message | Direction | Purpose |
|---|---|---|
DaemonMsg::Forward | daemon → worker | Single-step LayerForward (distributed inference) |
DaemonMsg::Generate | daemon → worker | Full prompt→tokens decode loop (API inference) |
DaemonMsg::Unload | daemon → worker | Drop a layer range (partial memory reclaim) |
DaemonMsg::Shutdown | daemon → worker | Graceful worker exit |
WorkerMsg::Token | worker → daemon | Streaming decoded token |
WorkerMsg::LayerResult | worker → daemon | Activation result for pipeline forwarding |
SplitModelEntry is metadata-only — it caches eos_tokens, vocab, chat_template, bos_token, and eos_token_str from the GGUF header without loading model weights. The weights live exclusively in the worker subprocess.
Worker granularity: one process per ModelId (not per shard). A single worker handles all layer ranges for a model and owns its own KvCacheStore. Individual shard unload uses DaemonMsg::Unload; the process exits only when all shards are released.
Split Inference Engine
The split inference engine (src/inference/split/) enables distributed inference using candle for direct tensor computation with quantized GGUF weights. Each node loads only its assigned transformer layers (in the worker subprocess), forwarding hidden-state activations between nodes. The module is split into: model.rs (SplitModel struct + accessors), loader.rs (GGUF/shard load), executor.rs (forward pass + tensor-parallel), kv_cache.rs, entry.rs, gguf_meta.rs, shard_reader.rs, rope.rs, prefix_cache.rs.
Client → API Server → InferenceRouter → Pipeline Assembly
│
┌───────────────────────┘
▼
┌──────────────────────┐
│ Pipeline Segment │ Token IDs (prefill)
│ Node A: Layers 0-15 │──── LayerForward ──►
└──────────────────────┘ │
┌───────────────┘
▼
┌──────────────────────┐
│ Pipeline Segment │
│ Node B: Layers 16-27 │── sample token ──►
└──────────────────────┘
Pipeline Assembly
- Fetch model manifest to determine layer ranges
- Pipeline affinity check: if multi-turn session has a previous pipeline and all nodes are still connected, reuse it (KV cache locality)
- Query model_registry.shard_holders for hosting nodes
- Liveness filter: drop holders that aren't in
connected_node_ids(the libp2p truth — DHT can re-inject providers for peers that just disconnected, andpeer_registryis intentionally preserved across mid-pipeline disconnects for reconnect attempts) - Fetch node load/latency from peer_registry
- Parallax scheduler: shortest-path dynamic programming over observed per-layer latencies (EMA over recent forwards), rather than a greedy latency-only sort. Cross-gossips top-32 observed latencies via
NodeCapability.observed_latenciesso every node has a current view of the network's compute profile - Encrypted pipeline check: if enabled for this model, force first and last segments to the local node (boomerang topology)
- Assignment: widest contiguous layer range per node, merging on same-node
- Identify standby nodes per segment (failover)
- Send PipelineAssignment, wait for ACKs, begin forwarding
Failure Handling
The router applies a single retry on transient remote failures
(silent rr drops, OutboundFailure, remote-generate timeouts). The
retry passes preferred_pipeline = None so the scheduler re-runs
and the dead/dropped peer is filtered out via the liveness oracle
above. Failure of the second attempt propagates to the user with a
"try again" hint.
Independently, streaming-tracked SendDirectMessage sends carry a
delivery_request_id; if the receiver doesn't ACK within
RR_ACK_TIMEOUT_SECS (10s), the daemon closes the caller's
streaming channel — converting a 120s FIRST_TOKEN_TIMEOUT hang
into a fast-fail in ~10–20s. This handles the rare case where
libp2p request_response accepts a send_request call but never
delivers it (no OutboundFailure event fires).
Concurrent Request Throttling
Per-tier concurrency caps come from max_concurrent_requests
(default 10): Bronze=¼, Silver=½, Gold=1×, Platinum=2×. Requests
beyond the cap queue in the router. The queue is event-driven:
every active_count.fetch_sub(1) on completion is paired with
queue_notify.notify_one() so drain_queue wakes immediately.
Without that pairing, queued requests would sit indefinitely until
the next Submit arrived (a real bug found in stress testing — fix
in commit da6f485).
Pipeline affinity means that multi-turn conversations (with session_id) prefer to route through the same nodes, preserving KV-cache state and avoiding cold restarts on every turn.
The Parallax allocator also runs offline in AutoShardManager (Phase C.2)
with a soft acquire/prune bias driven by a per-shard stability counter
(≥3 consistent ticks of "this shard wants to move here" before it acts).
Hard constraints (pinning, trust gates, VRAM caps) always win.
Architecture Detection
The SplitModel loader reads general.architecture from GGUF metadata and applies per-architecture handling:
| Architecture | RoPE | QKV Biases | Special Handling |
|---|---|---|---|
| Llama | Interleaved | No | Default EOS=2 |
| Llama 4 | iRoPE (NoPE every 4th) | No | MoE FFN |
| Qwen2 | Contiguous | Yes | EOS 151643+151645 |
| Qwen 3.5 | Contiguous | No | Hybrid SSM+attention (Gated Delta Networks) |
| Gemma/Gemma2 | Interleaved | No | Embedding scaling (sqrt(d)), Gemma RmsNorm (+1), EOS 107, attention + final logit softcapping, Gemma chat template fallback |
| Phi-3 | Su/YaRN | Yes | Fused QKV/FFN tensors |
| Mistral | Interleaved | No | GQA |
| DeepSeek-V2/V3 | Contiguous | No | MLA attention + MoE FFN |
| GLM-4 | Contiguous | No | Partial RoPE, extreme GQA (16:1) |
| Starcoder2 | Interleaved | Yes | Code-optimized |
KV-Cache Management
- Per-request isolation via
DashMap<(ModelKey, RequestId), Cache> - Multi-turn reuse:
session_idtracks conversations, prefix matching skips redundant prefill - Configurable TTL (default 10 min)
- VRAM-aware LRU eviction for split model cache
Prefix-Cache KV Sharing (Cross-Node)
Each worker stores a local prefix-cache keyed by BLAKE3 chained hashes
over fixed-size token blocks (prefix_cache_block_tokens, default 64).
Blocks are announced to peers via SwarmMessage::PrefixCacheAnnounce
on the swarm/models gossipsub topic and indexed in
state.models.cross_node_prefix_index.
When a local worker sees a prompt whose prefix it hasn't prefilled, it
emits WorkerMsg::PrefixFetchProbe; the daemon walks the index
(longest-match first), trust-gates candidate peers by
cross_node_prefix_trust_min (default 0.5), and issues a
SendPrefixKvFetch request-response to the best holder. The serving
daemon re-issues DaemonMsg::ExportPrefixSnapshot to its worker, which
narrows a stored KvSnapshot to the requested block boundary and returns
the serialized bytes in the IPC binary-payload slot. Back on the
requesting side, the bytes are BLAKE3-reverified against the requested
hash and NaN/Inf-scanned before hydrating a new KvCacheEntry for the
in-flight request, which then only has to prefill the suffix beyond the
cached block boundary.
Three chained timeouts (worker probe 3000 ms, daemon network 2500 ms, serving IPC 2000 ms — sized for 7B-class f32 snapshots) guarantee that a stuck peer degrades to a clean miss rather than blocking the request. See the Performance chapter for measured TTFT numbers on TinyLlama (GPU, corner case where fetch is slightly slower than prefill) vs Qwen2.5-7B (12.9× iter-1 TTFT speedup on CPU-CPU localhost).
Advanced Features
- Speculative Decoding — Draft model proposes K tokens, target verifies in one pass (flag-gated
speculative_distributed) - SWIFT self-speculative — Target model acts as its own draft by skipping a layer range (flag-gated
swift_self_speculative) - DSD (Decentralized Speculative Decoding) — Multi-segment pipeline with γ-token speculation woven in (flag-gated
decentralized_spec_decoding) - Chunked Prefill — Sarathi-style: each Prefilling slot advances by
prefill_chunk_tokens(default 128) per decode tick so a long admission can't block decode - Continuous Batching — default-on: concurrent
Generaterequests share oneforward_batchper decode tick; GPU uses fused kernel, CPU falls through to sequential - Batched Prefill Forward — default-on: fuses concurrent same-shape Prefilling chunks into one
forward_batchcall - Remote-generate Fast Path — default-on: single-segment distributed inference runs the full decode loop on the remote worker instead of per-token coordinator round-trips (measured 1.93× decode speedup)
- Cross-request Prefix Cache — default-on: see "Prefix-Cache KV Sharing" above for the cross-node extension; the local cache alone is a 29.4× wall-clock win on prompt re-submission
- Activation Compression (Q8_0) — Intermediate pipeline activations wire-quantized ~3.76× (flag-gated
activation_compression) - Flash Attention — CPU and GPU fast paths (GQA-native, no
repeat_kv) - PagedAttention — Deferred;
paged-attnfeature flag reserved for future use (module removed, never wired to production) - Logprobs — Per-token log probabilities via
sample_token_with_params_and_logprobs(). Whenlogprobs: truein the request, the sampling layer collects top-N token probabilities and returns them in the OpenAI-compatible response. Available on split model (candle) inference paths - Pipeline Error Broadcast — On distributed inference failure,
broadcast_pipeline_error()notifies all participants so peers can update shard availability and route around failures - Local Embedding Privacy — When
local_embedding_privacy: true, the requesting node performs token→embedding locally (~1ms) and sends pre-embedded hidden-state activations instead of raw token IDs to the first pipeline segment. Remote nodes never see the plaintext prompt. See Security > Local Embedding Privacy - Encrypted Pipeline — When enabled (per-model or global), forces a "boomerang" topology: the requesting node handles both the first segment (embedding) and last segment (token sampling). Remote nodes only process intermediate activations — no remote node ever sees plaintext input or output. See Security > Encrypted Pipeline
Vision Language Models (VLM)
Distributed mmproj
The mmproj (vision encoder) is modeled as a sentinel shard (index = u32::MAX) decoupled from the text pipeline. Any node with mmproj can encode images — the router selects local → first-segment → any holder.
Image → JPEG compress → VisionEncodeRequest (remote) or encode locally
→ zstd+FP16 compressed embeddings
→ attached to first LayerForward (vision_embeddings field)
→ text pipeline processes as normal
Key types: VisionEncodeRequest, VisionEncodeResponse, LayerForward.vision_embeddings.
If no node has mmproj loaded, the API returns HTTP 503 (VisionEncoderUnavailable).
Tensor Wire Format
[4B ndim][4B×ndim shape][4B dtype_tag][f32 data]
For a 7B model (hidden_dim=3584):
- Prefill (14 tokens): ~200 KB
- Decode (1 token): ~14 KB
Credit System
Credits are SwarmLLM's fairness mechanism — no blockchain, no token, just local accounting with dual-signed transactions. The system ensures contributors are rewarded and free-riders are deprioritized.
Earning & Spending
| Action | Credits | Notes |
|---|---|---|
| Serve inference (per token) | +10 | Balanced with consume side |
| Host shard (per GB per hour) | +1 | Hourly tick in CreditLedger |
| Seed shard data (per GB transferred) | +5 | Atomic counter, periodic drain |
| Relay traffic (per connection hour) | +2 | Circuit open/close tracking |
| Consume inference (per token) | -10 | Balanced with earn side |
| Distributed inference failure | -50 | Automatic penalty |
Balanced rates: Both earn and spend use rate × tokens — no layer multiplier. A 22-layer model serving 100 tokens earns the same as it costs to consume, preventing credit inflation.
All rates are configurable per pool via [pool.credit_rates] in config.
Minimum Balance Enforcement
Nodes with balance below -1000 credits have remote inference requests rejected. They receive a clear error message telling them to contribute (host shards, serve inference, seed data).
- Local API requests (from localhost) are always allowed regardless of balance
- This prevents free-riders from endlessly consuming without contributing
- The floor is configurable via
MIN_BALANCE_FOR_INFERENCEconstant
Priority Tiers
Tiers are calculated from your credit balance relative to the network:
| Tier | Requirement | Concurrent Limit |
|---|---|---|
| Platinum | ≥90th percentile and balance > 0 | 2× base max |
| Gold | ≥70th percentile and balance > 0 | base max |
| Silver | Positive balance | ½ base max |
| Bronze | Zero or negative | ¼ base max (min 1) |
How it works: On each inference request, the router computes your network percentile from peer credit gossip data (deduplicated by NodeId to prevent Sybil stuffing) and calls calculate_tier(). Higher tiers dequeue first. Bronze nodes are never fully blocked — they get deprioritized but always get at least 1 concurrent slot.
Anti-Abuse Mechanisms
- Anti-Sybil deduplication: Peer balance gossip is deduplicated by NodeId — a single peer can't stuff the percentile distribution by re-gossiping
- Atomic accumulation: Forward participation credits use
AtomicI64accumulator, flushed every 60s — no credits lost under high concurrency - AntiGaming rate limiter: Max 100 credit transactions per node per 5-minute window
- Self-dealing rejection: Transactions from/to same node are rejected
- Signed balance reports: Ed25519 signatures with 5-minute freshness window
Failure Penalties
When distributed inference fails:
- The requesting node is penalized (configurable
penalty_serve_failure, default 50 credits) - A
broadcast_pipeline_error()message is sent to all pipeline participants
Transaction Security
- Every transaction requires dual Ed25519 signatures (serving node + requesting node)
- UUID deduplication prevents replay attacks (checked against DB)
- Balance arithmetic uses
saturating_add(no overflow panics) - Peer balance gossip rejects implausible values (abs > 100M)
Escrow
For large requests (above configurable threshold), credits are held in escrow:
create_escrow()→release_escrow()(success) orrefund_escrow()(failure)- Balance deducted BEFORE escrow persisted (crash-safe: lose credits > create free credits)
- Refunds are persisted to DB immediately
- Entries expire after 10 minutes with automatic refund
- Escrow and direct charge are mutually exclusive (no double-billing)
Device Pool Credit Forwarding
When devices are linked in a pool, member devices forward their earnings to the owner:
- Credit split configurable: 0-50% kept by member, rest forwarded
- Dual-signed
PoolCreditForward(member signature + owner co-signature) - Forwarded amount deducted from member balance before persisting
- Owner's
PoolManagervalidates and applies credits atomically
Security & Encryption
Three Encryption Tiers
Tier 1: Pairwise Sessions (Unicast)
For direct peer-to-peer communication:
- Ed25519 → X25519 → ECDH → ChaCha20-Poly1305
- Forward secrecy via ephemeral X25519 re-keying every 10 minutes
- Nonce reuse prevented by session clearing on disconnect (
remove_session()) - Replay protection: RFC 6479 sliding window (128-bit bitmap) — allows packet reordering within window while rejecting duplicates
- Nonce state updated only after successful decryption (prevents DoS)
- Pending ephemeral keys expire after 60 seconds (prevents memory exhaustion from unanswered re-keys)
Tier 2: Pipeline Sealing (Inference)
For inference prompts and responses:
- Per-request ephemeral key
- Sealed prompt/response
- Wire tag:
TENSOR_TAG_ENCRYPTED = 0x10
Pipeline sealing is active: the final segment encrypts output token IDs with the requester's X25519 public key. The final-segment node can see the sampled tokens before encryption — this is inherent to the architecture since sampling happens on that node. Intermediate nodes process activation tensors (protected by Tier 1 in transit) but never see the final plaintext output. See Pipeline Privacy Model for a full breakdown of what each node can see.
Tier 3: Sealed Gossip (Broadcasts)
For GossipSub messages:
- Epoch-based group key + mandatory Ed25519 origin signature
- All gossip messages MUST be
seal_signed()— unsigned messages are rejected - Verifies sender authenticity before processing
- 1-hour rotation cycle
Transport-Authenticated Dispatch
All inbound network messages carry transport-authenticated sender identity:
- libp2p Noise protocol authenticates peers at the transport layer
AuthenticatedMessagewrapper carries the verifiedNodeIdof the sender- MessageDispatcher validates sender identity against message claims:
- ShardAnnounce: sender must match
announce.node_id - CreditTransaction: sender must be a party (from or to)
- CreditGossip, NicknameGossip: sender must match claimed
node_id - HealthPing/Pong: sender must match claimed
node_id - EphemeralKeyExchange: sender must match
exchange.node_id
- ShardAnnounce: sender must match
- Mismatched messages are logged and dropped
Signed DHT Records
Kademlia DHT records are Ed25519-signed to prevent poisoning:
- Format:
[32B pubkey][64B signature][payload] start_providing_shards()signs records with node identity- Active verification:
verify_dht_value()is called on allGetRecordOkresults in NetworkManager — records with invalid or missing signatures are logged and discarded - Records expire after 1 hour with automatic re-publication
Identity
- Ed25519 keypair generated on first run, stored in
identity.key - Private key never leaves the machine
- Public key = Node ID (first 8 bytes hex for display)
- Nickname system: Ed25519-signed records with timestamp-wins conflict resolution
- Nickname registry capped at 10,000 entries (requires peer_registry membership)
Trust & Reputation
TrustManager tracks per-peer scores (0.0-1.0, default 0.5):
| Event | Score Change |
|---|---|
| InferenceSuccess | +0.01 |
| ValidTransaction | +0.02 |
| SpotCheckFail | -0.10 |
| InvalidGossip | -0.05 |
| SignatureViolation | -0.20 |
Scores decay toward 0.5 over time (1% per health cycle, default 30 seconds). Trust factors into pipeline scheduling and credit tier weighting.
Sybil Resistance
- Subnet clustering detection: >5 nodes per /24 → elevated spot-check rate
- Signed-only balance reports
- Timestamp freshness checks on gossip (5 min window, rejects >5 min old)
API Authentication
- Auto-generated 32-byte hex Bearer token (constant-time comparison)
- Protected:
/v1/*,/api/admin/provider-models, config PUT, shutdown, HF downloads, API key endpoint - Exempt:
/,/health,/admin(read-only dashboard), static assets - Request body limit: 32 MB (raised from 2 MB to support VLM image payloads)
- Content-Security-Policy:
default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline'; connect-src 'self' ws: wss:; img-src 'self' data: blob:; frame-ancestors 'none'; base-uri 'self'; form-action 'self' - X-Content-Type-Options: nosniff
- X-Frame-Options: DENY
- Referrer-Policy: no-referrer
- WebSocket Origin validation (rejects cross-site WebSocket hijacking)
Input Validation
- Model field length: max 256 chars in OpenAI + Anthropic handlers
- Tools array: max 128 entries
- Stop sequences: max 16 entries
- HuggingFace repo_id: validated
owner/repoformat (alphanumeric, hyphens, dots, underscores, max 96 chars) - HuggingFace filename: must end in
.gguf, no.., no URL metacharacters - Path traversal:
sanitize_path_component()on all network-provided model IDs before filesystem operations - Update URLs: only GitHub download URLs accepted
- Update binaries: SHA256 checksum verification mandatory
Rate Limiting & DoS Protection
- Per-IP rate limiter with periodic cleanup (5 min intervals)
- Inference queue depth cap: 512 requests
- HTTP timeout: 5 minutes (Slowloris protection via tower-http TimeoutLayer)
- Credit transaction signature verification before ledger apply
Pipeline Privacy Model
Distributed inference splits a model across multiple nodes. This creates inherent privacy trade-offs — each node in the pipeline must process data to do its job. This section documents exactly what each node can see.
What each node sees during inference
Consider a 3-node pipeline: Requester → Node A (layers 0-10) → Node B (layers 11-21) → Node C (layers 22-27, final):
| Data | Requester | Node A (first) | Node B (middle) | Node C (last) |
|---|---|---|---|---|
| Plaintext prompt | Yes (author) | See below* | No | No |
| Raw token IDs | Yes | See below* | No | No |
| Input activations | — | Yes | Yes | Yes |
| Output activations | — | Yes | Yes | — |
| Generated token IDs | Yes (decrypted) | No | No | Yes (samples them) |
| Final plaintext response | Yes (decrypted) | No | No | Yes (before sealing) |
*Node A's visibility depends on the local_embedding_privacy setting — see below.
Risk: First-segment node sees raw tokens (default)
Without local_embedding_privacy (default): The first-segment node (Node A) receives the raw prompt text or token IDs to perform the embedding lookup. This means Node A can read the user's prompt in plaintext.
With local_embedding_privacy: true: The requesting node performs the embedding lookup locally and sends pre-embedded activation tensors. Node A receives floating-point vectors instead of token IDs. This is a significant privacy improvement, but not absolute — see Activation Inversion Risk below.
Risk: Final-segment node sees generated output
The final-segment node (Node C) must sample tokens from the logit distribution. This is fundamental — sampling is the act of choosing the next word, and it can only happen where the final layer's output logits exist. Node C therefore sees every generated token before encrypting them via Tier 2 pipeline sealing.
This cannot be mitigated architecturally. The node that runs the last transformer layer and samples tokens will always know what tokens were sampled. Pipeline sealing ensures the tokens are encrypted before being sent back over the network, so intermediate nodes and eavesdroppers cannot read the response — but the final-segment node itself can.
Risk: Activation inversion attacks
All intermediate nodes see hidden-state activation tensors (floating-point matrices). Research has shown that activations from early transformer layers can sometimes be partially inverted to recover input tokens, especially:
- Embedding-layer activations (layer 0 output) — most vulnerable, essentially a lookup table that can be reversed
- Early layers (1-4) — progressively harder to invert as information mixes across token positions
- Deep layers (5+) — extremely difficult to invert in practice; activations encode abstract features, not token identity
Mitigations in SwarmLLM:
local_embedding_privacy: true— the requesting node performs embedding locally, so the first segment never receives the trivially-invertible embedding output. It receives post-layer-0 activations at earliest.- Tier 1 encryption — all inter-node tensor transfers are encrypted with ChaCha20-Poly1305, preventing network-level eavesdropping
- Pipeline scheduling preference — the scheduler prefers local segments for the first layers when possible
Risk: Byzantine tensor manipulation
A malicious node can send garbage activations instead of computing the actual transformer layers. This produces incorrect output without detection unless spot-checked. Mitigations: probabilistic spot-check validation (5% rate, 25% for subnet-clustered peers) with trust score reduction on failure.
Summary of privacy guarantees
| Configuration | Prompt privacy | Response privacy | Activation risk |
|---|---|---|---|
| Default (no privacy flags) | First segment sees plaintext | Final segment sees plaintext | Intermediate nodes see activations |
local_embedding_privacy: true | No remote node sees raw tokens | Final segment sees plaintext | Reduced — no trivial embedding inversion |
encrypted_pipeline: true | No remote node sees raw tokens | No remote node sees output | Only intermediate activations visible to remote nodes |
| + Tier 2 pipeline sealing | No remote node sees raw tokens | Encrypted on the wire | Reduced — no trivial embedding inversion |
| All protections enabled | Best available | Best available | Remote nodes only see intermediate activations; inversion theoretically possible but computationally expensive |
Bottom line: With
encrypted_pipeline, no remote node sees plaintext input or output — the pipeline "boomerangs" through remote nodes and returns to the requester. This is the strongest privacy mode. Without it,local_embedding_privacystill protects raw token IDs but the final-segment node sees generated output.
Local Embedding Privacy
When local_embedding_privacy: true is set in [inference] config, the requesting node performs token→embedding lookup locally before sending activations to the first pipeline segment. Remote nodes never see raw token IDs — only hidden-state activation tensors.
How it works:
- On startup,
LocalEmbedderloadstoken_embd.weightfromshard_000.bin(~64MB for a 7B Q4 model) - The requesting node tokenizes the prompt and performs the embedding lookup locally (~1ms)
- The resulting hidden-state tensor (
[1, seq_len, hidden_dim], FP32) is sent asLayerForward.activationswithpre_embedded: true - The receiving first-segment node skips its embedding lookup and processes the pre-embedded activations directly
Wire format: The pre_embedded flag on LayerForward is #[serde(default)], so old nodes receiving new-format messages default to false (backward compatible).
Trade-off: Pre-embedded activations are larger than raw text (e.g., 512 tokens × 4096 hidden × 4 bytes = 8MB vs ~2KB text). This matches the existing inter-segment activation sizes, so it does not change the bandwidth profile of distributed inference.
Relevant code: src/inference/local_embedder.rs, src/inference/pipeline/, src/daemon/state/mod.rs (local_embedders DashMap).
Encrypted Pipeline
When encrypted_pipeline: true is enabled (globally or per-model), the pipeline scheduler forces the requesting node to handle both the first and last segments. This creates a "boomerang" topology:
Requester (shard 0, embed) → Remote A (middle shards) → ... → Requester (final shard, decode)
No remote node ever sees plaintext — neither the raw prompt tokens nor the generated output. Remote nodes only process intermediate hidden-state activations.
Requirements:
- The requesting node must hold shard 0 (embedding table) AND the final shard (output head)
local_embedding_privacyis auto-enabled when encrypted pipeline is active- Only useful for models with 3+ shards (2-shard models = fully local, no distribution)
Overhead:
- Adds ~1 extra network RTT per generated token (activations must return to the requester for final decoding)
- Latency increase depends on distance to the furthest remote segment
- No bandwidth overhead vs normal distributed inference (activation sizes are the same)
Per-model configuration:
- API:
GET/PUT /api/admin/models/{id}/encrypted-pipeline - Dashboard: gear icon on model card → "Encrypted pipeline" checkbox
- Global fallback:
encrypted_pipeline = truein[inference]config - Per-model overrides are persisted to the database
Relevant code: src/inference/scheduler/mod.rs (greedy_assign), src/inference/pipeline/ (auto-enable local embedding), src/api/admin_models/mod.rs (API endpoints), src/daemon/state/mod.rs (encrypted_pipeline_models DashMap).
Known Limitations
These are architectural properties that cannot be fully mitigated with code changes:
- Gossip epoch key is publicly derivable — derived from "swarmllm-mainnet-v1". Gossip encryption is defense-in-depth; Ed25519 signing is the primary security mechanism.
- Final-segment output visibility — the node running the last transformer layers sees all generated tokens before pipeline sealing encrypts them. This is inherent to the architecture (see Pipeline Privacy Model).
- Activation inversion — hidden-state tensors passed between nodes can theoretically be inverted to recover input, especially from early layers.
local_embedding_privacyeliminates the trivial case (embedding lookup reversal). Deep-layer inversion remains an open research problem. - Byzantine tensor manipulation — malicious peers can send garbage activations. Mitigation: probabilistic spot-check validation (5% rate, 25% for subnet-clustered peers) with trust score reduction on failure.
- Sybil credit farming — Ed25519 keys are free. Anti-gaming heuristics help but are not bulletproof.
- GGUF parser vulnerabilities — llama.cpp CVEs. BLAKE3 content hash gates shard loading but parser bugs remain upstream.
- Kademlia eclipse attacks — strategic Sybil node IDs can control DHT routing. K-bucket eviction policies help.
Storage & Data
Data Directory Layout
~/.local/share/swarmllm/
├── config.toml # User configuration
├── identity.key # Ed25519 keypair
├── api_key # Bearer token (auto-generated)
├── db.redb # redb database (migrated from sled db/ directory)
└── models/
├── qwen2.5-coder-7b/
│ ├── manifest.json
│ ├── gguf_header.bin
│ ├── shard_000.bin
│ └── shard_001.bin
└── tinyllama-1.1b/
└── ...
Database Tables (redb)
| Table | Key | Value |
|---|---|---|
| config | "config" | Config |
| config | "api_key" | Bearer token string |
| identity | "keypair" | Encrypted Ed25519 key |
| credits | "balance" | CreditBalance |
| credit_txns | {uuid} | CreditTransaction |
| peer_trust | {node_id_hex} | TrustScore |
| peer_cache | {multiaddr} | () presence key |
| shard_meta | {model_id}/{index} | ShardInfo + path |
| model_meta | {model_id} | ModelManifest |
| sessions | {session_id} | KV-cache metadata |
| nicknames | {node_id_hex} | NicknameRecord |
| pool_state | "pool" | PoolState |
| trust_scores | {node_id_hex} | f64 trust score |
| escrow | {escrow_id} | EscrowEntry |
| hf_sources | {model_id} | HfSource metadata |
| locked_shards | {shard_id_json} | bool |
| resource_schedule | "current" | ResourceSchedule |
| model_trust | {model_id} | ModelTrustEntry (level, request count, last seen) |
Model Acquisition Pipeline
Network Registry (GossipSub/DHT)
│
▼
Manifest Check ──► Reject if BLAKE3 mismatch
│
▼
Shard Selection ──► Rarest-first (BitTorrent-style)
│
▼
Download Loop ──► Atomic write to .tmp, rename to .bin
│
▼
Shard Verify ──► BLAKE3 vs manifest hash
│
▼
Model Ready
Integrity guarantees:
- Manifests verified via BLAKE3 self-hash
- Each shard verified against manifest hash
- Failed shards renamed
.bin.quarantine, serving peer penalized - Downloads retried (3 attempts, exponential backoff)
- Atomic writes prevent corrupt partial files
- Stale
.tmpfiles cleaned on startup
OpenAI-Compatible API
SwarmLLM provides a drop-in replacement for the OpenAI API. All endpoints require Bearer token authentication.
POST /v1/chat/completions
Chat completions with streaming support.
curl http://localhost:8800/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-coder-7b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Rust?"}
],
"stream": true,
"max_tokens": 512,
"temperature": 0.7
}'
Request Body
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | yes | — | Model name (or "auto" for first available) |
messages | array | yes | — | Chat messages (role + content). Roles: system, user, assistant, tool |
stream | boolean | no | false | Enable SSE streaming |
max_tokens | integer | no | 2048 | Max tokens to generate (clamped to 1–32768) |
temperature | float | no | 0.7 | Sampling temperature (0.0-2.0) |
top_p | float | no | 1.0 | Nucleus sampling threshold |
stop | string or array | no | — | Stop sequence(s), 1–256 chars each, max 16 |
frequency_penalty | float | no | 0.0 | Frequency penalty (-2.0 to 2.0) |
presence_penalty | float | no | 0.0 | Presence penalty (-2.0 to 2.0) |
tools | array | no | — | Tool/function definitions for function calling |
tool_choice | string or object | no | — | "none", "auto", "required", or {"type":"function","function":{"name":"..."}} |
logprobs | boolean | no | false | Return log probabilities for output tokens. Supported on split model (candle) inference paths |
top_logprobs | integer | no | — | Number of top log probabilities per token (0-20, requires logprobs: true). Computed from pre-sampling (raw) logits per OpenAI spec |
session_id | string | no | — | Reuse KV-cache from a previous request |
lora_adapter | string | no | — | LoRA adapter ID for fine-tuned inference |
Response (non-streaming)
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "qwen2.5-coder-7b",
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "Rust is a systems programming language..."},
"finish_reason": "stop",
"logprobs": null
}],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 42,
"total_tokens": 57
}
}
Response with logprobs
When logprobs: true and top_logprobs: 3:
{
"choices": [{
"message": {"role": "assistant", "content": "Hello"},
"finish_reason": "stop",
"logprobs": {
"content": [{
"token": "Hello",
"logprob": -0.234,
"bytes": null,
"top_logprobs": [
{"token": "Hello", "logprob": -0.234, "bytes": null},
{"token": "Hi", "logprob": -1.456, "bytes": null},
{"token": "Hey", "logprob": -2.012, "bytes": null}
]
}]
}
}]
}
Response with tool_calls
When the model calls a tool, finish_reason is "tool_calls" and content is null:
{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\":\"NYC\"}"
}
}]
},
"finish_reason": "tool_calls"
}]
}
Streaming (SSE)
When stream: true, responses arrive as Server-Sent Events:
data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":"Rust"},"index":0}]}
data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":" is"},"index":0}]}
data: [DONE]
GET /v1/models
List available models.
curl http://localhost:8800/v1/models \
-H "Authorization: Bearer YOUR_API_KEY"
{
"object": "list",
"data": [
{
"id": "qwen2.5-coder-7b",
"object": "model",
"owned_by": "swarmllm"
}
]
}
GET /v1/status
Node status (SwarmLLM extension).
curl http://localhost:8800/v1/status \
-H "Authorization: Bearer YOUR_API_KEY"
Using with OpenAI Client Libraries
Python (openai)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8800/v1",
api_key="YOUR_API_KEY"
)
# Basic streaming
response = client.chat.completions.create(
model="qwen2.5-coder-7b",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Python — Function calling
response = client.chat.completions.create(
model="qwen2.5-coder-7b",
messages=[{"role": "user", "content": "What's the weather in NYC?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}
}],
tool_choice="auto"
)
if response.choices[0].finish_reason == "tool_calls":
for tc in response.choices[0].message.tool_calls:
print(f"Call {tc.function.name}({tc.function.arguments})")
JavaScript (openai)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8800/v1",
apiKey: "YOUR_API_KEY",
});
const stream = await client.chat.completions.create({
model: "qwen2.5-coder-7b",
messages: [{ role: "user", content: "Hello!" }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || "");
}
curl (streaming)
curl -N http://localhost:8800/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"qwen2.5-coder-7b","messages":[{"role":"user","content":"Hello!"}],"stream":true}'
POST /v1/embeddings
Returns 503 Service Unavailable. Text embeddings are not supported via the subprocess inference path. Use a dedicated embedding provider or the OpenAI embeddings API directly.
GET /v1/providers
List configured cloud providers and their available models.
curl http://localhost:8800/v1/providers \
-H "Authorization: Bearer YOUR_API_KEY"
Returns an array of { name, models: [...] } objects for each configured provider.
Responses API
OpenAI's /v1/responses is the default API for o-series and gpt-5-series
models in 2026 and the replacement for the sunsetting Assistants API
(2026-08-26). SwarmLLM exposes the full v1 surface plus follow-on
features such as resumable streams, async background runs, MCP tool
integration, and conversation chaining via previous_response_id.
Endpoints
| Method | Path | Purpose |
|---|---|---|
POST | /v1/responses | Create a response (streaming or not, foreground or background) |
GET | /v1/responses/{id} | Fetch a stored response. With ?stream=true&starting_after=N, resume the SSE stream from event N (V5). |
DELETE | /v1/responses/{id} | Delete a stored response. |
POST | /v1/responses/{id}/cancel | Cancel a background response (M9). The cancel flag is checked at completion time; per-token interruption is deferred. |
GET | /v1/responses/{id}/input_items | Paginated input-item listing (V4) for chained previous_response_id flows. |
GET | /api/admin/responses | Admin: list all stored response records (used by the dashboard). |
All endpoints accept the same Bearer-auth header as the rest of the API.
Routing
POST /v1/responses picks one of three execution paths in this order:
- Cloud proxy — when the requested
modelresolves to an OpenAI-routed provider, the request is serialized verbatim and forwarded to the upstream/v1/responsesendpoint. Built-in tools, streaming, background, reasoning effort,text.verbosity,include[],previous_response_id, and any future field round-trip via#[serde(flatten)]extras. - Anthropic-Messages bridge (V3) — when the model resolves to an
Anthropic provider (or the local
claude-subscriptionsubprocess), the Responses request is translated to an Anthropic Messages request, forwarded, and translated back. This lets Claude Code clients drive/v1/responsesend-to-end without losing tool-call or streaming semantics. - Local inference — translates to
/v1/chat/completionsand runs on the local model. Function tools andtool_choicetranslate through; built-in tools (web_search,file_search,computer_use_preview,code_interpreter,image_generation,mcp,custom) are rejected with HTTP 400 because they require backing infrastructure SwarmLLM does not run.
Capabilities
- Multimodal input (V2) —
input_imageandinput_file(UTF-8 only) parts in the structuredinputarray. Binary file payloads (PDF, docx, image bytes viafile_data) are rejected with a clear hint pointing atinput_image. - Function tools —
toolsdefinitions andtool_choicetranslate to OpenAI Chat Completions tool semantics; assistanttool_callsmap back tofunction_calloutput items. - Streaming SSE (M6 + V1) —
stream=trueemits the full Responses event sequence (response.created→response.in_progress→response.output_item.added→response.content_part.added→ per-deltaresponse.output_text.delta→response.output_text.done→response.content_part.done→response.output_item.done→response.completed). The V1 fix shipped in 2026-04-25 cuts first-token latency by emittingcreatedandin_progressbefore model warmup instead of after. - Persistence (M7) —
store=true(the OpenAI default) writes the full response object to redb with a 30-day TTL.previous_response_id(M8) chains follow-up requests by prepending the prior turn's messages before the new input. - Background mode (M9 + V8) —
background=truereturns HTTP 202 with aLocation: /v1/responses/{id}header; the client polls or, withbackground=true && stream=true, opens a resumable SSE connection atGET /v1/responses/{id}?stream=truethat replays buffered events and then tails the live producer.
Validation (ingress)
The handler runs validate_responses_ingress BEFORE any routing decision
so the cloud-proxy and Anthropic-bridge paths can't forward attacker-sized
strings to upstream providers (where they'd burn quota or land in log
lines). Caps:
| Field | Limit |
|---|---|
model | 1..=256 chars |
previous_response_id | ≤64 ASCII alphanumeric (_ / - allowed); generation format is resp_<32-hex> |
instructions | ≤2 MB |
user | ≤256 chars |
truncation, service_tier | ≤64 chars each |
metadata | ≤64 KB total (keys + values) |
Stop / temperature / top_p / max_tokens are clamped or validated at the sampling-params layer.
Dashboard
The admin dashboard exposes a Responses tab (/admin/responses) backed by
GET /api/admin/responses. It shows the most-recent stored response
records with status, model, input snippet, and per-record cancel/delete
actions.
Deferred
POST /v1/responses/compact(V9) — no concrete caller has asked for it.- Token-level cancel for background inference — current cancel flips a
flag checked at completion time; per-token interruption needs hooks in
chat_completionsthat are out of v2 plan scope. - Server-side
conversationresource CRUD — OpenAI'sconversationparameter forwards through cloud proxy verbatim today; a local conversation type with its own endpoints is a separate design. - Built-in tools on the local path — see "Local inference" above.
customtools with Lark / regex grammars — rejected on local, forwarded on cloud. Local grammar-constrained generation is a candle-side project.- Audio input on
/v1/responses—input_audioreturns 400; needs a Whisper-class transcription model SwarmLLM doesn't currently expose. - Binary file inputs in
input_file{file_data}— UTF-8 only; PDF/docx/ image-bytes payloads are rejected with a clear hint pointing atinput_image(for images) or server-side text extraction.
Anthropic Messages API
SwarmLLM provides a full Anthropic Messages API at POST /v1/messages, enabling it to serve as a drop-in backend for Claude Code and other Anthropic-compatible clients.
Claude Code Integration
Use SwarmLLM as your Claude Code backend to access all models (local, network, and cloud) through a single endpoint:
ANTHROPIC_BASE_URL=http://localhost:8800 claude --model qwen2.5-coder-7b
Environment Variables
| Variable | Description |
|---|---|
ANTHROPIC_BASE_URL | Point to your SwarmLLM node (e.g., http://localhost:8800) |
ANTHROPIC_AUTH_TOKEN | Your node's API key (from Settings or /api/admin/api-key) |
ANTHROPIC_MODEL | Default model to use |
POST /v1/messages
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
model | string | yes | Model name (local GGUF, network model, or cloud model like gpt-4o) |
messages | array | yes | Chat messages with role + content |
max_tokens | integer | yes | Maximum tokens to generate (clamped to 1–32768) |
system | string or array | no | System prompt (supports cache_control blocks) |
stream | boolean | no | Enable SSE streaming |
temperature | float | no | Sampling temperature |
top_p | float | no | Nucleus sampling |
stop_sequences | array | no | Stop sequences, 1–256 chars each, max 16 |
tools | array | no | Tool definitions for function calling |
tool_choice | object | no | Tool selection strategy |
metadata | object | no | Request metadata |
thinking | object | no | Extended thinking configuration |
Content Block Types
Messages can contain these content block types:
// Text
{"type": "text", "text": "Hello, world!"}
// Image (base64)
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "..."}}
// Tool use (assistant response)
{"type": "tool_use", "id": "toolu_123", "name": "get_weather", "input": {"location": "NYC"}}
// Tool result (user message)
{"type": "tool_result", "tool_use_id": "toolu_123", "content": "72F, sunny"}
// Thinking (extended thinking)
{"type": "thinking", "thinking": "Let me reason about this..."}
// Redacted thinking
{"type": "redacted_thinking", "data": "..."}
Response
{
"id": "msg_abc123",
"type": "message",
"role": "assistant",
"model": "qwen2.5-coder-7b",
"content": [
{"type": "text", "text": "Here's my response..."}
],
"stop_reason": "end_turn",
"usage": {
"input_tokens": 25,
"output_tokens": 150
}
}
Model Routing
Requests are routed based on the model name:
| Model Pattern | Route | Details |
|---|---|---|
| Local GGUF model | Local inference | Tool calls and thinking blocks converted to text |
claude-* | Anthropic API | Full pass-through (all fields preserved including tools and thinking) |
gpt-*, o1-*, o3-*, o4-* | OpenAI | Anthropic→OpenAI format translation |
deepseek-* | DeepSeek | Anthropic→OpenAI format translation |
mistral-*, codestral-*, pixtral-* | Mistral | Anthropic→OpenAI format translation |
llama-*, groq-* | Groq | Anthropic→OpenAI format translation |
nim-* | NVIDIA NIM | Anthropic→OpenAI format translation |
cerebras-* | Cerebras | Anthropic→OpenAI format translation |
samba-* | SambaNova | Anthropic→OpenAI format translation |
fireworks-*, accounts/fireworks/* | Fireworks AI | Anthropic→OpenAI format translation |
together-* | Together AI | Anthropic→OpenAI format translation |
deepinfra-* | DeepInfra | Anthropic→OpenAI format translation |
moonshot-*, kimi-* | Moonshot/Kimi | Anthropic→OpenAI format translation |
| Network model | Distributed inference | Routed through swarm P2P network |
All 12 cloud providers are supported. Configure API keys via the dashboard Settings page or by placing a .env file in the data directory (~/.local/share/swarmllm/.env) with standard variable names (e.g., OPENAI_API_KEY, DEEPSEEK_API_KEY).
System Blocks with Cache Control
Anthropic-compatible prompt caching:
{
"system": [
{"type": "text", "text": "You are a helpful assistant.", "cache_control": {"type": "ephemeral"}}
]
}
Streaming (SSE)
When stream: true, responses arrive as Server-Sent Events following the Anthropic streaming format:
event: message_start
data: {"type":"message_start","message":{"id":"msg_123","type":"message","role":"assistant","model":"qwen2.5-coder-7b","content":[]}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}
event: message_stop
data: {"type":"message_stop"}
MCP Server
SwarmLLM includes a native Model Context Protocol (MCP) server at POST /mcp. This enables AI agents like Claude Code, Cursor, VS Code Copilot, and other MCP-compatible tools to use your SwarmLLM node as a tool provider.
Protocol version: 2025-11-05 (JSON-RPC 2.0 over HTTP).
Endpoint
POST /mcp
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY
All requests use JSON-RPC 2.0 format. All tools include tool annotations (readOnlyHint, destructiveHint, etc.).
Available Tools
chat
Send a message to any model available on the node (local, network, or cloud).
{
"jsonrpc": "2.0",
"method": "tools/call",
"params": {
"name": "chat",
"arguments": {
"model": "qwen2.5-coder-7b",
"messages": [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Explain Rust's ownership model"}
],
"temperature": 0.7,
"max_tokens": 2048
}
},
"id": 1
}
models
List all available models (local + network + cloud).
{
"jsonrpc": "2.0",
"method": "tools/call",
"params": { "name": "models", "arguments": {} },
"id": 2
}
compare
Send the same prompt to multiple models concurrently and get side-by-side results. Up to 10 models per comparison.
{
"jsonrpc": "2.0",
"method": "tools/call",
"params": {
"name": "compare",
"arguments": {
"prompt": "Write a function to check if a number is prime",
"models": ["qwen2.5-coder-7b", "gpt-4o", "claude-sonnet-4-20250514"],
"system": "Write clean, efficient code.",
"max_tokens": 1024
}
},
"id": 3
}
research
Fan out a research question to multiple models in parallel. Designed for knowledge gathering — offload questions to cheap/fast models to get diverse perspectives without using expensive model tokens. If models is omitted, auto-selects available models (local first, then cloud).
{
"jsonrpc": "2.0",
"method": "tools/call",
"params": {
"name": "research",
"arguments": {
"question": "What are the tradeoffs between ring-allreduce and star topology for tensor parallelism?",
"models": ["deepseek-chat", "gpt-4o-mini", "qwen2.5-coder-7b"],
"system": "Be concise and technical.",
"max_tokens": 2048
}
},
"id": 4
}
Response:
{
"question": "What are the tradeoffs...",
"models_queried": 3,
"successful_responses": 3,
"total_tokens_used": 1847,
"results": [
{
"model": "deepseek-chat",
"response": "Ring-allreduce...",
"input_tokens": 24,
"output_tokens": 512,
"latency_ms": 2100,
"status": "ok"
}
]
}
batch_prompts
Execute multiple independent prompts in parallel, each targeting a specific model. Ideal for offloading parallel subtasks — e.g., ask one model to summarize, another to translate, another to review code, all at once.
{
"jsonrpc": "2.0",
"method": "tools/call",
"params": {
"name": "batch_prompts",
"arguments": {
"tasks": [
{
"id": "summary",
"model": "gpt-4o-mini",
"prompt": "Summarize this error log: ...",
"max_tokens": 512
},
{
"id": "fix",
"model": "qwen2.5-coder-7b",
"prompt": "Write a fix for this bug: ...",
"max_tokens": 1024
},
{
"id": "translate",
"model": "deepseek-chat",
"prompt": "Translate to Japanese: ...",
"max_tokens": 256
}
]
}
},
"id": 5
}
Response:
{
"tasks_submitted": 3,
"tasks_completed": 3,
"results": [
{
"task_id": "summary",
"model": "gpt-4o-mini",
"content": "The error log shows...",
"latency_ms": 890,
"status": "ok"
}
]
}
delegate
Offload a task to the most appropriate model based on a tier preference. Tiers: fast picks the lowest-latency local model, cheap picks a small/free model, smart picks the most capable available model (may use cloud). Saves subscription tokens by routing routine work to local/cheap models.
{
"jsonrpc": "2.0",
"method": "tools/call",
"params": {
"name": "delegate",
"arguments": {
"prompt": "Summarize this function in one sentence: ...",
"tier": "fast",
"max_tokens": 256
}
},
"id": 6
}
Tiers:
fast— lowest-latency local model (default)cheap— smallest/free model availablesmart— most capable model (may use cloud provider)
node_info
Get detailed information about the SwarmLLM node: loaded models, connected peers, credit balance, available cloud providers, and network status.
{
"jsonrpc": "2.0",
"method": "tools/call",
"params": { "name": "node_info", "arguments": {} },
"id": 6
}
Available Resources
swarmllm://status
Returns node status information (version, model loaded, peer count).
{
"jsonrpc": "2.0",
"method": "resources/read",
"params": { "uri": "swarmllm://status" },
"id": 7
}
IDE Integration
Claude Code
Option A: MCP tools — access SwarmLLM's tools (research, batch, compare) alongside your normal model:
claude mcp add --transport http swarmllm http://localhost:8800/mcp \
--header "Authorization: Bearer YOUR_API_KEY"
Option B: Model backend — use SwarmLLM as your inference backend (routes all requests through the swarm):
ANTHROPIC_BASE_URL=http://localhost:8800 ANTHROPIC_AUTH_TOKEN=YOUR_API_KEY \
claude --model qwen2.5-coder-7b
Option C: Both — use Claude for reasoning, SwarmLLM MCP for offloading research to cheap models:
# Add SwarmLLM as MCP server
claude mcp add --transport http swarmllm http://localhost:8800/mcp \
--header "Authorization: Bearer YOUR_API_KEY"
# Then use Claude normally — it can call research/batch/compare tools via MCP
claude
VS Code (Copilot Chat)
Add to .vscode/mcp.json in your project:
{
"servers": {
"swarmllm": {
"type": "http",
"url": "http://localhost:8800/mcp",
"headers": {
"Authorization": "Bearer YOUR_API_KEY"
}
}
}
}
Copilot Chat will discover SwarmLLM's tools automatically. Use them by asking Copilot to research, compare models, or batch prompts.
Cursor / Windsurf / Other MCP Clients
Any MCP-compatible client can connect via HTTP:
URL: http://localhost:8800/mcp
Transport: HTTP (Streamable HTTP)
Auth: Bearer token in Authorization header
Continue.dev (OpenAI API)
If your IDE extension supports the OpenAI API format, point it directly at SwarmLLM:
{
"models": [{
"title": "SwarmLLM Local",
"provider": "openai",
"model": "qwen2.5-coder-7b",
"apiBase": "http://localhost:8800/v1",
"apiKey": "YOUR_API_KEY"
}]
}
Model Compare Dashboard
The compare functionality is also available in the web dashboard via the Compare tab. Select 2-10 models, enter a prompt, and view results side-by-side with latency, token counts, and response content.
Admin API
Admin endpoints are CORS-protected. Most read-only endpoints don't require Bearer auth; write operations do.
Node Management
GET /api/admin/stats
Node statistics and hardware info.
GET /api/admin/peers
Connected peers with latency, trust scores, and hosted models.
GET /api/admin/credits
Credit balance and tier info.
GET /api/admin/network-map
Geographic distribution of peers and shards across regions. Each entry includes the total peer count for that region, per-model shard-holder counts, per-model request demand rates, coverage gaps (models with zero holders in the region), and per-model replication targets derived from pool size and demand. Includes the local node in its auto-detected or configured region.
Response:
{
"regions": {
"US": {
"total": 3,
"models": { "tinyllama-1.1b-q4-k-m": 2 },
"demand": { "tinyllama-1.1b-q4-k-m": 5 },
"coverage_gaps": [],
"replication_target": { "tinyllama-1.1b-q4-k-m": 2 }
}
}
}
GET/PUT /api/admin/config
Read or update daemon configuration. PUT requires Bearer auth.
POST /api/admin/config/reload
Hot-reload operational parameters without restart. Bearer auth required.
POST /api/admin/shutdown
Gracefully shut down the node. Localhost only, Bearer auth required.
Model Management
GET /api/admin/models
List models with shard status, VRAM estimates, and acquisition state. Each model includes:
mmprojfield withavailable(bool),local(bool), andholders(count) for VLM vision encoder statustrust_levelfield: one of"Discovered","Pinned","DemandVerified", or"NetworkPopular"indicating the model's trust status (auto-manage only downloads shards for DemandVerified+ or Pinned models)
POST /api/admin/models/{id}/add
Trigger model acquisition from the network.
GET /api/admin/models/{id}/status
Check model acquisition progress.
DELETE /api/admin/models/
Remove model (shards + manifest + state).
DELETE /api/admin/models/{id}/shards/
Delete a single shard.
GET/PUT /api/admin/models/{id}/auto-manage
Per-model auto-manage policy (including prune toggle).
GET/PUT /api/admin/models/{id}/encrypted-pipeline
Per-model encrypted pipeline toggle. GET returns current status, readiness (whether local node holds first + last shard), and overhead note. PUT enables/disables with body {"enabled": true}. Requires the local node to hold shard 0 and the final shard. Returns a warning for 2-shard models (fully local, no distribution benefit). Setting is persisted to the database and survives restarts. Falls back to global encrypted_pipeline config if no per-model override is set.
PUT /api/admin/models/{id}/shards/{index}/lock
Lock/unlock a shard to prevent auto-pruning.
Storage & Shards
POST /api/admin/rescan-shards
Rescan local shard files on disk and update the model registry and network announcements without restarting the daemon. Useful after manually placing shard files in the data directory. Bearer auth required.
Response:
{ "status": "ok", "models_updated": ["model-id-1"], "count": 1 }
GET /api/admin/models/{id}/metadata
Read parsed GGUF metadata from a locally-stored model header (gguf_header.bin). Returns architecture parameters, tokenizer settings, quantization type, and all raw metadata key/value pairs (tokenizer vocabulary arrays are excluded). Returns 400 if no header file exists for the model.
Response shape:
{
"model_id": "...",
"general": { "name": "...", "architecture": "llama", "architecture_supported": true, "file_type": 11, "quantization": "Q4_K_M" },
"model": { "context_length": 4096, "block_count": 32, "embedding_length": 4096, "head_count": 32, "head_count_kv": 8, "rope_dimension_count": 128, "rope_freq_base": 500000.0, "layer_norm_rms_epsilon": 1e-5, "vocab_size": 32000 },
"tokenizer": { "model": "llama", "pre": "...", "eos_token_id": 2, "bos_token_id": 1, "padding_token_id": null },
"tensors": { "count": 291, "data_offset": 131072 },
"raw": [{ "key": "general.architecture", "value": "llama" }, ...]
}
POST /api/admin/models/{id}/shards/{index}/download
Trigger a P2P download of a specific shard that is not yet held locally. The daemon first checks for P2P peers that hold the shard (picking the best peer by LAN-proximity, latency, and trust), then falls back to returning HuggingFace source info if no peers are available. Bearer auth required.
Responses:
{ "status": "already_local", ... }— shard is already on disk{ "status": "downloading", "source": "p2p", "peer": "...", ... }— P2P download started{ "status": "use_hf", "source": "huggingface", "repo_id": "...", "filename": "...", ... }— no P2P peers, usehf/download-shardsinstead- 400 if no peers and no HuggingFace source known
POST /api/admin/models/{id}/shards/{index}/unload
Unload a single shard from memory (VRAM/RAM) without deleting the file from disk. Narrows the model's shard window to exclude this shard and restarts the worker subprocess. If this is the last loaded shard, the model is fully unloaded. Bearer auth required.
Response:
{ "status": "unloaded", "model_id": "...", "shard_index": 0, "remaining_loaded": [1, 2] }
POST /api/admin/models/{id}/shards/{index}/load
Load a shard that is on disk into memory. The shard must already be present locally (use /download first if not). Expands the model's shard window to include the shard and restarts the worker subprocess. Bearer auth required.
Response:
{ "status": "loaded", "model_id": "...", "shard_index": 0, "loaded_shards": [0, 1, 2] }
POST /api/admin/models/{id}/unload
Unload an entire model from memory (VRAM/RAM) without deleting any files from disk. Evicts all split-model entries, kills the worker subprocess, clears GGUF metadata cache, and clears the loaded-model record. Bearer auth required.
Response:
{ "status": "unloaded", "model_id": "...", "model_name": "...", "segments_removed": 2, "estimated_freed_mb": 4096 }
GET /api/admin/shard-storage
Per-model storage breakdown, disk and VRAM usage.
GET /api/admin/prune-history
Recent auto-prune events.
GET/PUT /api/admin/schedule
Resource schedule management.
HuggingFace Integration
GET /api/admin/hf/search?query=...
Search HuggingFace for GGUF models. Returns results grouped by repository with quantization variants, recommended variant, and VRAM fitness indicator.
Response format:
[{
"repo_id": "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF",
"downloads": 50000,
"likes": 120,
"variants": [
{ "filename": "...Q4_K_M.gguf", "size_bytes": 668000000, "quant": "Q4_K_M" },
{ "filename": "...Q8_0.gguf", "size_bytes": 1100000000, "quant": "Q8_0" }
],
"recommended_variant": "Q4_K_M",
"fits_vram": true
}]
GET /api/admin/hf/probe?repo_id=...&filename=...
Probe a remote GGUF file (size, shard layout).
POST /api/admin/hf/download-shards
Download specific shard indices from HuggingFace. Bearer auth required.
Supports peer_fair_share: true for smart distribution — the backend computes a deterministic fair share of shards using BLAKE3(node_id || model_id), and peers with auto-manage enabled auto-acquire the rest.
curl -X POST http://localhost:8800/api/admin/hf/download-shards \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"repo_id": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF", "filename": "qwen2.5-coder-7b-instruct.Q4_K_M.gguf", "peer_fair_share": true}'
GET /api/admin/hf/source/
Look up the HuggingFace source (repo + filename) for a locally-known model. First checks the in-memory source cache and the probe cache, then auto-discovers by searching HuggingFace if neither has an entry. If found via auto-discovery the result is cached to the database and hf_source.json in the model directory.
Response:
{ "model_id": "...", "repo_id": "TheBloke/TinyLlama-...-GGUF", "filename": "tinyllama-...Q4_K_M.gguf" }
GET /api/admin/downloads
List the download queue with per-shard progress, speed, and source.
POST /api/admin/downloads/{model_id}/cancel
Cancel an in-progress download.
LoRA Adapters
GET /api/admin/adapters
List all registered LoRA adapters with their metadata (id, name, base model, rank, alpha, path).
Response: { "adapters": [ { "id": "...", "name": "...", "base_model": "...", "rank": 16, "alpha": 32.0, "path": "..." } ] }
POST /api/admin/adapters
Register a LoRA adapter from a safetensors file. Bearer auth required. Path traversal is blocked. If id is omitted, a UUID is generated.
Request body:
{ "id": "my-adapter", "name": "My Adapter", "base_model": "tinyllama-...", "rank": 16, "alpha": 32.0, "path": "adapters/my-adapter.safetensors" }
path may be absolute or relative to <data_dir>/adapters/.
Response: { "status": "ok", "adapter": { ... } }
DELETE /api/admin/adapters/
Unregister a LoRA adapter. Does not delete the file from disk. Bearer auth required. Returns 400 if the id is not found.
Response: { "status": "ok", "message": "Adapter 'my-adapter' removed" }
Cloud Providers
GET /api/admin/providers
List configured cloud providers (name + configured flag, no keys exposed).
PUT /api/admin/providers
Update cloud provider API keys. Bearer auth required. Keys are encrypted at rest.
GET /api/admin/provider-models
List available models from all configured cloud providers. Results are cached for 60 seconds; stale results are returned immediately and refreshed in the background. Includes models from OpenAI, Anthropic (static list), DeepSeek, Mistral, Groq, NVIDIA NIM, Cerebras, SambaNova, Fireworks, Together AI, DeepInfra, and Moonshot/Kimi.
Response: { "models": [ { "id": "gpt-4o", "name": "GPT-4o", "provider": "openai" } ] }
GET /api/admin/provider-health
Probe each configured provider by sending a tiny max_tokens=1 inference request (using a suitable test model per provider). All probes run in parallel with a connect timeout.
Response:
{ "providers": [ { "provider": "openai", "status": "up", "latency_ms": 320, "detail": "" } ] }
Status values: up, rate_limited, overloaded, timeout, unreachable, error_<code>.
POST /api/admin/provider-model-status
Probe availability and latency for a list of specific cloud model IDs (up to 20 per request). Sends a max_tokens=1 request to each model's provider endpoint. Anthropic models are skipped (no cloud proxy probing). Bearer auth not required.
Request body: { "models": ["gpt-4o", "claude-sonnet-4-6", "deepseek-chat"] }
Response:
{ "models": [ { "model": "gpt-4o", "status": "up", "latency_ms": 210 } ] }
Status values: up, rate_limited, not_found, unavailable, timeout, error.
Claude Subscription (feature-gated)
Requires building with
--features claude-subscription. When the feature is not enabled, these endpoints return{"error": "claude-subscription feature not enabled"}.
GET /api/admin/claude-subscription/status
Detect whether the claude CLI is installed and authenticated on this machine. Reads version from claude --version and subscription info from ~/.claude/.credentials.json (read-only).
Response:
{
"cli_installed": true,
"cli_version": "2.1.92 (Claude Code)",
"authenticated": true,
"subscription_type": "max",
"rate_limit_tier": "default_claude_max_5x"
}
PUT /api/admin/providers (claude_subscription_enabled field)
Enable or disable the Claude subscription provider. Pass claude_subscription_enabled alongside other provider key updates.
{ "claude_subscription_enabled": true }
When enabled, claude-* model requests are routed through the local CLI subprocess instead of the Anthropic API key. The Anthropic API key (if configured) is used as fallback when disabled.
Updates
GET /api/admin/version
Current binary version info.
POST /api/admin/update/check
Check for available updates. Returns version info and changelog if update available.
POST /api/admin/update/apply
Download and apply an update. Bearer auth required.
Discovery
GET /api/admin/network-code
Get an encrypted shareable invite code and network phase. The code embeds the node's TCP listening address encrypted with ChaCha20Poly1305 — the IP is not visible in the code.
POST /api/admin/join-network
Join the network via encrypted invite code (swarm://...) or raw multiaddr. Immediately dials the peer and saves the address to the peer cache.
Responses API listing
GET /api/admin/responses
List stored Responses-API records (backs the dashboard's Responses tab).
Optional query params: ?limit=N (cap on returned records, default 100,
max 500) and ?status=... (filter by completed / in_progress /
cancelled / failed / queued). See Responses API
for the user-facing surface.
Authentication
GET /api/admin/api-key
Retrieve the API key. Bearer auth required.
WebSocket
GET /api/admin/ws
WebSocket for live updates. Pushes the following event types:
| Event | Trigger | Data |
|---|---|---|
activity_event | Any subsystem event | kind, model_id, message, timestamp, toast_level |
stats_update | Every 2s | Peer count, credits, acquisitions, shard registry, swarm_capacity (R110), wishlist (R111) |
peer_list | Peer connect/disconnect | Full peer snapshot |
models_changed | Shard download/load/prune | (none — signals dashboard to refresh) |
update_available | New version detected | Version info, changelog |
Claude Subscription Provider
Use your existing Claude Pro, Max, Team, or Enterprise subscription to access Claude models through SwarmLLM — no API key or per-token charges needed.
Feature-gated: Build with
--features claude-subscriptionto enable. This feature is isolated behind a compile-time flag for easy removal.
How It Works
When enabled, SwarmLLM spawns the claude CLI as a subprocess for each Claude model request:
Client Request (OpenAI or Anthropic format)
→ SwarmLLM API (openai.rs / anthropic/mod.rs)
→ Provider resolution: model starts with "claude-"
→ Claude subscription enabled? → Spawn subprocess
→ Else: use Anthropic API key (existing behavior)
→ claude -p --output-format stream-json --model <model> "<prompt>"
→ Parse NDJSON → Translate to API format → Return response
Both the OpenAI (/v1/chat/completions) and Anthropic (/v1/messages) endpoints are supported, with streaming and non-streaming modes.
Setup
1. Install the Claude CLI
npm install -g @anthropic-ai/claude-code
2. Log in with your subscription
claude login
This opens a browser window. Sign in with your Claude Pro/Max/Team/Enterprise account.
3. Build SwarmLLM with the feature
cargo build --no-default-features --features dev,claude-subscription
4. Enable via the dashboard
Open Settings → Cloud Providers → Claude Subscription, click "Check Status" to verify your CLI is detected, then enable the toggle.
Or via API:
curl -X PUT http://localhost:8800/api/admin/providers \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{"claude_subscription_enabled": true}'
5. Send requests
# OpenAI format
curl http://localhost:8800/v1/chat/completions \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-6",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
# Anthropic format
curl http://localhost:8800/v1/messages \
-H "x-api-key: <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-6",
"max_tokens": 100,
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
Multi-Turn Conversations
Multi-turn conversations work by serializing the full message history into the prompt on each request. The format uses XML tags that Claude understands natively:
- System messages →
<system>...</system> - Assistant messages →
<previous_response>...</previous_response> - User messages → bare text
This is the same stateless approach used by OpenAI-compatible APIs — the client sends the full conversation every time, and the server doesn't maintain session state.
Configuration
All configuration is in the providers.claude_subscription section, manageable via the admin API or dashboard:
| Field | Default | Description |
|---|---|---|
enabled | false | Route Claude requests through the CLI |
claude_binary | "claude" | Path to the claude binary |
default_model | (from request) | Override model for all requests |
max_concurrent | 3 | Max concurrent subprocess invocations |
timeout_secs | 300 | Timeout per request (seconds) |
working_dir | (system temp) | Working directory for the subprocess |
Working Directory
By default, the subprocess runs in the system temp directory to avoid loading project-specific CLAUDE.md files, hooks, and MCP servers. Set working_dir to a project path if you want Claude to have project context for its responses.
Routing Priority
When a claude-* model is requested:
- Claude subscription (if enabled and CLI detected) — subprocess path, uses subscription
- Anthropic API key (if configured) — direct API proxy, pay-per-token
- Error — no provider available
The subscription provider takes priority over the API key. Disable the subscription toggle to fall back to API key billing.
Rate Limits
Subscription rate limits are per rolling 5-hour window (not per-second RPM like API keys). The concurrency limiter (default 3) prevents spawning too many concurrent processes. Community reports suggest ~3-5 parallel Opus sessions before degradation.
Rate limit info is returned in the NDJSON output and logged. The GET /api/admin/claude-subscription/status endpoint shows the current rate limit tier.
Removal
If this feature needs to be removed:
git rm src/api/claude_sub.rs
# Remove "claude-subscription = []" from Cargo.toml
grep -rn 'claude.subscription\|claude_sub' src/ frontend/
# Remove the ~6 #[cfg] blocks found by grep
Single commit, clean removal. No deep dependencies on the rest of the codebase.
Identity & Device Pool API
Identity
GET /api/identity/nickname
Get the current node's nickname.
PUT /api/identity/nickname
Set a nickname. Body: {"nickname": "my-node"}
DELETE /api/identity/nickname
Clear the nickname.
GET /api/identity/leaderboard
Network-wide credit leaderboard.
GET /api/identity/peers
Peer identity directory (nicknames, regions, tiers).
Device Pools ("My Devices")
Link multiple devices owned by the same user. Credits earned by all linked devices are combined into one balance on the main (owner) device.
Terminology: "Linked Devices" in the UI. This is different from connecting to the SwarmLLM network — linking devices groups your own hardware, while the network connects you with other people.
Quick Start (CLI)
# On your main device:
swarmllm pool create --name "My Devices"
swarmllm pool invite-code
# → A3F7K2M9
# On each other device:
swarmllm pool join A3F7K2M9
# Check status:
swarmllm pool status
Invite Code System
Instead of exchanging raw 64-character node IDs, device pools use 8-character invite codes (e.g., A3F7K2M9):
- Owner generates a code →
POST /api/pool/generate-code - Code shared verbally, via QR, or copy-paste
- Member enters code →
POST /api/pool/join→ broadcasts join request over gossip - Owner's node auto-validates code and creates invitation
- Member auto-accepts → pool established
Security: Codes use a 32-character alphabet (no 0/O/1/I), are one-time use, expire in 24h, and the code itself is never transmitted over the network — only its BLAKE3 hash.
API Endpoints
GET /api/pool/state
Current pool membership state. Returns in_pool, member list with device names, online status, per-device stats, credit split percentage.
POST /api/pool/create
Create a new device pool. Body: {"name": "My Devices"}
POST /api/pool/generate-code
Generate an invite code (owner only). Returns: {"code": "A3F7K2M9"}. Max 5 active codes.
POST /api/pool/join
Join a pool using an invite code. Body: {"code": "A3F7K2M9"}
POST /api/pool/invite
Invite a specific node by ID (advanced). Body: {"node_id": "abc123..."}
POST /api/pool/accept
Accept a pool invitation. Body: {"invitation_id": "..."}
POST /api/pool/remove
Remove a member (owner only). Body: {"node_id": "..."}
POST /api/pool/leave
Leave the current pool.
POST /api/pool/device-name
Set this device's nickname. Body: {"name": "Gaming PC"}
PUT /api/pool/credit-split
Set credit split percentage (owner only). Body: {"pct": 20} (0-50)
PUT /api/pool/contribution
Set per-member contribution level override. Body: {"node_id": "...", "level": 75} (integer 0–100)
GET /api/pool/invitations
List pending invitations for this node.
GET /api/pool/leaderboard
Pool member contribution rankings.
GET/PUT /api/admin/pools/:id/rates
Per-pool credit rate overrides.
Private Mode
Restrict inference to your device pool for maximum privacy. Your prompts never leave your devices.
GET /api/pool/private-mode
Current state + coverage summary. Returns enabled, allow_lan, offline_mode, and coverage object.
PUT /api/pool/private-mode
Toggle private mode. Body: {"enabled": true} or {"enabled": true, "offline_mode": true}.
Returns coverage summary so the UI can show trade-offs immediately.
GET /api/pool/coverage
Per-model coverage breakdown: total_shards, pool_shards, coverage_pct, missing indices, est_download_mb. Also returns disk_budget_mb and disk_used_mb.
Shard Pinning
GET /api/pool/pins
List current shard pins.
POST /api/pool/pin
Pin a model to a specific device (owner only). Body: {"model_id": "...", "target_node_id": "hex..."}.
Optional shard_indices array for specific shards (empty = all shards).
DELETE /api/pool/pin
Remove a shard pin. Same body format as POST.
Pool Features
- Device nicknames: Name each device for easy identification
- Online/offline status: Tracked via health pings, displayed with last-seen timestamps
- Per-device stats: VRAM, shards hosted, forwards served, uptime, models hosted
- Combined VRAM: Aggregate GPU memory across all linked devices
- Credit split: Owner configures what percentage (0-50%) members keep vs forward
- Private Mode: Restrict inference to pool devices only. Toggle via UI or API
- Shard Pinning: Assign specific models to specific devices. Auto-manage respects pins
- Offline Mode: Air-gapped LAN operation with mDNS-only discovery
- Coverage Dashboard: Per-model availability bars showing pool shard coverage
- Max 10 devices per pool (configurable), 10 pool operations per hour rate limit
Pool Security
- Invite codes: 32^8 ≈ 1.1 trillion combos, one-time use, 24h expiry
- Join requests signed with Ed25519 (transport-layer sender authentication)
- Credit forwarding uses dual-signed
PoolCreditForward(member + owner) - Member removal requires Ed25519-signed removal notice with replay protection
- Pool state gossip verifies each member's acceptance signature
- Blinded invitation broadcast (SEC-M18): network observers can't see who's invited
Prometheus Metrics
SwarmLLM exposes a Prometheus-compatible metrics endpoint at GET /metrics. No authentication required (standard convention for metrics endpoints).
Available Metrics
Core Metrics
| Metric | Type | Description |
|---|---|---|
swarmllm_peers_connected | gauge | Number of connected peers |
swarmllm_inference_requests_total | counter | Total inference requests processed |
swarmllm_credits_balance | gauge | Current credit balance |
swarmllm_shards_hosted | gauge | Number of locally hosted shards |
swarmllm_inference_latency_seconds | histogram | Inference request latency |
Channel Metrics
Internal channel health metrics for monitoring backpressure:
| Metric | Type | Description |
|---|---|---|
swarmllm_channel_capacity{channel="..."} | gauge | Channel buffer capacity |
swarmllm_channel_sent_total{channel="..."} | counter | Messages sent through channel |
swarmllm_channel_dropped_total{channel="..."} | counter | Messages dropped due to backpressure |
Histogram Buckets
The latency histogram uses these bucket boundaries (in seconds):
0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, +Inf
Scraping Configuration
Add to your prometheus.yml:
scrape_configs:
- job_name: "swarmllm"
static_configs:
- targets: ["localhost:8800"]
Example Queries
# Request rate (requests per second over 5 minutes)
rate(swarmllm_inference_requests_total[5m])
# P50 latency
histogram_quantile(0.50, rate(swarmllm_inference_latency_seconds_bucket[5m]))
# P99 latency
histogram_quantile(0.99, rate(swarmllm_inference_latency_seconds_bucket[5m]))
# Average latency
rate(swarmllm_inference_latency_seconds_sum[5m]) / rate(swarmllm_inference_latency_seconds_count[5m])
Health Check
GET /health/ready
Readiness probe returning subsystem status. Returns 200 when ready, 503 otherwise. No auth required.
{
"ready": true,
"subsystems": {
"network": true,
"inference_router": true,
"api_server": true,
...
}
}
Deployment Guide
Single Node
The simplest deployment — just run the binary:
./swarmllm run
This starts the daemon on port 8800 with default settings.
Production Configuration
For production use, create a config file:
[node]
listen_port = 8800
contribution = "maximum"
[resources]
max_gpu_vram_mb = 0 # Auto-detect
max_disk_mb = 100000 # 100 GB
[inference]
gpu_layers = 99 # Offload all layers to GPU
max_concurrent_requests = 20
max_batch_size = 4
session_timeout_seconds = 600
[auto_manage]
enabled = true
max_storage_mb = 50000
max_concurrent_downloads = 5
[logging]
level = "info"
format = "json" # Structured logs for production
file = "/var/log/swarmllm.log"
[ui]
open_browser_on_start = false
[identity]
region = "US"
Systemd Service
Create /etc/systemd/system/swarmllm.service:
[Unit]
Description=SwarmLLM P2P Inference Node
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=swarmllm
ExecStart=/usr/local/bin/swarmllm run --config /etc/swarmllm/config.toml
Restart=on-failure
RestartSec=10
LimitNOFILE=65536
# Security hardening
NoNewPrivileges=yes
ProtectSystem=strict
ReadWritePaths=/var/lib/swarmllm /var/log
[Install]
WantedBy=multi-user.target
sudo systemctl enable --now swarmllm
Docker
Quick Start (Recommended)
# Download compose file and env template
curl -LO https://raw.githubusercontent.com/enapt/SwarmLLM/main/docker-compose.yml
curl -LO https://raw.githubusercontent.com/enapt/SwarmLLM/main/.env.example
cp .env.example .env
# CPU
docker compose up -d
# GPU (requires NVIDIA Container Toolkit)
docker compose --profile gpu up -d
Pre-built Images
| Image | Description |
|---|---|
ghcr.io/enapt/swarmllm:latest | CPU-only (Debian bookworm-slim) |
ghcr.io/enapt/swarmllm:latest-cuda | NVIDIA GPU (CUDA 12.4 runtime) |
Versioned tags follow semver: 0.1.0, 0.1.0-cuda, 0.1, 0.1-cuda.
Manual Docker Run
# CPU
docker run -d \
--name swarmllm \
--restart unless-stopped \
-p 8800:8800/tcp \
-p 8810:8810/tcp \
-p 8800:8800/udp \
-v swarmllm-data:/data \
-v /path/to/models:/data/models \
--env-file .env \
ghcr.io/enapt/swarmllm:latest
# GPU
docker run -d \
--gpus all \
--name swarmllm \
--restart unless-stopped \
-p 8800:8800/tcp \
-p 8810:8810/tcp \
-p 8800:8800/udp \
-v swarmllm-data:/data \
-v /path/to/models:/data/models \
--env-file .env \
ghcr.io/enapt/swarmllm:latest-cuda
Build from Source
# CPU
docker build -t swarmllm .
# CUDA
docker build -f Dockerfile.cuda -t swarmllm:cuda .
Multi-Node Dev Cluster
For development and testing, a 3-node compose file is available:
docker compose -f docker-compose.dev.yml up
Nodes are at localhost:8800, localhost:8801, localhost:8802. Add GPU support:
docker compose -f docker-compose.dev.yml -f docker-compose.cuda.dev.yml up
Multi-Node Cluster
Same LAN
Nodes on the same network discover each other automatically via mDNS. Just start multiple instances on different ports:
# Node 1
./swarmllm run -p 8800
# Node 2
./swarmllm run -p 8801 -d ~/.local/share/swarmllm-node2
Across Networks
Use bootstrap peers or invite codes:
# Node 1 (get its address from the dashboard or logs)
./swarmllm run
# Node 2 (connect to Node 1)
./swarmllm run --bootstrap "/ip4/NODE1_IP/udp/8800/quic-v1/p2p/PEER_ID"
Split Inference Cluster
For a dedicated split-inference setup across multiple machines:
# Machine A: shards 0-3
./swarmllm run --shards "0-3" --bootstrap "/ip4/MACHINE_B/udp/8800/quic-v1/p2p/..."
# Machine B: shards 4-7
./swarmllm run --shards "4-7" --bootstrap "/ip4/MACHINE_A/udp/8800/quic-v1/p2p/..."
Firewall
Open TCP port 8800 (HTTP API), TCP port 8810 (P2P), and optionally UDP port 8800 (QUIC):
# Linux (ufw)
sudo ufw allow 8800/tcp # HTTP API
sudo ufw allow 8810/tcp # P2P (Noise+Yamux, primary transport)
sudo ufw allow 8800/udp # P2P (QUIC, optional)
# Linux (iptables)
sudo iptables -A INPUT -p tcp --dport 8800 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 8810 -j ACCEPT
sudo iptables -A INPUT -p udp --dport 8800 -j ACCEPT
Reverse Proxy (Optional)
If you want to put the HTTP API behind nginx:
server {
listen 443 ssl;
server_name swarmllm.example.com;
location / {
proxy_pass http://127.0.0.1:8800;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
Note: The reverse proxy only handles HTTP traffic. P2P (QUIC/UDP) must still be accessible directly on port 8800.
Cloud Provider API Keys
To use cloud model fallback, configure provider API keys via:
- Dashboard: Settings page in the web UI
- Environment file: Place a
.envfile in the data directory with standard variable names:
# ~/.local/share/swarmllm/.env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
DEEPSEEK_API_KEY=sk-...
MISTRAL_API_KEY=...
GROQ_API_KEY=gsk_...
NVIDIA_API_KEY=nvapi-...
CEREBRAS_API_KEY=...
SAMBANOVA_API_KEY=...
FIREWORKS_API_KEY=...
TOGETHER_API_KEY=...
DEEPINFRA_API_KEY=...
MOONSHOT_API_KEY=...
- Shell environment: Export the same variables before starting the daemon
Performance & Inference Speedups
SwarmLLM's distributed inference path ships with a stack of optimisations that are on by default — you get them without touching a config. This chapter names each one, explains what it does, and shows the measured win so you can tell which levers matter for your workload.
A few are flag-gated because the win is workload-dependent or the path is still being hardened; those are documented at the bottom so you can turn them on intentionally.
The full design notes live in
docs/plans/archive/distributed_inference_speedup.md
with benchmark recipes in
docs/plans/benchmarks/.
The default-on stack
Continuous batching
Concurrent /v1/chat/completions requests for the same model share one
forward pass per decode tick instead of running serially. GPU builds use a
fused forward_batch kernel; CPU workers fall through to sequential with
no regression.
- Measured: 1.34–1.55× GPU throughput at batch 2–8 on RTX 3070 + TinyLlama Q4
- Config:
inference.continuous_batching = true(default)
Remote-generate fast path
For single-segment distributed inference (the common case: one remote node owns the whole model, requester does embedding + sampling), skip the per-token coordinator round-trips and run the decode loop end-to-end on the remote worker. Tokens stream back as they're sampled.
- Measured: 1.93× decode speedup
- Config: default-on — no flag, triggered automatically on single-segment pipelines
Cross-request prefix cache
Each worker keeps an LRU cache of prefill KV snapshots keyed by the prompt's token prefix. A re-submission with the same system prompt (different user turn) skips prefill for the shared prefix and only forwards the suffix.
- Measured: 29.4× wall-clock speedup on re-submission of the same 513-token prompt (single node, TinyLlama)
- Config:
inference.prefix_cache_enabled = true(default),inference.prefix_cache_block_tokens = 64(default — block granularity),inference.prefix_cache_max_entries = 16(default — per model)
Batched prefill + chunked prefill
Sarathi-style chunked prefill: a long admission advances by
prefill_chunk_tokens (default 128) per decode tick, so new requests
don't wait behind a full prior prefill. Phase 4 adds
batched_prefill_forward = true (default), which fuses concurrent
same-shape prefill chunks into one forward_batch call.
- Measured (Phases 1+2): 17–23× TTFT fairness at concurrency 2/4/8 on RTX 3070 + TinyLlama Q4 vs serial prefill
- Measured (Phase 4): 1.57× aggregate tok/s at c=4 with uniform 180/180/180 ms TTFT (vs pre-fix 52/235/447 ms spread)
- Config:
inference.continuous_batching = true,inference.prefill_chunk_tokens = 128,inference.batched_prefill_forward = true(all default)
Cross-node prefix-KV sharing
When node B receives a prompt whose prefix was already prefilled by peer A, B fetches A's KV snapshot over the wire instead of re-prefilling locally. The pipeline is:
A prefills → inserts prefix-cache block → gossips PrefixCacheAnnounce
B receives prompt → local cache miss → probe daemon → walk index
B sends SendPrefixKvFetch to A → A's worker exports snapshot
B verifies BLAKE3 + NaN/Inf → hydrates KV → prefill suffix only
- Measured (TinyLlama, GPU-GPU): fetched path is ~100 ms slower than local prefill — the 28 MB f32 snapshot takes ~260 ms to ship while the local prefill it replaces is only ~460 ms. TinyLlama is too small to demonstrate the win on localhost + fast GPU.
- Measured (Qwen2.5-Coder-7B, CPU-CPU): 12.9× TTFT speedup on iter 1 — control full-prefill = 151.7 s, fetched path = 11.8 s. The 73 MB f32 snapshot transfers in ~1 s while 640-token Qwen-7B CPU prefill runs ~150 s.
- Config:
inference.cross_node_prefix_trust_min = 0.5(default — gates peers by trust score; set to2.0to disable the fetch path entirely).
The fetch path uses three chained timeouts (worker probe 3000 ms, daemon network 2500 ms, serving IPC 2000 ms) sized for 7B-class f32 snapshots. Missing the window degrades to a clean miss — no worse than not having the feature. See the two-daemon loopback bench recipe for reproduction details.
Parallax scheduler
Pipeline assignment uses shortest-path dynamic programming over observed
per-layer latencies (EMA over recent forwards) rather than a greedy
pick-the-closest-peer heuristic. Cross-gossip of top-32 observed
latencies via NodeCapability.observed_latencies lets every node keep
a current view of the network's compute profile. A soft acquire/prune
bias in AutoShardManager driven by a per-shard stability counter
(≥3 consistent ticks before it acts) drifts shards toward where they're
actually used without violating existing hard constraints.
- Measured: 10 routing + 7 allocator + 2 scheduler integration tests passing; real-world improvements depend on network heterogeneity. The biggest impact is in asymmetric setups where a cheap peer's low observed latency should beat a high-VRAM peer's big shard slot.
- Config: default-on. Multi-pipeline concurrency is deferred.
Flag-gated features
Turn these on when you've measured that they match your workload.
Distributed speculative decoding (speculative_distributed)
Draft model proposes γ tokens locally; target verifies all γ in one remote forward pass.
- Status: End-to-end verified. 40–52% accept rate in a llama-cpp-draft / candle-target pairing (cross-backend numerical mismatch caps accept rate).
- Config:
inference.speculative_distributed = true,inference.draft_model_path = "path/to/draft.gguf",inference.speculative_gamma = 4(tokens per verify round)
SWIFT self-speculative decoding (swift_self_speculative)
The target model acts as its own draft by skipping a contiguous range of layers on the proposal pass. No external draft model needed.
- Status: Landed behind flag. Structurally slower than baseline on candle CPU until flash-attn-with-mask lands (attention kernel mismatch on multi-position verify). Shelved on CPU; may help on GPU.
- Config:
inference.swift_self_speculative = true,inference.swift_skip_ratio = 0.45(fraction of layers to skip on the draft pass)
DSD — decentralized speculative decoding (decentralized_spec_decoding)
Multi-segment distributed inference with speculative decoding woven in.
A γ-token decode on the last-segment worker plus KV truncation primitives
plus a coordinator loop in pipeline/dsd.rs.
- Status: All phases landed 2026-04-18 behind flag. End-to-end multi-segment WAN benchmark pending.
- Config:
inference.decentralized_spec_decoding = true
Activation compression Q8_0 (activation_compression)
Intermediate pipeline hidden-state activations are quantized from f16 to Q8_0 before going over the wire. Receivers auto-dispatch on the dtype tag.
- Status: Codec verified. ~3.76× wire compression, RMS error <0.005. End-to-end multi-segment benchmark pending.
- Config:
inference.activation_compression = true
Persistent pipeline stream (persistent_pipeline_stream)
Replace per-token request/response with one long-lived libp2p bidirectional stream per pipeline session.
- Status: Landed behind flag. Wire-level verified; no measured latency win because the bottleneck was elsewhere (solved by remote-generate + batched prefill).
- Config:
inference.persistent_pipeline_stream = true
Debugging slow inference
Default verbosity (-v) gives an INFO-level stream. Bump to -vv to
see per-request DIAG: logs, which include the per-feature speedup
signals:
./swarmllm run -vv 2>&1 | grep "DIAG:"
Key DIAG kinds:
DIAG: prefix-cache HIT— local prefix cache hitDIAG: cross-node prefix HIT— cross-node prefix-KV fetch succeededDIAG: prefix-probe: fetch timed out— cross-node fetch missed the window (see Troubleshooting for timeout sizing on 7B+ models)DIAG: served PrefixKvFetch ... hit=true— this node served a cross-node fetchDIAG: BatchGenerate— batched-prefill slot table activityDIAG: chunk fused batch_size=N— fused prefill chunks (Phase 4)DIAG: Parallax— Parallax scheduler decisions
For the full DIAG taxonomy and what each line means, see
docs/DIAGNOSTICS.md.
When should I turn a speedup off?
Almost never. The default-on features degrade cleanly under edge cases — the prefix cache falls through to full prefill on a miss, cross-node fetch falls through to local prefill on a timeout, batched prefill falls back to sequential when concurrency is 1. If you suspect one is the cause of a regression:
- Prefix cache off:
inference.prefix_cache_enabled = false - Cross-node fetch off:
inference.cross_node_prefix_trust_min = 2.0(gates every peer out) - Continuous batching off:
inference.continuous_batching = false(also disables Phase 4 fusion) - Phase 4 fusion off, keep continuous batching:
inference.batched_prefill_forward = false
Please open an issue if a speedup is costing you — the benchmarks above are RTX 3070 + WSL2 + a specific set of models, so real-world workloads will surface corners the benches miss.
Benchmarking
SwarmLLM ships with a built-in bench command and a set of reproducible
recipes under docs/plans/benchmarks/. This chapter covers both.
Quick: swarmllm bench
The bench subcommand runs a real /v1/chat/completions workload against
a daemon and reports latency + throughput.
./swarmllm bench \
--max-tokens 100 \
--iterations 5 \
--concurrency 1 \
--stream \
--model-id tinyllama-1.1b-chat-v1.0.q4-k-m \
--json
Key flags:
--max-tokens— tokens to generate per request (default 100)--iterations— sequential iterations per concurrency level (default 5)--concurrency— concurrent requests for throughput tests (default 1)--stream— use streaming chat completions and report TTFT (time-to-first-token) per request. TTFT is the signal that captures the batched-prefill and cross-node-fetch wins; non-streaming bench rolls prefill + decode into one total time and hides the difference.--prompt— custom prompt; default is a short prompt about relativity that won't stress prefix caching. Pass a longer prompt (≥500 tokens) to exercise prefix cache paths.--model-id— target a specific model when several are registered; otherwise uses the first one from/v1/models.--json— machine-readable output
The bench reads the API key from the daemon's data dir, so run it with
the same SWARMLLM_NODE_DATA_DIR or -d as the daemon.
Single-node baselines
Reference numbers on an AMD Ryzen 7 5800H + RTX 3070 Laptop 8 GB VRAM (WSL2, release build):
| Model | Params | Quant | GPU | CPU |
|---|---|---|---|---|
| TinyLlama 1.1B Chat | 1.1B | Q4_K_M | 27.2 tok/s | 4.2 tok/s |
| Gemma-2 2B IT | 2.5B | Q4_K_M | 20.6 tok/s | 3.5 tok/s |
| Phi-3.5 Mini | 3.8B | Q4_K_M | 46.4 tok/s | 1.8 tok/s |
| Qwen2.5-Coder 7B | 7.6B | Q4_K_M | 29.0 tok/s | 2.4 tok/s |
Single-node numbers are largely about your hardware. The interesting benchmarks are distributed.
Reproducing the performance benchmarks
Each performance optimization has a written benchmark recipe in
docs/plans/benchmarks/.
Most require two local daemons on loopback; a couple need three.
Batched prefill — TTFT fairness
docs/plans/benchmarks/round4.md
Measures TTFT at concurrency 2/4/8 with Phases 1+2 on vs off. The win is fairness, not aggregate throughput: Sarathi chunked prefill prevents new admits from waiting behind the full prior prefill.
Batched chunked prefill (Phase 4)
docs/plans/benchmarks/round5.md
Measures aggregate tok/s and per-request TTFT spread with
batched_prefill_forward on vs off. The on-config fuses concurrent
same-shape prefill chunks so TTFT lands tightly clustered instead of
spreading.
Cross-node prefix-KV sharing
docs/plans/benchmarks/round6.md
Two-daemon loopback TCP. Measures iter-1 TTFT with the cross-node fetch
path enabled vs gated off (via cross_node_prefix_trust_min = 2.0).
Same recipe runs against TinyLlama (fast-GPU corner case: fetch is
slightly slower than prefill) and Qwen-7B (12.9× TTFT speedup on
CPU-CPU because 7B CPU prefill is slow enough that the ~1 s fetch +
verify + hydrate buys back ~150 s of local prefill).
Sketch of the recipe:
# Node A on 8800
SWARMLLM_NODE_DATA_DIR=/tmp/swarm_a ./target/release/swarmllm run -p 8800 -v &
# Node B on 8900, bootstrapped off A
A_MADDR=$(grep -oE "peer_id=12D3KooW[A-Za-z0-9]+" /tmp/swarm_a.log | \
head -1 | sed 's/peer_id=/\/ip4\/127.0.0.1\/tcp\/8810\/p2p\//')
SWARMLLM_NODE_DATA_DIR=/tmp/swarm_b ./target/release/swarmllm run \
-p 8900 -v --bootstrap "$A_MADDR" &
# Copy shards into both data dirs (or download via /api/admin/hf/download-shards)
cp -r ~/.local/share/swarmllm/models/<model-id> /tmp/swarm_a/models/
cp -r ~/.local/share/swarmllm/models/<model-id> /tmp/swarm_b/models/
# Warm A with the long prompt (populates A's prefix cache, announces to B)
./swarmllm bench -p 8800 --stream --iterations 3 --max-tokens 100 \
--prompt "$(cat long-prompt.txt)" --model-id <model-id>
# Measure B TTFT — iter 1 should fire the cross-node fetch
./swarmllm bench -p 8900 --stream --iterations 3 --max-tokens 100 \
--prompt "$(cat long-prompt.txt)" --model-id <model-id> --json
Check B's log for DIAG: cross-node prefix HIT — hydrated KV matched_tokens=... bytes=...
to confirm the fetch path fired.
Caveats
- WSL2 localhost bandwidth is much higher than any real network — localhost benches are the best case for compute-bound paths and the worst case for fetch paths. WAN numbers will be different.
- TinyLlama is too small to show some speedups — cross-node prefix-KV sharing in particular needs a larger model (Phi-3.5, Qwen-7B) to flip the sign between fetch-cost and prefill-cost. See the round6 benchmark notes for the cross-over math.
- VRAM fit matters — Qwen-7B Q4 weights fit in 8 GB but batched attention kernel scratch does not. CPU-mode works but the baseline numbers above change.
- Pre-warm before measuring TTFT — iter 1 of a model includes disk read + weight load + first CUDA context init; exclude this by pre-warming with a short unrelated prompt before the real measurement.
Standard pre-push gate is cargo fmt && cargo clippy --all-targets -- -D warnings && cargo test.
If you add a benchmark, add it under docs/plans/benchmarks/roundN.md
with the recipe + results + interpretation, and link it from here.
Tailscale & WAN Access
SwarmLLM works over any IP-routable network, including VPN overlays like Tailscale, WireGuard, and ZeroTier. This guide covers how to access your node remotely and connect peers across the internet.
Use Cases
- Remote access — Chat with your home GPU from your laptop at a coffee shop
- Multi-site cluster — Connect nodes at home and work into one swarm
- Team deployment — Share a private swarm across your team without exposing ports to the internet
- Cloud + local hybrid — Connect a cloud GPU instance to your local network
Quick Setup with Tailscale
1. Install Tailscale on all machines
# Linux
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up
# macOS
brew install tailscale
tailscale up
# Windows — download from https://tailscale.com/download
Each machine gets a stable 100.x.x.x IP address on the Tailscale network.
2. Start SwarmLLM normally
# On each machine — no special flags needed
./swarmllm run
SwarmLLM binds to 0.0.0.0 by default, which includes the Tailscale interface.
3. Connect peers via bootstrap
Since mDNS doesn't work across Tailscale (it's link-local only), use one of these methods:
Option A: Invite code (easiest)
On Node A, copy the invite code from the dashboard (http://localhost:8800). On Node B, paste it into the "Join Network" field. The invite code contains the node's addresses — including the Tailscale IP if it's listening on 0.0.0.0.
Option B: Bootstrap peers in config
# ~/.local/share/swarmllm/config.toml on Node B
[network]
bootstrap_peers = [
"/ip4/100.64.0.5/tcp/8810", # Node A's Tailscale IP
]
Option C: CLI flag
./swarmllm run --bootstrap /ip4/100.64.0.5/tcp/8810
4. Access the dashboard remotely
Once connected via Tailscale, open the dashboard from any machine:
http://100.64.0.5:8800
The API is also accessible at that address:
curl http://100.64.0.5:8800/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "tinyllama", "messages": [{"role": "user", "content": "Hello!"}]}'
Recommended Config for WAN / Tailscale
[network]
enable_mdns = false # mDNS is LAN-only, won't work through Tailscale
enable_autonat = false # Tailscale handles NAT, disable noisy probes
enable_dcutr = false # Hole punching unnecessary on Tailscale
enable_relay = true # Keep as fallback for robustness
enable_quic = true # QUIC works well on Tailscale (low-latency UDP)
bootstrap_peers = [
"/ip4/100.64.0.5/tcp/8810", # Replace with your peer's Tailscale IP
]
For higher latency links (cross-continent), you may also want:
[inference]
tp_max_latency_ms = 50 # Relax tensor parallelism latency threshold (default: 10ms)
Binding to a Specific Interface
If you only want SwarmLLM accessible via Tailscale (not the local network):
[network]
listen_address = "100.64.0.5" # Bind only to Tailscale interface
Or bind to localhost only and use Tailscale's Funnel or port forwarding:
[network]
listen_address = "127.0.0.1"
WireGuard / ZeroTier / Other VPNs
The same approach works with any VPN overlay:
- Install the VPN on all machines
- Start SwarmLLM with default config (
listen_address = "0.0.0.0") - Use the VPN IP as a bootstrap peer address
- Disable mDNS if peers aren't on the same physical LAN
Security Notes
- API key still required — remote access to inference endpoints requires Bearer token auth, even over Tailscale
- E2E encryption is independent of VPN — SwarmLLM encrypts all P2P traffic with X25519 + ChaCha20-Poly1305 regardless of whether you use a VPN. The VPN adds a second layer of encryption at the network level
- Dashboard is not auth-protected — the admin dashboard at
/admindoesn't require authentication. If exposing to untrusted networks, use Tailscale ACLs to restrict access or bind to127.0.0.1and use SSH tunneling
Troubleshooting
Peers don't connect:
- Verify Tailscale is running:
tailscale status - Check that port 8810 (TCP) and 8800 (UDP/QUIC) are reachable:
tailscale ping 100.64.0.5 - Try with
--bootstrap /ip4/<TAILSCALE_IP>/tcp/8810explicitly - Check logs with
-vvfor connection errors
Slow inference across WAN:
- Pipeline parallelism (splitting layers across nodes) works best on low-latency links (<50ms)
- Tensor parallelism requires LAN-like latency (<10ms) — increase
tp_max_latency_msor let SwarmLLM use pipeline mode instead - Consider having each site run its own models for local inference, with the swarm as fallback
Stale peer cache after IP change:
- If your Tailscale IP changes, old cached addresses will fail. Delete the database to clear the cache:
rm ~/.local/share/swarmllm/db.redb
Monitoring with Grafana
SwarmLLM ships with a pre-built Grafana dashboard and Prometheus configuration in the monitoring/ directory.
Quick Start
cd monitoring/
docker compose up -d
This starts:
- Prometheus at
http://localhost:9090— scrapes SwarmLLM metrics - Grafana at
http://localhost:3000— visualizes metrics (login:admin/admin)
The SwarmLLM dashboard is auto-provisioned on first start.
Dashboard Panels
The Grafana dashboard includes:
Node Overview
- Connected Peers (stat)
- Total Inference Requests (stat)
- Credit Balance (stat)
- Shards Hosted (stat)
Inference
- Request Rate (req/s over time)
- Latency Percentiles (p50, p90, p99)
- Latency Distribution (histogram)
- Average Inference Latency (gauge)
Network & Peers
- Connected Peers Over Time
Storage & Shards
- Hosted Shards Over Time
Credits
- Credit Balance Over Time
Manual Setup
If you already have Prometheus and Grafana running:
1. Configure Prometheus
Add to prometheus.yml:
scrape_configs:
- job_name: "swarmllm"
static_configs:
- targets: ["localhost:8800"]
2. Import Dashboard
- Open Grafana → Dashboards → Import
- Upload
monitoring/grafana-dashboard.json - Select your Prometheus data source
- Click Import
Multi-Node Monitoring
For monitoring multiple SwarmLLM nodes, add all targets:
scrape_configs:
- job_name: "swarmllm"
static_configs:
- targets:
- "node1:8800"
- "node2:8800"
- "node3:8800"
Or use file-based service discovery:
scrape_configs:
- job_name: "swarmllm"
file_sd_configs:
- files: ["swarmllm-targets.json"]
refresh_interval: 30s
Alerting
Example alert rules for Prometheus:
groups:
- name: swarmllm
rules:
- alert: NoPeersConnected
expr: swarmllm_peers_connected == 0
for: 5m
labels:
severity: warning
annotations:
summary: "SwarmLLM node has no connected peers"
- alert: HighInferenceLatency
expr: histogram_quantile(0.99, rate(swarmllm_inference_latency_seconds_bucket[5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "p99 inference latency exceeds 10 seconds"
- alert: NegativeCreditBalance
expr: swarmllm_credits_balance < 0
for: 1h
labels:
severity: info
annotations:
summary: "Node has negative credit balance (Bronze tier)"