SwarmLLM

Run AI together — for free. A single Rust binary that turns your computer into a node in a peer-to-peer LLM inference network. Pool hardware with others to run models too large for any single machine, with no API tokens, no cloud fees, and end-to-end encryption between every peer.

This site is the long-form reference. For source code, releases, and issues, head to enapt/SwarmLLM.

What you can do with it

  • Chat with AI locally — open localhost:8800 after running the binary; the dashboard auto-detects your hardware and walks you through downloading a model.
  • Use it as a drop-in API — OpenAI-compatible /v1/chat/completions, the Anthropic Messages API at /v1/messages (full Claude Code support), an MCP server with seven tools, plus 12 cloud providers reachable through one endpoint.
  • Pool hardware — your phone with 2 GB of RAM can host a few shards of a 70B model and contribute alongside someone else's GPU. Shards download individually via byte-range requests; no node ever needs the full file.
  • Stay private — every P2P hop uses X25519 + ChaCha20-Poly1305 with forward secrecy. The optional boomerang pipeline ensures no remote node ever sees plaintext.

Single-node performance (RTX 3070 Laptop, 8 GB VRAM)

ModelGPUCPU
TinyLlama 1.1B Q427.2 tok/s4.2 tok/s
Gemma-2 2B Q420.6 tok/s3.5 tok/s
Phi-3.5 3.8B Q446.4 tok/s1.8 tok/s
Qwen2.5-Coder 7B Q429.0 tok/s2.4 tok/s

Distributed-inference speedups (all default-on): prefix-caching, batched prefill, the Parallax scheduler, and cross-node KV sharing. The cross-node prefix-KV benchmark (2026-04-20) measured a 12.9× iter-1 TTFT speedup on a 672-token Qwen-7B prompt when a peer had the same prefix already cached (151.7 s → 11.8 s, CPU-CPU, localhost). Each knob is documented in Performance & Inference Speedups.

How a node fits together

┌──────────────────────────────────────────────────────────────┐
│                      Your computer (port 8800)                │
│                                                              │
│   P2P node          HTTP server          Web dashboard       │
│   TCP+QUIC          OpenAI · Anthropic   (embedded)          │
│   Noise+Yamux       MCP · Admin          21 languages        │
│                                                              │
│   ─────────────────────────────────────────────────────────  │
│   11 Tokio subsystems · DashMap shared state · redb storage  │
└──────────────────────────────────────────────────────────────┘

Each node simultaneously: connects over TCP and QUIC, serves four HTTP API surfaces (OpenAI · Anthropic · MCP · admin) on the same port, hosts shard files for popular models, participates in distributed inference pipelines, and ships an embedded web dashboard.

Where to go next

Status

Alpha — actively developed and moving into broader testing. Distributed inference is stable across multi-node deployments. Windows release binaries reach Linux parity (Round 8, 2026-04-23). 887 lib tests + 75 integration tests run on every PR; continuous security sweeps. Report issues.

Platform support

PlatformStatusGPU
Linux x86_64AvailableCUDA
Windows x86_64AvailableCUDA
macOS aarch64 (Apple Silicon)Binary available; compile-validatedCPU only (Metal planned)
macOS x86_64 (Intel)Best-effortCPU only
Linux aarch64Best-effortCPU only

macOS aarch64 runs cargo test --lib + cargo clippy on macos-15 in CI. Integration tests stay Linux-only for now.

All binaries live on the Releases page.

Getting Started

SwarmLLM lets you combine your hardware with others to run AI models too large for any single machine — for free, with no API tokens or cloud fees. It's open-source and your conversations are end-to-end encrypted.

This guide walks you through installation, downloading your first model, and chatting.

Prerequisites

  • A computer running Windows, macOS, or Linux
  • At least 4 GB of RAM (8+ GB recommended)
  • At least 2 GB of free disk space (more for larger models)
  • An internet connection (for downloading models and connecting to peers)

Chapters

Quick Commands

./swarmllm run                  # Start the node (default port 8800)
./swarmllm run -p 9000          # Start on a different port
./swarmllm run -v               # Start with verbose logging
./swarmllm status               # Check if the node is running
./swarmllm chat                 # Interactive CLI chat
./swarmllm bench                # Benchmark inference performance
./swarmllm peers                # List connected peers
./swarmllm version              # Show version number

Installation

Download

Download the right file for your system from the GitHub Releases page:

Your ComputerFile Name
Windows (most PCs)SwarmLLM-Setup.exe (installer — auto-detects GPU)
Windows (raw binary, GPU)swarmllm-windows-x86_64-gpu.zip
Windows (raw binary, CPU)swarmllm-windows-x86_64-cpu.zip
Mac (M1/M2/M3/M4)swarmllm-macos-aarch64.tar.gz (compile-validated)
Mac (older Intel)Best-effort — build from source
Linux (most distros)swarmllm-linux-x86_64.tar.gz
Linux (NVIDIA GPU)swarmllm-linux-x86_64-cuda.tar.gz

Not sure which Mac? Apple menu > "About This Mac." If it says "Apple M1" (or M2/M3/etc.), pick Apple Silicon. If it says "Intel," pick Intel.

Install & Run

Windows

Recommended — installer: double-click SwarmLLM-Setup.exe. It detects your GPU (NVIDIA / AMD / Intel) and installs the matching binary. If SmartScreen warns you, click More info > Run anyway.

Raw binary alternative: download swarmllm-windows-x86_64-gpu.zip (Vulkan + CUDA static) or swarmllm-windows-x86_64-cpu.zip (CPU-only fallback), extract, and run swarmllm.exe.

From PowerShell on a raw binary:

cd Downloads\swarmllm-windows-x86_64-gpu
.\swarmllm.exe run

macOS

cd ~/Downloads
tar xzf swarmllm-macos-aarch64.tar.gz
cd swarmllm-macos-aarch64
chmod +x swarmllm
./swarmllm run

Note: macOS aarch64 binaries are compile-validated and exercised in CI (test + clippy on macos-15); integration tests stay Linux-only for now. Intel Mac users should build from source. If macOS blocks the binary on first launch: System Settings > Privacy & Security > click Open Anyway next to SwarmLLM.

Linux

cd ~/Downloads
tar xzf swarmllm-linux-x86_64.tar.gz
cd swarmllm-linux-x86_64
chmod +x swarmllm
./swarmllm run

Docker

The fastest way to get running on any Linux server:

# 1. Get the compose file and example env
curl -LO https://raw.githubusercontent.com/enapt/SwarmLLM/main/docker-compose.yml
curl -LO https://raw.githubusercontent.com/enapt/SwarmLLM/main/.env.example

# 2. Configure (add API keys, change ports, etc.)
cp .env.example .env
nano .env

# 3. Start
docker compose up -d

For NVIDIA GPU support (requires NVIDIA Container Toolkit):

docker compose --profile gpu up -d

Pre-built images on GHCR:

ImageDescription
ghcr.io/enapt/swarmllm:latestCPU-only
ghcr.io/enapt/swarmllm:latest-cudaNVIDIA GPU (CUDA 12.4)
ghcr.io/enapt/swarmllm:0.1.0Pinned version (CPU)
ghcr.io/enapt/swarmllm:0.1.0-cudaPinned version (GPU)

Data is persisted in Docker volumes. Model shards are stored in the swarmllm-models volume (or bind-mount a host directory via SWARMLLM_MODELS_DIR in .env).

View logs with docker compose logs -f. The API key is printed on first startup.

Cargo Install

Requires Rust 1.80+:

cargo install --git https://github.com/enapt/SwarmLLM.git --tag v0.1.0
swarmllm run

Building from Source

git clone https://github.com/enapt/SwarmLLM.git
cd SwarmLLM
cargo build --release
./target/release/swarmllm run

For CUDA GPU support:

cargo build --release --features candle-cuda

For Apple Silicon: the default build runs on CPU. A Metal-accelerated build is on the roadmap but not yet implemented (no metal Cargo feature exists yet); until then, use the default cargo build --release.

Open the Dashboard

Once running, open http://localhost:8800 in your browser. The setup wizard will walk you through initial configuration.

Your First Model

You need at least one AI model before you can chat.

Download via Dashboard

  1. Open the Dashboard at http://localhost:8800
  2. Click Browse HuggingFace in the Models section
  3. Search for a model (try TinyLlama for a small, fast model)
  4. Choose a quantization variant (Q4_K_M recommended for most hardware)
  5. Click Add to node — the node downloads its fair share of shards, and peers with auto-manage enabled auto-acquire the rest
  6. The dashboard auto-refreshes when downloads complete (no page reload needed)

Download via CLI

# Smart distribution: node downloads its fair share, peers get the rest
curl -X POST http://localhost:8800/api/admin/hf/download-shards \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"repo_id": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF", "filename": "qwen2.5-coder-7b-instruct.Q4_K_M.gguf", "peer_fair_share": true}'

# Or download specific shards manually:
curl -X POST http://localhost:8800/api/admin/hf/download-shards \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"repo_id": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF", "filename": "qwen2.5-coder-7b-instruct.Q4_K_M.gguf", "shards": [0, 1, 2]}'
HardwareModelSize
Any (testing)TinyLlama 1.1B Q4_K_M~700 MB
8 GB RAM, no GPUQwen2.5-3B Q4_K_M~2 GB
8 GB VRAMQwen2.5-7B Q4_K_M~4.5 GB
16+ GB VRAMLlama-3-13B Q4_K_M~7 GB

On-Demand Loading

You do not need to pre-load models into VRAM. When you send an inference request for a model whose shards are on disk but not loaded, SwarmLLM automatically loads the model on the fly. If VRAM is full, the least-recently-used model is evicted to make room. The first request to a cold model may take a few extra seconds while loading completes.

Start Chatting

Web UI:

  1. Click the Chat tab
  2. Select your model from the dropdown
  3. Type a message and press Enter

CLI:

./swarmllm chat
# Or with a specific model:
./swarmllm chat --model-name "qwen2.5-coder-7b"

API:

curl http://localhost:8800/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder-7b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

What Are Shards?

Large AI models are split into smaller pieces called shards (~512 MB each) so they can be distributed across the network. Each shard contains a subset of the model's transformer layers. SwarmLLM handles this automatically — you just pick a model and download.

A node never needs all shards of a model. In distributed inference, each node loads only the layers it's responsible for.

Joining the Network

SwarmLLM works standalone, but connecting to peers unlocks distributed inference for larger models.

Automatic Discovery

SwarmLLM finds peers automatically:

  • Same network (LAN): mDNS discovers peers on the same Wi-Fi/LAN in seconds.
  • Returning users: Previously-seen peers are remembered and reconnected on startup.
  • Peer exchange: Connected peers share their peer lists with you.

Invite Codes (Easiest)

  1. In the Dashboard, click "Share Network Code"
  2. Copy the encrypted code and share it with a friend
  3. They paste it into the "Join Network" field and click Join
  4. Both nodes connect immediately and start discovering the wider network

Invite codes are encrypted (ChaCha20Poly1305) — your IP address is not visible in the code itself. Anyone with the full code can decode it, but the IP can't be extracted by casual inspection.

Manual Bootstrap

./swarmllm run --bootstrap "/ip4/203.0.113.50/udp/8800/quic-v1/p2p/12D3KooW..."

Or in your config file:

[network]
bootstrap_peers = ["/ip4/203.0.113.50/udp/8800/quic-v1/p2p/12D3KooW..."]

Private Networks

To run a private cluster that doesn't mix with the public network:

[network]
gossip_network_id = "my-private-network"

Only nodes with the same gossip_network_id can communicate.

Firewall

SwarmLLM needs TCP port 8810 (P2P primary transport) and optionally UDP port 8800 (QUIC) open. If you're behind a router, either:

  • Set up port forwarding (TCP 8810 + UDP 8800 to your machine's local IP)
  • Rely on SwarmLLM's built-in relay (works automatically in most cases)

Configuration

SwarmLLM works out of the box with sensible defaults. This section covers customization.

Config Priority

Settings are read from four sources, in order of priority:

  1. Command-line flags (highest) — e.g., --port 9000
  2. Environment variables — e.g., SWARMLLM_NODE_LISTEN_PORT=9000
  3. Config fileconfig.toml in your data directory
  4. Built-in defaults (lowest)

Provider API keys have an additional source: a .env file in the data directory or current working directory. Standard env var names are used (OPENAI_API_KEY, ANTHROPIC_API_KEY, DEEPSEEK_API_KEY, etc.). The .env file does not override existing environment variables or keys already set via the dashboard.

Config File Location

OSPath
Linux~/.local/share/swarmllm/config.toml
macOS~/Library/Application Support/swarmllm/config.toml
Windows%APPDATA%\swarmllm\config.toml

Specify a custom path: --config /path/to/config.toml

Minimal Example

[node]
listen_port = 8800
contribution = "moderate"

[resources]
max_disk_mb = 50000

[identity]
region = "US"

[inference]
gpu_layers = 35

[auto_manage]
enabled = true

Chapters

Config File Reference

Every configuration option, organized by section.

[node] — Basic Node Settings

OptionTypeDefaultDescription
listen_portinteger8800Port for web dashboard and P2P networking
data_dirpathPlatform-specificWhere SwarmLLM stores data
contributionstring"minimal"Resource contribution: "minimal", "moderate", "maximum"

[resources] — Resource Limits

OptionTypeDefaultDescription
max_gpu_vram_mbinteger0Max GPU memory in MB. 0 = auto-detect
max_ram_mbinteger0Max system RAM in MB. 0 = auto
max_disk_mbinteger50000Max disk space in MB for model storage
max_bandwidth_mbpsinteger0Max upload bandwidth. 0 = unlimited

[resources.schedule] — Usage Schedule

OptionTypeDefaultDescription
enabledbooleanfalseEnable scheduled resource reduction
reduced_hours_startinteger22Hour (0-23) to start reduced mode
reduced_hours_endinteger8Hour (0-23) to end reduced mode
reduced_contributionstring"minimal"Contribution level during reduced hours
prune_aggressivenessstring"normal"Shard pruning during reduced hours: "normal", "aggressive", "conservative"

[network] — Networking

OptionTypeDefaultDescription
bootstrap_peerslist[]Peer addresses to connect on startup
enable_mdnsbooleantrueLAN peer discovery
gossip_network_idstringnoneCustom network ID for private networks
peer_exchangebooleantrueShare peer lists with connected nodes
enable_relaybooleantrueAct as relay for peers behind firewalls
enable_relay_clientbooleantrueUse relays when behind a firewall
max_peersinteger200Max simultaneous peer connections
auto_relaybooleantrueAuto-use relay when NAT detected
relay_max_circuit_duration_secsinteger3600Max relay circuit duration
relay_max_circuitsinteger16Max relay circuits to serve
enable_encryptionbooleantrueE2E encryption for tensor forwards and control messages
enable_autonatbooleantrueNAT detection. Disable on WSL2 to reduce noise
enable_dcutrbooleantrueHole punching. Disable on WSL2 to reduce noise
tensor_compressionbooleantrueZstd compression for tensor payloads
prefix_kv_compressionbooleanfalseZstd compression for cross-node prefix-KV snapshot wire frames. Default off — meaningful win on WAN where wire size is the bottleneck; roughly neutral on localhost. Receivers always decompress regardless of this flag.
tensor_compress_levelinteger1Zstd compression level (1-22, 1 = fastest). Shared between tensor and prefix-KV.
tensor_compress_thresholdinteger1024Min payload bytes before compression. Shared between tensor and prefix-KV.

[inference] — AI Model Inference

OptionTypeDefaultDescription
default_modelstring""Default model. Empty = first available
session_timeout_secondsinteger600Chat session memory lifetime (10 min)
max_concurrent_requestsinteger10Max parallel requests
model_pathpathnonePath to a GGUF model file
gpu_layersinteger0Layers to offload to GPU. 0 = CPU only
kv_cache_ttl_secsinteger600KV-cache lifetime
max_batch_sizeinteger1Max request batch size. 1 = no batching. When > 1, both local and remote forward requests batch together via BatchForwarder, filling pipeline bubbles in distributed inference
batch_timeout_msinteger50Ms to wait for additional requests before dispatching a partial batch. 0 = dispatch immediately (purely opportunistic batching)
speculative_decodingbooleanfalseEnable speculative decoding
speculative_gammainteger4Draft tokens per verification step
draft_model_pathpathnonePath to draft model
max_split_model_memory_mbintegernoneMax GPU memory for split model cache
tp_max_latency_msinteger10Max peer latency (ms) for tensor parallelism groups
local_embedding_privacybooleanfalseEmbed tokens locally before sending to first segment. Remote nodes never see raw token IDs
encrypted_pipelinebooleanfalseForce first+last segment to local node (boomerang topology). No remote sees plaintext. Adds ~1 RTT/token. Per-model override via API. Requires shard 0 + final shard locally

[logging] — Log Output

OptionTypeDefaultDescription
levelstring"info"Log level: "error", "warn", "info", "debug", "trace"
formatstring"pretty"Log format: "pretty" or "json"
filepathnoneWrite logs to file

[ui] — Web Interface

OptionTypeDefaultDescription
open_browser_on_startbooleantrueOpen dashboard on launch
themestring"dark"Color theme: "dark" or "light"

[api] — API Authentication

OptionTypeDefaultDescription
api_keystringnoneBearer token. Empty = auto-generated
rate_limit_rpminteger60Rate limit for /v1/ endpoints (requests/min)
rate_limit_admin_rpminteger200Rate limit for /api/admin/ endpoints (requests/min)

[model] — Model Storage

OptionTypeDefaultDescription
shard_size_mbinteger512Shard size in MB. Range: 64-2048

[auto_manage] — Automatic Shard Management

OptionTypeDefaultDescription
enabledbooleantrueAuto-download popular shards (only for models at DemandVerified+ or Pinned trust level)
max_storage_mbinteger0Max disk for auto-downloads. 0 = 50% of max_disk_mb
interval_minutesinteger5Check interval for new shards
max_shardsinteger0Max shards. 0 = unlimited
max_concurrent_downloadsinteger3Max parallel downloads
prune_enabledbooleantrueAuto-remove over-replicated shards
min_replicasinteger2Min network replicas before pruning
prune_cooldown_secsinteger300Seconds between prune actions per model
max_holder_load_for_pruneinteger3Block pruning if holders are busy

[pool] — Device Pool

OptionTypeDefaultDescription
max_pool_sizeinteger10Max devices in a pool
invitation_ttl_hoursinteger24Invitation validity period
rate_limit_per_hourinteger10Max pool operations per hour
gossip_interval_secsinteger600Pool state gossip interval
private_modeboolfalseRestrict inference to pool members only. Toggleable at runtime via API/UI
private_mode_allow_lanbooltrueAlso allow LAN peers (mDNS-discovered) when private mode is on
offline_modeboolfalseAir-gapped: no bootstrap peers, no HF downloads, mDNS-only discovery

[pool.credit_rates] — Credit Rates

OptionTypeDefaultDescription
inference_serveinteger10Credits earned per layer per token served
inference_consumeinteger10Credits spent per layer per token consumed
shard_hostinginteger1Credits per GB per hour hosting
shard_seedinginteger5Credits per GB seeding
relay_serviceinteger2Credits per connection hour relaying
penalty_serve_failureinteger50Credits deducted per failure

[updates] — Auto-Update

OptionTypeDefaultDescription
auto_updatestring"stable"Policy: "disabled", "stable", "all"
check_interval_hoursinteger6Update check frequency

[identity] — Your Identity

OptionTypeDefaultDescription
regionstringnoneCountry code for network map (e.g., "US")

[providers.claude_subscription] — Claude Subscription (feature-gated)

Requires --features claude-subscription at build time. Managed via the dashboard or PUT /api/admin/providers.

OptionTypeDefaultDescription
enabledbooleanfalseRoute claude-* model requests through the local CLI
claude_binarystring"claude"Path to the claude binary
default_modelstringnoneOverride model for all requests
max_concurrentinteger3Maximum concurrent subprocess invocations
timeout_secsinteger300Per-request timeout in seconds
working_dirstring(temp dir)Working directory for the subprocess. Empty or "none" uses system temp dir (recommended for API proxy use). Set to a project path for context-aware responses.

Shard-Only Mode

SwarmLLM supports shard-only operation — a node only needs individual shard files (~512 MB each) plus a small GGUF header (~6 MB), not the full model file.

How It Works

A model directory in shard-only mode:

~/.local/share/swarmllm/models/qwen2.5-coder-7b/
├── manifest.json        # Model metadata + shard layout
├── gguf_header.bin      # First ~6MB of GGUF (metadata + tensor index)
├── shard_000.bin        # 512MB shard
├── shard_001.bin
├── shard_002.bin
└── ...

SwarmLLM automatically extracts gguf_header.bin from shard_000.bin when first needed. The ShardReader constructs a virtual GGUF from header + shard files, so the model parser works exactly as if the full GGUF were present.

Why This Matters

  • A 7B model is ~4.5 GB as a full GGUF, but a single shard is only ~512 MB
  • Nodes only load the layers they're assigned — no wasted disk or VRAM
  • You can participate in inference for a 70B model on a machine with 8 GB VRAM by hosting just a few shards

Manual Shard Assignment (--shards)

For multi-node split inference, assign each node a subset of shards:

./swarmllm run --shards "0-3"    # This node handles shards 0, 1, 2, 3

The range is persisted to the database and restored on subsequent runs. Start without --shards to clear.

Behavior when --shards is set:

  • The node only advertises the specified shard indices
  • Auto-manage prioritizes downloading missing shards in the range (100x scoring bonus)
  • Smart pruning never removes shards in the configured range

Multi-Node Example

Run a 7B model across two machines:

# Machine A (shards 0-3, layers 0-13):
./swarmllm run --shards "0-3" --bootstrap "/ip4/MACHINE_B_IP/udp/8800/quic-v1/p2p/PEER_ID"

# Machine B (shards 4-7, layers 14-27):
./swarmllm run --shards "4-7" --bootstrap "/ip4/MACHINE_A_IP/udp/8800/quic-v1/p2p/PEER_ID"

Both nodes discover each other, assemble a distributed pipeline, and forward hidden-state activations between them. The pipeline is assembled automatically by the InferenceRouter.

Without --shards

If you don't specify --shards, the node auto-detects and advertises all local shards. This is the normal mode for most users — --shards is only needed when you want explicit control over which layers a node handles.

CLI Flags & Environment Variables

CLI Flags

FlagShortDescription
--port <PORT>-pListen port
--data-dir <PATH>-dData directory
--config <PATH>-cConfig file path
--model <PATH>-mPath to a GGUF model file
--gpu-layers <N>Layers to offload to GPU
--bootstrap <ADDR>Bootstrap peer address (repeatable)
--shards <RANGE>Shard range for split inference (e.g., "0-4")
--verbose-vIncrease log verbosity (-v, -vv, -vvv)

Subcommands

CommandDescription
runStart the daemon (default if no subcommand)
statusQuery running daemon status
chatInteractive CLI chat with streaming
benchBenchmark inference (tokens/sec, TTFT)
peersList connected peers
poolDevice pool management (link your machines)
test-splitTest split inference locally (diagnostic)
versionPrint version

chat Options

FlagDefaultDescription
--model-name <NAME>auto-detectModel to chat with
--system <TEXT>noneSystem prompt
--max-tokens <N>2048Max tokens per response
--temperature <F>0.7Sampling temperature

bench Options

FlagDefaultDescription
--model-name <NAME>auto-detectModel to benchmark
--prompt <TEXT>"Write a short essay..."Benchmark prompt
--max-tokens <N>128Tokens to generate
--iterations <N>1Number of benchmark runs

pool Subcommands

Link your personal devices so credits are combined on one main machine.

CommandDescription
pool create --name "My Devices"Create a device group (this machine becomes the main device)
pool invite-codeGenerate an 8-character invite code to share
pool join <CODE>Link this device using a code from your main machine
pool statusShow linked devices, credits, and online status
pool leaveUnlink this device from the group

Example flow:

# Main device:
swarmllm pool create --name "My Devices"
swarmllm pool invite-code   # → A3F7K2M9

# On each other device:
swarmllm pool join A3F7K2M9

Note: This links YOUR own devices. It's different from connecting to the SwarmLLM network (which uses swarm:// peer addresses).

Environment Variables

Every config option can be set via SWARMLLM_ prefix:

Config PathEnvironment Variable
node.listen_portSWARMLLM_NODE_LISTEN_PORT
node.data_dirSWARMLLM_NODE_DATA_DIR
logging.levelSWARMLLM_LOGGING_LEVEL
inference.model_pathSWARMLLM_INFERENCE_MODEL_PATH
inference.gpu_layersSWARMLLM_INFERENCE_GPU_LAYERS

Example:

SWARMLLM_NODE_LISTEN_PORT=9000 SWARMLLM_LOGGING_LEVEL=debug ./swarmllm run

Provider API Keys via Environment

Cloud provider API keys use standard environment variable names:

ProviderEnvironment Variable
OpenAIOPENAI_API_KEY
AnthropicANTHROPIC_API_KEY
DeepSeekDEEPSEEK_API_KEY
MistralMISTRAL_API_KEY
GroqGROQ_API_KEY
NVIDIA NIMNVIDIA_NIM_API_KEY
CerebrasCEREBRAS_API_KEY
SambaNovaSAMBANOVA_API_KEY
FireworksFIREWORKS_API_KEY
TogetherTOGETHER_API_KEY
DeepInfraDEEPINFRA_API_KEY
Moonshot/KimiMOONSHOT_API_KEY

These can also be placed in a .env file in your data directory:

# ~/.local/share/swarmllm/.env
OPENAI_API_KEY=sk-proj-...
DEEPSEEK_API_KEY=sk-...
NVIDIA_NIM_API_KEY=nvapi-...

The .env file is loaded at startup. It does not override existing environment variables or keys already configured via the dashboard/database. The dashboard settings UI shows "From .env" for keys loaded this way.

Troubleshooting

Can't Connect to Peers

Check the bootstrap address format:

/ip4/203.0.113.50/udp/8800/quic-v1/p2p/12D3KooW...

Firewall: SwarmLLM needs TCP port 8810 (P2P) and optionally UDP port 8800 (QUIC) open.

  • Linux: sudo ufw allow 8810/tcp && sudo ufw allow 8800/udp
  • Windows: Windows Defender Firewall > Inbound Rules > New > Port > TCP 8810 + UDP 8800
  • macOS: System Settings > Network > Firewall > allow SwarmLLM

Same LAN? Use local IP (e.g., 192.168.1.x). LAN peers should be found automatically via mDNS.

Model Download Stuck

  1. Check disk space — a 7B model needs ~4-5 GB free
  2. Verify internet access to https://huggingface.co
  3. Cancel and retry from the Dashboard
  4. Start with -v for verbose logs: ./swarmllm run -v
  5. Try a smaller model first (TinyLlama, ~700 MB)

GPU Not Detected

  1. Verify GPU works: nvidia-smi
  2. Install NVIDIA drivers if needed
  3. Enable GPU offloading: ./swarmllm run --gpu-layers 99

WSL2 users: The CUDA driver comes from your Windows NVIDIA driver. Check that /usr/lib/wsl/lib/libcuda.so.1 exists and add to your ~/.bashrc:

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/wsl/lib:$LD_LIBRARY_PATH

Port Already in Use

./swarmllm run --port 9000    # Use a different port
lsof -i :8800                 # Find what's using 8800
./swarmllm status             # Check if another instance is running

Slow First Request

If the first inference request to a model takes noticeably longer than subsequent ones, this is expected. SwarmLLM uses on-demand model loading — models whose shards are on disk but not loaded into VRAM are loaded when first requested. If VRAM is full, an LRU eviction occurs first. Subsequent requests to the same model will be fast.

Slow Inference

  1. GPU vs CPU: CPU is 5-20x slower. Check Dashboard for GPU status.
  2. Model too large: Use Q4 quantization, match model size to VRAM.
  3. Enable batching: Set max_batch_size = 4 in config.

Database Corrupted

# Back up first
cp -r ~/.local/share/swarmllm ~/.local/share/swarmllm-backup
# Delete database (models and config are preserved)
rm ~/.local/share/swarmllm/db.redb
# Restart
./swarmllm run

GPU Out of Memory

If a model exceeds your GPU's VRAM, SwarmLLM automatically falls back to CPU inference. You'll see this in the logs:

WARN GPU OOM detected, retrying on CPU

CPU inference is 5-20x slower but works for any model size. To avoid OOM:

  • Use smaller quantizations (Q4 instead of Q8)
  • Use a model that fits in VRAM (check model size vs available VRAM in the dashboard)
  • For models too large for one GPU, use distributed inference across multiple nodes

Distributed Inference Issues

Peers visible but inference fails:

  1. Ensure both nodes have the required shards loaded (check Dashboard > Models)
  2. Verify P2P TCP connectivity: port <base_port> + 10 must be reachable
  3. Run with -vv and filter: ./swarmllm run -vv 2>&1 | grep "DIAG:"
  4. Check for DIAG: segment TIMED OUT — indicates network or compute bottleneck

High latency per token:

  • Distributed inference adds ~20-130ms per token for network round-trips
  • Use TCP bootstrap addresses (not QUIC) for lowest latency
  • Ensure nodes are on the same LAN for tensor parallelism

Pipeline assembly fails:

  • The scheduler needs enough shard coverage to build a complete pipeline
  • Check DIAG: assemble_pipeline_for for candidate counts

Inference fails with "peer never acknowledged" or "silent drop":

  • A SendDirectMessage was issued but neither a Response nor an OutboundFailure event arrived from libp2p within 10s (RR_ACK_TIMEOUT_SECS). Treated as a transient failure: the router automatically retries once with a fresh pipeline assembly that filters out the unreachable peer. If retry also fails, the user sees the error within ~20s (vs the 120s FIRST_TOKEN_TIMEOUT).
  • Most common cause: the target peer was killed or partitioned and the local libp2p connection state hasn't yet caught up.
  • Look for DIAG: rr ACK timeout — closing streaming caller in the logs to confirm the fast-fail path engaged.

Concurrent requests stall when only some get dispatched:

  • Per-tier concurrency caps come from inference.max_concurrent_requests (default 10): Bronze=2, Silver=5, Gold=10, Platinum=20. Excess requests queue until prior ones complete. To raise: bump the config knob or earn credits to climb tiers.
  • If queued requests don't dispatch even after others complete, check for a missed queue_notify.notify_one() after active_count.fetch_sub(1) (should never happen on main; was a real regression fixed in da6f485).

Cross-Node Prefix-KV Sharing

The cross-node prefix fetch is default-on. Expected logs on a successful first hit of a peer's cached prefix:

B: DIAG: cross-node prefix HIT — hydrated KV matched_tokens=N total_tokens=M
A: DIAG: served PrefixKvFetch ... hit=true

I never see cross-node prefix HIT:

  • Only fires on iter 1 of a prompt whose prefix your local node hasn't prefilled yet. Iter 2/3 hit the local cache (populated by iter 1).
  • Check the peer even announced the prefix: look for DIAG: PrefixCacheAnnounce indexed node_id=... blocks=N in your log. No announce → peer's gossip never reached you (check grep 'Published message to GossipSub' | grep 'swarm/models').
  • Check the peer passes the trust gate: default cross_node_prefix_trust_min = 0.5 equals DEFAULT_TRUST, so a freshly-seen peer should just barely pass. Any misbehavior drops it below.

I see prefix-probe: fetch timed out:

  • The peer didn't return a snapshot inside the worker-probe window (3000 ms by default). On a large model (7B+) with cold CPU this can happen if the snapshot is >100 MB. The path degrades to local prefill — no worse than not having the feature. The current 3000/2500/2000 ms chained timeouts are sized for 7B-class snapshots; the older 500/400/500 ms values were TinyLlama-sized and forced a fallback to local prefill on larger models.

I see rejected KV snapshot — penalizing peer trust:

  • The returned snapshot failed BLAKE3 reverification or contained NaN/Inf. Three rejection reasons:
    • hash_chain_mismatchprefix_cache_block_tokens differs between nodes (default 64, common alternatives 32/128)
    • non_finite_tensors → GPU overflow on the serving side
    • deserialize_failed → wire corruption — open an issue

Disable cross-node fetch entirely: Set inference.cross_node_prefix_trust_min = 2.0 in config.toml. The probe never fires because no peer passes the trust gate.

Running the Test Suite

SwarmLLM ships 943 lib tests + 75 integration tests + VLM E2E.

# Run all tests (release, used in CI)
cargo test --release

# Unit tests only (fastest feedback loop)
cargo test --lib

# Integration tests only
cargo test --test '*'

# A specific test by name substring
cargo test --release prefix_cache

# With CUDA features on (requires NVIDIA GPU)
cargo test --release --features candle-cuda

If a test fails, the release build shows the name + line; rerun with --nocapture to see its stderr:

cargo test failing_test_name -- --nocapture

Integration tests under tests/integration/ simulate multi-node P2P on loopback — they're the slow ones, and CI runs them with --test-threads=1 to avoid port contention.

See Benchmarking for reproducing the performance benchmarks and Performance for which knobs turn each speedup on/off.

Model Trust

Models go through trust levels: Discovered → Pinned → DemandVerified → NetworkPopular. Auto-manage only downloads shards for models at sufficient trust levels.

Model stuck at "Discovered":

  • Pin it manually from the Dashboard to promote to "Pinned"
  • Models reach "DemandVerified" after receiving inference requests
  • Models reach "NetworkPopular" when enough peers host them

Still Stuck?

  • Run with full diagnostics: ./swarmllm run -vv 2>&1 | grep "DIAG:"
  • See the Diagnostics Guide for detailed log instrumentation
  • Check GitHub Issues
  • Open a new issue with: OS, hardware, ./swarmllm version, and logs from -vv

System Overview

SwarmLLM is a single Rust binary that simultaneously functions as:

  1. A P2P network node — connects to peers over TCP (Noise+Yamux) and QUIC/UDP using libp2p
  2. An HTTP API server — serves OpenAI + Anthropic-compatible endpoints, MCP server, and cloud provider proxy via Axum
  3. A web dashboard — embedded frontend (component-based vanilla HTML/CSS/JS, 11 HTML templates, no build step)

All three share a single port (default 8800) and a common Arc<SharedState>.

┌──────────────────────────────────────────────────────────┐
│                      swarmllm binary                      │
│                                                          │
│  ┌──────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │  P2P     │  │  HTTP API    │  │  Admin UI    │       │
│  │  Node    │  │  Server      │  │  (embedded)  │       │
│  │(TCP+QUIC)│  │  (Axum)      │  │              │       │
│  └────┬─────┘  └──────┬───────┘  └──────┬───────┘       │
│       │               │                 │                │
│  ┌────┴───────────────┴─────────────────┴─────────────┐  │
│  │              Shared State (Arc)                     │  │
│  │  DashMap<NodeId, PeerInfo>      — peer registry     │  │
│  │  ModelRegistry                  — models + shards   │  │
│  │  state.events (EventBus)        — activity + dashboard│ │
│  │  state.credits (CreditPool)     — balance + pool     │  │
│  │  state.models (ModelMgmt)       — acquisition + trust │  │
│  │  state.metrics (MetricsProviders)— stats + providers │  │
│  └────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────┘

Key Design Decisions

  • Config priority: CLI flags > env vars (SWARMLLM_ prefix) > config.toml > defaults
  • Data directory: ~/.local/share/swarmllm/ (Linux), ~/Library/Application Support/swarmllm/ (macOS), %APPDATA%\swarmllm\ (Windows)
  • Port layout: HTTP API on TCP:port, P2P TCP on port+10 (Noise+Yamux), P2P QUIC on UDP:port
  • Shard-only: Nodes never need a full GGUF. Shards are downloaded individually.
  • No blockchain: Credit system uses dual-signed transactions, not a token or chain

Technology Stack

ComponentLibrary
Async runtimeTokio (multi-threaded)
P2P networkinglibp2p 0.56 (Kademlia, GossipSub, QUIC)
HTTP serverAxum 0.8
Tensor computecandle-core/candle-transformers
GGUF inferencellama-cpp-2 (optional backend)
Cryptographyed25519-dalek, x25519-dalek, chacha20poly1305
Content hashingBLAKE3
Databaseredb (pure-Rust, ACID, single-file)
Concurrent mapsDashMap 6

Daemon & Subsystems

The daemon spawns 12 Tokio tasks wired together with mpsc channels:

                           ┌──────────────┐
                           │  daemon/     │
                           │  (bootstrap) │
                           └──────┬───────┘
                                  │ spawns tokio tasks
  ┌───────┬───────┬───────┬───────┼───────┬──────────┬──────────┬──────────┬──────────┬──────────┬─────┐
  ▼       ▼       ▼       ▼       ▼       ▼          ▼          ▼          ▼          ▼          ▼     ▼
Network  Infer   Credit  Health   API    Rebal-   Acquisi-   Message    Pool     AutoShrd   HfWat- Update
Manager  Router  Ledger  Monitor  Server ancer    tion Mgr   Dispatch   Manager  Manager   cher   Checker

Subsystem Responsibilities

SubsystemFileRole
NetworkManagersrc/network/manager/libp2p swarm: Kademlia DHT + GossipSub + request/response
InferenceRoutersrc/inference/router/Request queuing, pipeline assembly, execution coordination
MessageDispatchersrc/daemon/dispatch/mod.rsRoutes inbound network messages to appropriate subsystems
CreditLedgersrc/credit/ledger.rsCredit balance tracking, transaction signing, gossip
HealthMonitorsrc/health/monitor.rsPeriodic health pings, rebalancing triggers
ShardRebalancersrc/health/rebalancer.rsShard redistribution on node join/leave
AcquisitionManagersrc/model/acquisition.rsBLAKE3-verified model downloads from peers and HuggingFace
ApiServersrc/api/server.rsAxum HTTP: OpenAI + Anthropic APIs + MCP server + admin dashboard + WebSocket
PoolManagersrc/pool/manager/Device pool management, credit forwarding
AutoShardManagersrc/model/auto_manage/VRAM-aware shard acquisition + smart pruning (manager, scoring, download, prune, scan, vram, wishlist). R111: refreshes the user-visible wishlist at the end of every tick.
HfWatcher (R112)src/model/huggingface/watcher.rsBackground task polling HuggingFace's trending GGUF feed once per hour. Caches the snapshot on state.models.hf_trending_cache (consumed by the wishlist scorer) and auto-promotes models above 100k downloads + 24h age from Discovered to DemandVerified. NonCritical — HF outages don't escalate to a daemon crash. Opt-out via auto_manage.hf_watcher_enabled = false.
UpdateCheckersrc/update.rsPeriodic GitHub release polling, SHA256-verified binary download, atomic apply. Skipped entirely when auto_update = "disabled" (default until binary signing C1 lands), so the supervisor doesn't log a misleading "exited unexpectedly" warning.

Channel Layout

FromToMessage Types
NetworkManagerMessageDispatcherAll inbound SwarmMessage variants
MessageDispatcherInferenceRouterInferenceRequest, LayerForward, LayerResult
InferenceRouterNetworkManagerOutgoing P2P messages
HealthMonitorShardRebalancerRebalanceEvent
ApiServerInferenceRouterRouterCommand (from HTTP)
ApiServerAcquisitionManagerAcquisitionCommand
AutoShardManagerAcquisitionManagerAcquisitionCommand
CreditLedgerNetworkManagerCreditGossip, CreditTransaction
MessageDispatcher(spawned task)VisionEncodeRequest → handler → VisionEncodeResponse

Broadcast Channels

ChannelTypeSubscribersPurpose
activity_txbroadcast::Sender<ActivityEvent> (256)WebSocketUnified event bus — all subsystem events (shard ops, downloads, inference, pool, config changes). Events carry toast_level for frontend toast control. History replayed to new WS clients.
dashboard_txbroadcast::Sender<DashboardSignal> (32)WebSocketDashboard refresh signals — PeersChanged (peer connect/disconnect), ModelsChanged (shard download/load/prune), UpdateAvailable(UpdateInfo) (new version).

Note: Former separate channels (prune_events_tx, models_changed_tx, lan_discovery_tx, system_notify_tx, peer_list_changed_tx, update_tx) were consolidated into these two in the event system unification.

Startup Sequence

  1. Parse CLI args (clap)
  2. Initialize tracing subscriber
  3. Load/create config (TOML + env + defaults + CLI overrides)
  4. Ensure data directory exists
  5. Load/generate Ed25519 identity
  6. Open redb database
  7. Build Daemon { config, identity, db }
  8. Initialize ModelExecutor (load GGUF if --model provided)
  9. Build Arc<SharedState> (includes ModelRegistry from DB)
  10. Scan local shards, register in registries
  11. Create mpsc channels
  12. Spawn all 12 tasks
  13. Open browser if configured
  14. tokio::select! on Ctrl+C or task exit
  15. Graceful shutdown: save peer cache, flush database

Graceful Shutdown

Shutdown is triggered by Ctrl+C (SIGINT/SIGTERM) or any task exiting:

  • A watch channel signals all subsystems
  • Peer cache is saved to redb
  • Database is flushed
  • Open connections are drained

Networking & Discovery

Transport Stack

libp2p Swarm
├── Kademlia (DHT) — distributed hash table for peer/shard/model lookup
├── GossipSub — pub/sub for shard/health/credits/identity/pools/regions
├── request_response — unified protocol (/swarmllm/1.0.0, 600s timeout)
├── mDNS — optional LAN peer discovery
├── connection_limits — max 1/peer (>1 causes rr round-robin to dead connections), 500 total
├── Identify — protocol identification
├── AutoNAT — NAT detection
├── DCUtR — hole punching
└── relay::client — circuit relay

Protocol Format

The unified protocol uses a type-tag byte on every frame (src/network/protocol/mod.rs):

TagConstantUse
0x00WIRE_TAG_JSONJSON control message (SwarmMessage, ShardRequest/ShardResponse)
0x01WIRE_TAG_TENSORBinary tensor payload (LayerForward, LayerResult), f16
0x02WIRE_TAG_TENSOR_COMPRESSEDQ8_0 activation frame (flag-gated activation_compression) — ~3.76× smaller than 0x01
0x03WIRE_TAG_SHARDRaw shard bytes (ShardResponse payload, 32 MB max — bypasses the 4 MB JSON cap)
0x04WIRE_TAG_PREFIX_KVCross-node prefix-KV snapshot. Frame body's flag byte: 0 = miss, 1 = raw f32, 2 = zstd-compressed f32 (gated on NetworkConfig::prefix_kv_compression, default off). Receivers always decompress regardless of the send-side flag.

Receivers auto-dispatch on the leading byte; senders choose based on config + request kind. Only the 0x00 frame carries a JSON body; the rest use binary framing with length prefixes.

Discovery Stack

SwarmLLM uses 5 independent discovery layers:

  1. mDNS — Discovers LAN peers in seconds. Config: enable_mdns = true
  2. Persistent Peer Cache — Saves up to 200 peers every 5 min + on shutdown. Fastest reconnect.
  3. Invite Codes — Format: swarm://<base64url(key‖nonce‖encrypted_multiaddr)>. Encrypted with ChaCha20Poly1305.
  4. Peer Exchange (PEX) — On each connection, exchanges up to 20 known peers.
  5. Kademlia DHT — Bootstrap flag + periodic re-bootstrap every 60s.

GossipSub Topics

Six topics, all subscribed at startup in discovery::subscribe_topics:

TopicConstantContent
swarm/modelsTOPIC_MODELSShardAnnounce, ModelManifest, PrefixCacheAnnounce (cross-node prefix-KV index)
swarm/healthTOPIC_HEALTHHealthPing, NodeCapability (includes observed per-layer latencies for the Parallax scheduler), TpAllReduceResponse
swarm/creditsTOPIC_CREDITSCreditGossip, CreditTransaction
swarm/identityTOPIC_IDENTITYNicknameGossip (signed)
swarm/poolsTOPIC_POOLSPoolMessage (PoolState, PoolInvitation, CreditForward)
swarm/regionsTOPIC_REGIONSRegionShardSummary (per-region shard availability for routing locality)

The topic match in NetworkManager::handle_broadcast is contract-not-default: a SwarmMessage variant with no topic arm falls through _ => return and silently drops at the wire. Adding a new gossip variant requires updating the match — an early multi-node test caught PrefixCacheAnnounce missing from the TOPIC_MODELS arm, which had silently dropped every cross-node prefix-cache announce at the network layer until a two-daemon run flushed it out.

Messages older than 5 minutes are rejected (replay protection).

Cross-Node Prefix KV Sharing Dispatch

The cross-node prefix-cache fetch path uses the request_response protocol, not gossip. The gossip layer only broadcasts which blocks each peer holds (PrefixCacheAnnounce on swarm/models); the actual snapshot transfer is a direct bilateral exchange:

  1. Requesting daemon sends SwarmRequest::PrefixKvFetch to the peer chosen by the probe resolver (trust-gated by cross_node_prefix_trust_min, default 0.5)
  2. Serving daemon runs fetch_local_snapshot against its own worker over IPC (2000 ms timeout) and gets the serialized bytes or None
  3. Serving daemon returns SwarmResponse::PrefixKvData { present, payload } with the bytes wrapped in the WIRE_TAG_PREFIX_KV frame on the binary payload slot (not in the JSON header — serde_json inflates Vec<u8> ~5× and blows past the 64 MiB IPC cap)
  4. Requesting daemon BLAKE3-reverifies + NaN/Inf-scans, hands bytes to its worker to hydrate a KvCacheEntry

See Inference > Prefix-Cache KV Sharing for the full pipeline and measured numbers.

Anti-Gaming

  • Subnet clustering detection: >5 nodes per /24 triggers 25% spot-check rate (up from 5%)
  • SubnetClustering trust penalty (-0.03 per cycle)
  • Signed balance reports with timestamp freshness (5 min window)
  • Gossip replay rejection (5 min window)
  • cross_node_prefix_trust_min gates fetch peers at a minimum trust score (default 0.5, equal to DEFAULT_TRUST; set to 2.0 to disable cross-node fetch entirely)

Inference Pipeline

Subprocess-Per-Model Isolation

Each loaded model runs in its own swarmllm model-worker subprocess (Ollama-style). When a model is unloaded, the subprocess is killed and the OS + CUDA driver immediately reclaim all GPU memory — no daemon restart required.

Main daemon                          model-worker subprocess (one per model)
───────────────────────────────      ───────────────────────────────────────
ModelProcessPool.generate()  ─────►  loads shards from disk on first request
ModelProcessPool.forward()   ─────►  runs forward passes / full decode loop
                             ◄─────  streams WorkerMsg::Token / LayerResult
unload_model()               ─────►  kill process → OS frees all VRAM

IPC: Unix domain socket with binary framing — [4B json_len][json header][4B payload_len][raw tensor bytes]. JSON carries message metadata; the payload carries raw activation bytes to avoid base64 overhead.

Message types (src/inference/worker_ipc.rs):

MessageDirectionPurpose
DaemonMsg::Forwarddaemon → workerSingle-step LayerForward (distributed inference)
DaemonMsg::Generatedaemon → workerFull prompt→tokens decode loop (API inference)
DaemonMsg::Unloaddaemon → workerDrop a layer range (partial memory reclaim)
DaemonMsg::Shutdowndaemon → workerGraceful worker exit
WorkerMsg::Tokenworker → daemonStreaming decoded token
WorkerMsg::LayerResultworker → daemonActivation result for pipeline forwarding

SplitModelEntry is metadata-only — it caches eos_tokens, vocab, chat_template, bos_token, and eos_token_str from the GGUF header without loading model weights. The weights live exclusively in the worker subprocess.

Worker granularity: one process per ModelId (not per shard). A single worker handles all layer ranges for a model and owns its own KvCacheStore. Individual shard unload uses DaemonMsg::Unload; the process exits only when all shards are released.

Split Inference Engine

The split inference engine (src/inference/split/) enables distributed inference using candle for direct tensor computation with quantized GGUF weights. Each node loads only its assigned transformer layers (in the worker subprocess), forwarding hidden-state activations between nodes. The module is split into: model.rs (SplitModel struct + accessors), loader.rs (GGUF/shard load), executor.rs (forward pass + tensor-parallel), kv_cache.rs, entry.rs, gguf_meta.rs, shard_reader.rs, rope.rs, prefix_cache.rs.

Client → API Server → InferenceRouter → Pipeline Assembly
                                              │
                      ┌───────────────────────┘
                      ▼
          ┌──────────────────────┐
          │   Pipeline Segment   │     Token IDs (prefill)
          │ Node A: Layers 0-15  │──── LayerForward ──►
          └──────────────────────┘                      │
                                        ┌───────────────┘
                                        ▼
                            ┌──────────────────────┐
                            │   Pipeline Segment   │
                            │ Node B: Layers 16-27 │── sample token ──►
                            └──────────────────────┘

Pipeline Assembly

  1. Fetch model manifest to determine layer ranges
  2. Pipeline affinity check: if multi-turn session has a previous pipeline and all nodes are still connected, reuse it (KV cache locality)
  3. Query model_registry.shard_holders for hosting nodes
  4. Liveness filter: drop holders that aren't in connected_node_ids (the libp2p truth — DHT can re-inject providers for peers that just disconnected, and peer_registry is intentionally preserved across mid-pipeline disconnects for reconnect attempts)
  5. Fetch node load/latency from peer_registry
  6. Parallax scheduler: shortest-path dynamic programming over observed per-layer latencies (EMA over recent forwards), rather than a greedy latency-only sort. Cross-gossips top-32 observed latencies via NodeCapability.observed_latencies so every node has a current view of the network's compute profile
  7. Encrypted pipeline check: if enabled for this model, force first and last segments to the local node (boomerang topology)
  8. Assignment: widest contiguous layer range per node, merging on same-node
  9. Identify standby nodes per segment (failover)
  10. Send PipelineAssignment, wait for ACKs, begin forwarding

Failure Handling

The router applies a single retry on transient remote failures (silent rr drops, OutboundFailure, remote-generate timeouts). The retry passes preferred_pipeline = None so the scheduler re-runs and the dead/dropped peer is filtered out via the liveness oracle above. Failure of the second attempt propagates to the user with a "try again" hint.

Independently, streaming-tracked SendDirectMessage sends carry a delivery_request_id; if the receiver doesn't ACK within RR_ACK_TIMEOUT_SECS (10s), the daemon closes the caller's streaming channel — converting a 120s FIRST_TOKEN_TIMEOUT hang into a fast-fail in ~10–20s. This handles the rare case where libp2p request_response accepts a send_request call but never delivers it (no OutboundFailure event fires).

Concurrent Request Throttling

Per-tier concurrency caps come from max_concurrent_requests (default 10): Bronze=¼, Silver=½, Gold=1×, Platinum=2×. Requests beyond the cap queue in the router. The queue is event-driven: every active_count.fetch_sub(1) on completion is paired with queue_notify.notify_one() so drain_queue wakes immediately. Without that pairing, queued requests would sit indefinitely until the next Submit arrived (a real bug found in stress testing — fix in commit da6f485).

Pipeline affinity means that multi-turn conversations (with session_id) prefer to route through the same nodes, preserving KV-cache state and avoiding cold restarts on every turn.

The Parallax allocator also runs offline in AutoShardManager (Phase C.2) with a soft acquire/prune bias driven by a per-shard stability counter (≥3 consistent ticks of "this shard wants to move here" before it acts). Hard constraints (pinning, trust gates, VRAM caps) always win.

Architecture Detection

The SplitModel loader reads general.architecture from GGUF metadata and applies per-architecture handling:

ArchitectureRoPEQKV BiasesSpecial Handling
LlamaInterleavedNoDefault EOS=2
Llama 4iRoPE (NoPE every 4th)NoMoE FFN
Qwen2ContiguousYesEOS 151643+151645
Qwen 3.5ContiguousNoHybrid SSM+attention (Gated Delta Networks)
Gemma/Gemma2InterleavedNoEmbedding scaling (sqrt(d)), Gemma RmsNorm (+1), EOS 107, attention + final logit softcapping, Gemma chat template fallback
Phi-3Su/YaRNYesFused QKV/FFN tensors
MistralInterleavedNoGQA
DeepSeek-V2/V3ContiguousNoMLA attention + MoE FFN
GLM-4ContiguousNoPartial RoPE, extreme GQA (16:1)
Starcoder2InterleavedYesCode-optimized

KV-Cache Management

  • Per-request isolation via DashMap<(ModelKey, RequestId), Cache>
  • Multi-turn reuse: session_id tracks conversations, prefix matching skips redundant prefill
  • Configurable TTL (default 10 min)
  • VRAM-aware LRU eviction for split model cache

Prefix-Cache KV Sharing (Cross-Node)

Each worker stores a local prefix-cache keyed by BLAKE3 chained hashes over fixed-size token blocks (prefix_cache_block_tokens, default 64). Blocks are announced to peers via SwarmMessage::PrefixCacheAnnounce on the swarm/models gossipsub topic and indexed in state.models.cross_node_prefix_index.

When a local worker sees a prompt whose prefix it hasn't prefilled, it emits WorkerMsg::PrefixFetchProbe; the daemon walks the index (longest-match first), trust-gates candidate peers by cross_node_prefix_trust_min (default 0.5), and issues a SendPrefixKvFetch request-response to the best holder. The serving daemon re-issues DaemonMsg::ExportPrefixSnapshot to its worker, which narrows a stored KvSnapshot to the requested block boundary and returns the serialized bytes in the IPC binary-payload slot. Back on the requesting side, the bytes are BLAKE3-reverified against the requested hash and NaN/Inf-scanned before hydrating a new KvCacheEntry for the in-flight request, which then only has to prefill the suffix beyond the cached block boundary.

Three chained timeouts (worker probe 3000 ms, daemon network 2500 ms, serving IPC 2000 ms — sized for 7B-class f32 snapshots) guarantee that a stuck peer degrades to a clean miss rather than blocking the request. See the Performance chapter for measured TTFT numbers on TinyLlama (GPU, corner case where fetch is slightly slower than prefill) vs Qwen2.5-7B (12.9× iter-1 TTFT speedup on CPU-CPU localhost).

Advanced Features

  • Speculative Decoding — Draft model proposes K tokens, target verifies in one pass (flag-gated speculative_distributed)
  • SWIFT self-speculative — Target model acts as its own draft by skipping a layer range (flag-gated swift_self_speculative)
  • DSD (Decentralized Speculative Decoding) — Multi-segment pipeline with γ-token speculation woven in (flag-gated decentralized_spec_decoding)
  • Chunked Prefill — Sarathi-style: each Prefilling slot advances by prefill_chunk_tokens (default 128) per decode tick so a long admission can't block decode
  • Continuous Batching — default-on: concurrent Generate requests share one forward_batch per decode tick; GPU uses fused kernel, CPU falls through to sequential
  • Batched Prefill Forward — default-on: fuses concurrent same-shape Prefilling chunks into one forward_batch call
  • Remote-generate Fast Path — default-on: single-segment distributed inference runs the full decode loop on the remote worker instead of per-token coordinator round-trips (measured 1.93× decode speedup)
  • Cross-request Prefix Cache — default-on: see "Prefix-Cache KV Sharing" above for the cross-node extension; the local cache alone is a 29.4× wall-clock win on prompt re-submission
  • Activation Compression (Q8_0) — Intermediate pipeline activations wire-quantized ~3.76× (flag-gated activation_compression)
  • Flash Attention — CPU and GPU fast paths (GQA-native, no repeat_kv)
  • PagedAttention — Deferred; paged-attn feature flag reserved for future use (module removed, never wired to production)
  • Logprobs — Per-token log probabilities via sample_token_with_params_and_logprobs(). When logprobs: true in the request, the sampling layer collects top-N token probabilities and returns them in the OpenAI-compatible response. Available on split model (candle) inference paths
  • Pipeline Error Broadcast — On distributed inference failure, broadcast_pipeline_error() notifies all participants so peers can update shard availability and route around failures
  • Local Embedding Privacy — When local_embedding_privacy: true, the requesting node performs token→embedding locally (~1ms) and sends pre-embedded hidden-state activations instead of raw token IDs to the first pipeline segment. Remote nodes never see the plaintext prompt. See Security > Local Embedding Privacy
  • Encrypted Pipeline — When enabled (per-model or global), forces a "boomerang" topology: the requesting node handles both the first segment (embedding) and last segment (token sampling). Remote nodes only process intermediate activations — no remote node ever sees plaintext input or output. See Security > Encrypted Pipeline

Vision Language Models (VLM)

Distributed mmproj

The mmproj (vision encoder) is modeled as a sentinel shard (index = u32::MAX) decoupled from the text pipeline. Any node with mmproj can encode images — the router selects local → first-segment → any holder.

Image → JPEG compress → VisionEncodeRequest (remote) or encode locally
    → zstd+FP16 compressed embeddings
    → attached to first LayerForward (vision_embeddings field)
    → text pipeline processes as normal

Key types: VisionEncodeRequest, VisionEncodeResponse, LayerForward.vision_embeddings.

If no node has mmproj loaded, the API returns HTTP 503 (VisionEncoderUnavailable).

Tensor Wire Format

[4B ndim][4B×ndim shape][4B dtype_tag][f32 data]

For a 7B model (hidden_dim=3584):

  • Prefill (14 tokens): ~200 KB
  • Decode (1 token): ~14 KB

Credit System

Credits are SwarmLLM's fairness mechanism — no blockchain, no token, just local accounting with dual-signed transactions. The system ensures contributors are rewarded and free-riders are deprioritized.

Earning & Spending

ActionCreditsNotes
Serve inference (per token)+10Balanced with consume side
Host shard (per GB per hour)+1Hourly tick in CreditLedger
Seed shard data (per GB transferred)+5Atomic counter, periodic drain
Relay traffic (per connection hour)+2Circuit open/close tracking
Consume inference (per token)-10Balanced with earn side
Distributed inference failure-50Automatic penalty

Balanced rates: Both earn and spend use rate × tokens — no layer multiplier. A 22-layer model serving 100 tokens earns the same as it costs to consume, preventing credit inflation.

All rates are configurable per pool via [pool.credit_rates] in config.

Minimum Balance Enforcement

Nodes with balance below -1000 credits have remote inference requests rejected. They receive a clear error message telling them to contribute (host shards, serve inference, seed data).

  • Local API requests (from localhost) are always allowed regardless of balance
  • This prevents free-riders from endlessly consuming without contributing
  • The floor is configurable via MIN_BALANCE_FOR_INFERENCE constant

Priority Tiers

Tiers are calculated from your credit balance relative to the network:

TierRequirementConcurrent Limit
Platinum≥90th percentile and balance > 02× base max
Gold≥70th percentile and balance > 0base max
SilverPositive balance½ base max
BronzeZero or negative¼ base max (min 1)

How it works: On each inference request, the router computes your network percentile from peer credit gossip data (deduplicated by NodeId to prevent Sybil stuffing) and calls calculate_tier(). Higher tiers dequeue first. Bronze nodes are never fully blocked — they get deprioritized but always get at least 1 concurrent slot.

Anti-Abuse Mechanisms

  • Anti-Sybil deduplication: Peer balance gossip is deduplicated by NodeId — a single peer can't stuff the percentile distribution by re-gossiping
  • Atomic accumulation: Forward participation credits use AtomicI64 accumulator, flushed every 60s — no credits lost under high concurrency
  • AntiGaming rate limiter: Max 100 credit transactions per node per 5-minute window
  • Self-dealing rejection: Transactions from/to same node are rejected
  • Signed balance reports: Ed25519 signatures with 5-minute freshness window

Failure Penalties

When distributed inference fails:

  • The requesting node is penalized (configurable penalty_serve_failure, default 50 credits)
  • A broadcast_pipeline_error() message is sent to all pipeline participants

Transaction Security

  • Every transaction requires dual Ed25519 signatures (serving node + requesting node)
  • UUID deduplication prevents replay attacks (checked against DB)
  • Balance arithmetic uses saturating_add (no overflow panics)
  • Peer balance gossip rejects implausible values (abs > 100M)

Escrow

For large requests (above configurable threshold), credits are held in escrow:

  • create_escrow()release_escrow() (success) or refund_escrow() (failure)
  • Balance deducted BEFORE escrow persisted (crash-safe: lose credits > create free credits)
  • Refunds are persisted to DB immediately
  • Entries expire after 10 minutes with automatic refund
  • Escrow and direct charge are mutually exclusive (no double-billing)

Device Pool Credit Forwarding

When devices are linked in a pool, member devices forward their earnings to the owner:

  • Credit split configurable: 0-50% kept by member, rest forwarded
  • Dual-signed PoolCreditForward (member signature + owner co-signature)
  • Forwarded amount deducted from member balance before persisting
  • Owner's PoolManager validates and applies credits atomically

Security & Encryption

Three Encryption Tiers

Tier 1: Pairwise Sessions (Unicast)

For direct peer-to-peer communication:

  • Ed25519 → X25519 → ECDH → ChaCha20-Poly1305
  • Forward secrecy via ephemeral X25519 re-keying every 10 minutes
  • Nonce reuse prevented by session clearing on disconnect (remove_session())
  • Replay protection: RFC 6479 sliding window (128-bit bitmap) — allows packet reordering within window while rejecting duplicates
  • Nonce state updated only after successful decryption (prevents DoS)
  • Pending ephemeral keys expire after 60 seconds (prevents memory exhaustion from unanswered re-keys)

Tier 2: Pipeline Sealing (Inference)

For inference prompts and responses:

  • Per-request ephemeral key
  • Sealed prompt/response
  • Wire tag: TENSOR_TAG_ENCRYPTED = 0x10

Pipeline sealing is active: the final segment encrypts output token IDs with the requester's X25519 public key. The final-segment node can see the sampled tokens before encryption — this is inherent to the architecture since sampling happens on that node. Intermediate nodes process activation tensors (protected by Tier 1 in transit) but never see the final plaintext output. See Pipeline Privacy Model for a full breakdown of what each node can see.

Tier 3: Sealed Gossip (Broadcasts)

For GossipSub messages:

  • Epoch-based group key + mandatory Ed25519 origin signature
  • All gossip messages MUST be seal_signed() — unsigned messages are rejected
  • Verifies sender authenticity before processing
  • 1-hour rotation cycle

Transport-Authenticated Dispatch

All inbound network messages carry transport-authenticated sender identity:

  • libp2p Noise protocol authenticates peers at the transport layer
  • AuthenticatedMessage wrapper carries the verified NodeId of the sender
  • MessageDispatcher validates sender identity against message claims:
    • ShardAnnounce: sender must match announce.node_id
    • CreditTransaction: sender must be a party (from or to)
    • CreditGossip, NicknameGossip: sender must match claimed node_id
    • HealthPing/Pong: sender must match claimed node_id
    • EphemeralKeyExchange: sender must match exchange.node_id
  • Mismatched messages are logged and dropped

Signed DHT Records

Kademlia DHT records are Ed25519-signed to prevent poisoning:

  • Format: [32B pubkey][64B signature][payload]
  • start_providing_shards() signs records with node identity
  • Active verification: verify_dht_value() is called on all GetRecordOk results in NetworkManager — records with invalid or missing signatures are logged and discarded
  • Records expire after 1 hour with automatic re-publication

Identity

  • Ed25519 keypair generated on first run, stored in identity.key
  • Private key never leaves the machine
  • Public key = Node ID (first 8 bytes hex for display)
  • Nickname system: Ed25519-signed records with timestamp-wins conflict resolution
  • Nickname registry capped at 10,000 entries (requires peer_registry membership)

Trust & Reputation

TrustManager tracks per-peer scores (0.0-1.0, default 0.5):

EventScore Change
InferenceSuccess+0.01
ValidTransaction+0.02
SpotCheckFail-0.10
InvalidGossip-0.05
SignatureViolation-0.20

Scores decay toward 0.5 over time (1% per health cycle, default 30 seconds). Trust factors into pipeline scheduling and credit tier weighting.

Sybil Resistance

  • Subnet clustering detection: >5 nodes per /24 → elevated spot-check rate
  • Signed-only balance reports
  • Timestamp freshness checks on gossip (5 min window, rejects >5 min old)

API Authentication

  • Auto-generated 32-byte hex Bearer token (constant-time comparison)
  • Protected: /v1/*, /api/admin/provider-models, config PUT, shutdown, HF downloads, API key endpoint
  • Exempt: /, /health, /admin (read-only dashboard), static assets
  • Request body limit: 32 MB (raised from 2 MB to support VLM image payloads)
  • Content-Security-Policy: default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline'; connect-src 'self' ws: wss:; img-src 'self' data: blob:; frame-ancestors 'none'; base-uri 'self'; form-action 'self'
  • X-Content-Type-Options: nosniff
  • X-Frame-Options: DENY
  • Referrer-Policy: no-referrer
  • WebSocket Origin validation (rejects cross-site WebSocket hijacking)

Input Validation

  • Model field length: max 256 chars in OpenAI + Anthropic handlers
  • Tools array: max 128 entries
  • Stop sequences: max 16 entries
  • HuggingFace repo_id: validated owner/repo format (alphanumeric, hyphens, dots, underscores, max 96 chars)
  • HuggingFace filename: must end in .gguf, no .., no URL metacharacters
  • Path traversal: sanitize_path_component() on all network-provided model IDs before filesystem operations
  • Update URLs: only GitHub download URLs accepted
  • Update binaries: SHA256 checksum verification mandatory

Rate Limiting & DoS Protection

  • Per-IP rate limiter with periodic cleanup (5 min intervals)
  • Inference queue depth cap: 512 requests
  • HTTP timeout: 5 minutes (Slowloris protection via tower-http TimeoutLayer)
  • Credit transaction signature verification before ledger apply

Pipeline Privacy Model

Distributed inference splits a model across multiple nodes. This creates inherent privacy trade-offs — each node in the pipeline must process data to do its job. This section documents exactly what each node can see.

What each node sees during inference

Consider a 3-node pipeline: RequesterNode A (layers 0-10) → Node B (layers 11-21) → Node C (layers 22-27, final):

DataRequesterNode A (first)Node B (middle)Node C (last)
Plaintext promptYes (author)See below*NoNo
Raw token IDsYesSee below*NoNo
Input activationsYesYesYes
Output activationsYesYes
Generated token IDsYes (decrypted)NoNoYes (samples them)
Final plaintext responseYes (decrypted)NoNoYes (before sealing)

*Node A's visibility depends on the local_embedding_privacy setting — see below.

Risk: First-segment node sees raw tokens (default)

Without local_embedding_privacy (default): The first-segment node (Node A) receives the raw prompt text or token IDs to perform the embedding lookup. This means Node A can read the user's prompt in plaintext.

With local_embedding_privacy: true: The requesting node performs the embedding lookup locally and sends pre-embedded activation tensors. Node A receives floating-point vectors instead of token IDs. This is a significant privacy improvement, but not absolute — see Activation Inversion Risk below.

Risk: Final-segment node sees generated output

The final-segment node (Node C) must sample tokens from the logit distribution. This is fundamental — sampling is the act of choosing the next word, and it can only happen where the final layer's output logits exist. Node C therefore sees every generated token before encrypting them via Tier 2 pipeline sealing.

This cannot be mitigated architecturally. The node that runs the last transformer layer and samples tokens will always know what tokens were sampled. Pipeline sealing ensures the tokens are encrypted before being sent back over the network, so intermediate nodes and eavesdroppers cannot read the response — but the final-segment node itself can.

Risk: Activation inversion attacks

All intermediate nodes see hidden-state activation tensors (floating-point matrices). Research has shown that activations from early transformer layers can sometimes be partially inverted to recover input tokens, especially:

  • Embedding-layer activations (layer 0 output) — most vulnerable, essentially a lookup table that can be reversed
  • Early layers (1-4) — progressively harder to invert as information mixes across token positions
  • Deep layers (5+) — extremely difficult to invert in practice; activations encode abstract features, not token identity

Mitigations in SwarmLLM:

  1. local_embedding_privacy: true — the requesting node performs embedding locally, so the first segment never receives the trivially-invertible embedding output. It receives post-layer-0 activations at earliest.
  2. Tier 1 encryption — all inter-node tensor transfers are encrypted with ChaCha20-Poly1305, preventing network-level eavesdropping
  3. Pipeline scheduling preference — the scheduler prefers local segments for the first layers when possible

Risk: Byzantine tensor manipulation

A malicious node can send garbage activations instead of computing the actual transformer layers. This produces incorrect output without detection unless spot-checked. Mitigations: probabilistic spot-check validation (5% rate, 25% for subnet-clustered peers) with trust score reduction on failure.

Summary of privacy guarantees

ConfigurationPrompt privacyResponse privacyActivation risk
Default (no privacy flags)First segment sees plaintextFinal segment sees plaintextIntermediate nodes see activations
local_embedding_privacy: trueNo remote node sees raw tokensFinal segment sees plaintextReduced — no trivial embedding inversion
encrypted_pipeline: trueNo remote node sees raw tokensNo remote node sees outputOnly intermediate activations visible to remote nodes
+ Tier 2 pipeline sealingNo remote node sees raw tokensEncrypted on the wireReduced — no trivial embedding inversion
All protections enabledBest availableBest availableRemote nodes only see intermediate activations; inversion theoretically possible but computationally expensive

Bottom line: With encrypted_pipeline, no remote node sees plaintext input or output — the pipeline "boomerangs" through remote nodes and returns to the requester. This is the strongest privacy mode. Without it, local_embedding_privacy still protects raw token IDs but the final-segment node sees generated output.

Local Embedding Privacy

When local_embedding_privacy: true is set in [inference] config, the requesting node performs token→embedding lookup locally before sending activations to the first pipeline segment. Remote nodes never see raw token IDs — only hidden-state activation tensors.

How it works:

  1. On startup, LocalEmbedder loads token_embd.weight from shard_000.bin (~64MB for a 7B Q4 model)
  2. The requesting node tokenizes the prompt and performs the embedding lookup locally (~1ms)
  3. The resulting hidden-state tensor ([1, seq_len, hidden_dim], FP32) is sent as LayerForward.activations with pre_embedded: true
  4. The receiving first-segment node skips its embedding lookup and processes the pre-embedded activations directly

Wire format: The pre_embedded flag on LayerForward is #[serde(default)], so old nodes receiving new-format messages default to false (backward compatible).

Trade-off: Pre-embedded activations are larger than raw text (e.g., 512 tokens × 4096 hidden × 4 bytes = 8MB vs ~2KB text). This matches the existing inter-segment activation sizes, so it does not change the bandwidth profile of distributed inference.

Relevant code: src/inference/local_embedder.rs, src/inference/pipeline/, src/daemon/state/mod.rs (local_embedders DashMap).

Encrypted Pipeline

When encrypted_pipeline: true is enabled (globally or per-model), the pipeline scheduler forces the requesting node to handle both the first and last segments. This creates a "boomerang" topology:

Requester (shard 0, embed) → Remote A (middle shards) → ... → Requester (final shard, decode)

No remote node ever sees plaintext — neither the raw prompt tokens nor the generated output. Remote nodes only process intermediate hidden-state activations.

Requirements:

  • The requesting node must hold shard 0 (embedding table) AND the final shard (output head)
  • local_embedding_privacy is auto-enabled when encrypted pipeline is active
  • Only useful for models with 3+ shards (2-shard models = fully local, no distribution)

Overhead:

  • Adds ~1 extra network RTT per generated token (activations must return to the requester for final decoding)
  • Latency increase depends on distance to the furthest remote segment
  • No bandwidth overhead vs normal distributed inference (activation sizes are the same)

Per-model configuration:

  • API: GET/PUT /api/admin/models/{id}/encrypted-pipeline
  • Dashboard: gear icon on model card → "Encrypted pipeline" checkbox
  • Global fallback: encrypted_pipeline = true in [inference] config
  • Per-model overrides are persisted to the database

Relevant code: src/inference/scheduler/mod.rs (greedy_assign), src/inference/pipeline/ (auto-enable local embedding), src/api/admin_models/mod.rs (API endpoints), src/daemon/state/mod.rs (encrypted_pipeline_models DashMap).

Known Limitations

These are architectural properties that cannot be fully mitigated with code changes:

  • Gossip epoch key is publicly derivable — derived from "swarmllm-mainnet-v1". Gossip encryption is defense-in-depth; Ed25519 signing is the primary security mechanism.
  • Final-segment output visibility — the node running the last transformer layers sees all generated tokens before pipeline sealing encrypts them. This is inherent to the architecture (see Pipeline Privacy Model).
  • Activation inversion — hidden-state tensors passed between nodes can theoretically be inverted to recover input, especially from early layers. local_embedding_privacy eliminates the trivial case (embedding lookup reversal). Deep-layer inversion remains an open research problem.
  • Byzantine tensor manipulation — malicious peers can send garbage activations. Mitigation: probabilistic spot-check validation (5% rate, 25% for subnet-clustered peers) with trust score reduction on failure.
  • Sybil credit farming — Ed25519 keys are free. Anti-gaming heuristics help but are not bulletproof.
  • GGUF parser vulnerabilities — llama.cpp CVEs. BLAKE3 content hash gates shard loading but parser bugs remain upstream.
  • Kademlia eclipse attacks — strategic Sybil node IDs can control DHT routing. K-bucket eviction policies help.

Storage & Data

Data Directory Layout

~/.local/share/swarmllm/
├── config.toml          # User configuration
├── identity.key         # Ed25519 keypair
├── api_key              # Bearer token (auto-generated)
├── db.redb              # redb database (migrated from sled db/ directory)
└── models/
    ├── qwen2.5-coder-7b/
    │   ├── manifest.json
    │   ├── gguf_header.bin
    │   ├── shard_000.bin
    │   └── shard_001.bin
    └── tinyllama-1.1b/
        └── ...

Database Tables (redb)

TableKeyValue
config"config"Config
config"api_key"Bearer token string
identity"keypair"Encrypted Ed25519 key
credits"balance"CreditBalance
credit_txns{uuid}CreditTransaction
peer_trust{node_id_hex}TrustScore
peer_cache{multiaddr}() presence key
shard_meta{model_id}/{index}ShardInfo + path
model_meta{model_id}ModelManifest
sessions{session_id}KV-cache metadata
nicknames{node_id_hex}NicknameRecord
pool_state"pool"PoolState
trust_scores{node_id_hex}f64 trust score
escrow{escrow_id}EscrowEntry
hf_sources{model_id}HfSource metadata
locked_shards{shard_id_json}bool
resource_schedule"current"ResourceSchedule
model_trust{model_id}ModelTrustEntry (level, request count, last seen)

Model Acquisition Pipeline

Network Registry (GossipSub/DHT)
        │
        ▼
  Manifest Check ──► Reject if BLAKE3 mismatch
        │
        ▼
  Shard Selection ──► Rarest-first (BitTorrent-style)
        │
        ▼
  Download Loop ──► Atomic write to .tmp, rename to .bin
        │
        ▼
  Shard Verify ──► BLAKE3 vs manifest hash
        │
        ▼
  Model Ready

Integrity guarantees:

  • Manifests verified via BLAKE3 self-hash
  • Each shard verified against manifest hash
  • Failed shards renamed .bin.quarantine, serving peer penalized
  • Downloads retried (3 attempts, exponential backoff)
  • Atomic writes prevent corrupt partial files
  • Stale .tmp files cleaned on startup

OpenAI-Compatible API

SwarmLLM provides a drop-in replacement for the OpenAI API. All endpoints require Bearer token authentication.

POST /v1/chat/completions

Chat completions with streaming support.

curl http://localhost:8800/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder-7b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is Rust?"}
    ],
    "stream": true,
    "max_tokens": 512,
    "temperature": 0.7
  }'

Request Body

FieldTypeRequiredDefaultDescription
modelstringyesModel name (or "auto" for first available)
messagesarrayyesChat messages (role + content). Roles: system, user, assistant, tool
streambooleannofalseEnable SSE streaming
max_tokensintegerno2048Max tokens to generate (clamped to 1–32768)
temperaturefloatno0.7Sampling temperature (0.0-2.0)
top_pfloatno1.0Nucleus sampling threshold
stopstring or arraynoStop sequence(s), 1–256 chars each, max 16
frequency_penaltyfloatno0.0Frequency penalty (-2.0 to 2.0)
presence_penaltyfloatno0.0Presence penalty (-2.0 to 2.0)
toolsarraynoTool/function definitions for function calling
tool_choicestring or objectno"none", "auto", "required", or {"type":"function","function":{"name":"..."}}
logprobsbooleannofalseReturn log probabilities for output tokens. Supported on split model (candle) inference paths
top_logprobsintegernoNumber of top log probabilities per token (0-20, requires logprobs: true). Computed from pre-sampling (raw) logits per OpenAI spec
session_idstringnoReuse KV-cache from a previous request
lora_adapterstringnoLoRA adapter ID for fine-tuned inference

Response (non-streaming)

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "qwen2.5-coder-7b",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "Rust is a systems programming language..."},
    "finish_reason": "stop",
    "logprobs": null
  }],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 42,
    "total_tokens": 57
  }
}

Response with logprobs

When logprobs: true and top_logprobs: 3:

{
  "choices": [{
    "message": {"role": "assistant", "content": "Hello"},
    "finish_reason": "stop",
    "logprobs": {
      "content": [{
        "token": "Hello",
        "logprob": -0.234,
        "bytes": null,
        "top_logprobs": [
          {"token": "Hello", "logprob": -0.234, "bytes": null},
          {"token": "Hi", "logprob": -1.456, "bytes": null},
          {"token": "Hey", "logprob": -2.012, "bytes": null}
        ]
      }]
    }
  }]
}

Response with tool_calls

When the model calls a tool, finish_reason is "tool_calls" and content is null:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_abc123",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\":\"NYC\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

Streaming (SSE)

When stream: true, responses arrive as Server-Sent Events:

data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":"Rust"},"index":0}]}

data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":" is"},"index":0}]}

data: [DONE]

GET /v1/models

List available models.

curl http://localhost:8800/v1/models \
  -H "Authorization: Bearer YOUR_API_KEY"
{
  "object": "list",
  "data": [
    {
      "id": "qwen2.5-coder-7b",
      "object": "model",
      "owned_by": "swarmllm"
    }
  ]
}

GET /v1/status

Node status (SwarmLLM extension).

curl http://localhost:8800/v1/status \
  -H "Authorization: Bearer YOUR_API_KEY"

Using with OpenAI Client Libraries

Python (openai)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8800/v1",
    api_key="YOUR_API_KEY"
)

# Basic streaming
response = client.chat.completions.create(
    model="qwen2.5-coder-7b",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Python — Function calling

response = client.chat.completions.create(
    model="qwen2.5-coder-7b",
    messages=[{"role": "user", "content": "What's the weather in NYC?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"]
            }
        }
    }],
    tool_choice="auto"
)

if response.choices[0].finish_reason == "tool_calls":
    for tc in response.choices[0].message.tool_calls:
        print(f"Call {tc.function.name}({tc.function.arguments})")

JavaScript (openai)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8800/v1",
  apiKey: "YOUR_API_KEY",
});

const stream = await client.chat.completions.create({
  model: "qwen2.5-coder-7b",
  messages: [{ role: "user", content: "Hello!" }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

curl (streaming)

curl -N http://localhost:8800/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen2.5-coder-7b","messages":[{"role":"user","content":"Hello!"}],"stream":true}'

POST /v1/embeddings

Returns 503 Service Unavailable. Text embeddings are not supported via the subprocess inference path. Use a dedicated embedding provider or the OpenAI embeddings API directly.

GET /v1/providers

List configured cloud providers and their available models.

curl http://localhost:8800/v1/providers \
  -H "Authorization: Bearer YOUR_API_KEY"

Returns an array of { name, models: [...] } objects for each configured provider.

Responses API

OpenAI's /v1/responses is the default API for o-series and gpt-5-series models in 2026 and the replacement for the sunsetting Assistants API (2026-08-26). SwarmLLM exposes the full v1 surface plus follow-on features such as resumable streams, async background runs, MCP tool integration, and conversation chaining via previous_response_id.

Endpoints

MethodPathPurpose
POST/v1/responsesCreate a response (streaming or not, foreground or background)
GET/v1/responses/{id}Fetch a stored response. With ?stream=true&starting_after=N, resume the SSE stream from event N (V5).
DELETE/v1/responses/{id}Delete a stored response.
POST/v1/responses/{id}/cancelCancel a background response (M9). The cancel flag is checked at completion time; per-token interruption is deferred.
GET/v1/responses/{id}/input_itemsPaginated input-item listing (V4) for chained previous_response_id flows.
GET/api/admin/responsesAdmin: list all stored response records (used by the dashboard).

All endpoints accept the same Bearer-auth header as the rest of the API.

Routing

POST /v1/responses picks one of three execution paths in this order:

  1. Cloud proxy — when the requested model resolves to an OpenAI-routed provider, the request is serialized verbatim and forwarded to the upstream /v1/responses endpoint. Built-in tools, streaming, background, reasoning effort, text.verbosity, include[], previous_response_id, and any future field round-trip via #[serde(flatten)] extras.
  2. Anthropic-Messages bridge (V3) — when the model resolves to an Anthropic provider (or the local claude-subscription subprocess), the Responses request is translated to an Anthropic Messages request, forwarded, and translated back. This lets Claude Code clients drive /v1/responses end-to-end without losing tool-call or streaming semantics.
  3. Local inference — translates to /v1/chat/completions and runs on the local model. Function tools and tool_choice translate through; built-in tools (web_search, file_search, computer_use_preview, code_interpreter, image_generation, mcp, custom) are rejected with HTTP 400 because they require backing infrastructure SwarmLLM does not run.

Capabilities

  • Multimodal input (V2)input_image and input_file (UTF-8 only) parts in the structured input array. Binary file payloads (PDF, docx, image bytes via file_data) are rejected with a clear hint pointing at input_image.
  • Function toolstools definitions and tool_choice translate to OpenAI Chat Completions tool semantics; assistant tool_calls map back to function_call output items.
  • Streaming SSE (M6 + V1)stream=true emits the full Responses event sequence (response.createdresponse.in_progressresponse.output_item.addedresponse.content_part.added → per-delta response.output_text.deltaresponse.output_text.doneresponse.content_part.doneresponse.output_item.doneresponse.completed). The V1 fix shipped in 2026-04-25 cuts first-token latency by emitting created and in_progress before model warmup instead of after.
  • Persistence (M7)store=true (the OpenAI default) writes the full response object to redb with a 30-day TTL. previous_response_id (M8) chains follow-up requests by prepending the prior turn's messages before the new input.
  • Background mode (M9 + V8)background=true returns HTTP 202 with a Location: /v1/responses/{id} header; the client polls or, with background=true && stream=true, opens a resumable SSE connection at GET /v1/responses/{id}?stream=true that replays buffered events and then tails the live producer.

Validation (ingress)

The handler runs validate_responses_ingress BEFORE any routing decision so the cloud-proxy and Anthropic-bridge paths can't forward attacker-sized strings to upstream providers (where they'd burn quota or land in log lines). Caps:

FieldLimit
model1..=256 chars
previous_response_id≤64 ASCII alphanumeric (_ / - allowed); generation format is resp_<32-hex>
instructions≤2 MB
user≤256 chars
truncation, service_tier≤64 chars each
metadata≤64 KB total (keys + values)

Stop / temperature / top_p / max_tokens are clamped or validated at the sampling-params layer.

Dashboard

The admin dashboard exposes a Responses tab (/admin/responses) backed by GET /api/admin/responses. It shows the most-recent stored response records with status, model, input snippet, and per-record cancel/delete actions.

Deferred

  • POST /v1/responses/compact (V9) — no concrete caller has asked for it.
  • Token-level cancel for background inference — current cancel flips a flag checked at completion time; per-token interruption needs hooks in chat_completions that are out of v2 plan scope.
  • Server-side conversation resource CRUD — OpenAI's conversation parameter forwards through cloud proxy verbatim today; a local conversation type with its own endpoints is a separate design.
  • Built-in tools on the local path — see "Local inference" above.
  • custom tools with Lark / regex grammars — rejected on local, forwarded on cloud. Local grammar-constrained generation is a candle-side project.
  • Audio input on /v1/responsesinput_audio returns 400; needs a Whisper-class transcription model SwarmLLM doesn't currently expose.
  • Binary file inputs in input_file{file_data} — UTF-8 only; PDF/docx/ image-bytes payloads are rejected with a clear hint pointing at input_image (for images) or server-side text extraction.

Anthropic Messages API

SwarmLLM provides a full Anthropic Messages API at POST /v1/messages, enabling it to serve as a drop-in backend for Claude Code and other Anthropic-compatible clients.

Claude Code Integration

Use SwarmLLM as your Claude Code backend to access all models (local, network, and cloud) through a single endpoint:

ANTHROPIC_BASE_URL=http://localhost:8800 claude --model qwen2.5-coder-7b

Environment Variables

VariableDescription
ANTHROPIC_BASE_URLPoint to your SwarmLLM node (e.g., http://localhost:8800)
ANTHROPIC_AUTH_TOKENYour node's API key (from Settings or /api/admin/api-key)
ANTHROPIC_MODELDefault model to use

POST /v1/messages

Request Body

FieldTypeRequiredDescription
modelstringyesModel name (local GGUF, network model, or cloud model like gpt-4o)
messagesarrayyesChat messages with role + content
max_tokensintegeryesMaximum tokens to generate (clamped to 1–32768)
systemstring or arraynoSystem prompt (supports cache_control blocks)
streambooleannoEnable SSE streaming
temperaturefloatnoSampling temperature
top_pfloatnoNucleus sampling
stop_sequencesarraynoStop sequences, 1–256 chars each, max 16
toolsarraynoTool definitions for function calling
tool_choiceobjectnoTool selection strategy
metadataobjectnoRequest metadata
thinkingobjectnoExtended thinking configuration

Content Block Types

Messages can contain these content block types:

// Text
{"type": "text", "text": "Hello, world!"}

// Image (base64)
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "..."}}

// Tool use (assistant response)
{"type": "tool_use", "id": "toolu_123", "name": "get_weather", "input": {"location": "NYC"}}

// Tool result (user message)
{"type": "tool_result", "tool_use_id": "toolu_123", "content": "72F, sunny"}

// Thinking (extended thinking)
{"type": "thinking", "thinking": "Let me reason about this..."}

// Redacted thinking
{"type": "redacted_thinking", "data": "..."}

Response

{
  "id": "msg_abc123",
  "type": "message",
  "role": "assistant",
  "model": "qwen2.5-coder-7b",
  "content": [
    {"type": "text", "text": "Here's my response..."}
  ],
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 25,
    "output_tokens": 150
  }
}

Model Routing

Requests are routed based on the model name:

Model PatternRouteDetails
Local GGUF modelLocal inferenceTool calls and thinking blocks converted to text
claude-*Anthropic APIFull pass-through (all fields preserved including tools and thinking)
gpt-*, o1-*, o3-*, o4-*OpenAIAnthropic→OpenAI format translation
deepseek-*DeepSeekAnthropic→OpenAI format translation
mistral-*, codestral-*, pixtral-*MistralAnthropic→OpenAI format translation
llama-*, groq-*GroqAnthropic→OpenAI format translation
nim-*NVIDIA NIMAnthropic→OpenAI format translation
cerebras-*CerebrasAnthropic→OpenAI format translation
samba-*SambaNovaAnthropic→OpenAI format translation
fireworks-*, accounts/fireworks/*Fireworks AIAnthropic→OpenAI format translation
together-*Together AIAnthropic→OpenAI format translation
deepinfra-*DeepInfraAnthropic→OpenAI format translation
moonshot-*, kimi-*Moonshot/KimiAnthropic→OpenAI format translation
Network modelDistributed inferenceRouted through swarm P2P network

All 12 cloud providers are supported. Configure API keys via the dashboard Settings page or by placing a .env file in the data directory (~/.local/share/swarmllm/.env) with standard variable names (e.g., OPENAI_API_KEY, DEEPSEEK_API_KEY).

System Blocks with Cache Control

Anthropic-compatible prompt caching:

{
  "system": [
    {"type": "text", "text": "You are a helpful assistant.", "cache_control": {"type": "ephemeral"}}
  ]
}

Streaming (SSE)

When stream: true, responses arrive as Server-Sent Events following the Anthropic streaming format:

event: message_start
data: {"type":"message_start","message":{"id":"msg_123","type":"message","role":"assistant","model":"qwen2.5-coder-7b","content":[]}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: message_stop
data: {"type":"message_stop"}

MCP Server

SwarmLLM includes a native Model Context Protocol (MCP) server at POST /mcp. This enables AI agents like Claude Code, Cursor, VS Code Copilot, and other MCP-compatible tools to use your SwarmLLM node as a tool provider.

Protocol version: 2025-11-05 (JSON-RPC 2.0 over HTTP).

Endpoint

POST /mcp
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

All requests use JSON-RPC 2.0 format. All tools include tool annotations (readOnlyHint, destructiveHint, etc.).

Available Tools

chat

Send a message to any model available on the node (local, network, or cloud).

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "chat",
    "arguments": {
      "model": "qwen2.5-coder-7b",
      "messages": [
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Explain Rust's ownership model"}
      ],
      "temperature": 0.7,
      "max_tokens": 2048
    }
  },
  "id": 1
}

models

List all available models (local + network + cloud).

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": { "name": "models", "arguments": {} },
  "id": 2
}

compare

Send the same prompt to multiple models concurrently and get side-by-side results. Up to 10 models per comparison.

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "compare",
    "arguments": {
      "prompt": "Write a function to check if a number is prime",
      "models": ["qwen2.5-coder-7b", "gpt-4o", "claude-sonnet-4-20250514"],
      "system": "Write clean, efficient code.",
      "max_tokens": 1024
    }
  },
  "id": 3
}

research

Fan out a research question to multiple models in parallel. Designed for knowledge gathering — offload questions to cheap/fast models to get diverse perspectives without using expensive model tokens. If models is omitted, auto-selects available models (local first, then cloud).

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "research",
    "arguments": {
      "question": "What are the tradeoffs between ring-allreduce and star topology for tensor parallelism?",
      "models": ["deepseek-chat", "gpt-4o-mini", "qwen2.5-coder-7b"],
      "system": "Be concise and technical.",
      "max_tokens": 2048
    }
  },
  "id": 4
}

Response:

{
  "question": "What are the tradeoffs...",
  "models_queried": 3,
  "successful_responses": 3,
  "total_tokens_used": 1847,
  "results": [
    {
      "model": "deepseek-chat",
      "response": "Ring-allreduce...",
      "input_tokens": 24,
      "output_tokens": 512,
      "latency_ms": 2100,
      "status": "ok"
    }
  ]
}

batch_prompts

Execute multiple independent prompts in parallel, each targeting a specific model. Ideal for offloading parallel subtasks — e.g., ask one model to summarize, another to translate, another to review code, all at once.

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "batch_prompts",
    "arguments": {
      "tasks": [
        {
          "id": "summary",
          "model": "gpt-4o-mini",
          "prompt": "Summarize this error log: ...",
          "max_tokens": 512
        },
        {
          "id": "fix",
          "model": "qwen2.5-coder-7b",
          "prompt": "Write a fix for this bug: ...",
          "max_tokens": 1024
        },
        {
          "id": "translate",
          "model": "deepseek-chat",
          "prompt": "Translate to Japanese: ...",
          "max_tokens": 256
        }
      ]
    }
  },
  "id": 5
}

Response:

{
  "tasks_submitted": 3,
  "tasks_completed": 3,
  "results": [
    {
      "task_id": "summary",
      "model": "gpt-4o-mini",
      "content": "The error log shows...",
      "latency_ms": 890,
      "status": "ok"
    }
  ]
}

delegate

Offload a task to the most appropriate model based on a tier preference. Tiers: fast picks the lowest-latency local model, cheap picks a small/free model, smart picks the most capable available model (may use cloud). Saves subscription tokens by routing routine work to local/cheap models.

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "delegate",
    "arguments": {
      "prompt": "Summarize this function in one sentence: ...",
      "tier": "fast",
      "max_tokens": 256
    }
  },
  "id": 6
}

Tiers:

  • fast — lowest-latency local model (default)
  • cheap — smallest/free model available
  • smart — most capable model (may use cloud provider)

node_info

Get detailed information about the SwarmLLM node: loaded models, connected peers, credit balance, available cloud providers, and network status.

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": { "name": "node_info", "arguments": {} },
  "id": 6
}

Available Resources

swarmllm://status

Returns node status information (version, model loaded, peer count).

{
  "jsonrpc": "2.0",
  "method": "resources/read",
  "params": { "uri": "swarmllm://status" },
  "id": 7
}

IDE Integration

Claude Code

Option A: MCP tools — access SwarmLLM's tools (research, batch, compare) alongside your normal model:

claude mcp add --transport http swarmllm http://localhost:8800/mcp \
  --header "Authorization: Bearer YOUR_API_KEY"

Option B: Model backend — use SwarmLLM as your inference backend (routes all requests through the swarm):

ANTHROPIC_BASE_URL=http://localhost:8800 ANTHROPIC_AUTH_TOKEN=YOUR_API_KEY \
  claude --model qwen2.5-coder-7b

Option C: Both — use Claude for reasoning, SwarmLLM MCP for offloading research to cheap models:

# Add SwarmLLM as MCP server
claude mcp add --transport http swarmllm http://localhost:8800/mcp \
  --header "Authorization: Bearer YOUR_API_KEY"

# Then use Claude normally — it can call research/batch/compare tools via MCP
claude

VS Code (Copilot Chat)

Add to .vscode/mcp.json in your project:

{
  "servers": {
    "swarmllm": {
      "type": "http",
      "url": "http://localhost:8800/mcp",
      "headers": {
        "Authorization": "Bearer YOUR_API_KEY"
      }
    }
  }
}

Copilot Chat will discover SwarmLLM's tools automatically. Use them by asking Copilot to research, compare models, or batch prompts.

Cursor / Windsurf / Other MCP Clients

Any MCP-compatible client can connect via HTTP:

URL: http://localhost:8800/mcp
Transport: HTTP (Streamable HTTP)
Auth: Bearer token in Authorization header

Continue.dev (OpenAI API)

If your IDE extension supports the OpenAI API format, point it directly at SwarmLLM:

{
  "models": [{
    "title": "SwarmLLM Local",
    "provider": "openai",
    "model": "qwen2.5-coder-7b",
    "apiBase": "http://localhost:8800/v1",
    "apiKey": "YOUR_API_KEY"
  }]
}

Model Compare Dashboard

The compare functionality is also available in the web dashboard via the Compare tab. Select 2-10 models, enter a prompt, and view results side-by-side with latency, token counts, and response content.

Admin API

Admin endpoints are CORS-protected. Most read-only endpoints don't require Bearer auth; write operations do.

Node Management

GET /api/admin/stats

Node statistics and hardware info.

GET /api/admin/peers

Connected peers with latency, trust scores, and hosted models.

GET /api/admin/credits

Credit balance and tier info.

GET /api/admin/network-map

Geographic distribution of peers and shards across regions. Each entry includes the total peer count for that region, per-model shard-holder counts, per-model request demand rates, coverage gaps (models with zero holders in the region), and per-model replication targets derived from pool size and demand. Includes the local node in its auto-detected or configured region.

Response:

{
  "regions": {
    "US": {
      "total": 3,
      "models": { "tinyllama-1.1b-q4-k-m": 2 },
      "demand": { "tinyllama-1.1b-q4-k-m": 5 },
      "coverage_gaps": [],
      "replication_target": { "tinyllama-1.1b-q4-k-m": 2 }
    }
  }
}

GET/PUT /api/admin/config

Read or update daemon configuration. PUT requires Bearer auth.

POST /api/admin/config/reload

Hot-reload operational parameters without restart. Bearer auth required.

POST /api/admin/shutdown

Gracefully shut down the node. Localhost only, Bearer auth required.

Model Management

GET /api/admin/models

List models with shard status, VRAM estimates, and acquisition state. Each model includes:

  • mmproj field with available (bool), local (bool), and holders (count) for VLM vision encoder status
  • trust_level field: one of "Discovered", "Pinned", "DemandVerified", or "NetworkPopular" indicating the model's trust status (auto-manage only downloads shards for DemandVerified+ or Pinned models)

POST /api/admin/models/{id}/add

Trigger model acquisition from the network.

GET /api/admin/models/{id}/status

Check model acquisition progress.

DELETE /api/admin/models/

Remove model (shards + manifest + state).

DELETE /api/admin/models/{id}/shards/

Delete a single shard.

GET/PUT /api/admin/models/{id}/auto-manage

Per-model auto-manage policy (including prune toggle).

GET/PUT /api/admin/models/{id}/encrypted-pipeline

Per-model encrypted pipeline toggle. GET returns current status, readiness (whether local node holds first + last shard), and overhead note. PUT enables/disables with body {"enabled": true}. Requires the local node to hold shard 0 and the final shard. Returns a warning for 2-shard models (fully local, no distribution benefit). Setting is persisted to the database and survives restarts. Falls back to global encrypted_pipeline config if no per-model override is set.

PUT /api/admin/models/{id}/shards/{index}/lock

Lock/unlock a shard to prevent auto-pruning.

Storage & Shards

POST /api/admin/rescan-shards

Rescan local shard files on disk and update the model registry and network announcements without restarting the daemon. Useful after manually placing shard files in the data directory. Bearer auth required.

Response:

{ "status": "ok", "models_updated": ["model-id-1"], "count": 1 }

GET /api/admin/models/{id}/metadata

Read parsed GGUF metadata from a locally-stored model header (gguf_header.bin). Returns architecture parameters, tokenizer settings, quantization type, and all raw metadata key/value pairs (tokenizer vocabulary arrays are excluded). Returns 400 if no header file exists for the model.

Response shape:

{
  "model_id": "...",
  "general": { "name": "...", "architecture": "llama", "architecture_supported": true, "file_type": 11, "quantization": "Q4_K_M" },
  "model": { "context_length": 4096, "block_count": 32, "embedding_length": 4096, "head_count": 32, "head_count_kv": 8, "rope_dimension_count": 128, "rope_freq_base": 500000.0, "layer_norm_rms_epsilon": 1e-5, "vocab_size": 32000 },
  "tokenizer": { "model": "llama", "pre": "...", "eos_token_id": 2, "bos_token_id": 1, "padding_token_id": null },
  "tensors": { "count": 291, "data_offset": 131072 },
  "raw": [{ "key": "general.architecture", "value": "llama" }, ...]
}

POST /api/admin/models/{id}/shards/{index}/download

Trigger a P2P download of a specific shard that is not yet held locally. The daemon first checks for P2P peers that hold the shard (picking the best peer by LAN-proximity, latency, and trust), then falls back to returning HuggingFace source info if no peers are available. Bearer auth required.

Responses:

  • { "status": "already_local", ... } — shard is already on disk
  • { "status": "downloading", "source": "p2p", "peer": "...", ... } — P2P download started
  • { "status": "use_hf", "source": "huggingface", "repo_id": "...", "filename": "...", ... } — no P2P peers, use hf/download-shards instead
  • 400 if no peers and no HuggingFace source known

POST /api/admin/models/{id}/shards/{index}/unload

Unload a single shard from memory (VRAM/RAM) without deleting the file from disk. Narrows the model's shard window to exclude this shard and restarts the worker subprocess. If this is the last loaded shard, the model is fully unloaded. Bearer auth required.

Response:

{ "status": "unloaded", "model_id": "...", "shard_index": 0, "remaining_loaded": [1, 2] }

POST /api/admin/models/{id}/shards/{index}/load

Load a shard that is on disk into memory. The shard must already be present locally (use /download first if not). Expands the model's shard window to include the shard and restarts the worker subprocess. Bearer auth required.

Response:

{ "status": "loaded", "model_id": "...", "shard_index": 0, "loaded_shards": [0, 1, 2] }

POST /api/admin/models/{id}/unload

Unload an entire model from memory (VRAM/RAM) without deleting any files from disk. Evicts all split-model entries, kills the worker subprocess, clears GGUF metadata cache, and clears the loaded-model record. Bearer auth required.

Response:

{ "status": "unloaded", "model_id": "...", "model_name": "...", "segments_removed": 2, "estimated_freed_mb": 4096 }

GET /api/admin/shard-storage

Per-model storage breakdown, disk and VRAM usage.

GET /api/admin/prune-history

Recent auto-prune events.

GET/PUT /api/admin/schedule

Resource schedule management.

HuggingFace Integration

GET /api/admin/hf/search?query=...

Search HuggingFace for GGUF models. Returns results grouped by repository with quantization variants, recommended variant, and VRAM fitness indicator.

Response format:

[{
  "repo_id": "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF",
  "downloads": 50000,
  "likes": 120,
  "variants": [
    { "filename": "...Q4_K_M.gguf", "size_bytes": 668000000, "quant": "Q4_K_M" },
    { "filename": "...Q8_0.gguf", "size_bytes": 1100000000, "quant": "Q8_0" }
  ],
  "recommended_variant": "Q4_K_M",
  "fits_vram": true
}]

GET /api/admin/hf/probe?repo_id=...&filename=...

Probe a remote GGUF file (size, shard layout).

POST /api/admin/hf/download-shards

Download specific shard indices from HuggingFace. Bearer auth required.

Supports peer_fair_share: true for smart distribution — the backend computes a deterministic fair share of shards using BLAKE3(node_id || model_id), and peers with auto-manage enabled auto-acquire the rest.

curl -X POST http://localhost:8800/api/admin/hf/download-shards \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"repo_id": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF", "filename": "qwen2.5-coder-7b-instruct.Q4_K_M.gguf", "peer_fair_share": true}'

GET /api/admin/hf/source/

Look up the HuggingFace source (repo + filename) for a locally-known model. First checks the in-memory source cache and the probe cache, then auto-discovers by searching HuggingFace if neither has an entry. If found via auto-discovery the result is cached to the database and hf_source.json in the model directory.

Response:

{ "model_id": "...", "repo_id": "TheBloke/TinyLlama-...-GGUF", "filename": "tinyllama-...Q4_K_M.gguf" }

GET /api/admin/downloads

List the download queue with per-shard progress, speed, and source.

POST /api/admin/downloads/{model_id}/cancel

Cancel an in-progress download.

LoRA Adapters

GET /api/admin/adapters

List all registered LoRA adapters with their metadata (id, name, base model, rank, alpha, path).

Response: { "adapters": [ { "id": "...", "name": "...", "base_model": "...", "rank": 16, "alpha": 32.0, "path": "..." } ] }

POST /api/admin/adapters

Register a LoRA adapter from a safetensors file. Bearer auth required. Path traversal is blocked. If id is omitted, a UUID is generated.

Request body:

{ "id": "my-adapter", "name": "My Adapter", "base_model": "tinyllama-...", "rank": 16, "alpha": 32.0, "path": "adapters/my-adapter.safetensors" }

path may be absolute or relative to <data_dir>/adapters/.

Response: { "status": "ok", "adapter": { ... } }

DELETE /api/admin/adapters/

Unregister a LoRA adapter. Does not delete the file from disk. Bearer auth required. Returns 400 if the id is not found.

Response: { "status": "ok", "message": "Adapter 'my-adapter' removed" }

Cloud Providers

GET /api/admin/providers

List configured cloud providers (name + configured flag, no keys exposed).

PUT /api/admin/providers

Update cloud provider API keys. Bearer auth required. Keys are encrypted at rest.

GET /api/admin/provider-models

List available models from all configured cloud providers. Results are cached for 60 seconds; stale results are returned immediately and refreshed in the background. Includes models from OpenAI, Anthropic (static list), DeepSeek, Mistral, Groq, NVIDIA NIM, Cerebras, SambaNova, Fireworks, Together AI, DeepInfra, and Moonshot/Kimi.

Response: { "models": [ { "id": "gpt-4o", "name": "GPT-4o", "provider": "openai" } ] }

GET /api/admin/provider-health

Probe each configured provider by sending a tiny max_tokens=1 inference request (using a suitable test model per provider). All probes run in parallel with a connect timeout.

Response:

{ "providers": [ { "provider": "openai", "status": "up", "latency_ms": 320, "detail": "" } ] }

Status values: up, rate_limited, overloaded, timeout, unreachable, error_<code>.

POST /api/admin/provider-model-status

Probe availability and latency for a list of specific cloud model IDs (up to 20 per request). Sends a max_tokens=1 request to each model's provider endpoint. Anthropic models are skipped (no cloud proxy probing). Bearer auth not required.

Request body: { "models": ["gpt-4o", "claude-sonnet-4-6", "deepseek-chat"] }

Response:

{ "models": [ { "model": "gpt-4o", "status": "up", "latency_ms": 210 } ] }

Status values: up, rate_limited, not_found, unavailable, timeout, error.

Claude Subscription (feature-gated)

Requires building with --features claude-subscription. When the feature is not enabled, these endpoints return {"error": "claude-subscription feature not enabled"}.

GET /api/admin/claude-subscription/status

Detect whether the claude CLI is installed and authenticated on this machine. Reads version from claude --version and subscription info from ~/.claude/.credentials.json (read-only).

Response:

{
  "cli_installed": true,
  "cli_version": "2.1.92 (Claude Code)",
  "authenticated": true,
  "subscription_type": "max",
  "rate_limit_tier": "default_claude_max_5x"
}

PUT /api/admin/providers (claude_subscription_enabled field)

Enable or disable the Claude subscription provider. Pass claude_subscription_enabled alongside other provider key updates.

{ "claude_subscription_enabled": true }

When enabled, claude-* model requests are routed through the local CLI subprocess instead of the Anthropic API key. The Anthropic API key (if configured) is used as fallback when disabled.

Updates

GET /api/admin/version

Current binary version info.

POST /api/admin/update/check

Check for available updates. Returns version info and changelog if update available.

POST /api/admin/update/apply

Download and apply an update. Bearer auth required.

Discovery

GET /api/admin/network-code

Get an encrypted shareable invite code and network phase. The code embeds the node's TCP listening address encrypted with ChaCha20Poly1305 — the IP is not visible in the code.

POST /api/admin/join-network

Join the network via encrypted invite code (swarm://...) or raw multiaddr. Immediately dials the peer and saves the address to the peer cache.

Responses API listing

GET /api/admin/responses

List stored Responses-API records (backs the dashboard's Responses tab). Optional query params: ?limit=N (cap on returned records, default 100, max 500) and ?status=... (filter by completed / in_progress / cancelled / failed / queued). See Responses API for the user-facing surface.

Authentication

GET /api/admin/api-key

Retrieve the API key. Bearer auth required.

WebSocket

GET /api/admin/ws

WebSocket for live updates. Pushes the following event types:

EventTriggerData
activity_eventAny subsystem eventkind, model_id, message, timestamp, toast_level
stats_updateEvery 2sPeer count, credits, acquisitions, shard registry, swarm_capacity (R110), wishlist (R111)
peer_listPeer connect/disconnectFull peer snapshot
models_changedShard download/load/prune(none — signals dashboard to refresh)
update_availableNew version detectedVersion info, changelog

Claude Subscription Provider

Use your existing Claude Pro, Max, Team, or Enterprise subscription to access Claude models through SwarmLLM — no API key or per-token charges needed.

Feature-gated: Build with --features claude-subscription to enable. This feature is isolated behind a compile-time flag for easy removal.

How It Works

When enabled, SwarmLLM spawns the claude CLI as a subprocess for each Claude model request:

Client Request (OpenAI or Anthropic format)
  → SwarmLLM API (openai.rs / anthropic/mod.rs)
    → Provider resolution: model starts with "claude-"
      → Claude subscription enabled? → Spawn subprocess
      → Else: use Anthropic API key (existing behavior)
    → claude -p --output-format stream-json --model <model> "<prompt>"
    → Parse NDJSON → Translate to API format → Return response

Both the OpenAI (/v1/chat/completions) and Anthropic (/v1/messages) endpoints are supported, with streaming and non-streaming modes.

Setup

1. Install the Claude CLI

npm install -g @anthropic-ai/claude-code

2. Log in with your subscription

claude login

This opens a browser window. Sign in with your Claude Pro/Max/Team/Enterprise account.

3. Build SwarmLLM with the feature

cargo build --no-default-features --features dev,claude-subscription

4. Enable via the dashboard

Open Settings → Cloud Providers → Claude Subscription, click "Check Status" to verify your CLI is detected, then enable the toggle.

Or via API:

curl -X PUT http://localhost:8800/api/admin/providers \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{"claude_subscription_enabled": true}'

5. Send requests

# OpenAI format
curl http://localhost:8800/v1/chat/completions \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

# Anthropic format
curl http://localhost:8800/v1/messages \
  -H "x-api-key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "max_tokens": 100,
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Multi-Turn Conversations

Multi-turn conversations work by serializing the full message history into the prompt on each request. The format uses XML tags that Claude understands natively:

  • System messages → <system>...</system>
  • Assistant messages → <previous_response>...</previous_response>
  • User messages → bare text

This is the same stateless approach used by OpenAI-compatible APIs — the client sends the full conversation every time, and the server doesn't maintain session state.

Configuration

All configuration is in the providers.claude_subscription section, manageable via the admin API or dashboard:

FieldDefaultDescription
enabledfalseRoute Claude requests through the CLI
claude_binary"claude"Path to the claude binary
default_model(from request)Override model for all requests
max_concurrent3Max concurrent subprocess invocations
timeout_secs300Timeout per request (seconds)
working_dir(system temp)Working directory for the subprocess

Working Directory

By default, the subprocess runs in the system temp directory to avoid loading project-specific CLAUDE.md files, hooks, and MCP servers. Set working_dir to a project path if you want Claude to have project context for its responses.

Routing Priority

When a claude-* model is requested:

  1. Claude subscription (if enabled and CLI detected) — subprocess path, uses subscription
  2. Anthropic API key (if configured) — direct API proxy, pay-per-token
  3. Error — no provider available

The subscription provider takes priority over the API key. Disable the subscription toggle to fall back to API key billing.

Rate Limits

Subscription rate limits are per rolling 5-hour window (not per-second RPM like API keys). The concurrency limiter (default 3) prevents spawning too many concurrent processes. Community reports suggest ~3-5 parallel Opus sessions before degradation.

Rate limit info is returned in the NDJSON output and logged. The GET /api/admin/claude-subscription/status endpoint shows the current rate limit tier.

Removal

If this feature needs to be removed:

git rm src/api/claude_sub.rs
# Remove "claude-subscription = []" from Cargo.toml
grep -rn 'claude.subscription\|claude_sub' src/ frontend/
# Remove the ~6 #[cfg] blocks found by grep

Single commit, clean removal. No deep dependencies on the rest of the codebase.

Identity & Device Pool API

Identity

GET /api/identity/nickname

Get the current node's nickname.

PUT /api/identity/nickname

Set a nickname. Body: {"nickname": "my-node"}

DELETE /api/identity/nickname

Clear the nickname.

GET /api/identity/leaderboard

Network-wide credit leaderboard.

GET /api/identity/peers

Peer identity directory (nicknames, regions, tiers).

Device Pools ("My Devices")

Link multiple devices owned by the same user. Credits earned by all linked devices are combined into one balance on the main (owner) device.

Terminology: "Linked Devices" in the UI. This is different from connecting to the SwarmLLM network — linking devices groups your own hardware, while the network connects you with other people.

Quick Start (CLI)

# On your main device:
swarmllm pool create --name "My Devices"
swarmllm pool invite-code
# → A3F7K2M9

# On each other device:
swarmllm pool join A3F7K2M9

# Check status:
swarmllm pool status

Invite Code System

Instead of exchanging raw 64-character node IDs, device pools use 8-character invite codes (e.g., A3F7K2M9):

  1. Owner generates a code → POST /api/pool/generate-code
  2. Code shared verbally, via QR, or copy-paste
  3. Member enters code → POST /api/pool/join → broadcasts join request over gossip
  4. Owner's node auto-validates code and creates invitation
  5. Member auto-accepts → pool established

Security: Codes use a 32-character alphabet (no 0/O/1/I), are one-time use, expire in 24h, and the code itself is never transmitted over the network — only its BLAKE3 hash.

API Endpoints

GET /api/pool/state

Current pool membership state. Returns in_pool, member list with device names, online status, per-device stats, credit split percentage.

POST /api/pool/create

Create a new device pool. Body: {"name": "My Devices"}

POST /api/pool/generate-code

Generate an invite code (owner only). Returns: {"code": "A3F7K2M9"}. Max 5 active codes.

POST /api/pool/join

Join a pool using an invite code. Body: {"code": "A3F7K2M9"}

POST /api/pool/invite

Invite a specific node by ID (advanced). Body: {"node_id": "abc123..."}

POST /api/pool/accept

Accept a pool invitation. Body: {"invitation_id": "..."}

POST /api/pool/remove

Remove a member (owner only). Body: {"node_id": "..."}

POST /api/pool/leave

Leave the current pool.

POST /api/pool/device-name

Set this device's nickname. Body: {"name": "Gaming PC"}

PUT /api/pool/credit-split

Set credit split percentage (owner only). Body: {"pct": 20} (0-50)

PUT /api/pool/contribution

Set per-member contribution level override. Body: {"node_id": "...", "level": 75} (integer 0–100)

GET /api/pool/invitations

List pending invitations for this node.

GET /api/pool/leaderboard

Pool member contribution rankings.

GET/PUT /api/admin/pools/:id/rates

Per-pool credit rate overrides.

Private Mode

Restrict inference to your device pool for maximum privacy. Your prompts never leave your devices.

GET /api/pool/private-mode

Current state + coverage summary. Returns enabled, allow_lan, offline_mode, and coverage object.

PUT /api/pool/private-mode

Toggle private mode. Body: {"enabled": true} or {"enabled": true, "offline_mode": true}. Returns coverage summary so the UI can show trade-offs immediately.

GET /api/pool/coverage

Per-model coverage breakdown: total_shards, pool_shards, coverage_pct, missing indices, est_download_mb. Also returns disk_budget_mb and disk_used_mb.

Shard Pinning

GET /api/pool/pins

List current shard pins.

POST /api/pool/pin

Pin a model to a specific device (owner only). Body: {"model_id": "...", "target_node_id": "hex..."}. Optional shard_indices array for specific shards (empty = all shards).

DELETE /api/pool/pin

Remove a shard pin. Same body format as POST.

Pool Features

  • Device nicknames: Name each device for easy identification
  • Online/offline status: Tracked via health pings, displayed with last-seen timestamps
  • Per-device stats: VRAM, shards hosted, forwards served, uptime, models hosted
  • Combined VRAM: Aggregate GPU memory across all linked devices
  • Credit split: Owner configures what percentage (0-50%) members keep vs forward
  • Private Mode: Restrict inference to pool devices only. Toggle via UI or API
  • Shard Pinning: Assign specific models to specific devices. Auto-manage respects pins
  • Offline Mode: Air-gapped LAN operation with mDNS-only discovery
  • Coverage Dashboard: Per-model availability bars showing pool shard coverage
  • Max 10 devices per pool (configurable), 10 pool operations per hour rate limit

Pool Security

  • Invite codes: 32^8 ≈ 1.1 trillion combos, one-time use, 24h expiry
  • Join requests signed with Ed25519 (transport-layer sender authentication)
  • Credit forwarding uses dual-signed PoolCreditForward (member + owner)
  • Member removal requires Ed25519-signed removal notice with replay protection
  • Pool state gossip verifies each member's acceptance signature
  • Blinded invitation broadcast (SEC-M18): network observers can't see who's invited

Prometheus Metrics

SwarmLLM exposes a Prometheus-compatible metrics endpoint at GET /metrics. No authentication required (standard convention for metrics endpoints).

Available Metrics

Core Metrics

MetricTypeDescription
swarmllm_peers_connectedgaugeNumber of connected peers
swarmllm_inference_requests_totalcounterTotal inference requests processed
swarmllm_credits_balancegaugeCurrent credit balance
swarmllm_shards_hostedgaugeNumber of locally hosted shards
swarmllm_inference_latency_secondshistogramInference request latency

Channel Metrics

Internal channel health metrics for monitoring backpressure:

MetricTypeDescription
swarmllm_channel_capacity{channel="..."}gaugeChannel buffer capacity
swarmllm_channel_sent_total{channel="..."}counterMessages sent through channel
swarmllm_channel_dropped_total{channel="..."}counterMessages dropped due to backpressure

Histogram Buckets

The latency histogram uses these bucket boundaries (in seconds): 0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, +Inf

Scraping Configuration

Add to your prometheus.yml:

scrape_configs:
  - job_name: "swarmllm"
    static_configs:
      - targets: ["localhost:8800"]

Example Queries

# Request rate (requests per second over 5 minutes)
rate(swarmllm_inference_requests_total[5m])

# P50 latency
histogram_quantile(0.50, rate(swarmllm_inference_latency_seconds_bucket[5m]))

# P99 latency
histogram_quantile(0.99, rate(swarmllm_inference_latency_seconds_bucket[5m]))

# Average latency
rate(swarmllm_inference_latency_seconds_sum[5m]) / rate(swarmllm_inference_latency_seconds_count[5m])

Health Check

GET /health/ready

Readiness probe returning subsystem status. Returns 200 when ready, 503 otherwise. No auth required.

{
  "ready": true,
  "subsystems": {
    "network": true,
    "inference_router": true,
    "api_server": true,
    ...
  }
}

Deployment Guide

Single Node

The simplest deployment — just run the binary:

./swarmllm run

This starts the daemon on port 8800 with default settings.

Production Configuration

For production use, create a config file:

[node]
listen_port = 8800
contribution = "maximum"

[resources]
max_gpu_vram_mb = 0        # Auto-detect
max_disk_mb = 100000       # 100 GB

[inference]
gpu_layers = 99            # Offload all layers to GPU
max_concurrent_requests = 20
max_batch_size = 4
session_timeout_seconds = 600

[auto_manage]
enabled = true
max_storage_mb = 50000
max_concurrent_downloads = 5

[logging]
level = "info"
format = "json"            # Structured logs for production
file = "/var/log/swarmllm.log"

[ui]
open_browser_on_start = false

[identity]
region = "US"

Systemd Service

Create /etc/systemd/system/swarmllm.service:

[Unit]
Description=SwarmLLM P2P Inference Node
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=swarmllm
ExecStart=/usr/local/bin/swarmllm run --config /etc/swarmllm/config.toml
Restart=on-failure
RestartSec=10
LimitNOFILE=65536

# Security hardening
NoNewPrivileges=yes
ProtectSystem=strict
ReadWritePaths=/var/lib/swarmllm /var/log

[Install]
WantedBy=multi-user.target
sudo systemctl enable --now swarmllm

Docker

# Download compose file and env template
curl -LO https://raw.githubusercontent.com/enapt/SwarmLLM/main/docker-compose.yml
curl -LO https://raw.githubusercontent.com/enapt/SwarmLLM/main/.env.example
cp .env.example .env

# CPU
docker compose up -d

# GPU (requires NVIDIA Container Toolkit)
docker compose --profile gpu up -d

Pre-built Images

ImageDescription
ghcr.io/enapt/swarmllm:latestCPU-only (Debian bookworm-slim)
ghcr.io/enapt/swarmllm:latest-cudaNVIDIA GPU (CUDA 12.4 runtime)

Versioned tags follow semver: 0.1.0, 0.1.0-cuda, 0.1, 0.1-cuda.

Manual Docker Run

# CPU
docker run -d \
  --name swarmllm \
  --restart unless-stopped \
  -p 8800:8800/tcp \
  -p 8810:8810/tcp \
  -p 8800:8800/udp \
  -v swarmllm-data:/data \
  -v /path/to/models:/data/models \
  --env-file .env \
  ghcr.io/enapt/swarmllm:latest

# GPU
docker run -d \
  --gpus all \
  --name swarmllm \
  --restart unless-stopped \
  -p 8800:8800/tcp \
  -p 8810:8810/tcp \
  -p 8800:8800/udp \
  -v swarmllm-data:/data \
  -v /path/to/models:/data/models \
  --env-file .env \
  ghcr.io/enapt/swarmllm:latest-cuda

Build from Source

# CPU
docker build -t swarmllm .

# CUDA
docker build -f Dockerfile.cuda -t swarmllm:cuda .

Multi-Node Dev Cluster

For development and testing, a 3-node compose file is available:

docker compose -f docker-compose.dev.yml up

Nodes are at localhost:8800, localhost:8801, localhost:8802. Add GPU support:

docker compose -f docker-compose.dev.yml -f docker-compose.cuda.dev.yml up

Multi-Node Cluster

Same LAN

Nodes on the same network discover each other automatically via mDNS. Just start multiple instances on different ports:

# Node 1
./swarmllm run -p 8800

# Node 2
./swarmllm run -p 8801 -d ~/.local/share/swarmllm-node2

Across Networks

Use bootstrap peers or invite codes:

# Node 1 (get its address from the dashboard or logs)
./swarmllm run

# Node 2 (connect to Node 1)
./swarmllm run --bootstrap "/ip4/NODE1_IP/udp/8800/quic-v1/p2p/PEER_ID"

Split Inference Cluster

For a dedicated split-inference setup across multiple machines:

# Machine A: shards 0-3
./swarmllm run --shards "0-3" --bootstrap "/ip4/MACHINE_B/udp/8800/quic-v1/p2p/..."

# Machine B: shards 4-7
./swarmllm run --shards "4-7" --bootstrap "/ip4/MACHINE_A/udp/8800/quic-v1/p2p/..."

Firewall

Open TCP port 8800 (HTTP API), TCP port 8810 (P2P), and optionally UDP port 8800 (QUIC):

# Linux (ufw)
sudo ufw allow 8800/tcp    # HTTP API
sudo ufw allow 8810/tcp    # P2P (Noise+Yamux, primary transport)
sudo ufw allow 8800/udp    # P2P (QUIC, optional)

# Linux (iptables)
sudo iptables -A INPUT -p tcp --dport 8800 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 8810 -j ACCEPT
sudo iptables -A INPUT -p udp --dport 8800 -j ACCEPT

Reverse Proxy (Optional)

If you want to put the HTTP API behind nginx:

server {
    listen 443 ssl;
    server_name swarmllm.example.com;

    location / {
        proxy_pass http://127.0.0.1:8800;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Note: The reverse proxy only handles HTTP traffic. P2P (QUIC/UDP) must still be accessible directly on port 8800.

Cloud Provider API Keys

To use cloud model fallback, configure provider API keys via:

  1. Dashboard: Settings page in the web UI
  2. Environment file: Place a .env file in the data directory with standard variable names:
# ~/.local/share/swarmllm/.env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
DEEPSEEK_API_KEY=sk-...
MISTRAL_API_KEY=...
GROQ_API_KEY=gsk_...
NVIDIA_API_KEY=nvapi-...
CEREBRAS_API_KEY=...
SAMBANOVA_API_KEY=...
FIREWORKS_API_KEY=...
TOGETHER_API_KEY=...
DEEPINFRA_API_KEY=...
MOONSHOT_API_KEY=...
  1. Shell environment: Export the same variables before starting the daemon

Performance & Inference Speedups

SwarmLLM's distributed inference path ships with a stack of optimisations that are on by default — you get them without touching a config. This chapter names each one, explains what it does, and shows the measured win so you can tell which levers matter for your workload.

A few are flag-gated because the win is workload-dependent or the path is still being hardened; those are documented at the bottom so you can turn them on intentionally.

The full design notes live in docs/plans/archive/distributed_inference_speedup.md with benchmark recipes in docs/plans/benchmarks/.

The default-on stack

Continuous batching

Concurrent /v1/chat/completions requests for the same model share one forward pass per decode tick instead of running serially. GPU builds use a fused forward_batch kernel; CPU workers fall through to sequential with no regression.

  • Measured: 1.34–1.55× GPU throughput at batch 2–8 on RTX 3070 + TinyLlama Q4
  • Config: inference.continuous_batching = true (default)

Remote-generate fast path

For single-segment distributed inference (the common case: one remote node owns the whole model, requester does embedding + sampling), skip the per-token coordinator round-trips and run the decode loop end-to-end on the remote worker. Tokens stream back as they're sampled.

  • Measured: 1.93× decode speedup
  • Config: default-on — no flag, triggered automatically on single-segment pipelines

Cross-request prefix cache

Each worker keeps an LRU cache of prefill KV snapshots keyed by the prompt's token prefix. A re-submission with the same system prompt (different user turn) skips prefill for the shared prefix and only forwards the suffix.

  • Measured: 29.4× wall-clock speedup on re-submission of the same 513-token prompt (single node, TinyLlama)
  • Config: inference.prefix_cache_enabled = true (default), inference.prefix_cache_block_tokens = 64 (default — block granularity), inference.prefix_cache_max_entries = 16 (default — per model)

Batched prefill + chunked prefill

Sarathi-style chunked prefill: a long admission advances by prefill_chunk_tokens (default 128) per decode tick, so new requests don't wait behind a full prior prefill. Phase 4 adds batched_prefill_forward = true (default), which fuses concurrent same-shape prefill chunks into one forward_batch call.

  • Measured (Phases 1+2): 17–23× TTFT fairness at concurrency 2/4/8 on RTX 3070 + TinyLlama Q4 vs serial prefill
  • Measured (Phase 4): 1.57× aggregate tok/s at c=4 with uniform 180/180/180 ms TTFT (vs pre-fix 52/235/447 ms spread)
  • Config: inference.continuous_batching = true, inference.prefill_chunk_tokens = 128, inference.batched_prefill_forward = true (all default)

Cross-node prefix-KV sharing

When node B receives a prompt whose prefix was already prefilled by peer A, B fetches A's KV snapshot over the wire instead of re-prefilling locally. The pipeline is:

A prefills → inserts prefix-cache block → gossips PrefixCacheAnnounce
B receives prompt → local cache miss → probe daemon → walk index
B sends SendPrefixKvFetch to A → A's worker exports snapshot
B verifies BLAKE3 + NaN/Inf → hydrates KV → prefill suffix only
  • Measured (TinyLlama, GPU-GPU): fetched path is ~100 ms slower than local prefill — the 28 MB f32 snapshot takes ~260 ms to ship while the local prefill it replaces is only ~460 ms. TinyLlama is too small to demonstrate the win on localhost + fast GPU.
  • Measured (Qwen2.5-Coder-7B, CPU-CPU): 12.9× TTFT speedup on iter 1 — control full-prefill = 151.7 s, fetched path = 11.8 s. The 73 MB f32 snapshot transfers in ~1 s while 640-token Qwen-7B CPU prefill runs ~150 s.
  • Config: inference.cross_node_prefix_trust_min = 0.5 (default — gates peers by trust score; set to 2.0 to disable the fetch path entirely).

The fetch path uses three chained timeouts (worker probe 3000 ms, daemon network 2500 ms, serving IPC 2000 ms) sized for 7B-class f32 snapshots. Missing the window degrades to a clean miss — no worse than not having the feature. See the two-daemon loopback bench recipe for reproduction details.

Parallax scheduler

Pipeline assignment uses shortest-path dynamic programming over observed per-layer latencies (EMA over recent forwards) rather than a greedy pick-the-closest-peer heuristic. Cross-gossip of top-32 observed latencies via NodeCapability.observed_latencies lets every node keep a current view of the network's compute profile. A soft acquire/prune bias in AutoShardManager driven by a per-shard stability counter (≥3 consistent ticks before it acts) drifts shards toward where they're actually used without violating existing hard constraints.

  • Measured: 10 routing + 7 allocator + 2 scheduler integration tests passing; real-world improvements depend on network heterogeneity. The biggest impact is in asymmetric setups where a cheap peer's low observed latency should beat a high-VRAM peer's big shard slot.
  • Config: default-on. Multi-pipeline concurrency is deferred.

Flag-gated features

Turn these on when you've measured that they match your workload.

Distributed speculative decoding (speculative_distributed)

Draft model proposes γ tokens locally; target verifies all γ in one remote forward pass.

  • Status: End-to-end verified. 40–52% accept rate in a llama-cpp-draft / candle-target pairing (cross-backend numerical mismatch caps accept rate).
  • Config: inference.speculative_distributed = true, inference.draft_model_path = "path/to/draft.gguf", inference.speculative_gamma = 4 (tokens per verify round)

SWIFT self-speculative decoding (swift_self_speculative)

The target model acts as its own draft by skipping a contiguous range of layers on the proposal pass. No external draft model needed.

  • Status: Landed behind flag. Structurally slower than baseline on candle CPU until flash-attn-with-mask lands (attention kernel mismatch on multi-position verify). Shelved on CPU; may help on GPU.
  • Config: inference.swift_self_speculative = true, inference.swift_skip_ratio = 0.45 (fraction of layers to skip on the draft pass)

DSD — decentralized speculative decoding (decentralized_spec_decoding)

Multi-segment distributed inference with speculative decoding woven in. A γ-token decode on the last-segment worker plus KV truncation primitives plus a coordinator loop in pipeline/dsd.rs.

  • Status: All phases landed 2026-04-18 behind flag. End-to-end multi-segment WAN benchmark pending.
  • Config: inference.decentralized_spec_decoding = true

Activation compression Q8_0 (activation_compression)

Intermediate pipeline hidden-state activations are quantized from f16 to Q8_0 before going over the wire. Receivers auto-dispatch on the dtype tag.

  • Status: Codec verified. ~3.76× wire compression, RMS error <0.005. End-to-end multi-segment benchmark pending.
  • Config: inference.activation_compression = true

Persistent pipeline stream (persistent_pipeline_stream)

Replace per-token request/response with one long-lived libp2p bidirectional stream per pipeline session.

  • Status: Landed behind flag. Wire-level verified; no measured latency win because the bottleneck was elsewhere (solved by remote-generate + batched prefill).
  • Config: inference.persistent_pipeline_stream = true

Debugging slow inference

Default verbosity (-v) gives an INFO-level stream. Bump to -vv to see per-request DIAG: logs, which include the per-feature speedup signals:

./swarmllm run -vv 2>&1 | grep "DIAG:"

Key DIAG kinds:

  • DIAG: prefix-cache HIT — local prefix cache hit
  • DIAG: cross-node prefix HIT — cross-node prefix-KV fetch succeeded
  • DIAG: prefix-probe: fetch timed out — cross-node fetch missed the window (see Troubleshooting for timeout sizing on 7B+ models)
  • DIAG: served PrefixKvFetch ... hit=true — this node served a cross-node fetch
  • DIAG: BatchGenerate — batched-prefill slot table activity
  • DIAG: chunk fused batch_size=N — fused prefill chunks (Phase 4)
  • DIAG: Parallax — Parallax scheduler decisions

For the full DIAG taxonomy and what each line means, see docs/DIAGNOSTICS.md.

When should I turn a speedup off?

Almost never. The default-on features degrade cleanly under edge cases — the prefix cache falls through to full prefill on a miss, cross-node fetch falls through to local prefill on a timeout, batched prefill falls back to sequential when concurrency is 1. If you suspect one is the cause of a regression:

  • Prefix cache off: inference.prefix_cache_enabled = false
  • Cross-node fetch off: inference.cross_node_prefix_trust_min = 2.0 (gates every peer out)
  • Continuous batching off: inference.continuous_batching = false (also disables Phase 4 fusion)
  • Phase 4 fusion off, keep continuous batching: inference.batched_prefill_forward = false

Please open an issue if a speedup is costing you — the benchmarks above are RTX 3070 + WSL2 + a specific set of models, so real-world workloads will surface corners the benches miss.

Benchmarking

SwarmLLM ships with a built-in bench command and a set of reproducible recipes under docs/plans/benchmarks/. This chapter covers both.

Quick: swarmllm bench

The bench subcommand runs a real /v1/chat/completions workload against a daemon and reports latency + throughput.

./swarmllm bench \
    --max-tokens 100 \
    --iterations 5 \
    --concurrency 1 \
    --stream \
    --model-id tinyllama-1.1b-chat-v1.0.q4-k-m \
    --json

Key flags:

  • --max-tokens — tokens to generate per request (default 100)
  • --iterations — sequential iterations per concurrency level (default 5)
  • --concurrency — concurrent requests for throughput tests (default 1)
  • --stream — use streaming chat completions and report TTFT (time-to-first-token) per request. TTFT is the signal that captures the batched-prefill and cross-node-fetch wins; non-streaming bench rolls prefill + decode into one total time and hides the difference.
  • --prompt — custom prompt; default is a short prompt about relativity that won't stress prefix caching. Pass a longer prompt (≥500 tokens) to exercise prefix cache paths.
  • --model-id — target a specific model when several are registered; otherwise uses the first one from /v1/models.
  • --json — machine-readable output

The bench reads the API key from the daemon's data dir, so run it with the same SWARMLLM_NODE_DATA_DIR or -d as the daemon.

Single-node baselines

Reference numbers on an AMD Ryzen 7 5800H + RTX 3070 Laptop 8 GB VRAM (WSL2, release build):

ModelParamsQuantGPUCPU
TinyLlama 1.1B Chat1.1BQ4_K_M27.2 tok/s4.2 tok/s
Gemma-2 2B IT2.5BQ4_K_M20.6 tok/s3.5 tok/s
Phi-3.5 Mini3.8BQ4_K_M46.4 tok/s1.8 tok/s
Qwen2.5-Coder 7B7.6BQ4_K_M29.0 tok/s2.4 tok/s

Single-node numbers are largely about your hardware. The interesting benchmarks are distributed.

Reproducing the performance benchmarks

Each performance optimization has a written benchmark recipe in docs/plans/benchmarks/. Most require two local daemons on loopback; a couple need three.

Batched prefill — TTFT fairness

docs/plans/benchmarks/round4.md

Measures TTFT at concurrency 2/4/8 with Phases 1+2 on vs off. The win is fairness, not aggregate throughput: Sarathi chunked prefill prevents new admits from waiting behind the full prior prefill.

Batched chunked prefill (Phase 4)

docs/plans/benchmarks/round5.md

Measures aggregate tok/s and per-request TTFT spread with batched_prefill_forward on vs off. The on-config fuses concurrent same-shape prefill chunks so TTFT lands tightly clustered instead of spreading.

Cross-node prefix-KV sharing

docs/plans/benchmarks/round6.md

Two-daemon loopback TCP. Measures iter-1 TTFT with the cross-node fetch path enabled vs gated off (via cross_node_prefix_trust_min = 2.0). Same recipe runs against TinyLlama (fast-GPU corner case: fetch is slightly slower than prefill) and Qwen-7B (12.9× TTFT speedup on CPU-CPU because 7B CPU prefill is slow enough that the ~1 s fetch + verify + hydrate buys back ~150 s of local prefill).

Sketch of the recipe:

# Node A on 8800
SWARMLLM_NODE_DATA_DIR=/tmp/swarm_a ./target/release/swarmllm run -p 8800 -v &

# Node B on 8900, bootstrapped off A
A_MADDR=$(grep -oE "peer_id=12D3KooW[A-Za-z0-9]+" /tmp/swarm_a.log | \
    head -1 | sed 's/peer_id=/\/ip4\/127.0.0.1\/tcp\/8810\/p2p\//')
SWARMLLM_NODE_DATA_DIR=/tmp/swarm_b ./target/release/swarmllm run \
    -p 8900 -v --bootstrap "$A_MADDR" &

# Copy shards into both data dirs (or download via /api/admin/hf/download-shards)
cp -r ~/.local/share/swarmllm/models/<model-id> /tmp/swarm_a/models/
cp -r ~/.local/share/swarmllm/models/<model-id> /tmp/swarm_b/models/

# Warm A with the long prompt (populates A's prefix cache, announces to B)
./swarmllm bench -p 8800 --stream --iterations 3 --max-tokens 100 \
    --prompt "$(cat long-prompt.txt)" --model-id <model-id>

# Measure B TTFT — iter 1 should fire the cross-node fetch
./swarmllm bench -p 8900 --stream --iterations 3 --max-tokens 100 \
    --prompt "$(cat long-prompt.txt)" --model-id <model-id> --json

Check B's log for DIAG: cross-node prefix HIT — hydrated KV matched_tokens=... bytes=... to confirm the fetch path fired.

Caveats

  • WSL2 localhost bandwidth is much higher than any real network — localhost benches are the best case for compute-bound paths and the worst case for fetch paths. WAN numbers will be different.
  • TinyLlama is too small to show some speedups — cross-node prefix-KV sharing in particular needs a larger model (Phi-3.5, Qwen-7B) to flip the sign between fetch-cost and prefill-cost. See the round6 benchmark notes for the cross-over math.
  • VRAM fit matters — Qwen-7B Q4 weights fit in 8 GB but batched attention kernel scratch does not. CPU-mode works but the baseline numbers above change.
  • Pre-warm before measuring TTFT — iter 1 of a model includes disk read + weight load + first CUDA context init; exclude this by pre-warming with a short unrelated prompt before the real measurement.

Standard pre-push gate is cargo fmt && cargo clippy --all-targets -- -D warnings && cargo test. If you add a benchmark, add it under docs/plans/benchmarks/roundN.md with the recipe + results + interpretation, and link it from here.

Tailscale & WAN Access

SwarmLLM works over any IP-routable network, including VPN overlays like Tailscale, WireGuard, and ZeroTier. This guide covers how to access your node remotely and connect peers across the internet.

Use Cases

  • Remote access — Chat with your home GPU from your laptop at a coffee shop
  • Multi-site cluster — Connect nodes at home and work into one swarm
  • Team deployment — Share a private swarm across your team without exposing ports to the internet
  • Cloud + local hybrid — Connect a cloud GPU instance to your local network

Quick Setup with Tailscale

1. Install Tailscale on all machines

# Linux
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up

# macOS
brew install tailscale
tailscale up

# Windows — download from https://tailscale.com/download

Each machine gets a stable 100.x.x.x IP address on the Tailscale network.

2. Start SwarmLLM normally

# On each machine — no special flags needed
./swarmllm run

SwarmLLM binds to 0.0.0.0 by default, which includes the Tailscale interface.

3. Connect peers via bootstrap

Since mDNS doesn't work across Tailscale (it's link-local only), use one of these methods:

Option A: Invite code (easiest)

On Node A, copy the invite code from the dashboard (http://localhost:8800). On Node B, paste it into the "Join Network" field. The invite code contains the node's addresses — including the Tailscale IP if it's listening on 0.0.0.0.

Option B: Bootstrap peers in config

# ~/.local/share/swarmllm/config.toml on Node B
[network]
bootstrap_peers = [
  "/ip4/100.64.0.5/tcp/8810",    # Node A's Tailscale IP
]

Option C: CLI flag

./swarmllm run --bootstrap /ip4/100.64.0.5/tcp/8810

4. Access the dashboard remotely

Once connected via Tailscale, open the dashboard from any machine:

http://100.64.0.5:8800

The API is also accessible at that address:

curl http://100.64.0.5:8800/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "tinyllama", "messages": [{"role": "user", "content": "Hello!"}]}'
[network]
enable_mdns = false           # mDNS is LAN-only, won't work through Tailscale
enable_autonat = false        # Tailscale handles NAT, disable noisy probes
enable_dcutr = false          # Hole punching unnecessary on Tailscale
enable_relay = true           # Keep as fallback for robustness
enable_quic = true            # QUIC works well on Tailscale (low-latency UDP)
bootstrap_peers = [
  "/ip4/100.64.0.5/tcp/8810", # Replace with your peer's Tailscale IP
]

For higher latency links (cross-continent), you may also want:

[inference]
tp_max_latency_ms = 50        # Relax tensor parallelism latency threshold (default: 10ms)

Binding to a Specific Interface

If you only want SwarmLLM accessible via Tailscale (not the local network):

[network]
listen_address = "100.64.0.5"  # Bind only to Tailscale interface

Or bind to localhost only and use Tailscale's Funnel or port forwarding:

[network]
listen_address = "127.0.0.1"

WireGuard / ZeroTier / Other VPNs

The same approach works with any VPN overlay:

  1. Install the VPN on all machines
  2. Start SwarmLLM with default config (listen_address = "0.0.0.0")
  3. Use the VPN IP as a bootstrap peer address
  4. Disable mDNS if peers aren't on the same physical LAN

Security Notes

  • API key still required — remote access to inference endpoints requires Bearer token auth, even over Tailscale
  • E2E encryption is independent of VPN — SwarmLLM encrypts all P2P traffic with X25519 + ChaCha20-Poly1305 regardless of whether you use a VPN. The VPN adds a second layer of encryption at the network level
  • Dashboard is not auth-protected — the admin dashboard at /admin doesn't require authentication. If exposing to untrusted networks, use Tailscale ACLs to restrict access or bind to 127.0.0.1 and use SSH tunneling

Troubleshooting

Peers don't connect:

  • Verify Tailscale is running: tailscale status
  • Check that port 8810 (TCP) and 8800 (UDP/QUIC) are reachable: tailscale ping 100.64.0.5
  • Try with --bootstrap /ip4/<TAILSCALE_IP>/tcp/8810 explicitly
  • Check logs with -vv for connection errors

Slow inference across WAN:

  • Pipeline parallelism (splitting layers across nodes) works best on low-latency links (<50ms)
  • Tensor parallelism requires LAN-like latency (<10ms) — increase tp_max_latency_ms or let SwarmLLM use pipeline mode instead
  • Consider having each site run its own models for local inference, with the swarm as fallback

Stale peer cache after IP change:

  • If your Tailscale IP changes, old cached addresses will fail. Delete the database to clear the cache:
    rm ~/.local/share/swarmllm/db.redb
    

Monitoring with Grafana

SwarmLLM ships with a pre-built Grafana dashboard and Prometheus configuration in the monitoring/ directory.

Quick Start

cd monitoring/
docker compose up -d

This starts:

  • Prometheus at http://localhost:9090 — scrapes SwarmLLM metrics
  • Grafana at http://localhost:3000 — visualizes metrics (login: admin/admin)

The SwarmLLM dashboard is auto-provisioned on first start.

Dashboard Panels

The Grafana dashboard includes:

Node Overview

  • Connected Peers (stat)
  • Total Inference Requests (stat)
  • Credit Balance (stat)
  • Shards Hosted (stat)

Inference

  • Request Rate (req/s over time)
  • Latency Percentiles (p50, p90, p99)
  • Latency Distribution (histogram)
  • Average Inference Latency (gauge)

Network & Peers

  • Connected Peers Over Time

Storage & Shards

  • Hosted Shards Over Time

Credits

  • Credit Balance Over Time

Manual Setup

If you already have Prometheus and Grafana running:

1. Configure Prometheus

Add to prometheus.yml:

scrape_configs:
  - job_name: "swarmllm"
    static_configs:
      - targets: ["localhost:8800"]

2. Import Dashboard

  1. Open Grafana → Dashboards → Import
  2. Upload monitoring/grafana-dashboard.json
  3. Select your Prometheus data source
  4. Click Import

Multi-Node Monitoring

For monitoring multiple SwarmLLM nodes, add all targets:

scrape_configs:
  - job_name: "swarmllm"
    static_configs:
      - targets:
          - "node1:8800"
          - "node2:8800"
          - "node3:8800"

Or use file-based service discovery:

scrape_configs:
  - job_name: "swarmllm"
    file_sd_configs:
      - files: ["swarmllm-targets.json"]
        refresh_interval: 30s

Alerting

Example alert rules for Prometheus:

groups:
  - name: swarmllm
    rules:
      - alert: NoPeersConnected
        expr: swarmllm_peers_connected == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "SwarmLLM node has no connected peers"

      - alert: HighInferenceLatency
        expr: histogram_quantile(0.99, rate(swarmllm_inference_latency_seconds_bucket[5m])) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 inference latency exceeds 10 seconds"

      - alert: NegativeCreditBalance
        expr: swarmllm_credits_balance < 0
        for: 1h
        labels:
          severity: info
        annotations:
          summary: "Node has negative credit balance (Bronze tier)"