SwarmLLM

Run AI together — for free. A single Rust binary that turns your computer into a node in a peer-to-peer LLM inference network. Pool hardware with others to run models too large for any single machine, with no API tokens, no cloud fees, and end-to-end encryption between every peer.

This site is the long-form reference. For source code, releases, and issues, head to enapt/SwarmLLM.

What you can do with it

Chat with AI locally — open localhost:8800 after running the binary; the dashboard auto-detects your hardware and walks you through downloading a model.
Use it as a drop-in API — OpenAI-compatible /v1/chat/completions, the Anthropic Messages API at /v1/messages (full Claude Code support), an MCP server with seven tools, plus 12 cloud providers reachable through one endpoint.
Pool hardware — your phone with 2 GB of RAM can host a few shards of a 70B model and contribute alongside someone else's GPU. Shards download individually via byte-range requests; no node ever needs the full file.
Stay private — every P2P hop uses X25519 + ChaCha20-Poly1305 with forward secrecy. The optional boomerang pipeline ensures no remote node ever sees plaintext.

Single-node performance (RTX 3070 Laptop, 8 GB VRAM)

Model	GPU	CPU
TinyLlama 1.1B Q4	27.2 tok/s	4.2 tok/s
Gemma-2 2B Q4	20.6 tok/s	3.5 tok/s
Phi-3.5 3.8B Q4	46.4 tok/s	1.8 tok/s
Qwen2.5-Coder 7B Q4	29.0 tok/s	2.4 tok/s

Distributed-inference speedups (all default-on): prefix-caching, batched prefill, the Parallax scheduler, and cross-node KV sharing. The cross-node prefix-KV benchmark (2026-04-20) measured a 12.9× iter-1 TTFT speedup on a 672-token Qwen-7B prompt when a peer had the same prefix already cached (151.7 s → 11.8 s, CPU-CPU, localhost). Each knob is documented in Performance & Inference Speedups.

How a node fits together

┌──────────────────────────────────────────────────────────────┐
│                      Your computer (port 8800)                │
│                                                              │
│   P2P node          HTTP server          Web dashboard       │
│   TCP+QUIC          OpenAI · Anthropic   (embedded)          │
│   Noise+Yamux       MCP · Admin          21 languages        │
│                                                              │
│   ─────────────────────────────────────────────────────────  │
│   11 Tokio subsystems · DashMap shared state · redb storage  │
└──────────────────────────────────────────────────────────────┘

Each node simultaneously: connects over TCP and QUIC, serves four HTTP API surfaces (OpenAI · Anthropic · MCP · admin) on the same port, hosts shard files for popular models, participates in distributed inference pipelines, and ships an embedded web dashboard.

Where to go next

Getting Started →Install the binary, download your first model, send your first prompt. Architecture →Subsystems, network protocols, encryption model, scheduler design. API Reference →OpenAI · Anthropic · MCP · Responses · Admin endpoints with examples. Performance →The full speedup stack and how to tune it for your network. WAN setup →Run a swarm across the internet with Tailscale. Troubleshooting →Common pitfalls, diagnostics, and how to file useful bug reports.

Status

Alpha — actively developed and moving into broader testing. Distributed inference is stable across multi-node deployments. Windows release binaries reach Linux parity (Round 8, 2026-04-23). 887 lib tests + 75 integration tests run on every PR; continuous security sweeps. Report issues.

Platform support

Platform	Status	GPU
Linux x86_64	Available	CUDA
Windows x86_64	Available	CUDA
macOS aarch64 (Apple Silicon)	Binary available; compile-validated	CPU only (Metal planned)
macOS x86_64 (Intel)	Best-effort	CPU only
Linux aarch64	Best-effort	CPU only

macOS aarch64 runs cargo test --lib + cargo clippy on macos-15 in CI. Integration tests stay Linux-only for now.

All binaries live on the Releases page.

Getting Started

SwarmLLM lets you combine your hardware with others to run AI models too large for any single machine — for free, with no API tokens or cloud fees. It's open-source and your conversations are end-to-end encrypted.

This guide walks you through installation, downloading your first model, and chatting.

Prerequisites

A computer running Windows, macOS, or Linux
At least 4 GB of RAM (8+ GB recommended)
At least 2 GB of free disk space (more for larger models)
An internet connection (for downloading models and connecting to peers)

Chapters

Installation — Download and run SwarmLLM on your platform
First Model — Download and chat with your first AI model
Joining the Network — Connect to peers for distributed inference

Quick Commands

./swarmllm run                  # Start the node (default port 8800)
./swarmllm run -p 9000          # Start on a different port
./swarmllm run -v               # Start with verbose logging
./swarmllm status               # Check if the node is running
./swarmllm chat                 # Interactive CLI chat
./swarmllm bench                # Benchmark inference performance
./swarmllm peers                # List connected peers
./swarmllm version              # Show version number

Installation

Download

Download the right file for your system from the GitHub Releases page:

Your Computer	File Name
Windows (most PCs)	`SwarmLLM-Setup.exe` (installer — auto-detects GPU)
Windows (raw binary, GPU)	`swarmllm-windows-x86_64-gpu.zip`
Windows (raw binary, CPU)	`swarmllm-windows-x86_64-cpu.zip`
Mac (M1/M2/M3/M4)	`swarmllm-macos-aarch64.tar.gz` (compile-validated)
Mac (older Intel)	Best-effort — build from source
Linux (most distros)	`swarmllm-linux-x86_64.tar.gz`
Linux (NVIDIA GPU)	`swarmllm-linux-x86_64-cuda.tar.gz`

Not sure which Mac? Apple menu > "About This Mac." If it says "Apple M1" (or M2/M3/etc.), pick Apple Silicon. If it says "Intel," pick Intel.

Install & Run

Windows

Recommended — installer: double-click SwarmLLM-Setup.exe. It detects your GPU (NVIDIA / AMD / Intel) and installs the matching binary. If SmartScreen warns you, click More info > Run anyway.

Raw binary alternative: download swarmllm-windows-x86_64-gpu.zip (Vulkan + CUDA static) or swarmllm-windows-x86_64-cpu.zip (CPU-only fallback), extract, and run swarmllm.exe.

From PowerShell on a raw binary:

cd Downloads\swarmllm-windows-x86_64-gpu
.\swarmllm.exe run

macOS

cd ~/Downloads
tar xzf swarmllm-macos-aarch64.tar.gz
cd swarmllm-macos-aarch64
chmod +x swarmllm
./swarmllm run

Note: macOS aarch64 binaries are compile-validated and exercised in CI (test + clippy on macos-15); integration tests stay Linux-only for now. Intel Mac users should build from source. If macOS blocks the binary on first launch: System Settings > Privacy & Security > click Open Anyway next to SwarmLLM.

Linux

cd ~/Downloads
tar xzf swarmllm-linux-x86_64.tar.gz
cd swarmllm-linux-x86_64
chmod +x swarmllm
./swarmllm run

Docker

The fastest way to get running on any Linux server:

# 1. Get the compose file and example env
curl -LO https://raw.githubusercontent.com/enapt/SwarmLLM/main/docker-compose.yml
curl -LO https://raw.githubusercontent.com/enapt/SwarmLLM/main/.env.example

# 2. Configure (add API keys, change ports, etc.)
cp .env.example .env
nano .env

# 3. Start
docker compose up -d

For NVIDIA GPU support (requires NVIDIA Container Toolkit):

docker compose --profile gpu up -d

Pre-built images on GHCR:

Image	Description
`ghcr.io/enapt/swarmllm:latest`	CPU-only
`ghcr.io/enapt/swarmllm:latest-cuda`	NVIDIA GPU (CUDA 12.4)
`ghcr.io/enapt/swarmllm:0.1.0`	Pinned version (CPU)
`ghcr.io/enapt/swarmllm:0.1.0-cuda`	Pinned version (GPU)

Data is persisted in Docker volumes. Model shards are stored in the swarmllm-models volume (or bind-mount a host directory via SWARMLLM_MODELS_DIR in .env).

View logs with docker compose logs -f. The API key is printed on first startup.

Cargo Install

Requires Rust 1.80+:

cargo install --git https://github.com/enapt/SwarmLLM.git --tag v0.1.0
swarmllm run

Building from Source

git clone https://github.com/enapt/SwarmLLM.git
cd SwarmLLM
cargo build --release
./target/release/swarmllm run

For CUDA GPU support:

cargo build --release --features candle-cuda

For Apple Silicon: the default build runs on CPU. A Metal-accelerated build is on the roadmap but not yet implemented (no metal Cargo feature exists yet); until then, use the default cargo build --release.

Open the Dashboard

Once running, open http://localhost:8800 in your browser. The setup wizard will walk you through initial configuration.

Your First Model

You need at least one AI model before you can chat.

Download via Dashboard

Open the Dashboard at http://localhost:8800
Click Browse HuggingFace in the Models section
Search for a model (try TinyLlama for a small, fast model)
Choose a quantization variant (Q4_K_M recommended for most hardware)
Click Add to node — the node downloads its fair share of shards, and peers with auto-manage enabled auto-acquire the rest
The dashboard auto-refreshes when downloads complete (no page reload needed)

Download via CLI

# Smart distribution: node downloads its fair share, peers get the rest
curl -X POST http://localhost:8800/api/admin/hf/download-shards \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"repo_id": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF", "filename": "qwen2.5-coder-7b-instruct.Q4_K_M.gguf", "peer_fair_share": true}'

# Or download specific shards manually:
curl -X POST http://localhost:8800/api/admin/hf/download-shards \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"repo_id": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF", "filename": "qwen2.5-coder-7b-instruct.Q4_K_M.gguf", "shards": [0, 1, 2]}'

Recommended Models by Hardware

Hardware	Model	Size
Any (testing)	TinyLlama 1.1B Q4_K_M	~700 MB
8 GB RAM, no GPU	Qwen2.5-3B Q4_K_M	~2 GB
8 GB VRAM	Qwen2.5-7B Q4_K_M	~4.5 GB
16+ GB VRAM	Llama-3-13B Q4_K_M	~7 GB

On-Demand Loading

You do not need to pre-load models into VRAM. When you send an inference request for a model whose shards are on disk but not loaded, SwarmLLM automatically loads the model on the fly. If VRAM is full, the least-recently-used model is evicted to make room. The first request to a cold model may take a few extra seconds while loading completes.

Start Chatting

Web UI:

Click the Chat tab
Select your model from the dropdown
Type a message and press Enter

CLI:

./swarmllm chat
# Or with a specific model:
./swarmllm chat --model-name "qwen2.5-coder-7b"

API:

curl http://localhost:8800/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder-7b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

What Are Shards?

Large AI models are split into smaller pieces called shards (~512 MB each) so they can be distributed across the network. Each shard contains a subset of the model's transformer layers. SwarmLLM handles this automatically — you just pick a model and download.

A node never needs all shards of a model. In distributed inference, each node loads only the layers it's responsible for.

Joining the Network

SwarmLLM works standalone, but connecting to peers unlocks distributed inference for larger models.

Automatic Discovery

SwarmLLM finds peers automatically:

Same network (LAN): mDNS discovers peers on the same Wi-Fi/LAN in seconds.
Returning users: Previously-seen peers are remembered and reconnected on startup.
Peer exchange: Connected peers share their peer lists with you.

Invite Codes (Easiest)

In the Dashboard, click "Share Network Code"
Copy the encrypted code and share it with a friend
They paste it into the "Join Network" field and click Join
Both nodes connect immediately and start discovering the wider network

Invite codes are encrypted (ChaCha20Poly1305) — your IP address is not visible in the code itself. Anyone with the full code can decode it, but the IP can't be extracted by casual inspection.

Manual Bootstrap

./swarmllm run --bootstrap "/ip4/203.0.113.50/udp/8800/quic-v1/p2p/12D3KooW..."

Or in your config file:

[network]
bootstrap_peers = ["/ip4/203.0.113.50/udp/8800/quic-v1/p2p/12D3KooW..."]

Private Networks

To run a private cluster that doesn't mix with the public network:

[network]
gossip_network_id = "my-private-network"

Only nodes with the same gossip_network_id can communicate.

Firewall

SwarmLLM needs TCP port 8810 (P2P primary transport) and optionally UDP port 8800 (QUIC) open. If you're behind a router, either:

Set up port forwarding (TCP 8810 + UDP 8800 to your machine's local IP)
Rely on SwarmLLM's built-in relay (works automatically in most cases)

Configuration

SwarmLLM works out of the box with sensible defaults. This section covers customization.

Config Priority

Settings are read from four sources, in order of priority:

Command-line flags (highest) — e.g., --port 9000
Environment variables — e.g., SWARMLLM_NODE_LISTEN_PORT=9000
Config file — config.toml in your data directory
Built-in defaults (lowest)

Provider API keys have an additional source: a .env file in the data directory or current working directory. Standard env var names are used (OPENAI_API_KEY, ANTHROPIC_API_KEY, DEEPSEEK_API_KEY, etc.). The .env file does not override existing environment variables or keys already set via the dashboard.

Config File Location

OS	Path
Linux	`~/.local/share/swarmllm/config.toml`
macOS	`~/Library/Application Support/swarmllm/config.toml`
Windows	`%APPDATA%\swarmllm\config.toml`

Specify a custom path: --config /path/to/config.toml

Minimal Example

[node]
listen_port = 8800
contribution = "moderate"

[resources]
max_disk_mb = 50000

[identity]
region = "US"

[inference]
gpu_layers = 35

[auto_manage]
enabled = true

Chapters

Config File Reference — Every option explained
Shard-Only Mode — Distributed inference with partial models
CLI Flags & Environment Variables — Command-line and env var reference

Config File Reference

Every configuration option, organized by section.

`[node]` — Basic Node Settings

Option	Type	Default	Description
`listen_port`	integer	`8800`	Port for web dashboard and P2P networking
`data_dir`	path	Platform-specific	Where SwarmLLM stores data
`contribution`	string	`"minimal"`	Resource contribution: `"minimal"`, `"moderate"`, `"maximum"`

`[resources]` — Resource Limits

Option	Type	Default	Description
`max_gpu_vram_mb`	integer	`0`	Max GPU memory in MB. `0` = auto-detect
`max_ram_mb`	integer	`0`	Max system RAM in MB. `0` = auto
`max_disk_mb`	integer	`50000`	Max disk space in MB for model storage
`max_bandwidth_mbps`	integer	`0`	Max upload bandwidth. `0` = unlimited

`[resources.schedule]` — Usage Schedule

Option	Type	Default	Description
`enabled`	boolean	`false`	Enable scheduled resource reduction
`reduced_hours_start`	integer	`22`	Hour (0-23) to start reduced mode
`reduced_hours_end`	integer	`8`	Hour (0-23) to end reduced mode
`reduced_contribution`	string	`"minimal"`	Contribution level during reduced hours
`prune_aggressiveness`	string	`"normal"`	Shard pruning during reduced hours: `"normal"`, `"aggressive"`, `"conservative"`

`[network]` — Networking

Option	Type	Default	Description
`bootstrap_peers`	list	`[]`	Peer addresses to connect on startup
`enable_mdns`	boolean	`true`	LAN peer discovery
`gossip_network_id`	string	none	Custom network ID for private networks
`peer_exchange`	boolean	`true`	Share peer lists with connected nodes
`enable_relay`	boolean	`true`	Act as relay for peers behind firewalls
`enable_relay_client`	boolean	`true`	Use relays when behind a firewall
`max_peers`	integer	`200`	Max simultaneous peer connections
`auto_relay`	boolean	`true`	Auto-use relay when NAT detected
`relay_max_circuit_duration_secs`	integer	`3600`	Max relay circuit duration
`relay_max_circuits`	integer	`16`	Max relay circuits to serve
`enable_encryption`	boolean	`true`	E2E encryption for tensor forwards and control messages
`enable_autonat`	boolean	`true`	NAT detection. Disable on WSL2 to reduce noise
`enable_dcutr`	boolean	`true`	Hole punching. Disable on WSL2 to reduce noise
`tensor_compression`	boolean	`true`	Zstd compression for tensor payloads
`prefix_kv_compression`	boolean	`false`	Zstd compression for cross-node prefix-KV snapshot wire frames. Default off — meaningful win on WAN where wire size is the bottleneck; roughly neutral on localhost. Receivers always decompress regardless of this flag.
`tensor_compress_level`	integer	`1`	Zstd compression level (1-22, 1 = fastest). Shared between tensor and prefix-KV.
`tensor_compress_threshold`	integer	`1024`	Min payload bytes before compression. Shared between tensor and prefix-KV.

`[inference]` — AI Model Inference

Option	Type	Default	Description
`default_model`	string	`""`	Default model. Empty = first available
`session_timeout_seconds`	integer	`600`	Chat session memory lifetime (10 min)
`max_concurrent_requests`	integer	`10`	Max parallel requests
`model_path`	path	none	Path to a GGUF model file
`gpu_layers`	integer	`0`	Layers to offload to GPU. `0` = CPU only
`kv_cache_ttl_secs`	integer	`600`	KV-cache lifetime
`max_batch_size`	integer	`1`	Max request batch size. `1` = no batching. When `> 1`, both local and remote forward requests batch together via `BatchForwarder`, filling pipeline bubbles in distributed inference
`batch_timeout_ms`	integer	`50`	Ms to wait for additional requests before dispatching a partial batch. `0` = dispatch immediately (purely opportunistic batching)
`speculative_decoding`	boolean	`false`	Enable speculative decoding
`speculative_gamma`	integer	`4`	Draft tokens per verification step
`draft_model_path`	path	none	Path to draft model
`max_split_model_memory_mb`	integer	none	Max GPU memory for split model cache
`tp_max_latency_ms`	integer	`10`	Max peer latency (ms) for tensor parallelism groups
`local_embedding_privacy`	boolean	`false`	Embed tokens locally before sending to first segment. Remote nodes never see raw token IDs
`encrypted_pipeline`	boolean	`false`	Force first+last segment to local node (boomerang topology). No remote sees plaintext. Adds ~1 RTT/token. Per-model override via API. Requires shard 0 + final shard locally

`[logging]` — Log Output

Option	Type	Default	Description
`level`	string	`"info"`	Log level: `"error"`, `"warn"`, `"info"`, `"debug"`, `"trace"`
`format`	string	`"pretty"`	Log format: `"pretty"` or `"json"`
`file`	path	none	Write logs to file

`[ui]` — Web Interface

Option	Type	Default	Description
`open_browser_on_start`	boolean	`true`	Open dashboard on launch
`theme`	string	`"dark"`	Color theme: `"dark"` or `"light"`

`[api]` — API Authentication

Option	Type	Default	Description
`api_key`	string	none	Bearer token. Empty = auto-generated
`rate_limit_rpm`	integer	`60`	Rate limit for `/v1/` endpoints (requests/min)
`rate_limit_admin_rpm`	integer	`200`	Rate limit for `/api/admin/` endpoints (requests/min)

`[model]` — Model Storage

Option	Type	Default	Description
`shard_size_mb`	integer	`512`	Shard size in MB. Range: 64-2048

`[auto_manage]` — Automatic Shard Management

Option	Type	Default	Description
`enabled`	boolean	`true`	Auto-download popular shards (only for models at DemandVerified+ or Pinned trust level)
`max_storage_mb`	integer	`0`	Max disk for auto-downloads. `0` = 50% of max_disk_mb
`interval_minutes`	integer	`5`	Check interval for new shards
`max_shards`	integer	`0`	Max shards. `0` = unlimited
`max_concurrent_downloads`	integer	`3`	Max parallel downloads
`prune_enabled`	boolean	`true`	Auto-remove over-replicated shards
`min_replicas`	integer	`2`	Min network replicas before pruning
`prune_cooldown_secs`	integer	`300`	Seconds between prune actions per model
`max_holder_load_for_prune`	integer	`3`	Block pruning if holders are busy

`[pool]` — Device Pool

Option	Type	Default	Description
`max_pool_size`	integer	`10`	Max devices in a pool
`invitation_ttl_hours`	integer	`24`	Invitation validity period
`rate_limit_per_hour`	integer	`10`	Max pool operations per hour
`gossip_interval_secs`	integer	`600`	Pool state gossip interval
`private_mode`	bool	`false`	Restrict inference to pool members only. Toggleable at runtime via API/UI
`private_mode_allow_lan`	bool	`true`	Also allow LAN peers (mDNS-discovered) when private mode is on
`offline_mode`	bool	`false`	Air-gapped: no bootstrap peers, no HF downloads, mDNS-only discovery

`[pool.credit_rates]` — Credit Rates

Option	Type	Default	Description
`inference_serve`	integer	`10`	Credits earned per layer per token served
`inference_consume`	integer	`10`	Credits spent per layer per token consumed
`shard_hosting`	integer	`1`	Credits per GB per hour hosting
`shard_seeding`	integer	`5`	Credits per GB seeding
`relay_service`	integer	`2`	Credits per connection hour relaying
`penalty_serve_failure`	integer	`50`	Credits deducted per failure

`[updates]` — Auto-Update

Option	Type	Default	Description
`auto_update`	string	`"stable"`	Policy: `"disabled"`, `"stable"`, `"all"`
`check_interval_hours`	integer	`6`	Update check frequency

`[identity]` — Your Identity

Option	Type	Default	Description
`region`	string	none	Country code for network map (e.g., `"US"`)

`[providers.claude_subscription]` — Claude Subscription (feature-gated)

Requires --features claude-subscription at build time. Managed via the dashboard or PUT /api/admin/providers.

Option	Type	Default	Description
`enabled`	boolean	`false`	Route `claude-*` model requests through the local CLI
`claude_binary`	string	`"claude"`	Path to the `claude` binary
`default_model`	string	none	Override model for all requests
`max_concurrent`	integer	`3`	Maximum concurrent subprocess invocations
`timeout_secs`	integer	`300`	Per-request timeout in seconds
`working_dir`	string	(temp dir)	Working directory for the subprocess. Empty or `"none"` uses system temp dir (recommended for API proxy use). Set to a project path for context-aware responses.

Shard-Only Mode

SwarmLLM supports shard-only operation — a node only needs individual shard files (~512 MB each) plus a small GGUF header (~6 MB), not the full model file.

How It Works

A model directory in shard-only mode:

~/.local/share/swarmllm/models/qwen2.5-coder-7b/
├── manifest.json        # Model metadata + shard layout
├── gguf_header.bin      # First ~6MB of GGUF (metadata + tensor index)
├── shard_000.bin        # 512MB shard
├── shard_001.bin
├── shard_002.bin
└── ...

SwarmLLM automatically extracts gguf_header.bin from shard_000.bin when first needed. The ShardReader constructs a virtual GGUF from header + shard files, so the model parser works exactly as if the full GGUF were present.

Why This Matters

A 7B model is ~4.5 GB as a full GGUF, but a single shard is only ~512 MB
Nodes only load the layers they're assigned — no wasted disk or VRAM
You can participate in inference for a 70B model on a machine with 8 GB VRAM by hosting just a few shards

Manual Shard Assignment (--shards)

For multi-node split inference, assign each node a subset of shards:

./swarmllm run --shards "0-3"    # This node handles shards 0, 1, 2, 3

The range is persisted to the database and restored on subsequent runs. Start without --shards to clear.

Behavior when --shards is set:

The node only advertises the specified shard indices
Auto-manage prioritizes downloading missing shards in the range (100x scoring bonus)
Smart pruning never removes shards in the configured range

Multi-Node Example

Run a 7B model across two machines:

# Machine A (shards 0-3, layers 0-13):
./swarmllm run --shards "0-3" --bootstrap "/ip4/MACHINE_B_IP/udp/8800/quic-v1/p2p/PEER_ID"

# Machine B (shards 4-7, layers 14-27):
./swarmllm run --shards "4-7" --bootstrap "/ip4/MACHINE_A_IP/udp/8800/quic-v1/p2p/PEER_ID"

Both nodes discover each other, assemble a distributed pipeline, and forward hidden-state activations between them. The pipeline is assembled automatically by the InferenceRouter.

Without --shards

If you don't specify --shards, the node auto-detects and advertises all local shards. This is the normal mode for most users — --shards is only needed when you want explicit control over which layers a node handles.

CLI Flags & Environment Variables

CLI Flags

Flag	Short	Description
`--port <PORT>`	`-p`	Listen port
`--data-dir <PATH>`	`-d`	Data directory
`--config <PATH>`	`-c`	Config file path
`--model <PATH>`	`-m`	Path to a GGUF model file
`--gpu-layers <N>`		Layers to offload to GPU
`--bootstrap <ADDR>`		Bootstrap peer address (repeatable)
`--shards <RANGE>`		Shard range for split inference (e.g., `"0-4"`)
`--verbose`	`-v`	Increase log verbosity (`-v`, `-vv`, `-vvv`)

Subcommands

Command	Description
`run`	Start the daemon (default if no subcommand)
`status`	Query running daemon status
`chat`	Interactive CLI chat with streaming
`bench`	Benchmark inference (tokens/sec, TTFT)
`peers`	List connected peers
`pool`	Device pool management (link your machines)
`test-split`	Test split inference locally (diagnostic)
`version`	Print version

`chat` Options

Flag	Default	Description
`--model-name <NAME>`	auto-detect	Model to chat with
`--system <TEXT>`	none	System prompt
`--max-tokens <N>`	`2048`	Max tokens per response
`--temperature <F>`	`0.7`	Sampling temperature

`bench` Options

Flag	Default	Description
`--model-name <NAME>`	auto-detect	Model to benchmark
`--prompt <TEXT>`	"Write a short essay..."	Benchmark prompt
`--max-tokens <N>`	`128`	Tokens to generate
`--iterations <N>`	`1`	Number of benchmark runs

`pool` Subcommands

Link your personal devices so credits are combined on one main machine.

Command	Description
`pool create --name "My Devices"`	Create a device group (this machine becomes the main device)
`pool invite-code`	Generate an 8-character invite code to share
`pool join <CODE>`	Link this device using a code from your main machine
`pool status`	Show linked devices, credits, and online status
`pool leave`	Unlink this device from the group

Example flow:

# Main device:
swarmllm pool create --name "My Devices"
swarmllm pool invite-code   # → A3F7K2M9

# On each other device:
swarmllm pool join A3F7K2M9

Note: This links YOUR own devices. It's different from connecting to the SwarmLLM network (which uses swarm:// peer addresses).

Environment Variables

Every config option can be set via SWARMLLM_ prefix:

Config Path	Environment Variable
`node.listen_port`	`SWARMLLM_NODE_LISTEN_PORT`
`node.data_dir`	`SWARMLLM_NODE_DATA_DIR`
`logging.level`	`SWARMLLM_LOGGING_LEVEL`
`inference.model_path`	`SWARMLLM_INFERENCE_MODEL_PATH`
`inference.gpu_layers`	`SWARMLLM_INFERENCE_GPU_LAYERS`

Example:

SWARMLLM_NODE_LISTEN_PORT=9000 SWARMLLM_LOGGING_LEVEL=debug ./swarmllm run

Provider API Keys via Environment

Cloud provider API keys use standard environment variable names:

Provider	Environment Variable
OpenAI	`OPENAI_API_KEY`
Anthropic	`ANTHROPIC_API_KEY`
DeepSeek	`DEEPSEEK_API_KEY`
Mistral	`MISTRAL_API_KEY`
Groq	`GROQ_API_KEY`
NVIDIA NIM	`NVIDIA_NIM_API_KEY`
Cerebras	`CEREBRAS_API_KEY`
SambaNova	`SAMBANOVA_API_KEY`
Fireworks	`FIREWORKS_API_KEY`
Together	`TOGETHER_API_KEY`
DeepInfra	`DEEPINFRA_API_KEY`
Moonshot/Kimi	`MOONSHOT_API_KEY`

These can also be placed in a .env file in your data directory:

# ~/.local/share/swarmllm/.env
OPENAI_API_KEY=sk-proj-...
DEEPSEEK_API_KEY=sk-...
NVIDIA_NIM_API_KEY=nvapi-...

The .env file is loaded at startup. It does not override existing environment variables or keys already configured via the dashboard/database. The dashboard settings UI shows "From .env" for keys loaded this way.

Troubleshooting

Can't Connect to Peers

Check the bootstrap address format:

/ip4/203.0.113.50/udp/8800/quic-v1/p2p/12D3KooW...

Firewall: SwarmLLM needs TCP port 8810 (P2P) and optionally UDP port 8800 (QUIC) open.

Linux: sudo ufw allow 8810/tcp && sudo ufw allow 8800/udp
Windows: Windows Defender Firewall > Inbound Rules > New > Port > TCP 8810 + UDP 8800
macOS: System Settings > Network > Firewall > allow SwarmLLM

Same LAN? Use local IP (e.g., 192.168.1.x). LAN peers should be found automatically via mDNS.

Model Download Stuck

Check disk space — a 7B model needs ~4-5 GB free
Verify internet access to https://huggingface.co
Cancel and retry from the Dashboard
Start with -v for verbose logs: ./swarmllm run -v
Try a smaller model first (TinyLlama, ~700 MB)

GPU Not Detected

Verify GPU works: nvidia-smi
Install NVIDIA drivers if needed
Enable GPU offloading: ./swarmllm run --gpu-layers 99

WSL2 users: The CUDA driver comes from your Windows NVIDIA driver. Check that /usr/lib/wsl/lib/libcuda.so.1 exists and add to your ~/.bashrc:

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/wsl/lib:$LD_LIBRARY_PATH

Port Already in Use

./swarmllm run --port 9000    # Use a different port
lsof -i :8800                 # Find what's using 8800
./swarmllm status             # Check if another instance is running

Slow First Request

If the first inference request to a model takes noticeably longer than subsequent ones, this is expected. SwarmLLM uses on-demand model loading — models whose shards are on disk but not loaded into VRAM are loaded when first requested. If VRAM is full, an LRU eviction occurs first. Subsequent requests to the same model will be fast.

Slow Inference

GPU vs CPU: CPU is 5-20x slower. Check Dashboard for GPU status.
Model too large: Use Q4 quantization, match model size to VRAM.
Enable batching: Set max_batch_size = 4 in config.

Database Corrupted

# Back up first
cp -r ~/.local/share/swarmllm ~/.local/share/swarmllm-backup
# Delete database (models and config are preserved)
rm ~/.local/share/swarmllm/db.redb
# Restart
./swarmllm run

GPU Out of Memory

If a model exceeds your GPU's VRAM, SwarmLLM automatically falls back to CPU inference. You'll see this in the logs:

WARN GPU OOM detected, retrying on CPU

CPU inference is 5-20x slower but works for any model size. To avoid OOM:

Use smaller quantizations (Q4 instead of Q8)
Use a model that fits in VRAM (check model size vs available VRAM in the dashboard)
For models too large for one GPU, use distributed inference across multiple nodes

Distributed Inference Issues

Peers visible but inference fails:

Ensure both nodes have the required shards loaded (check Dashboard > Models)
Verify P2P TCP connectivity: port <base_port> + 10 must be reachable
Run with -vv and filter: ./swarmllm run -vv 2>&1 | grep "DIAG:"
Check for DIAG: segment TIMED OUT — indicates network or compute bottleneck

High latency per token:

Distributed inference adds ~20-130ms per token for network round-trips
Use TCP bootstrap addresses (not QUIC) for lowest latency
Ensure nodes are on the same LAN for tensor parallelism

Pipeline assembly fails:

The scheduler needs enough shard coverage to build a complete pipeline
Check DIAG: assemble_pipeline_for for candidate counts

Inference fails with "peer never acknowledged" or "silent drop":

A SendDirectMessage was issued but neither a Response nor an OutboundFailure event arrived from libp2p within 10s (RR_ACK_TIMEOUT_SECS). Treated as a transient failure: the router automatically retries once with a fresh pipeline assembly that filters out the unreachable peer. If retry also fails, the user sees the error within ~20s (vs the 120s FIRST_TOKEN_TIMEOUT).
Most common cause: the target peer was killed or partitioned and the local libp2p connection state hasn't yet caught up.
Look for DIAG: rr ACK timeout — closing streaming caller in the logs to confirm the fast-fail path engaged.

Concurrent requests stall when only some get dispatched:

Per-tier concurrency caps come from inference.max_concurrent_requests (default 10): Bronze=2, Silver=5, Gold=10, Platinum=20. Excess requests queue until prior ones complete. To raise: bump the config knob or earn credits to climb tiers.
If queued requests don't dispatch even after others complete, check for a missed queue_notify.notify_one() after active_count.fetch_sub(1) (should never happen on main; was a real regression fixed in da6f485).

The cross-node prefix fetch is default-on. Expected logs on a successful first hit of a peer's cached prefix:

B: DIAG: cross-node prefix HIT — hydrated KV matched_tokens=N total_tokens=M
A: DIAG: served PrefixKvFetch ... hit=true

I never see cross-node prefix HIT:

Only fires on iter 1 of a prompt whose prefix your local node hasn't prefilled yet. Iter 2/3 hit the local cache (populated by iter 1).
Check the peer even announced the prefix: look for DIAG: PrefixCacheAnnounce indexed node_id=... blocks=N in your log. No announce → peer's gossip never reached you (check grep 'Published message to GossipSub' | grep 'swarm/models').
Check the peer passes the trust gate: default cross_node_prefix_trust_min = 0.5 equals DEFAULT_TRUST, so a freshly-seen peer should just barely pass. Any misbehavior drops it below.

I see prefix-probe: fetch timed out:

The peer didn't return a snapshot inside the worker-probe window (3000 ms by default). On a large model (7B+) with cold CPU this can happen if the snapshot is >100 MB. The path degrades to local prefill — no worse than not having the feature. The current 3000/2500/2000 ms chained timeouts are sized for 7B-class snapshots; the older 500/400/500 ms values were TinyLlama-sized and forced a fallback to local prefill on larger models.

I see rejected KV snapshot — penalizing peer trust:

The returned snapshot failed BLAKE3 reverification or contained NaN/Inf. Three rejection reasons:
- hash_chain_mismatch → prefix_cache_block_tokens differs between nodes (default 64, common alternatives 32/128)
- non_finite_tensors → GPU overflow on the serving side
- deserialize_failed → wire corruption — open an issue

Disable cross-node fetch entirely: Set inference.cross_node_prefix_trust_min = 2.0 in config.toml. The probe never fires because no peer passes the trust gate.

Running the Test Suite

SwarmLLM ships 943 lib tests + 75 integration tests + VLM E2E.

# Run all tests (release, used in CI)
cargo test --release

# Unit tests only (fastest feedback loop)
cargo test --lib

# Integration tests only
cargo test --test '*'

# A specific test by name substring
cargo test --release prefix_cache

# With CUDA features on (requires NVIDIA GPU)
cargo test --release --features candle-cuda

If a test fails, the release build shows the name + line; rerun with --nocapture to see its stderr:

cargo test failing_test_name -- --nocapture

Integration tests under tests/integration/ simulate multi-node P2P on loopback — they're the slow ones, and CI runs them with --test-threads=1 to avoid port contention.

See Benchmarking for reproducing the performance benchmarks and Performance for which knobs turn each speedup on/off.

Model Trust

Models go through trust levels: Discovered → Pinned → DemandVerified → NetworkPopular. Auto-manage only downloads shards for models at sufficient trust levels.

Model stuck at "Discovered":

Pin it manually from the Dashboard to promote to "Pinned"
Models reach "DemandVerified" after receiving inference requests
Models reach "NetworkPopular" when enough peers host them

Still Stuck?

Run with full diagnostics: ./swarmllm run -vv 2>&1 | grep "DIAG:"
See the Diagnostics Guide for detailed log instrumentation
Check GitHub Issues
Open a new issue with: OS, hardware, ./swarmllm version, and logs from -vv

System Overview

SwarmLLM is a single Rust binary that simultaneously functions as:

A P2P network node — connects to peers over TCP (Noise+Yamux) and QUIC/UDP using libp2p
An HTTP API server — serves OpenAI + Anthropic-compatible endpoints, MCP server, and cloud provider proxy via Axum
A web dashboard — embedded frontend (component-based vanilla HTML/CSS/JS, 11 HTML templates, no build step)

All three share a single port (default 8800) and a common Arc<SharedState>.

┌──────────────────────────────────────────────────────────┐
│                      swarmllm binary                      │
│                                                          │
│  ┌──────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │  P2P     │  │  HTTP API    │  │  Admin UI    │       │
│  │  Node    │  │  Server      │  │  (embedded)  │       │
│  │(TCP+QUIC)│  │  (Axum)      │  │              │       │
│  └────┬─────┘  └──────┬───────┘  └──────┬───────┘       │
│       │               │                 │                │
│  ┌────┴───────────────┴─────────────────┴─────────────┐  │
│  │              Shared State (Arc)                     │  │
│  │  DashMap<NodeId, PeerInfo>      — peer registry     │  │
│  │  ModelRegistry                  — models + shards   │  │
│  │  state.events (EventBus)        — activity + dashboard│ │
│  │  state.credits (CreditPool)     — balance + pool     │  │
│  │  state.models (ModelMgmt)       — acquisition + trust │  │
│  │  state.metrics (MetricsProviders)— stats + providers │  │
│  └────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────┘

Key Design Decisions

Config priority: CLI flags > env vars (SWARMLLM_ prefix) > config.toml > defaults
Data directory: ~/.local/share/swarmllm/ (Linux), ~/Library/Application Support/swarmllm/ (macOS), %APPDATA%\swarmllm\ (Windows)
Port layout: HTTP API on TCP:port, P2P TCP on port+10 (Noise+Yamux), P2P QUIC on UDP:port
Shard-only: Nodes never need a full GGUF. Shards are downloaded individually.
No blockchain: Credit system uses dual-signed transactions, not a token or chain

Technology Stack

Component	Library
Async runtime	Tokio (multi-threaded)
P2P networking	libp2p 0.56 (Kademlia, GossipSub, QUIC)
HTTP server	Axum 0.8
Tensor compute	candle-core/candle-transformers
GGUF inference	llama-cpp-2 (optional backend)
Cryptography	ed25519-dalek, x25519-dalek, chacha20poly1305
Content hashing	BLAKE3
Database	redb (pure-Rust, ACID, single-file)
Concurrent maps	DashMap 6

Daemon & Subsystems

The daemon spawns 12 Tokio tasks wired together with mpsc channels:

                           ┌──────────────┐
                           │  daemon/     │
                           │  (bootstrap) │
                           └──────┬───────┘
                                  │ spawns tokio tasks
  ┌───────┬───────┬───────┬───────┼───────┬──────────┬──────────┬──────────┬──────────┬──────────┬─────┐
  ▼       ▼       ▼       ▼       ▼       ▼          ▼          ▼          ▼          ▼          ▼     ▼
Network  Infer   Credit  Health   API    Rebal-   Acquisi-   Message    Pool     AutoShrd   HfWat- Update
Manager  Router  Ledger  Monitor  Server ancer    tion Mgr   Dispatch   Manager  Manager   cher   Checker

Subsystem Responsibilities

Subsystem	File	Role
NetworkManager	`src/network/manager/`	libp2p swarm: Kademlia DHT + GossipSub + request/response
InferenceRouter	`src/inference/router/`	Request queuing, pipeline assembly, execution coordination
MessageDispatcher	`src/daemon/dispatch/mod.rs`	Routes inbound network messages to appropriate subsystems
CreditLedger	`src/credit/ledger.rs`	Credit balance tracking, transaction signing, gossip
HealthMonitor	`src/health/monitor.rs`	Periodic health pings, rebalancing triggers
ShardRebalancer	`src/health/rebalancer.rs`	Shard redistribution on node join/leave
AcquisitionManager	`src/model/acquisition.rs`	BLAKE3-verified model downloads from peers and HuggingFace
ApiServer	`src/api/server.rs`	Axum HTTP: OpenAI + Anthropic APIs + MCP server + admin dashboard + WebSocket
PoolManager	`src/pool/manager/`	Device pool management, credit forwarding
AutoShardManager	`src/model/auto_manage/`	VRAM-aware shard acquisition + smart pruning (manager, scoring, download, prune, scan, vram, wishlist). R111: refreshes the user-visible wishlist at the end of every tick.
HfWatcher (R112)	`src/model/huggingface/watcher.rs`	Background task polling HuggingFace's trending GGUF feed once per hour. Caches the snapshot on `state.models.hf_trending_cache` (consumed by the wishlist scorer) and auto-promotes models above 100k downloads + 24h age from `Discovered` to `DemandVerified`. NonCritical — HF outages don't escalate to a daemon crash. Opt-out via `auto_manage.hf_watcher_enabled = false`.
UpdateChecker	`src/update.rs`	Periodic GitHub release polling, SHA256-verified binary download, atomic apply. Skipped entirely when `auto_update = "disabled"` (default until binary signing C1 lands), so the supervisor doesn't log a misleading "exited unexpectedly" warning.

Channel Layout

From	To	Message Types
NetworkManager	MessageDispatcher	All inbound SwarmMessage variants
MessageDispatcher	InferenceRouter	InferenceRequest, LayerForward, LayerResult
InferenceRouter	NetworkManager	Outgoing P2P messages
HealthMonitor	ShardRebalancer	RebalanceEvent
ApiServer	InferenceRouter	RouterCommand (from HTTP)
ApiServer	AcquisitionManager	AcquisitionCommand
AutoShardManager	AcquisitionManager	AcquisitionCommand
CreditLedger	NetworkManager	CreditGossip, CreditTransaction
MessageDispatcher	(spawned task)	VisionEncodeRequest → handler → VisionEncodeResponse

Broadcast Channels

Channel	Type	Subscribers	Purpose
`activity_tx`	`broadcast::Sender<ActivityEvent>` (256)	WebSocket	Unified event bus — all subsystem events (shard ops, downloads, inference, pool, config changes). Events carry `toast_level` for frontend toast control. History replayed to new WS clients.
`dashboard_tx`	`broadcast::Sender<DashboardSignal>` (32)	WebSocket	Dashboard refresh signals — `PeersChanged` (peer connect/disconnect), `ModelsChanged` (shard download/load/prune), `UpdateAvailable(UpdateInfo)` (new version).

Note: Former separate channels (prune_events_tx, models_changed_tx, lan_discovery_tx, system_notify_tx, peer_list_changed_tx, update_tx) were consolidated into these two in the event system unification.

Startup Sequence

Parse CLI args (clap)
Initialize tracing subscriber
Load/create config (TOML + env + defaults + CLI overrides)
Ensure data directory exists
Load/generate Ed25519 identity
Open redb database
Build Daemon { config, identity, db }
Initialize ModelExecutor (load GGUF if --model provided)
Build Arc<SharedState> (includes ModelRegistry from DB)
Scan local shards, register in registries
Create mpsc channels
Spawn all 12 tasks
Open browser if configured
tokio::select! on Ctrl+C or task exit
Graceful shutdown: save peer cache, flush database

Graceful Shutdown

Shutdown is triggered by Ctrl+C (SIGINT/SIGTERM) or any task exiting:

A watch channel signals all subsystems
Peer cache is saved to redb
Database is flushed
Open connections are drained

Networking & Discovery

Transport Stack

libp2p Swarm
├── Kademlia (DHT) — distributed hash table for peer/shard/model lookup
├── GossipSub — pub/sub for shard/health/credits/identity/pools/regions
├── request_response — unified protocol (/swarmllm/1.0.0, 600s timeout)
├── mDNS — optional LAN peer discovery
├── connection_limits — max 1/peer (>1 causes rr round-robin to dead connections), 500 total
├── Identify — protocol identification
├── AutoNAT — NAT detection
├── DCUtR — hole punching
└── relay::client — circuit relay

Protocol Format

The unified protocol uses a type-tag byte on every frame (src/network/protocol/mod.rs):

Tag	Constant	Use
`0x00`	`WIRE_TAG_JSON`	JSON control message (`SwarmMessage`, `ShardRequest`/`ShardResponse`)
`0x01`	`WIRE_TAG_TENSOR`	Binary tensor payload (`LayerForward`, `LayerResult`), f16
`0x02`	`WIRE_TAG_TENSOR_COMPRESSED`	Q8_0 activation frame (flag-gated `activation_compression`) — ~3.76× smaller than `0x01`
`0x03`	`WIRE_TAG_SHARD`	Raw shard bytes (ShardResponse payload, 32 MB max — bypasses the 4 MB JSON cap)
`0x04`	`WIRE_TAG_PREFIX_KV`	Cross-node prefix-KV snapshot. Frame body's flag byte: `0` = miss, `1` = raw f32, `2` = zstd-compressed f32 (gated on `NetworkConfig::prefix_kv_compression`, default off). Receivers always decompress regardless of the send-side flag.

Receivers auto-dispatch on the leading byte; senders choose based on config + request kind. Only the 0x00 frame carries a JSON body; the rest use binary framing with length prefixes.

Discovery Stack

SwarmLLM uses 5 independent discovery layers:

mDNS — Discovers LAN peers in seconds. Config: enable_mdns = true
Persistent Peer Cache — Saves up to 200 peers every 5 min + on shutdown. Fastest reconnect.
Invite Codes — Format: swarm://<base64url(key‖nonce‖encrypted_multiaddr)>. Encrypted with ChaCha20Poly1305.
Peer Exchange (PEX) — On each connection, exchanges up to 20 known peers.
Kademlia DHT — Bootstrap flag + periodic re-bootstrap every 60s.

GossipSub Topics

Six topics, all subscribed at startup in discovery::subscribe_topics:

Topic	Constant	Content
`swarm/models`	`TOPIC_MODELS`	`ShardAnnounce`, `ModelManifest`, `PrefixCacheAnnounce` (cross-node prefix-KV index)
`swarm/health`	`TOPIC_HEALTH`	`HealthPing`, `NodeCapability` (includes observed per-layer latencies for the Parallax scheduler), `TpAllReduceResponse`
`swarm/credits`	`TOPIC_CREDITS`	`CreditGossip`, `CreditTransaction`
`swarm/identity`	`TOPIC_IDENTITY`	`NicknameGossip` (signed)
`swarm/pools`	`TOPIC_POOLS`	`PoolMessage` (PoolState, PoolInvitation, CreditForward)
`swarm/regions`	`TOPIC_REGIONS`	`RegionShardSummary` (per-region shard availability for routing locality)

The topic match in NetworkManager::handle_broadcast is contract-not-default: a SwarmMessage variant with no topic arm falls through _ => return and silently drops at the wire. Adding a new gossip variant requires updating the match — an early multi-node test caught PrefixCacheAnnounce missing from the TOPIC_MODELS arm, which had silently dropped every cross-node prefix-cache announce at the network layer until a two-daemon run flushed it out.

Messages older than 5 minutes are rejected (replay protection).

The cross-node prefix-cache fetch path uses the request_response protocol, not gossip. The gossip layer only broadcasts which blocks each peer holds (PrefixCacheAnnounce on swarm/models); the actual snapshot transfer is a direct bilateral exchange:

Requesting daemon sends SwarmRequest::PrefixKvFetch to the peer chosen by the probe resolver (trust-gated by cross_node_prefix_trust_min, default 0.5)
Serving daemon runs fetch_local_snapshot against its own worker over IPC (2000 ms timeout) and gets the serialized bytes or None
Serving daemon returns SwarmResponse::PrefixKvData { present, payload } with the bytes wrapped in the WIRE_TAG_PREFIX_KV frame on the binary payload slot (not in the JSON header — serde_json inflates Vec<u8> ~5× and blows past the 64 MiB IPC cap)
Requesting daemon BLAKE3-reverifies + NaN/Inf-scans, hands bytes to its worker to hydrate a KvCacheEntry

See Inference > Prefix-Cache KV Sharing for the full pipeline and measured numbers.

Anti-Gaming

Subnet clustering detection: >5 nodes per /24 triggers 25% spot-check rate (up from 5%)
SubnetClustering trust penalty (-0.03 per cycle)
Signed balance reports with timestamp freshness (5 min window)
Gossip replay rejection (5 min window)
cross_node_prefix_trust_min gates fetch peers at a minimum trust score (default 0.5, equal to DEFAULT_TRUST; set to 2.0 to disable cross-node fetch entirely)

Inference Pipeline

Subprocess-Per-Model Isolation

Each loaded model runs in its own swarmllm model-worker subprocess (Ollama-style). When a model is unloaded, the subprocess is killed and the OS + CUDA driver immediately reclaim all GPU memory — no daemon restart required.

Main daemon                          model-worker subprocess (one per model)
───────────────────────────────      ───────────────────────────────────────
ModelProcessPool.generate()  ─────►  loads shards from disk on first request
ModelProcessPool.forward()   ─────►  runs forward passes / full decode loop
                             ◄─────  streams WorkerMsg::Token / LayerResult
unload_model()               ─────►  kill process → OS frees all VRAM

IPC: Unix domain socket with binary framing — [4B json_len][json header][4B payload_len][raw tensor bytes]. JSON carries message metadata; the payload carries raw activation bytes to avoid base64 overhead.

Message types (src/inference/worker_ipc.rs):

Message	Direction	Purpose
`DaemonMsg::Forward`	daemon → worker	Single-step LayerForward (distributed inference)
`DaemonMsg::Generate`	daemon → worker	Full prompt→tokens decode loop (API inference)
`DaemonMsg::Unload`	daemon → worker	Drop a layer range (partial memory reclaim)
`DaemonMsg::Shutdown`	daemon → worker	Graceful worker exit
`WorkerMsg::Token`	worker → daemon	Streaming decoded token
`WorkerMsg::LayerResult`	worker → daemon	Activation result for pipeline forwarding

SplitModelEntry is metadata-only — it caches eos_tokens, vocab, chat_template, bos_token, and eos_token_str from the GGUF header without loading model weights. The weights live exclusively in the worker subprocess.

Worker granularity: one process per ModelId (not per shard). A single worker handles all layer ranges for a model and owns its own KvCacheStore. Individual shard unload uses DaemonMsg::Unload; the process exits only when all shards are released.

Split Inference Engine

The split inference engine (src/inference/split/) enables distributed inference using candle for direct tensor computation with quantized GGUF weights. Each node loads only its assigned transformer layers (in the worker subprocess), forwarding hidden-state activations between nodes. The module is split into: model.rs (SplitModel struct + accessors), loader.rs (GGUF/shard load), executor.rs (forward pass + tensor-parallel), kv_cache.rs, entry.rs, gguf_meta.rs, shard_reader.rs, rope.rs, prefix_cache.rs.

Client → API Server → InferenceRouter → Pipeline Assembly
                                              │
                      ┌───────────────────────┘
                      ▼
          ┌──────────────────────┐
          │   Pipeline Segment   │     Token IDs (prefill)
          │ Node A: Layers 0-15  │──── LayerForward ──►
          └──────────────────────┘                      │
                                        ┌───────────────┘
                                        ▼
                            ┌──────────────────────┐
                            │   Pipeline Segment   │
                            │ Node B: Layers 16-27 │── sample token ──►
                            └──────────────────────┘

Pipeline Assembly

Fetch model manifest to determine layer ranges
Pipeline affinity check: if multi-turn session has a previous pipeline and all nodes are still connected, reuse it (KV cache locality)
Query model_registry.shard_holders for hosting nodes
Liveness filter: drop holders that aren't in connected_node_ids (the libp2p truth — DHT can re-inject providers for peers that just disconnected, and peer_registry is intentionally preserved across mid-pipeline disconnects for reconnect attempts)
Fetch node load/latency from peer_registry
Parallax scheduler: shortest-path dynamic programming over observed per-layer latencies (EMA over recent forwards), rather than a greedy latency-only sort. Cross-gossips top-32 observed latencies via NodeCapability.observed_latencies so every node has a current view of the network's compute profile
Encrypted pipeline check: if enabled for this model, force first and last segments to the local node (boomerang topology)
Assignment: widest contiguous layer range per node, merging on same-node
Identify standby nodes per segment (failover)
Send PipelineAssignment, wait for ACKs, begin forwarding

Failure Handling

The router applies a single retry on transient remote failures (silent rr drops, OutboundFailure, remote-generate timeouts). The retry passes preferred_pipeline = None so the scheduler re-runs and the dead/dropped peer is filtered out via the liveness oracle above. Failure of the second attempt propagates to the user with a "try again" hint.

Independently, streaming-tracked SendDirectMessage sends carry a delivery_request_id; if the receiver doesn't ACK within RR_ACK_TIMEOUT_SECS (10s), the daemon closes the caller's streaming channel — converting a 120s FIRST_TOKEN_TIMEOUT hang into a fast-fail in ~10–20s. This handles the rare case where libp2p request_response accepts a send_request call but never delivers it (no OutboundFailure event fires).

Concurrent Request Throttling

Per-tier concurrency caps come from max_concurrent_requests (default 10): Bronze=¼, Silver=½, Gold=1×, Platinum=2×. Requests beyond the cap queue in the router. The queue is event-driven: every active_count.fetch_sub(1) on completion is paired with queue_notify.notify_one() so drain_queue wakes immediately. Without that pairing, queued requests would sit indefinitely until the next Submit arrived (a real bug found in stress testing — fix in commit da6f485).

Pipeline affinity means that multi-turn conversations (with session_id) prefer to route through the same nodes, preserving KV-cache state and avoiding cold restarts on every turn.

The Parallax allocator also runs offline in AutoShardManager (Phase C.2) with a soft acquire/prune bias driven by a per-shard stability counter (≥3 consistent ticks of "this shard wants to move here" before it acts). Hard constraints (pinning, trust gates, VRAM caps) always win.

Architecture Detection

The SplitModel loader reads general.architecture from GGUF metadata and applies per-architecture handling:

Architecture	RoPE	QKV Biases	Special Handling
Llama	Interleaved	No	Default EOS=2
Llama 4	iRoPE (NoPE every 4th)	No	MoE FFN
Qwen2	Contiguous	Yes	EOS 151643+151645
Qwen 3.5	Contiguous	No	Hybrid SSM+attention (Gated Delta Networks)
Gemma/Gemma2	Interleaved	No	Embedding scaling (sqrt(d)), Gemma RmsNorm (+1), EOS 107, attention + final logit softcapping, Gemma chat template fallback
Phi-3	Su/YaRN	Yes	Fused QKV/FFN tensors
Mistral	Interleaved	No	GQA
DeepSeek-V2/V3	Contiguous	No	MLA attention + MoE FFN
GLM-4	Contiguous	No	Partial RoPE, extreme GQA (16:1)
Starcoder2	Interleaved	Yes	Code-optimized

KV-Cache Management

Per-request isolation via DashMap<(ModelKey, RequestId), Cache>
Multi-turn reuse: session_id tracks conversations, prefix matching skips redundant prefill
Configurable TTL (default 10 min)
VRAM-aware LRU eviction for split model cache

Each worker stores a local prefix-cache keyed by BLAKE3 chained hashes over fixed-size token blocks (prefix_cache_block_tokens, default 64). Blocks are announced to peers via SwarmMessage::PrefixCacheAnnounce on the swarm/models gossipsub topic and indexed in state.models.cross_node_prefix_index.

When a local worker sees a prompt whose prefix it hasn't prefilled, it emits WorkerMsg::PrefixFetchProbe; the daemon walks the index (longest-match first), trust-gates candidate peers by cross_node_prefix_trust_min (default 0.5), and issues a SendPrefixKvFetch request-response to the best holder. The serving daemon re-issues DaemonMsg::ExportPrefixSnapshot to its worker, which narrows a stored KvSnapshot to the requested block boundary and returns the serialized bytes in the IPC binary-payload slot. Back on the requesting side, the bytes are BLAKE3-reverified against the requested hash and NaN/Inf-scanned before hydrating a new KvCacheEntry for the in-flight request, which then only has to prefill the suffix beyond the cached block boundary.

Three chained timeouts (worker probe 3000 ms, daemon network 2500 ms, serving IPC 2000 ms — sized for 7B-class f32 snapshots) guarantee that a stuck peer degrades to a clean miss rather than blocking the request. See the Performance chapter for measured TTFT numbers on TinyLlama (GPU, corner case where fetch is slightly slower than prefill) vs Qwen2.5-7B (12.9× iter-1 TTFT speedup on CPU-CPU localhost).

Advanced Features

Speculative Decoding — Draft model proposes K tokens, target verifies in one pass (flag-gated speculative_distributed)
SWIFT self-speculative — Target model acts as its own draft by skipping a layer range (flag-gated swift_self_speculative)
DSD (Decentralized Speculative Decoding) — Multi-segment pipeline with γ-token speculation woven in (flag-gated decentralized_spec_decoding)
Chunked Prefill — Sarathi-style: each Prefilling slot advances by prefill_chunk_tokens (default 128) per decode tick so a long admission can't block decode
Continuous Batching — default-on: concurrent Generate requests share one forward_batch per decode tick; GPU uses fused kernel, CPU falls through to sequential
Batched Prefill Forward — default-on: fuses concurrent same-shape Prefilling chunks into one forward_batch call
Remote-generate Fast Path — default-on: single-segment distributed inference runs the full decode loop on the remote worker instead of per-token coordinator round-trips (measured 1.93× decode speedup)
Cross-request Prefix Cache — default-on: see "Prefix-Cache KV Sharing" above for the cross-node extension; the local cache alone is a 29.4× wall-clock win on prompt re-submission
Activation Compression (Q8_0) — Intermediate pipeline activations wire-quantized ~3.76× (flag-gated activation_compression)
Flash Attention — CPU and GPU fast paths (GQA-native, no repeat_kv)
PagedAttention — Deferred; paged-attn feature flag reserved for future use (module removed, never wired to production)
Logprobs — Per-token log probabilities via sample_token_with_params_and_logprobs(). When logprobs: true in the request, the sampling layer collects top-N token probabilities and returns them in the OpenAI-compatible response. Available on split model (candle) inference paths
Pipeline Error Broadcast — On distributed inference failure, broadcast_pipeline_error() notifies all participants so peers can update shard availability and route around failures
Local Embedding Privacy — When local_embedding_privacy: true, the requesting node performs token→embedding locally (~1ms) and sends pre-embedded hidden-state activations instead of raw token IDs to the first pipeline segment. Remote nodes never see the plaintext prompt. See Security > Local Embedding Privacy
Encrypted Pipeline — When enabled (per-model or global), forces a "boomerang" topology: the requesting node handles both the first segment (embedding) and last segment (token sampling). Remote nodes only process intermediate activations — no remote node ever sees plaintext input or output. See Security > Encrypted Pipeline

Vision Language Models (VLM)

Distributed mmproj

The mmproj (vision encoder) is modeled as a sentinel shard (index = u32::MAX) decoupled from the text pipeline. Any node with mmproj can encode images — the router selects local → first-segment → any holder.

Image → JPEG compress → VisionEncodeRequest (remote) or encode locally
    → zstd+FP16 compressed embeddings
    → attached to first LayerForward (vision_embeddings field)
    → text pipeline processes as normal

Key types: VisionEncodeRequest, VisionEncodeResponse, LayerForward.vision_embeddings.

If no node has mmproj loaded, the API returns HTTP 503 (VisionEncoderUnavailable).

Tensor Wire Format

[4B ndim][4B×ndim shape][4B dtype_tag][f32 data]

For a 7B model (hidden_dim=3584):

Prefill (14 tokens): ~200 KB
Decode (1 token): ~14 KB

Credit System

Credits are SwarmLLM's fairness mechanism — no blockchain, no token, just local accounting with dual-signed transactions. The system ensures contributors are rewarded and free-riders are deprioritized.

Earning & Spending

Action	Credits	Notes
Serve inference (per token)	+10	Balanced with consume side
Host shard (per GB per hour)	+1	Hourly tick in CreditLedger
Seed shard data (per GB transferred)	+5	Atomic counter, periodic drain
Relay traffic (per connection hour)	+2	Circuit open/close tracking
Consume inference (per token)	-10	Balanced with earn side
Distributed inference failure	-50	Automatic penalty

Balanced rates: Both earn and spend use rate × tokens — no layer multiplier. A 22-layer model serving 100 tokens earns the same as it costs to consume, preventing credit inflation.

All rates are configurable per pool via [pool.credit_rates] in config.

Minimum Balance Enforcement

Nodes with balance below -1000 credits have remote inference requests rejected. They receive a clear error message telling them to contribute (host shards, serve inference, seed data).

Local API requests (from localhost) are always allowed regardless of balance
This prevents free-riders from endlessly consuming without contributing
The floor is configurable via MIN_BALANCE_FOR_INFERENCE constant

Priority Tiers

Tiers are calculated from your credit balance relative to the network:

Tier	Requirement	Concurrent Limit
Platinum	≥90th percentile and balance > 0	2× base max
Gold	≥70th percentile and balance > 0	base max
Silver	Positive balance	½ base max
Bronze	Zero or negative	¼ base max (min 1)

How it works: On each inference request, the router computes your network percentile from peer credit gossip data (deduplicated by NodeId to prevent Sybil stuffing) and calls calculate_tier(). Higher tiers dequeue first. Bronze nodes are never fully blocked — they get deprioritized but always get at least 1 concurrent slot.

Anti-Abuse Mechanisms

Anti-Sybil deduplication: Peer balance gossip is deduplicated by NodeId — a single peer can't stuff the percentile distribution by re-gossiping
Atomic accumulation: Forward participation credits use AtomicI64 accumulator, flushed every 60s — no credits lost under high concurrency
AntiGaming rate limiter: Max 100 credit transactions per node per 5-minute window
Self-dealing rejection: Transactions from/to same node are rejected
Signed balance reports: Ed25519 signatures with 5-minute freshness window

Failure Penalties

When distributed inference fails:

The requesting node is penalized (configurable penalty_serve_failure, default 50 credits)
A broadcast_pipeline_error() message is sent to all pipeline participants

Transaction Security

Every transaction requires dual Ed25519 signatures (serving node + requesting node)
UUID deduplication prevents replay attacks (checked against DB)
Balance arithmetic uses saturating_add (no overflow panics)
Peer balance gossip rejects implausible values (abs > 100M)

Escrow

For large requests (above configurable threshold), credits are held in escrow:

create_escrow() → release_escrow() (success) or refund_escrow() (failure)
Balance deducted BEFORE escrow persisted (crash-safe: lose credits > create free credits)
Refunds are persisted to DB immediately
Entries expire after 10 minutes with automatic refund
Escrow and direct charge are mutually exclusive (no double-billing)

Device Pool Credit Forwarding

When devices are linked in a pool, member devices forward their earnings to the owner:

Credit split configurable: 0-50% kept by member, rest forwarded
Dual-signed PoolCreditForward (member signature + owner co-signature)
Forwarded amount deducted from member balance before persisting
Owner's PoolManager validates and applies credits atomically

Security & Encryption

Three Encryption Tiers

Tier 1: Pairwise Sessions (Unicast)

For direct peer-to-peer communication:

Ed25519 → X25519 → ECDH → ChaCha20-Poly1305
Forward secrecy via ephemeral X25519 re-keying every 10 minutes
Nonce reuse prevented by session clearing on disconnect (remove_session())
Replay protection: RFC 6479 sliding window (128-bit bitmap) — allows packet reordering within window while rejecting duplicates
Nonce state updated only after successful decryption (prevents DoS)
Pending ephemeral keys expire after 60 seconds (prevents memory exhaustion from unanswered re-keys)

Tier 2: Pipeline Sealing (Inference)

For inference prompts and responses:

Per-request ephemeral key
Sealed prompt/response
Wire tag: TENSOR_TAG_ENCRYPTED = 0x10

Pipeline sealing is active: the final segment encrypts output token IDs with the requester's X25519 public key. The final-segment node can see the sampled tokens before encryption — this is inherent to the architecture since sampling happens on that node. Intermediate nodes process activation tensors (protected by Tier 1 in transit) but never see the final plaintext output. See Pipeline Privacy Model for a full breakdown of what each node can see.

Tier 3: Sealed Gossip (Broadcasts)

For GossipSub messages:

Epoch-based group key + mandatory Ed25519 origin signature
All gossip messages MUST be seal_signed() — unsigned messages are rejected
Verifies sender authenticity before processing
1-hour rotation cycle

Transport-Authenticated Dispatch

All inbound network messages carry transport-authenticated sender identity:

libp2p Noise protocol authenticates peers at the transport layer
AuthenticatedMessage wrapper carries the verified NodeId of the sender
MessageDispatcher validates sender identity against message claims:
- ShardAnnounce: sender must match announce.node_id
- CreditTransaction: sender must be a party (from or to)
- CreditGossip, NicknameGossip: sender must match claimed node_id
- HealthPing/Pong: sender must match claimed node_id
- EphemeralKeyExchange: sender must match exchange.node_id
Mismatched messages are logged and dropped

Signed DHT Records

Kademlia DHT records are Ed25519-signed to prevent poisoning:

Format: [32B pubkey][64B signature][payload]
start_providing_shards() signs records with node identity
Active verification: verify_dht_value() is called on all GetRecordOk results in NetworkManager — records with invalid or missing signatures are logged and discarded
Records expire after 1 hour with automatic re-publication

Identity

Ed25519 keypair generated on first run, stored in identity.key
Private key never leaves the machine
Public key = Node ID (first 8 bytes hex for display)
Nickname system: Ed25519-signed records with timestamp-wins conflict resolution
Nickname registry capped at 10,000 entries (requires peer_registry membership)

Trust & Reputation

TrustManager tracks per-peer scores (0.0-1.0, default 0.5):

Event	Score Change
InferenceSuccess	+0.01
ValidTransaction	+0.02
SpotCheckFail	-0.10
InvalidGossip	-0.05
SignatureViolation	-0.20

Scores decay toward 0.5 over time (1% per health cycle, default 30 seconds). Trust factors into pipeline scheduling and credit tier weighting.

Sybil Resistance

Subnet clustering detection: >5 nodes per /24 → elevated spot-check rate
Signed-only balance reports
Timestamp freshness checks on gossip (5 min window, rejects >5 min old)

API Authentication

Auto-generated 32-byte hex Bearer token (constant-time comparison)
Protected: /v1/*, /api/admin/provider-models, config PUT, shutdown, HF downloads, API key endpoint
Exempt: /, /health, /admin (read-only dashboard), static assets
Request body limit: 32 MB (raised from 2 MB to support VLM image payloads)
Content-Security-Policy: default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline'; connect-src 'self' ws: wss:; img-src 'self' data: blob:; frame-ancestors 'none'; base-uri 'self'; form-action 'self'
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
Referrer-Policy: no-referrer
WebSocket Origin validation (rejects cross-site WebSocket hijacking)

Input Validation

Model field length: max 256 chars in OpenAI + Anthropic handlers
Tools array: max 128 entries
Stop sequences: max 16 entries
HuggingFace repo_id: validated owner/repo format (alphanumeric, hyphens, dots, underscores, max 96 chars)
HuggingFace filename: must end in .gguf, no .., no URL metacharacters
Path traversal: sanitize_path_component() on all network-provided model IDs before filesystem operations
Update URLs: only GitHub download URLs accepted
Update binaries: SHA256 checksum verification mandatory

Rate Limiting & DoS Protection

Per-IP rate limiter with periodic cleanup (5 min intervals)
Inference queue depth cap: 512 requests
HTTP timeout: 5 minutes (Slowloris protection via tower-http TimeoutLayer)
Credit transaction signature verification before ledger apply

Pipeline Privacy Model

Distributed inference splits a model across multiple nodes. This creates inherent privacy trade-offs — each node in the pipeline must process data to do its job. This section documents exactly what each node can see.

What each node sees during inference

Consider a 3-node pipeline: Requester → Node A (layers 0-10) → Node B (layers 11-21) → Node C (layers 22-27, final):

Data	Requester	Node A (first)	Node B (middle)	Node C (last)
Plaintext prompt	Yes (author)	See below*	No	No
Raw token IDs	Yes	See below*	No	No
Input activations	—	Yes	Yes	Yes
Output activations	—	Yes	Yes	—
Generated token IDs	Yes (decrypted)	No	No	Yes (samples them)
Final plaintext response	Yes (decrypted)	No	No	Yes (before sealing)

*Node A's visibility depends on the local_embedding_privacy setting — see below.

Risk: First-segment node sees raw tokens (default)

Without local_embedding_privacy (default): The first-segment node (Node A) receives the raw prompt text or token IDs to perform the embedding lookup. This means Node A can read the user's prompt in plaintext.

With local_embedding_privacy: true: The requesting node performs the embedding lookup locally and sends pre-embedded activation tensors. Node A receives floating-point vectors instead of token IDs. This is a significant privacy improvement, but not absolute — see Activation Inversion Risk below.

Risk: Final-segment node sees generated output

The final-segment node (Node C) must sample tokens from the logit distribution. This is fundamental — sampling is the act of choosing the next word, and it can only happen where the final layer's output logits exist. Node C therefore sees every generated token before encrypting them via Tier 2 pipeline sealing.

This cannot be mitigated architecturally. The node that runs the last transformer layer and samples tokens will always know what tokens were sampled. Pipeline sealing ensures the tokens are encrypted before being sent back over the network, so intermediate nodes and eavesdroppers cannot read the response — but the final-segment node itself can.

Risk: Activation inversion attacks

All intermediate nodes see hidden-state activation tensors (floating-point matrices). Research has shown that activations from early transformer layers can sometimes be partially inverted to recover input tokens, especially:

Embedding-layer activations (layer 0 output) — most vulnerable, essentially a lookup table that can be reversed
Early layers (1-4) — progressively harder to invert as information mixes across token positions
Deep layers (5+) — extremely difficult to invert in practice; activations encode abstract features, not token identity

Mitigations in SwarmLLM:

local_embedding_privacy: true — the requesting node performs embedding locally, so the first segment never receives the trivially-invertible embedding output. It receives post-layer-0 activations at earliest.
Tier 1 encryption — all inter-node tensor transfers are encrypted with ChaCha20-Poly1305, preventing network-level eavesdropping
Pipeline scheduling preference — the scheduler prefers local segments for the first layers when possible

Risk: Byzantine tensor manipulation

A malicious node can send garbage activations instead of computing the actual transformer layers. This produces incorrect output without detection unless spot-checked. Mitigations: probabilistic spot-check validation (5% rate, 25% for subnet-clustered peers) with trust score reduction on failure.

Summary of privacy guarantees

Configuration	Prompt privacy	Response privacy	Activation risk
Default (no privacy flags)	First segment sees plaintext	Final segment sees plaintext	Intermediate nodes see activations
`local_embedding_privacy: true`	No remote node sees raw tokens	Final segment sees plaintext	Reduced — no trivial embedding inversion
`encrypted_pipeline: true`	No remote node sees raw tokens	No remote node sees output	Only intermediate activations visible to remote nodes
+ Tier 2 pipeline sealing	No remote node sees raw tokens	Encrypted on the wire	Reduced — no trivial embedding inversion
All protections enabled	Best available	Best available	Remote nodes only see intermediate activations; inversion theoretically possible but computationally expensive

Bottom line: With encrypted_pipeline, no remote node sees plaintext input or output — the pipeline "boomerangs" through remote nodes and returns to the requester. This is the strongest privacy mode. Without it, local_embedding_privacy still protects raw token IDs but the final-segment node sees generated output.

Local Embedding Privacy

When local_embedding_privacy: true is set in [inference] config, the requesting node performs token→embedding lookup locally before sending activations to the first pipeline segment. Remote nodes never see raw token IDs — only hidden-state activation tensors.

How it works:

On startup, LocalEmbedder loads token_embd.weight from shard_000.bin (~64MB for a 7B Q4 model)
The requesting node tokenizes the prompt and performs the embedding lookup locally (~1ms)
The resulting hidden-state tensor ([1, seq_len, hidden_dim], FP32) is sent as LayerForward.activations with pre_embedded: true
The receiving first-segment node skips its embedding lookup and processes the pre-embedded activations directly

Wire format: The pre_embedded flag on LayerForward is #[serde(default)], so old nodes receiving new-format messages default to false (backward compatible).

Trade-off: Pre-embedded activations are larger than raw text (e.g., 512 tokens × 4096 hidden × 4 bytes = 8MB vs ~2KB text). This matches the existing inter-segment activation sizes, so it does not change the bandwidth profile of distributed inference.

Relevant code: src/inference/local_embedder.rs, src/inference/pipeline/, src/daemon/state/mod.rs (local_embedders DashMap).

Encrypted Pipeline

When encrypted_pipeline: true is enabled (globally or per-model), the pipeline scheduler forces the requesting node to handle both the first and last segments. This creates a "boomerang" topology:

Requester (shard 0, embed) → Remote A (middle shards) → ... → Requester (final shard, decode)

No remote node ever sees plaintext — neither the raw prompt tokens nor the generated output. Remote nodes only process intermediate hidden-state activations.

Requirements:

The requesting node must hold shard 0 (embedding table) AND the final shard (output head)
local_embedding_privacy is auto-enabled when encrypted pipeline is active
Only useful for models with 3+ shards (2-shard models = fully local, no distribution)

Overhead:

Adds ~1 extra network RTT per generated token (activations must return to the requester for final decoding)
Latency increase depends on distance to the furthest remote segment
No bandwidth overhead vs normal distributed inference (activation sizes are the same)

Per-model configuration:

API: GET/PUT /api/admin/models/{id}/encrypted-pipeline
Dashboard: gear icon on model card → "Encrypted pipeline" checkbox
Global fallback: encrypted_pipeline = true in [inference] config
Per-model overrides are persisted to the database

Relevant code: src/inference/scheduler/mod.rs (greedy_assign), src/inference/pipeline/ (auto-enable local embedding), src/api/admin_models/mod.rs (API endpoints), src/daemon/state/mod.rs (encrypted_pipeline_models DashMap).

Known Limitations

These are architectural properties that cannot be fully mitigated with code changes:

Gossip epoch key is publicly derivable — derived from "swarmllm-mainnet-v1". Gossip encryption is defense-in-depth; Ed25519 signing is the primary security mechanism.
Final-segment output visibility — the node running the last transformer layers sees all generated tokens before pipeline sealing encrypts them. This is inherent to the architecture (see Pipeline Privacy Model).
Activation inversion — hidden-state tensors passed between nodes can theoretically be inverted to recover input, especially from early layers. local_embedding_privacy eliminates the trivial case (embedding lookup reversal). Deep-layer inversion remains an open research problem.
Byzantine tensor manipulation — malicious peers can send garbage activations. Mitigation: probabilistic spot-check validation (5% rate, 25% for subnet-clustered peers) with trust score reduction on failure.
Sybil credit farming — Ed25519 keys are free. Anti-gaming heuristics help but are not bulletproof.
GGUF parser vulnerabilities — llama.cpp CVEs. BLAKE3 content hash gates shard loading but parser bugs remain upstream.
Kademlia eclipse attacks — strategic Sybil node IDs can control DHT routing. K-bucket eviction policies help.

Storage & Data

Data Directory Layout

~/.local/share/swarmllm/
├── config.toml          # User configuration
├── identity.key         # Ed25519 keypair
├── api_key              # Bearer token (auto-generated)
├── db.redb              # redb database (migrated from sled db/ directory)
└── models/
    ├── qwen2.5-coder-7b/
    │   ├── manifest.json
    │   ├── gguf_header.bin
    │   ├── shard_000.bin
    │   └── shard_001.bin
    └── tinyllama-1.1b/
        └── ...

Database Tables (redb)

Table	Key	Value
config	`"config"`	Config
config	`"api_key"`	Bearer token string
identity	`"keypair"`	Encrypted Ed25519 key
credits	`"balance"`	CreditBalance
credit_txns	`{uuid}`	CreditTransaction
peer_trust	`{node_id_hex}`	TrustScore
peer_cache	`{multiaddr}`	() presence key
shard_meta	`{model_id}/{index}`	ShardInfo + path
model_meta	`{model_id}`	ModelManifest
sessions	`{session_id}`	KV-cache metadata
nicknames	`{node_id_hex}`	NicknameRecord
pool_state	`"pool"`	PoolState
trust_scores	`{node_id_hex}`	f64 trust score
escrow	`{escrow_id}`	EscrowEntry
hf_sources	`{model_id}`	HfSource metadata
locked_shards	`{shard_id_json}`	bool
resource_schedule	`"current"`	ResourceSchedule
model_trust	`{model_id}`	ModelTrustEntry (level, request count, last seen)

Model Acquisition Pipeline

Network Registry (GossipSub/DHT)
        │
        ▼
  Manifest Check ──► Reject if BLAKE3 mismatch
        │
        ▼
  Shard Selection ──► Rarest-first (BitTorrent-style)
        │
        ▼
  Download Loop ──► Atomic write to .tmp, rename to .bin
        │
        ▼
  Shard Verify ──► BLAKE3 vs manifest hash
        │
        ▼
  Model Ready

Integrity guarantees:

Manifests verified via BLAKE3 self-hash
Each shard verified against manifest hash
Failed shards renamed .bin.quarantine, serving peer penalized
Downloads retried (3 attempts, exponential backoff)
Atomic writes prevent corrupt partial files
Stale .tmp files cleaned on startup

OpenAI-Compatible API

SwarmLLM provides a drop-in replacement for the OpenAI API. All endpoints require Bearer token authentication.

POST /v1/chat/completions

Chat completions with streaming support.

curl http://localhost:8800/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder-7b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is Rust?"}
    ],
    "stream": true,
    "max_tokens": 512,
    "temperature": 0.7
  }'

Request Body

Field	Type	Required	Default	Description
`model`	string	yes	—	Model name (or `"auto"` for first available)
`messages`	array	yes	—	Chat messages (`role` + `content`). Roles: `system`, `user`, `assistant`, `tool`
`stream`	boolean	no	`false`	Enable SSE streaming
`max_tokens`	integer	no	`2048`	Max tokens to generate (clamped to 1–32768)
`temperature`	float	no	`0.7`	Sampling temperature (0.0-2.0)
`top_p`	float	no	`1.0`	Nucleus sampling threshold
`stop`	string or array	no	—	Stop sequence(s), 1–256 chars each, max 16
`frequency_penalty`	float	no	`0.0`	Frequency penalty (-2.0 to 2.0)
`presence_penalty`	float	no	`0.0`	Presence penalty (-2.0 to 2.0)
`tools`	array	no	—	Tool/function definitions for function calling
`tool_choice`	string or object	no	—	`"none"`, `"auto"`, `"required"`, or `{"type":"function","function":{"name":"..."}}`
`logprobs`	boolean	no	`false`	Return log probabilities for output tokens. Supported on split model (candle) inference paths
`top_logprobs`	integer	no	—	Number of top log probabilities per token (0-20, requires `logprobs: true`). Computed from pre-sampling (raw) logits per OpenAI spec
`session_id`	string	no	—	Reuse KV-cache from a previous request
`lora_adapter`	string	no	—	LoRA adapter ID for fine-tuned inference

Response (non-streaming)

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "qwen2.5-coder-7b",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "Rust is a systems programming language..."},
    "finish_reason": "stop",
    "logprobs": null
  }],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 42,
    "total_tokens": 57
  }
}

Response with logprobs

When logprobs: true and top_logprobs: 3:

{
  "choices": [{
    "message": {"role": "assistant", "content": "Hello"},
    "finish_reason": "stop",
    "logprobs": {
      "content": [{
        "token": "Hello",
        "logprob": -0.234,
        "bytes": null,
        "top_logprobs": [
          {"token": "Hello", "logprob": -0.234, "bytes": null},
          {"token": "Hi", "logprob": -1.456, "bytes": null},
          {"token": "Hey", "logprob": -2.012, "bytes": null}
        ]
      }]
    }
  }]
}

Response with tool_calls

When the model calls a tool, finish_reason is "tool_calls" and content is null:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_abc123",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\":\"NYC\"}"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

Streaming (SSE)

When stream: true, responses arrive as Server-Sent Events:

data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":"Rust"},"index":0}]}

data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":" is"},"index":0}]}

data: [DONE]

GET /v1/models

List available models.

curl http://localhost:8800/v1/models \
  -H "Authorization: Bearer YOUR_API_KEY"

{
  "object": "list",
  "data": [
    {
      "id": "qwen2.5-coder-7b",
      "object": "model",
      "owned_by": "swarmllm"
    }
  ]
}

GET /v1/status

Node status (SwarmLLM extension).

curl http://localhost:8800/v1/status \
  -H "Authorization: Bearer YOUR_API_KEY"

Using with OpenAI Client Libraries

Python (openai)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8800/v1",
    api_key="YOUR_API_KEY"
)

# Basic streaming
response = client.chat.completions.create(
    model="qwen2.5-coder-7b",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Python — Function calling

response = client.chat.completions.create(
    model="qwen2.5-coder-7b",
    messages=[{"role": "user", "content": "What's the weather in NYC?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"]
            }
        }
    }],
    tool_choice="auto"
)

if response.choices[0].finish_reason == "tool_calls":
    for tc in response.choices[0].message.tool_calls:
        print(f"Call {tc.function.name}({tc.function.arguments})")

JavaScript (openai)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8800/v1",
  apiKey: "YOUR_API_KEY",
});

const stream = await client.chat.completions.create({
  model: "qwen2.5-coder-7b",
  messages: [{ role: "user", content: "Hello!" }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

curl (streaming)

curl -N http://localhost:8800/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen2.5-coder-7b","messages":[{"role":"user","content":"Hello!"}],"stream":true}'

POST /v1/embeddings

Returns 503 Service Unavailable. Text embeddings are not supported via the subprocess inference path. Use a dedicated embedding provider or the OpenAI embeddings API directly.

GET /v1/providers

List configured cloud providers and their available models.

curl http://localhost:8800/v1/providers \
  -H "Authorization: Bearer YOUR_API_KEY"

Returns an array of { name, models: [...] } objects for each configured provider.

Responses API

OpenAI's /v1/responses is the default API for o-series and gpt-5-series models in 2026 and the replacement for the sunsetting Assistants API (2026-08-26). SwarmLLM exposes the full v1 surface plus follow-on features such as resumable streams, async background runs, MCP tool integration, and conversation chaining via previous_response_id.

Endpoints

Method	Path	Purpose
`POST`	`/v1/responses`	Create a response (streaming or not, foreground or background)
`GET`	`/v1/responses/{id}`	Fetch a stored response. With `?stream=true&starting_after=N`, resume the SSE stream from event `N` (V5).
`DELETE`	`/v1/responses/{id}`	Delete a stored response.
`POST`	`/v1/responses/{id}/cancel`	Cancel a background response (M9). The cancel flag is checked at completion time; per-token interruption is deferred.
`GET`	`/v1/responses/{id}/input_items`	Paginated input-item listing (V4) for chained `previous_response_id` flows.
`GET`	`/api/admin/responses`	Admin: list all stored response records (used by the dashboard).

All endpoints accept the same Bearer-auth header as the rest of the API.

Routing

POST /v1/responses picks one of three execution paths in this order:

Cloud proxy — when the requested model resolves to an OpenAI-routed provider, the request is serialized verbatim and forwarded to the upstream /v1/responses endpoint. Built-in tools, streaming, background, reasoning effort, text.verbosity, include[], previous_response_id, and any future field round-trip via #[serde(flatten)] extras.
Anthropic-Messages bridge (V3) — when the model resolves to an Anthropic provider (or the local claude-subscription subprocess), the Responses request is translated to an Anthropic Messages request, forwarded, and translated back. This lets Claude Code clients drive /v1/responses end-to-end without losing tool-call or streaming semantics.
Local inference — translates to /v1/chat/completions and runs on the local model. Function tools and tool_choice translate through; built-in tools (web_search, file_search, computer_use_preview, code_interpreter, image_generation, mcp, custom) are rejected with HTTP 400 because they require backing infrastructure SwarmLLM does not run.

Capabilities

Multimodal input (V2) — input_image and input_file (UTF-8 only) parts in the structured input array. Binary file payloads (PDF, docx, image bytes via file_data) are rejected with a clear hint pointing at input_image.
Function tools — tools definitions and tool_choice translate to OpenAI Chat Completions tool semantics; assistant tool_calls map back to function_call output items.
Streaming SSE (M6 + V1) — stream=true emits the full Responses event sequence (response.created → response.in_progress → response.output_item.added → response.content_part.added → per-delta response.output_text.delta → response.output_text.done → response.content_part.done → response.output_item.done → response.completed). The V1 fix shipped in 2026-04-25 cuts first-token latency by emitting created and in_progress before model warmup instead of after.
Persistence (M7) — store=true (the OpenAI default) writes the full response object to redb with a 30-day TTL. previous_response_id (M8) chains follow-up requests by prepending the prior turn's messages before the new input.
Background mode (M9 + V8) — background=true returns HTTP 202 with a Location: /v1/responses/{id} header; the client polls or, with background=true && stream=true, opens a resumable SSE connection at GET /v1/responses/{id}?stream=true that replays buffered events and then tails the live producer.

Validation (ingress)

The handler runs validate_responses_ingress BEFORE any routing decision so the cloud-proxy and Anthropic-bridge paths can't forward attacker-sized strings to upstream providers (where they'd burn quota or land in log lines). Caps:

Field	Limit
`model`	1..=256 chars
`previous_response_id`	≤64 ASCII alphanumeric (`_` / `-` allowed); generation format is `resp_<32-hex>`
`instructions`	≤2 MB
`user`	≤256 chars
`truncation`, `service_tier`	≤64 chars each
`metadata`	≤64 KB total (keys + values)

Stop / temperature / top_p / max_tokens are clamped or validated at the sampling-params layer.

Dashboard

The admin dashboard exposes a Responses tab (/admin/responses) backed by GET /api/admin/responses. It shows the most-recent stored response records with status, model, input snippet, and per-record cancel/delete actions.

Deferred

POST /v1/responses/compact (V9) — no concrete caller has asked for it.
Token-level cancel for background inference — current cancel flips a flag checked at completion time; per-token interruption needs hooks in chat_completions that are out of v2 plan scope.
Server-side conversation resource CRUD — OpenAI's conversation parameter forwards through cloud proxy verbatim today; a local conversation type with its own endpoints is a separate design.
Built-in tools on the local path — see "Local inference" above.
custom tools with Lark / regex grammars — rejected on local, forwarded on cloud. Local grammar-constrained generation is a candle-side project.
Audio input on /v1/responses — input_audio returns 400; needs a Whisper-class transcription model SwarmLLM doesn't currently expose.
Binary file inputs in input_file{file_data} — UTF-8 only; PDF/docx/ image-bytes payloads are rejected with a clear hint pointing at input_image (for images) or server-side text extraction.

Anthropic Messages API

SwarmLLM provides a full Anthropic Messages API at POST /v1/messages, enabling it to serve as a drop-in backend for Claude Code and other Anthropic-compatible clients.

Claude Code Integration

Use SwarmLLM as your Claude Code backend to access all models (local, network, and cloud) through a single endpoint:

ANTHROPIC_BASE_URL=http://localhost:8800 claude --model qwen2.5-coder-7b

Environment Variables

Variable	Description
`ANTHROPIC_BASE_URL`	Point to your SwarmLLM node (e.g., `http://localhost:8800`)
`ANTHROPIC_AUTH_TOKEN`	Your node's API key (from Settings or `/api/admin/api-key`)
`ANTHROPIC_MODEL`	Default model to use

POST /v1/messages

Request Body

Field	Type	Required	Description
`model`	string	yes	Model name (local GGUF, network model, or cloud model like `gpt-4o`)
`messages`	array	yes	Chat messages with `role` + `content`
`max_tokens`	integer	yes	Maximum tokens to generate (clamped to 1–32768)
`system`	string or array	no	System prompt (supports `cache_control` blocks)
`stream`	boolean	no	Enable SSE streaming
`temperature`	float	no	Sampling temperature
`top_p`	float	no	Nucleus sampling
`stop_sequences`	array	no	Stop sequences, 1–256 chars each, max 16
`tools`	array	no	Tool definitions for function calling
`tool_choice`	object	no	Tool selection strategy
`metadata`	object	no	Request metadata
`thinking`	object	no	Extended thinking configuration

Content Block Types

Messages can contain these content block types:

// Text
{"type": "text", "text": "Hello, world!"}

// Image (base64)
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "..."}}

// Tool use (assistant response)
{"type": "tool_use", "id": "toolu_123", "name": "get_weather", "input": {"location": "NYC"}}

// Tool result (user message)
{"type": "tool_result", "tool_use_id": "toolu_123", "content": "72F, sunny"}

// Thinking (extended thinking)
{"type": "thinking", "thinking": "Let me reason about this..."}

// Redacted thinking
{"type": "redacted_thinking", "data": "..."}

Response

{
  "id": "msg_abc123",
  "type": "message",
  "role": "assistant",
  "model": "qwen2.5-coder-7b",
  "content": [
    {"type": "text", "text": "Here's my response..."}
  ],
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 25,
    "output_tokens": 150
  }
}

Model Routing

Requests are routed based on the model name:

Model Pattern	Route	Details
Local GGUF model	Local inference	Tool calls and thinking blocks converted to text
`claude-*`	Anthropic API	Full pass-through (all fields preserved including tools and thinking)
`gpt-`, `o1-`, `o3-`, `o4-`	OpenAI	Anthropic→OpenAI format translation
`deepseek-*`	DeepSeek	Anthropic→OpenAI format translation
`mistral-`, `codestral-`, `pixtral-*`	Mistral	Anthropic→OpenAI format translation
`llama-`, `groq-`	Groq	Anthropic→OpenAI format translation
`nim-*`	NVIDIA NIM	Anthropic→OpenAI format translation
`cerebras-*`	Cerebras	Anthropic→OpenAI format translation
`samba-*`	SambaNova	Anthropic→OpenAI format translation
`fireworks-`, `accounts/fireworks/`	Fireworks AI	Anthropic→OpenAI format translation
`together-*`	Together AI	Anthropic→OpenAI format translation
`deepinfra-*`	DeepInfra	Anthropic→OpenAI format translation
`moonshot-`, `kimi-`	Moonshot/Kimi	Anthropic→OpenAI format translation
Network model	Distributed inference	Routed through swarm P2P network

All 12 cloud providers are supported. Configure API keys via the dashboard Settings page or by placing a .env file in the data directory (~/.local/share/swarmllm/.env) with standard variable names (e.g., OPENAI_API_KEY, DEEPSEEK_API_KEY).

System Blocks with Cache Control

Anthropic-compatible prompt caching:

{
  "system": [
    {"type": "text", "text": "You are a helpful assistant.", "cache_control": {"type": "ephemeral"}}
  ]
}

Streaming (SSE)

When stream: true, responses arrive as Server-Sent Events following the Anthropic streaming format:

event: message_start
data: {"type":"message_start","message":{"id":"msg_123","type":"message","role":"assistant","model":"qwen2.5-coder-7b","content":[]}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: message_stop
data: {"type":"message_stop"}

MCP Server

SwarmLLM includes a native Model Context Protocol (MCP) server at POST /mcp. This enables AI agents like Claude Code, Cursor, VS Code Copilot, and other MCP-compatible tools to use your SwarmLLM node as a tool provider.

Protocol version: 2025-11-05 (JSON-RPC 2.0 over HTTP).

Endpoint

POST /mcp
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

All requests use JSON-RPC 2.0 format. All tools include tool annotations (readOnlyHint, destructiveHint, etc.).

Available Tools

`chat`

Send a message to any model available on the node (local, network, or cloud).

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "chat",
    "arguments": {
      "model": "qwen2.5-coder-7b",
      "messages": [
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Explain Rust's ownership model"}
      ],
      "temperature": 0.7,
      "max_tokens": 2048
    }
  },
  "id": 1
}

`models`

List all available models (local + network + cloud).

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": { "name": "models", "arguments": {} },
  "id": 2
}

`compare`

Send the same prompt to multiple models concurrently and get side-by-side results. Up to 10 models per comparison.

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "compare",
    "arguments": {
      "prompt": "Write a function to check if a number is prime",
      "models": ["qwen2.5-coder-7b", "gpt-4o", "claude-sonnet-4-20250514"],
      "system": "Write clean, efficient code.",
      "max_tokens": 1024
    }
  },
  "id": 3
}

`research`

Fan out a research question to multiple models in parallel. Designed for knowledge gathering — offload questions to cheap/fast models to get diverse perspectives without using expensive model tokens. If models is omitted, auto-selects available models (local first, then cloud).

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "research",
    "arguments": {
      "question": "What are the tradeoffs between ring-allreduce and star topology for tensor parallelism?",
      "models": ["deepseek-chat", "gpt-4o-mini", "qwen2.5-coder-7b"],
      "system": "Be concise and technical.",
      "max_tokens": 2048
    }
  },
  "id": 4
}

Response:

{
  "question": "What are the tradeoffs...",
  "models_queried": 3,
  "successful_responses": 3,
  "total_tokens_used": 1847,
  "results": [
    {
      "model": "deepseek-chat",
      "response": "Ring-allreduce...",
      "input_tokens": 24,
      "output_tokens": 512,
      "latency_ms": 2100,
      "status": "ok"
    }
  ]
}

`batch_prompts`

Execute multiple independent prompts in parallel, each targeting a specific model. Ideal for offloading parallel subtasks — e.g., ask one model to summarize, another to translate, another to review code, all at once.

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "batch_prompts",
    "arguments": {
      "tasks": [
        {
          "id": "summary",
          "model": "gpt-4o-mini",
          "prompt": "Summarize this error log: ...",
          "max_tokens": 512
        },
        {
          "id": "fix",
          "model": "qwen2.5-coder-7b",
          "prompt": "Write a fix for this bug: ...",
          "max_tokens": 1024
        },
        {
          "id": "translate",
          "model": "deepseek-chat",
          "prompt": "Translate to Japanese: ...",
          "max_tokens": 256
        }
      ]
    }
  },
  "id": 5
}

Response:

{
  "tasks_submitted": 3,
  "tasks_completed": 3,
  "results": [
    {
      "task_id": "summary",
      "model": "gpt-4o-mini",
      "content": "The error log shows...",
      "latency_ms": 890,
      "status": "ok"
    }
  ]
}

`delegate`

Offload a task to the most appropriate model based on a tier preference. Tiers: fast picks the lowest-latency local model, cheap picks a small/free model, smart picks the most capable available model (may use cloud). Saves subscription tokens by routing routine work to local/cheap models.

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "delegate",
    "arguments": {
      "prompt": "Summarize this function in one sentence: ...",
      "tier": "fast",
      "max_tokens": 256
    }
  },
  "id": 6
}

Tiers:

fast — lowest-latency local model (default)
cheap — smallest/free model available
smart — most capable model (may use cloud provider)

`node_info`

Get detailed information about the SwarmLLM node: loaded models, connected peers, credit balance, available cloud providers, and network status.

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": { "name": "node_info", "arguments": {} },
  "id": 6
}

Available Resources

`swarmllm://status`

Returns node status information (version, model loaded, peer count).

{
  "jsonrpc": "2.0",
  "method": "resources/read",
  "params": { "uri": "swarmllm://status" },
  "id": 7
}

IDE Integration

Claude Code

Option A: MCP tools — access SwarmLLM's tools (research, batch, compare) alongside your normal model:

claude mcp add --transport http swarmllm http://localhost:8800/mcp \
  --header "Authorization: Bearer YOUR_API_KEY"

Option B: Model backend — use SwarmLLM as your inference backend (routes all requests through the swarm):

ANTHROPIC_BASE_URL=http://localhost:8800 ANTHROPIC_AUTH_TOKEN=YOUR_API_KEY \
  claude --model qwen2.5-coder-7b

Option C: Both — use Claude for reasoning, SwarmLLM MCP for offloading research to cheap models:

# Add SwarmLLM as MCP server
claude mcp add --transport http swarmllm http://localhost:8800/mcp \
  --header "Authorization: Bearer YOUR_API_KEY"

# Then use Claude normally — it can call research/batch/compare tools via MCP
claude

VS Code (Copilot Chat)

Add to .vscode/mcp.json in your project:

{
  "servers": {
    "swarmllm": {
      "type": "http",
      "url": "http://localhost:8800/mcp",
      "headers": {
        "Authorization": "Bearer YOUR_API_KEY"
      }
    }
  }
}

Copilot Chat will discover SwarmLLM's tools automatically. Use them by asking Copilot to research, compare models, or batch prompts.

Cursor / Windsurf / Other MCP Clients

Any MCP-compatible client can connect via HTTP:

URL: http://localhost:8800/mcp
Transport: HTTP (Streamable HTTP)
Auth: Bearer token in Authorization header

Continue.dev (OpenAI API)

If your IDE extension supports the OpenAI API format, point it directly at SwarmLLM:

{
  "models": [{
    "title": "SwarmLLM Local",
    "provider": "openai",
    "model": "qwen2.5-coder-7b",
    "apiBase": "http://localhost:8800/v1",
    "apiKey": "YOUR_API_KEY"
  }]
}

Model Compare Dashboard

The compare functionality is also available in the web dashboard via the Compare tab. Select 2-10 models, enter a prompt, and view results side-by-side with latency, token counts, and response content.

Admin API

Admin endpoints are CORS-protected. Most read-only endpoints don't require Bearer auth; write operations do.

Node Management

GET /api/admin/stats

Node statistics and hardware info.

GET /api/admin/peers

Connected peers with latency, trust scores, and hosted models.

GET /api/admin/credits

Credit balance and tier info.

GET /api/admin/network-map

Geographic distribution of peers and shards across regions. Each entry includes the total peer count for that region, per-model shard-holder counts, per-model request demand rates, coverage gaps (models with zero holders in the region), and per-model replication targets derived from pool size and demand. Includes the local node in its auto-detected or configured region.

Response:

{
  "regions": {
    "US": {
      "total": 3,
      "models": { "tinyllama-1.1b-q4-k-m": 2 },
      "demand": { "tinyllama-1.1b-q4-k-m": 5 },
      "coverage_gaps": [],
      "replication_target": { "tinyllama-1.1b-q4-k-m": 2 }
    }
  }
}

GET/PUT /api/admin/config

Read or update daemon configuration. PUT requires Bearer auth.

POST /api/admin/config/reload

Hot-reload operational parameters without restart. Bearer auth required.

POST /api/admin/shutdown

Gracefully shut down the node. Localhost only, Bearer auth required.

Model Management

GET /api/admin/models

List models with shard status, VRAM estimates, and acquisition state. Each model includes:

mmproj field with available (bool), local (bool), and holders (count) for VLM vision encoder status
trust_level field: one of "Discovered", "Pinned", "DemandVerified", or "NetworkPopular" indicating the model's trust status (auto-manage only downloads shards for DemandVerified+ or Pinned models)

POST /api/admin/models/{id}/add

Trigger model acquisition from the network.

GET /api/admin/models/{id}/status

Check model acquisition progress.

DELETE /api/admin/models/

Remove model (shards + manifest + state).

DELETE /api/admin/models/{id}/shards/

Delete a single shard.

GET/PUT /api/admin/models/{id}/auto-manage

Per-model auto-manage policy (including prune toggle).

GET/PUT /api/admin/models/{id}/encrypted-pipeline

Per-model encrypted pipeline toggle. GET returns current status, readiness (whether local node holds first + last shard), and overhead note. PUT enables/disables with body {"enabled": true}. Requires the local node to hold shard 0 and the final shard. Returns a warning for 2-shard models (fully local, no distribution benefit). Setting is persisted to the database and survives restarts. Falls back to global encrypted_pipeline config if no per-model override is set.

PUT /api/admin/models/{id}/shards/{index}/lock

Lock/unlock a shard to prevent auto-pruning.

Storage & Shards

POST /api/admin/rescan-shards

Rescan local shard files on disk and update the model registry and network announcements without restarting the daemon. Useful after manually placing shard files in the data directory. Bearer auth required.

Response:

{ "status": "ok", "models_updated": ["model-id-1"], "count": 1 }

GET /api/admin/models/{id}/metadata

Read parsed GGUF metadata from a locally-stored model header (gguf_header.bin). Returns architecture parameters, tokenizer settings, quantization type, and all raw metadata key/value pairs (tokenizer vocabulary arrays are excluded). Returns 400 if no header file exists for the model.

Response shape:

{
  "model_id": "...",
  "general": { "name": "...", "architecture": "llama", "architecture_supported": true, "file_type": 11, "quantization": "Q4_K_M" },
  "model": { "context_length": 4096, "block_count": 32, "embedding_length": 4096, "head_count": 32, "head_count_kv": 8, "rope_dimension_count": 128, "rope_freq_base": 500000.0, "layer_norm_rms_epsilon": 1e-5, "vocab_size": 32000 },
  "tokenizer": { "model": "llama", "pre": "...", "eos_token_id": 2, "bos_token_id": 1, "padding_token_id": null },
  "tensors": { "count": 291, "data_offset": 131072 },
  "raw": [{ "key": "general.architecture", "value": "llama" }, ...]
}

POST /api/admin/models/{id}/shards/{index}/download

Trigger a P2P download of a specific shard that is not yet held locally. The daemon first checks for P2P peers that hold the shard (picking the best peer by LAN-proximity, latency, and trust), then falls back to returning HuggingFace source info if no peers are available. Bearer auth required.

Responses:

{ "status": "already_local", ... } — shard is already on disk
{ "status": "downloading", "source": "p2p", "peer": "...", ... } — P2P download started
{ "status": "use_hf", "source": "huggingface", "repo_id": "...", "filename": "...", ... } — no P2P peers, use hf/download-shards instead
400 if no peers and no HuggingFace source known

POST /api/admin/models/{id}/shards/{index}/unload

Unload a single shard from memory (VRAM/RAM) without deleting the file from disk. Narrows the model's shard window to exclude this shard and restarts the worker subprocess. If this is the last loaded shard, the model is fully unloaded. Bearer auth required.

Response:

{ "status": "unloaded", "model_id": "...", "shard_index": 0, "remaining_loaded": [1, 2] }

POST /api/admin/models/{id}/shards/{index}/load

Load a shard that is on disk into memory. The shard must already be present locally (use /download first if not). Expands the model's shard window to include the shard and restarts the worker subprocess. Bearer auth required.

Response:

{ "status": "loaded", "model_id": "...", "shard_index": 0, "loaded_shards": [0, 1, 2] }

POST /api/admin/models/{id}/unload

Unload an entire model from memory (VRAM/RAM) without deleting any files from disk. Evicts all split-model entries, kills the worker subprocess, clears GGUF metadata cache, and clears the loaded-model record. Bearer auth required.

Response:

{ "status": "unloaded", "model_id": "...", "model_name": "...", "segments_removed": 2, "estimated_freed_mb": 4096 }

GET /api/admin/shard-storage

Per-model storage breakdown, disk and VRAM usage.

GET /api/admin/prune-history

Recent auto-prune events.

GET/PUT /api/admin/schedule

Resource schedule management.

HuggingFace Integration

GET /api/admin/hf/search?query=...

Search HuggingFace for GGUF models. Returns results grouped by repository with quantization variants, recommended variant, and VRAM fitness indicator.

Response format:

[{
  "repo_id": "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF",
  "downloads": 50000,
  "likes": 120,
  "variants": [
    { "filename": "...Q4_K_M.gguf", "size_bytes": 668000000, "quant": "Q4_K_M" },
    { "filename": "...Q8_0.gguf", "size_bytes": 1100000000, "quant": "Q8_0" }
  ],
  "recommended_variant": "Q4_K_M",
  "fits_vram": true
}]

GET /api/admin/hf/probe?repo_id=...&filename=...

Probe a remote GGUF file (size, shard layout).

POST /api/admin/hf/download-shards

Download specific shard indices from HuggingFace. Bearer auth required.

Supports peer_fair_share: true for smart distribution — the backend computes a deterministic fair share of shards using BLAKE3(node_id || model_id), and peers with auto-manage enabled auto-acquire the rest.

curl -X POST http://localhost:8800/api/admin/hf/download-shards \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"repo_id": "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF", "filename": "qwen2.5-coder-7b-instruct.Q4_K_M.gguf", "peer_fair_share": true}'

GET /api/admin/hf/source/

Look up the HuggingFace source (repo + filename) for a locally-known model. First checks the in-memory source cache and the probe cache, then auto-discovers by searching HuggingFace if neither has an entry. If found via auto-discovery the result is cached to the database and hf_source.json in the model directory.

Response:

{ "model_id": "...", "repo_id": "TheBloke/TinyLlama-...-GGUF", "filename": "tinyllama-...Q4_K_M.gguf" }

GET /api/admin/downloads

List the download queue with per-shard progress, speed, and source.

POST /api/admin/downloads/{model_id}/cancel

Cancel an in-progress download.

LoRA Adapters

GET /api/admin/adapters

List all registered LoRA adapters with their metadata (id, name, base model, rank, alpha, path).

Response: { "adapters": [ { "id": "...", "name": "...", "base_model": "...", "rank": 16, "alpha": 32.0, "path": "..." } ] }

POST /api/admin/adapters

Register a LoRA adapter from a safetensors file. Bearer auth required. Path traversal is blocked. If id is omitted, a UUID is generated.

Request body:

{ "id": "my-adapter", "name": "My Adapter", "base_model": "tinyllama-...", "rank": 16, "alpha": 32.0, "path": "adapters/my-adapter.safetensors" }

path may be absolute or relative to <data_dir>/adapters/.

Response: { "status": "ok", "adapter": { ... } }

DELETE /api/admin/adapters/

Unregister a LoRA adapter. Does not delete the file from disk. Bearer auth required. Returns 400 if the id is not found.

Response: { "status": "ok", "message": "Adapter 'my-adapter' removed" }

Cloud Providers

GET /api/admin/providers

List configured cloud providers (name + configured flag, no keys exposed).

PUT /api/admin/providers

Update cloud provider API keys. Bearer auth required. Keys are encrypted at rest.

GET /api/admin/provider-models

List available models from all configured cloud providers. Results are cached for 60 seconds; stale results are returned immediately and refreshed in the background. Includes models from OpenAI, Anthropic (static list), DeepSeek, Mistral, Groq, NVIDIA NIM, Cerebras, SambaNova, Fireworks, Together AI, DeepInfra, and Moonshot/Kimi.

Response: { "models": [ { "id": "gpt-4o", "name": "GPT-4o", "provider": "openai" } ] }

GET /api/admin/provider-health

Probe each configured provider by sending a tiny max_tokens=1 inference request (using a suitable test model per provider). All probes run in parallel with a connect timeout.

Response:

{ "providers": [ { "provider": "openai", "status": "up", "latency_ms": 320, "detail": "" } ] }

Status values: up, rate_limited, overloaded, timeout, unreachable, error_<code>.

POST /api/admin/provider-model-status

Probe availability and latency for a list of specific cloud model IDs (up to 20 per request). Sends a max_tokens=1 request to each model's provider endpoint. Anthropic models are skipped (no cloud proxy probing). Bearer auth not required.

Request body: { "models": ["gpt-4o", "claude-sonnet-4-6", "deepseek-chat"] }

Response:

{ "models": [ { "model": "gpt-4o", "status": "up", "latency_ms": 210 } ] }

Status values: up, rate_limited, not_found, unavailable, timeout, error.

Claude Subscription (feature-gated)

Requires building with --features claude-subscription. When the feature is not enabled, these endpoints return {"error": "claude-subscription feature not enabled"}.

GET /api/admin/claude-subscription/status

Detect whether the claude CLI is installed and authenticated on this machine. Reads version from claude --version and subscription info from ~/.claude/.credentials.json (read-only).

Response:

{
  "cli_installed": true,
  "cli_version": "2.1.92 (Claude Code)",
  "authenticated": true,
  "subscription_type": "max",
  "rate_limit_tier": "default_claude_max_5x"
}

PUT /api/admin/providers (claude_subscription_enabled field)

Enable or disable the Claude subscription provider. Pass claude_subscription_enabled alongside other provider key updates.

{ "claude_subscription_enabled": true }

When enabled, claude-* model requests are routed through the local CLI subprocess instead of the Anthropic API key. The Anthropic API key (if configured) is used as fallback when disabled.

Updates

GET /api/admin/version

Current binary version info.

POST /api/admin/update/check

Check for available updates. Returns version info and changelog if update available.

POST /api/admin/update/apply

Download and apply an update. Bearer auth required.

Discovery

GET /api/admin/network-code

Get an encrypted shareable invite code and network phase. The code embeds the node's TCP listening address encrypted with ChaCha20Poly1305 — the IP is not visible in the code.

POST /api/admin/join-network

Join the network via encrypted invite code (swarm://...) or raw multiaddr. Immediately dials the peer and saves the address to the peer cache.

Responses API listing

GET /api/admin/responses

List stored Responses-API records (backs the dashboard's Responses tab). Optional query params: ?limit=N (cap on returned records, default 100, max 500) and ?status=... (filter by completed / in_progress / cancelled / failed / queued). See Responses API for the user-facing surface.

Authentication

GET /api/admin/api-key

Retrieve the API key. Bearer auth required.

WebSocket

GET /api/admin/ws

WebSocket for live updates. Pushes the following event types:

Event	Trigger	Data
`activity_event`	Any subsystem event	kind, model_id, message, timestamp, toast_level
`stats_update`	Every 2s	Peer count, credits, acquisitions, shard registry, swarm_capacity (R110), wishlist (R111)
`peer_list`	Peer connect/disconnect	Full peer snapshot
`models_changed`	Shard download/load/prune	(none — signals dashboard to refresh)
`update_available`	New version detected	Version info, changelog

Claude Subscription Provider

Use your existing Claude Pro, Max, Team, or Enterprise subscription to access Claude models through SwarmLLM — no API key or per-token charges needed.

Feature-gated: Build with --features claude-subscription to enable. This feature is isolated behind a compile-time flag for easy removal.

How It Works

When enabled, SwarmLLM spawns the claude CLI as a subprocess for each Claude model request:

Client Request (OpenAI or Anthropic format)
  → SwarmLLM API (openai.rs / anthropic/mod.rs)
    → Provider resolution: model starts with "claude-"
      → Claude subscription enabled? → Spawn subprocess
      → Else: use Anthropic API key (existing behavior)
    → claude -p --output-format stream-json --model <model> "<prompt>"
    → Parse NDJSON → Translate to API format → Return response

Both the OpenAI (/v1/chat/completions) and Anthropic (/v1/messages) endpoints are supported, with streaming and non-streaming modes.

Setup

1. Install the Claude CLI

npm install -g @anthropic-ai/claude-code

2. Log in with your subscription

claude login

This opens a browser window. Sign in with your Claude Pro/Max/Team/Enterprise account.

3. Build SwarmLLM with the feature

cargo build --no-default-features --features dev,claude-subscription

4. Enable via the dashboard

Open Settings → Cloud Providers → Claude Subscription, click "Check Status" to verify your CLI is detected, then enable the toggle.

Or via API:

curl -X PUT http://localhost:8800/api/admin/providers \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{"claude_subscription_enabled": true}'

5. Send requests

# OpenAI format
curl http://localhost:8800/v1/chat/completions \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

# Anthropic format
curl http://localhost:8800/v1/messages \
  -H "x-api-key: <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-6",
    "max_tokens": 100,
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Multi-Turn Conversations

Multi-turn conversations work by serializing the full message history into the prompt on each request. The format uses XML tags that Claude understands natively:

System messages → <system>...</system>
Assistant messages → <previous_response>...</previous_response>
User messages → bare text

This is the same stateless approach used by OpenAI-compatible APIs — the client sends the full conversation every time, and the server doesn't maintain session state.

Configuration

All configuration is in the providers.claude_subscription section, manageable via the admin API or dashboard:

Field	Default	Description
`enabled`	`false`	Route Claude requests through the CLI
`claude_binary`	`"claude"`	Path to the `claude` binary
`default_model`	(from request)	Override model for all requests
`max_concurrent`	`3`	Max concurrent subprocess invocations
`timeout_secs`	`300`	Timeout per request (seconds)
`working_dir`	(system temp)	Working directory for the subprocess

Working Directory

By default, the subprocess runs in the system temp directory to avoid loading project-specific CLAUDE.md files, hooks, and MCP servers. Set working_dir to a project path if you want Claude to have project context for its responses.

Routing Priority

When a claude-* model is requested:

Claude subscription (if enabled and CLI detected) — subprocess path, uses subscription
Anthropic API key (if configured) — direct API proxy, pay-per-token
Error — no provider available

The subscription provider takes priority over the API key. Disable the subscription toggle to fall back to API key billing.

Rate Limits

Subscription rate limits are per rolling 5-hour window (not per-second RPM like API keys). The concurrency limiter (default 3) prevents spawning too many concurrent processes. Community reports suggest ~3-5 parallel Opus sessions before degradation.

Rate limit info is returned in the NDJSON output and logged. The GET /api/admin/claude-subscription/status endpoint shows the current rate limit tier.

Removal

If this feature needs to be removed:

git rm src/api/claude_sub.rs
# Remove "claude-subscription = []" from Cargo.toml
grep -rn 'claude.subscription\|claude_sub' src/ frontend/
# Remove the ~6 #[cfg] blocks found by grep

Single commit, clean removal. No deep dependencies on the rest of the codebase.

Identity & Device Pool API

Identity

GET /api/identity/nickname

Get the current node's nickname.

PUT /api/identity/nickname

Set a nickname. Body: {"nickname": "my-node"}

DELETE /api/identity/nickname

Clear the nickname.

GET /api/identity/leaderboard

Network-wide credit leaderboard.

GET /api/identity/peers

Peer identity directory (nicknames, regions, tiers).

Device Pools ("My Devices")

Link multiple devices owned by the same user. Credits earned by all linked devices are combined into one balance on the main (owner) device.

Terminology: "Linked Devices" in the UI. This is different from connecting to the SwarmLLM network — linking devices groups your own hardware, while the network connects you with other people.

Quick Start (CLI)

# On your main device:
swarmllm pool create --name "My Devices"
swarmllm pool invite-code
# → A3F7K2M9

# On each other device:
swarmllm pool join A3F7K2M9

# Check status:
swarmllm pool status

Invite Code System

Instead of exchanging raw 64-character node IDs, device pools use 8-character invite codes (e.g., A3F7K2M9):

Owner generates a code → POST /api/pool/generate-code
Code shared verbally, via QR, or copy-paste
Member enters code → POST /api/pool/join → broadcasts join request over gossip
Owner's node auto-validates code and creates invitation
Member auto-accepts → pool established

Security: Codes use a 32-character alphabet (no 0/O/1/I), are one-time use, expire in 24h, and the code itself is never transmitted over the network — only its BLAKE3 hash.

API Endpoints

GET /api/pool/state

Current pool membership state. Returns in_pool, member list with device names, online status, per-device stats, credit split percentage.

POST /api/pool/create

Create a new device pool. Body: {"name": "My Devices"}

POST /api/pool/generate-code

Generate an invite code (owner only). Returns: {"code": "A3F7K2M9"}. Max 5 active codes.

POST /api/pool/join

Join a pool using an invite code. Body: {"code": "A3F7K2M9"}

POST /api/pool/invite

Invite a specific node by ID (advanced). Body: {"node_id": "abc123..."}

POST /api/pool/accept

Accept a pool invitation. Body: {"invitation_id": "..."}

POST /api/pool/remove

Remove a member (owner only). Body: {"node_id": "..."}

POST /api/pool/leave

Leave the current pool.

POST /api/pool/device-name

Set this device's nickname. Body: {"name": "Gaming PC"}

PUT /api/pool/credit-split

Set credit split percentage (owner only). Body: {"pct": 20} (0-50)

PUT /api/pool/contribution

Set per-member contribution level override. Body: {"node_id": "...", "level": 75} (integer 0–100)

GET /api/pool/invitations

List pending invitations for this node.

GET /api/pool/leaderboard

Pool member contribution rankings.

GET/PUT /api/admin/pools/:id/rates

Per-pool credit rate overrides.

Private Mode

Restrict inference to your device pool for maximum privacy. Your prompts never leave your devices.

GET /api/pool/private-mode

Current state + coverage summary. Returns enabled, allow_lan, offline_mode, and coverage object.

PUT /api/pool/private-mode

Toggle private mode. Body: {"enabled": true} or {"enabled": true, "offline_mode": true}. Returns coverage summary so the UI can show trade-offs immediately.

GET /api/pool/coverage

Per-model coverage breakdown: total_shards, pool_shards, coverage_pct, missing indices, est_download_mb. Also returns disk_budget_mb and disk_used_mb.

Shard Pinning

GET /api/pool/pins

List current shard pins.

POST /api/pool/pin

Pin a model to a specific device (owner only). Body: {"model_id": "...", "target_node_id": "hex..."}. Optional shard_indices array for specific shards (empty = all shards).

DELETE /api/pool/pin

Remove a shard pin. Same body format as POST.

Pool Features

Device nicknames: Name each device for easy identification
Online/offline status: Tracked via health pings, displayed with last-seen timestamps
Per-device stats: VRAM, shards hosted, forwards served, uptime, models hosted
Combined VRAM: Aggregate GPU memory across all linked devices
Credit split: Owner configures what percentage (0-50%) members keep vs forward
Private Mode: Restrict inference to pool devices only. Toggle via UI or API
Shard Pinning: Assign specific models to specific devices. Auto-manage respects pins
Offline Mode: Air-gapped LAN operation with mDNS-only discovery
Coverage Dashboard: Per-model availability bars showing pool shard coverage
Max 10 devices per pool (configurable), 10 pool operations per hour rate limit

Pool Security

Invite codes: 32^8 ≈ 1.1 trillion combos, one-time use, 24h expiry
Join requests signed with Ed25519 (transport-layer sender authentication)
Credit forwarding uses dual-signed PoolCreditForward (member + owner)
Member removal requires Ed25519-signed removal notice with replay protection
Pool state gossip verifies each member's acceptance signature
Blinded invitation broadcast (SEC-M18): network observers can't see who's invited

Prometheus Metrics

SwarmLLM exposes a Prometheus-compatible metrics endpoint at GET /metrics. No authentication required (standard convention for metrics endpoints).

Available Metrics

Core Metrics

Metric	Type	Description
`swarmllm_peers_connected`	gauge	Number of connected peers
`swarmllm_inference_requests_total`	counter	Total inference requests processed
`swarmllm_credits_balance`	gauge	Current credit balance
`swarmllm_shards_hosted`	gauge	Number of locally hosted shards
`swarmllm_inference_latency_seconds`	histogram	Inference request latency

Channel Metrics

Internal channel health metrics for monitoring backpressure:

Metric	Type	Description
`swarmllm_channel_capacity{channel="..."}`	gauge	Channel buffer capacity
`swarmllm_channel_sent_total{channel="..."}`	counter	Messages sent through channel
`swarmllm_channel_dropped_total{channel="..."}`	counter	Messages dropped due to backpressure

Histogram Buckets

The latency histogram uses these bucket boundaries (in seconds): 0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, +Inf

Scraping Configuration

Add to your prometheus.yml:

scrape_configs:
  - job_name: "swarmllm"
    static_configs:
      - targets: ["localhost:8800"]

Example Queries

# Request rate (requests per second over 5 minutes)
rate(swarmllm_inference_requests_total[5m])

# P50 latency
histogram_quantile(0.50, rate(swarmllm_inference_latency_seconds_bucket[5m]))

# P99 latency
histogram_quantile(0.99, rate(swarmllm_inference_latency_seconds_bucket[5m]))

# Average latency
rate(swarmllm_inference_latency_seconds_sum[5m]) / rate(swarmllm_inference_latency_seconds_count[5m])

Health Check

GET /health/ready

Readiness probe returning subsystem status. Returns 200 when ready, 503 otherwise. No auth required.

{
  "ready": true,
  "subsystems": {
    "network": true,
    "inference_router": true,
    "api_server": true,
    ...
  }
}

Deployment Guide

Single Node

The simplest deployment — just run the binary:

./swarmllm run

This starts the daemon on port 8800 with default settings.

Production Configuration

For production use, create a config file:

[node]
listen_port = 8800
contribution = "maximum"

[resources]
max_gpu_vram_mb = 0        # Auto-detect
max_disk_mb = 100000       # 100 GB

[inference]
gpu_layers = 99            # Offload all layers to GPU
max_concurrent_requests = 20
max_batch_size = 4
session_timeout_seconds = 600

[auto_manage]
enabled = true
max_storage_mb = 50000
max_concurrent_downloads = 5

[logging]
level = "info"
format = "json"            # Structured logs for production
file = "/var/log/swarmllm.log"

[ui]
open_browser_on_start = false

[identity]
region = "US"

Systemd Service

Create /etc/systemd/system/swarmllm.service:

[Unit]
Description=SwarmLLM P2P Inference Node
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=swarmllm
ExecStart=/usr/local/bin/swarmllm run --config /etc/swarmllm/config.toml
Restart=on-failure
RestartSec=10
LimitNOFILE=65536

# Security hardening
NoNewPrivileges=yes
ProtectSystem=strict
ReadWritePaths=/var/lib/swarmllm /var/log

[Install]
WantedBy=multi-user.target

sudo systemctl enable --now swarmllm

Docker

Quick Start (Recommended)

# Download compose file and env template
curl -LO https://raw.githubusercontent.com/enapt/SwarmLLM/main/docker-compose.yml
curl -LO https://raw.githubusercontent.com/enapt/SwarmLLM/main/.env.example
cp .env.example .env

# CPU
docker compose up -d

# GPU (requires NVIDIA Container Toolkit)
docker compose --profile gpu up -d

Pre-built Images

Image	Description
`ghcr.io/enapt/swarmllm:latest`	CPU-only (Debian bookworm-slim)
`ghcr.io/enapt/swarmllm:latest-cuda`	NVIDIA GPU (CUDA 12.4 runtime)

Versioned tags follow semver: 0.1.0, 0.1.0-cuda, 0.1, 0.1-cuda.

Manual Docker Run

# CPU
docker run -d \
  --name swarmllm \
  --restart unless-stopped \
  -p 8800:8800/tcp \
  -p 8810:8810/tcp \
  -p 8800:8800/udp \
  -v swarmllm-data:/data \
  -v /path/to/models:/data/models \
  --env-file .env \
  ghcr.io/enapt/swarmllm:latest

# GPU
docker run -d \
  --gpus all \
  --name swarmllm \
  --restart unless-stopped \
  -p 8800:8800/tcp \
  -p 8810:8810/tcp \
  -p 8800:8800/udp \
  -v swarmllm-data:/data \
  -v /path/to/models:/data/models \
  --env-file .env \
  ghcr.io/enapt/swarmllm:latest-cuda

Build from Source

# CPU
docker build -t swarmllm .

# CUDA
docker build -f Dockerfile.cuda -t swarmllm:cuda .

Multi-Node Dev Cluster

For development and testing, a 3-node compose file is available:

docker compose -f docker-compose.dev.yml up

Nodes are at localhost:8800, localhost:8801, localhost:8802. Add GPU support:

docker compose -f docker-compose.dev.yml -f docker-compose.cuda.dev.yml up

Multi-Node Cluster

Same LAN

Nodes on the same network discover each other automatically via mDNS. Just start multiple instances on different ports:

# Node 1
./swarmllm run -p 8800

# Node 2
./swarmllm run -p 8801 -d ~/.local/share/swarmllm-node2

Across Networks

Use bootstrap peers or invite codes:

# Node 1 (get its address from the dashboard or logs)
./swarmllm run

# Node 2 (connect to Node 1)
./swarmllm run --bootstrap "/ip4/NODE1_IP/udp/8800/quic-v1/p2p/PEER_ID"

Split Inference Cluster

For a dedicated split-inference setup across multiple machines:

# Machine A: shards 0-3
./swarmllm run --shards "0-3" --bootstrap "/ip4/MACHINE_B/udp/8800/quic-v1/p2p/..."

# Machine B: shards 4-7
./swarmllm run --shards "4-7" --bootstrap "/ip4/MACHINE_A/udp/8800/quic-v1/p2p/..."

Firewall

Open TCP port 8800 (HTTP API), TCP port 8810 (P2P), and optionally UDP port 8800 (QUIC):

# Linux (ufw)
sudo ufw allow 8800/tcp    # HTTP API
sudo ufw allow 8810/tcp    # P2P (Noise+Yamux, primary transport)
sudo ufw allow 8800/udp    # P2P (QUIC, optional)

# Linux (iptables)
sudo iptables -A INPUT -p tcp --dport 8800 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 8810 -j ACCEPT
sudo iptables -A INPUT -p udp --dport 8800 -j ACCEPT

Reverse Proxy (Optional)

If you want to put the HTTP API behind nginx:

server {
    listen 443 ssl;
    server_name swarmllm.example.com;

    location / {
        proxy_pass http://127.0.0.1:8800;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Note: The reverse proxy only handles HTTP traffic. P2P (QUIC/UDP) must still be accessible directly on port 8800.

Cloud Provider API Keys

To use cloud model fallback, configure provider API keys via:

Dashboard: Settings page in the web UI
Environment file: Place a .env file in the data directory with standard variable names:

# ~/.local/share/swarmllm/.env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
DEEPSEEK_API_KEY=sk-...
MISTRAL_API_KEY=...
GROQ_API_KEY=gsk_...
NVIDIA_API_KEY=nvapi-...
CEREBRAS_API_KEY=...
SAMBANOVA_API_KEY=...
FIREWORKS_API_KEY=...
TOGETHER_API_KEY=...
DEEPINFRA_API_KEY=...
MOONSHOT_API_KEY=...

Shell environment: Export the same variables before starting the daemon

Performance & Inference Speedups

SwarmLLM's distributed inference path ships with a stack of optimisations that are on by default — you get them without touching a config. This chapter names each one, explains what it does, and shows the measured win so you can tell which levers matter for your workload.

A few are flag-gated because the win is workload-dependent or the path is still being hardened; those are documented at the bottom so you can turn them on intentionally.

The full design notes live in docs/plans/archive/distributed_inference_speedup.md with benchmark recipes in docs/plans/benchmarks/.

The default-on stack

Continuous batching

Concurrent /v1/chat/completions requests for the same model share one forward pass per decode tick instead of running serially. GPU builds use a fused forward_batch kernel; CPU workers fall through to sequential with no regression.

Measured: 1.34–1.55× GPU throughput at batch 2–8 on RTX 3070 + TinyLlama Q4
Config: inference.continuous_batching = true (default)

Remote-generate fast path

For single-segment distributed inference (the common case: one remote node owns the whole model, requester does embedding + sampling), skip the per-token coordinator round-trips and run the decode loop end-to-end on the remote worker. Tokens stream back as they're sampled.

Measured: 1.93× decode speedup
Config: default-on — no flag, triggered automatically on single-segment pipelines

Cross-request prefix cache

Each worker keeps an LRU cache of prefill KV snapshots keyed by the prompt's token prefix. A re-submission with the same system prompt (different user turn) skips prefill for the shared prefix and only forwards the suffix.

Measured: 29.4× wall-clock speedup on re-submission of the same 513-token prompt (single node, TinyLlama)
Config: inference.prefix_cache_enabled = true (default), inference.prefix_cache_block_tokens = 64 (default — block granularity), inference.prefix_cache_max_entries = 16 (default — per model)

Batched prefill + chunked prefill

Sarathi-style chunked prefill: a long admission advances by prefill_chunk_tokens (default 128) per decode tick, so new requests don't wait behind a full prior prefill. Phase 4 adds batched_prefill_forward = true (default), which fuses concurrent same-shape prefill chunks into one forward_batch call.

Measured (Phases 1+2): 17–23× TTFT fairness at concurrency 2/4/8 on RTX 3070 + TinyLlama Q4 vs serial prefill
Measured (Phase 4): 1.57× aggregate tok/s at c=4 with uniform 180/180/180 ms TTFT (vs pre-fix 52/235/447 ms spread)
Config: inference.continuous_batching = true, inference.prefill_chunk_tokens = 128, inference.batched_prefill_forward = true (all default)

When node B receives a prompt whose prefix was already prefilled by peer A, B fetches A's KV snapshot over the wire instead of re-prefilling locally. The pipeline is:

A prefills → inserts prefix-cache block → gossips PrefixCacheAnnounce
B receives prompt → local cache miss → probe daemon → walk index
B sends SendPrefixKvFetch to A → A's worker exports snapshot
B verifies BLAKE3 + NaN/Inf → hydrates KV → prefill suffix only

Measured (TinyLlama, GPU-GPU): fetched path is ~100 ms slower than local prefill — the 28 MB f32 snapshot takes ~260 ms to ship while the local prefill it replaces is only ~460 ms. TinyLlama is too small to demonstrate the win on localhost + fast GPU.
Measured (Qwen2.5-Coder-7B, CPU-CPU): 12.9× TTFT speedup on iter 1 — control full-prefill = 151.7 s, fetched path = 11.8 s. The 73 MB f32 snapshot transfers in ~1 s while 640-token Qwen-7B CPU prefill runs ~150 s.
Config: inference.cross_node_prefix_trust_min = 0.5 (default — gates peers by trust score; set to 2.0 to disable the fetch path entirely).

The fetch path uses three chained timeouts (worker probe 3000 ms, daemon network 2500 ms, serving IPC 2000 ms) sized for 7B-class f32 snapshots. Missing the window degrades to a clean miss — no worse than not having the feature. See the two-daemon loopback bench recipe for reproduction details.

Parallax scheduler

Pipeline assignment uses shortest-path dynamic programming over observed per-layer latencies (EMA over recent forwards) rather than a greedy pick-the-closest-peer heuristic. Cross-gossip of top-32 observed latencies via NodeCapability.observed_latencies lets every node keep a current view of the network's compute profile. A soft acquire/prune bias in AutoShardManager driven by a per-shard stability counter (≥3 consistent ticks before it acts) drifts shards toward where they're actually used without violating existing hard constraints.

Measured: 10 routing + 7 allocator + 2 scheduler integration tests passing; real-world improvements depend on network heterogeneity. The biggest impact is in asymmetric setups where a cheap peer's low observed latency should beat a high-VRAM peer's big shard slot.
Config: default-on. Multi-pipeline concurrency is deferred.

Flag-gated features

Turn these on when you've measured that they match your workload.

Distributed speculative decoding (`speculative_distributed`)

Draft model proposes γ tokens locally; target verifies all γ in one remote forward pass.

Status: End-to-end verified. 40–52% accept rate in a llama-cpp-draft / candle-target pairing (cross-backend numerical mismatch caps accept rate).
Config: inference.speculative_distributed = true, inference.draft_model_path = "path/to/draft.gguf", inference.speculative_gamma = 4 (tokens per verify round)

SWIFT self-speculative decoding (`swift_self_speculative`)

The target model acts as its own draft by skipping a contiguous range of layers on the proposal pass. No external draft model needed.

Status: Landed behind flag. Structurally slower than baseline on candle CPU until flash-attn-with-mask lands (attention kernel mismatch on multi-position verify). Shelved on CPU; may help on GPU.
Config: inference.swift_self_speculative = true, inference.swift_skip_ratio = 0.45 (fraction of layers to skip on the draft pass)

DSD — decentralized speculative decoding (`decentralized_spec_decoding`)

Multi-segment distributed inference with speculative decoding woven in. A γ-token decode on the last-segment worker plus KV truncation primitives plus a coordinator loop in pipeline/dsd.rs.

Status: All phases landed 2026-04-18 behind flag. End-to-end multi-segment WAN benchmark pending.
Config: inference.decentralized_spec_decoding = true

Activation compression Q8_0 (`activation_compression`)

Intermediate pipeline hidden-state activations are quantized from f16 to Q8_0 before going over the wire. Receivers auto-dispatch on the dtype tag.

Status: Codec verified. ~3.76× wire compression, RMS error <0.005. End-to-end multi-segment benchmark pending.
Config: inference.activation_compression = true

Persistent pipeline stream (`persistent_pipeline_stream`)

Replace per-token request/response with one long-lived libp2p bidirectional stream per pipeline session.

Status: Landed behind flag. Wire-level verified; no measured latency win because the bottleneck was elsewhere (solved by remote-generate + batched prefill).
Config: inference.persistent_pipeline_stream = true

Debugging slow inference

Default verbosity (-v) gives an INFO-level stream. Bump to -vv to see per-request DIAG: logs, which include the per-feature speedup signals:

./swarmllm run -vv 2>&1 | grep "DIAG:"

Key DIAG kinds:

DIAG: prefix-cache HIT — local prefix cache hit
DIAG: cross-node prefix HIT — cross-node prefix-KV fetch succeeded
DIAG: prefix-probe: fetch timed out — cross-node fetch missed the window (see Troubleshooting for timeout sizing on 7B+ models)
DIAG: served PrefixKvFetch ... hit=true — this node served a cross-node fetch
DIAG: BatchGenerate — batched-prefill slot table activity
DIAG: chunk fused batch_size=N — fused prefill chunks (Phase 4)
DIAG: Parallax — Parallax scheduler decisions

For the full DIAG taxonomy and what each line means, see docs/DIAGNOSTICS.md.

When should I turn a speedup off?

Almost never. The default-on features degrade cleanly under edge cases — the prefix cache falls through to full prefill on a miss, cross-node fetch falls through to local prefill on a timeout, batched prefill falls back to sequential when concurrency is 1. If you suspect one is the cause of a regression:

Prefix cache off: inference.prefix_cache_enabled = false
Cross-node fetch off: inference.cross_node_prefix_trust_min = 2.0 (gates every peer out)
Continuous batching off: inference.continuous_batching = false (also disables Phase 4 fusion)
Phase 4 fusion off, keep continuous batching: inference.batched_prefill_forward = false

Please open an issue if a speedup is costing you — the benchmarks above are RTX 3070 + WSL2 + a specific set of models, so real-world workloads will surface corners the benches miss.

Benchmarking

SwarmLLM ships with a built-in bench command and a set of reproducible recipes under docs/plans/benchmarks/. This chapter covers both.

Quick: `swarmllm bench`

The bench subcommand runs a real /v1/chat/completions workload against a daemon and reports latency + throughput.

./swarmllm bench \
    --max-tokens 100 \
    --iterations 5 \
    --concurrency 1 \
    --stream \
    --model-id tinyllama-1.1b-chat-v1.0.q4-k-m \
    --json

Key flags:

--max-tokens — tokens to generate per request (default 100)
--iterations — sequential iterations per concurrency level (default 5)
--concurrency — concurrent requests for throughput tests (default 1)
--stream — use streaming chat completions and report TTFT (time-to-first-token) per request. TTFT is the signal that captures the batched-prefill and cross-node-fetch wins; non-streaming bench rolls prefill + decode into one total time and hides the difference.
--prompt — custom prompt; default is a short prompt about relativity that won't stress prefix caching. Pass a longer prompt (≥500 tokens) to exercise prefix cache paths.
--model-id — target a specific model when several are registered; otherwise uses the first one from /v1/models.
--json — machine-readable output

The bench reads the API key from the daemon's data dir, so run it with the same SWARMLLM_NODE_DATA_DIR or -d as the daemon.

Single-node baselines

Reference numbers on an AMD Ryzen 7 5800H + RTX 3070 Laptop 8 GB VRAM (WSL2, release build):

Model	Params	Quant	GPU	CPU
TinyLlama 1.1B Chat	1.1B	Q4_K_M	27.2 tok/s	4.2 tok/s
Gemma-2 2B IT	2.5B	Q4_K_M	20.6 tok/s	3.5 tok/s
Phi-3.5 Mini	3.8B	Q4_K_M	46.4 tok/s	1.8 tok/s
Qwen2.5-Coder 7B	7.6B	Q4_K_M	29.0 tok/s	2.4 tok/s

Single-node numbers are largely about your hardware. The interesting benchmarks are distributed.

Reproducing the performance benchmarks

Each performance optimization has a written benchmark recipe in docs/plans/benchmarks/. Most require two local daemons on loopback; a couple need three.

Batched prefill — TTFT fairness

docs/plans/benchmarks/round4.md

Measures TTFT at concurrency 2/4/8 with Phases 1+2 on vs off. The win is fairness, not aggregate throughput: Sarathi chunked prefill prevents new admits from waiting behind the full prior prefill.

Batched chunked prefill (Phase 4)

docs/plans/benchmarks/round5.md

Measures aggregate tok/s and per-request TTFT spread with batched_prefill_forward on vs off. The on-config fuses concurrent same-shape prefill chunks so TTFT lands tightly clustered instead of spreading.

docs/plans/benchmarks/round6.md

Two-daemon loopback TCP. Measures iter-1 TTFT with the cross-node fetch path enabled vs gated off (via cross_node_prefix_trust_min = 2.0). Same recipe runs against TinyLlama (fast-GPU corner case: fetch is slightly slower than prefill) and Qwen-7B (12.9× TTFT speedup on CPU-CPU because 7B CPU prefill is slow enough that the ~1 s fetch + verify + hydrate buys back ~150 s of local prefill).

Sketch of the recipe:

# Node A on 8800
SWARMLLM_NODE_DATA_DIR=/tmp/swarm_a ./target/release/swarmllm run -p 8800 -v &

# Node B on 8900, bootstrapped off A
A_MADDR=$(grep -oE "peer_id=12D3KooW[A-Za-z0-9]+" /tmp/swarm_a.log | \
    head -1 | sed 's/peer_id=/\/ip4\/127.0.0.1\/tcp\/8810\/p2p\//')
SWARMLLM_NODE_DATA_DIR=/tmp/swarm_b ./target/release/swarmllm run \
    -p 8900 -v --bootstrap "$A_MADDR" &

# Copy shards into both data dirs (or download via /api/admin/hf/download-shards)
cp -r ~/.local/share/swarmllm/models/<model-id> /tmp/swarm_a/models/
cp -r ~/.local/share/swarmllm/models/<model-id> /tmp/swarm_b/models/

# Warm A with the long prompt (populates A's prefix cache, announces to B)
./swarmllm bench -p 8800 --stream --iterations 3 --max-tokens 100 \
    --prompt "$(cat long-prompt.txt)" --model-id <model-id>

# Measure B TTFT — iter 1 should fire the cross-node fetch
./swarmllm bench -p 8900 --stream --iterations 3 --max-tokens 100 \
    --prompt "$(cat long-prompt.txt)" --model-id <model-id> --json

Check B's log for DIAG: cross-node prefix HIT — hydrated KV matched_tokens=... bytes=... to confirm the fetch path fired.

Caveats

WSL2 localhost bandwidth is much higher than any real network — localhost benches are the best case for compute-bound paths and the worst case for fetch paths. WAN numbers will be different.
TinyLlama is too small to show some speedups — cross-node prefix-KV sharing in particular needs a larger model (Phi-3.5, Qwen-7B) to flip the sign between fetch-cost and prefill-cost. See the round6 benchmark notes for the cross-over math.
VRAM fit matters — Qwen-7B Q4 weights fit in 8 GB but batched attention kernel scratch does not. CPU-mode works but the baseline numbers above change.
Pre-warm before measuring TTFT — iter 1 of a model includes disk read + weight load + first CUDA context init; exclude this by pre-warming with a short unrelated prompt before the real measurement.

Standard pre-push gate is cargo fmt && cargo clippy --all-targets -- -D warnings && cargo test. If you add a benchmark, add it under docs/plans/benchmarks/roundN.md with the recipe + results + interpretation, and link it from here.

Tailscale & WAN Access

SwarmLLM works over any IP-routable network, including VPN overlays like Tailscale, WireGuard, and ZeroTier. This guide covers how to access your node remotely and connect peers across the internet.

Use Cases

Remote access — Chat with your home GPU from your laptop at a coffee shop
Multi-site cluster — Connect nodes at home and work into one swarm
Team deployment — Share a private swarm across your team without exposing ports to the internet
Cloud + local hybrid — Connect a cloud GPU instance to your local network

Quick Setup with Tailscale

1. Install Tailscale on all machines

# Linux
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up

# macOS
brew install tailscale
tailscale up

# Windows — download from https://tailscale.com/download

Each machine gets a stable 100.x.x.x IP address on the Tailscale network.

2. Start SwarmLLM normally

# On each machine — no special flags needed
./swarmllm run

SwarmLLM binds to 0.0.0.0 by default, which includes the Tailscale interface.

3. Connect peers via bootstrap

Since mDNS doesn't work across Tailscale (it's link-local only), use one of these methods:

Option A: Invite code (easiest)

On Node A, copy the invite code from the dashboard (http://localhost:8800). On Node B, paste it into the "Join Network" field. The invite code contains the node's addresses — including the Tailscale IP if it's listening on 0.0.0.0.

Option B: Bootstrap peers in config

# ~/.local/share/swarmllm/config.toml on Node B
[network]
bootstrap_peers = [
  "/ip4/100.64.0.5/tcp/8810",    # Node A's Tailscale IP
]

Option C: CLI flag

./swarmllm run --bootstrap /ip4/100.64.0.5/tcp/8810

4. Access the dashboard remotely

Once connected via Tailscale, open the dashboard from any machine:

http://100.64.0.5:8800

The API is also accessible at that address:

curl http://100.64.0.5:8800/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "tinyllama", "messages": [{"role": "user", "content": "Hello!"}]}'

Recommended Config for WAN / Tailscale

[network]
enable_mdns = false           # mDNS is LAN-only, won't work through Tailscale
enable_autonat = false        # Tailscale handles NAT, disable noisy probes
enable_dcutr = false          # Hole punching unnecessary on Tailscale
enable_relay = true           # Keep as fallback for robustness
enable_quic = true            # QUIC works well on Tailscale (low-latency UDP)
bootstrap_peers = [
  "/ip4/100.64.0.5/tcp/8810", # Replace with your peer's Tailscale IP
]

For higher latency links (cross-continent), you may also want:

[inference]
tp_max_latency_ms = 50        # Relax tensor parallelism latency threshold (default: 10ms)

Binding to a Specific Interface

If you only want SwarmLLM accessible via Tailscale (not the local network):

[network]
listen_address = "100.64.0.5"  # Bind only to Tailscale interface

Or bind to localhost only and use Tailscale's Funnel or port forwarding:

[network]
listen_address = "127.0.0.1"

WireGuard / ZeroTier / Other VPNs

The same approach works with any VPN overlay:

Install the VPN on all machines
Start SwarmLLM with default config (listen_address = "0.0.0.0")
Use the VPN IP as a bootstrap peer address
Disable mDNS if peers aren't on the same physical LAN

Security Notes

API key still required — remote access to inference endpoints requires Bearer token auth, even over Tailscale
E2E encryption is independent of VPN — SwarmLLM encrypts all P2P traffic with X25519 + ChaCha20-Poly1305 regardless of whether you use a VPN. The VPN adds a second layer of encryption at the network level
Dashboard is not auth-protected — the admin dashboard at /admin doesn't require authentication. If exposing to untrusted networks, use Tailscale ACLs to restrict access or bind to 127.0.0.1 and use SSH tunneling

Troubleshooting

Peers don't connect:

Verify Tailscale is running: tailscale status
Check that port 8810 (TCP) and 8800 (UDP/QUIC) are reachable: tailscale ping 100.64.0.5
Try with --bootstrap /ip4/<TAILSCALE_IP>/tcp/8810 explicitly
Check logs with -vv for connection errors

Slow inference across WAN:

Pipeline parallelism (splitting layers across nodes) works best on low-latency links (<50ms)
Tensor parallelism requires LAN-like latency (<10ms) — increase tp_max_latency_ms or let SwarmLLM use pipeline mode instead
Consider having each site run its own models for local inference, with the swarm as fallback

Stale peer cache after IP change:

If your Tailscale IP changes, old cached addresses will fail. Delete the database to clear the cache:
```
rm ~/.local/share/swarmllm/db.redb
```

Monitoring with Grafana

SwarmLLM ships with a pre-built Grafana dashboard and Prometheus configuration in the monitoring/ directory.

Quick Start

cd monitoring/
docker compose up -d

This starts:

Prometheus at http://localhost:9090 — scrapes SwarmLLM metrics
Grafana at http://localhost:3000 — visualizes metrics (login: admin/admin)

The SwarmLLM dashboard is auto-provisioned on first start.

Dashboard Panels

The Grafana dashboard includes:

Node Overview

Connected Peers (stat)
Total Inference Requests (stat)
Credit Balance (stat)
Shards Hosted (stat)

Inference

Request Rate (req/s over time)
Latency Percentiles (p50, p90, p99)
Latency Distribution (histogram)
Average Inference Latency (gauge)

Network & Peers

Connected Peers Over Time

Storage & Shards

Hosted Shards Over Time

Credits

Credit Balance Over Time

Manual Setup

If you already have Prometheus and Grafana running:

1. Configure Prometheus

Add to prometheus.yml:

scrape_configs:
  - job_name: "swarmllm"
    static_configs:
      - targets: ["localhost:8800"]

2. Import Dashboard

Open Grafana → Dashboards → Import
Upload monitoring/grafana-dashboard.json
Select your Prometheus data source
Click Import

Multi-Node Monitoring

For monitoring multiple SwarmLLM nodes, add all targets:

scrape_configs:
  - job_name: "swarmllm"
    static_configs:
      - targets:
          - "node1:8800"
          - "node2:8800"
          - "node3:8800"

Or use file-based service discovery:

scrape_configs:
  - job_name: "swarmllm"
    file_sd_configs:
      - files: ["swarmllm-targets.json"]
        refresh_interval: 30s

Alerting

Example alert rules for Prometheus:

groups:
  - name: swarmllm
    rules:
      - alert: NoPeersConnected
        expr: swarmllm_peers_connected == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "SwarmLLM node has no connected peers"

      - alert: HighInferenceLatency
        expr: histogram_quantile(0.99, rate(swarmllm_inference_latency_seconds_bucket[5m])) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 inference latency exceeds 10 seconds"

      - alert: NegativeCreditBalance
        expr: swarmllm_credits_balance < 0
        for: 1h
        labels:
          severity: info
        annotations:
          summary: "Node has negative credit balance (Bronze tier)"

SwarmLLM Documentation