Troubleshooting
Can't Connect to Peers
Check the bootstrap address format:
/ip4/203.0.113.50/udp/8800/quic-v1/p2p/12D3KooW...
Firewall: SwarmLLM needs TCP port 8810 (P2P) and optionally UDP port 8800 (QUIC) open.
- Linux:
sudo ufw allow 8810/tcp && sudo ufw allow 8800/udp - Windows: Windows Defender Firewall > Inbound Rules > New > Port > TCP 8810 + UDP 8800
- macOS: System Settings > Network > Firewall > allow SwarmLLM
Same LAN? Use local IP (e.g., 192.168.1.x). LAN peers should be found automatically via mDNS.
Model Download Stuck
- Check disk space — a 7B model needs ~4-5 GB free
- Verify internet access to
https://huggingface.co - Cancel and retry from the Dashboard
- Start with
-vfor verbose logs:./swarmllm run -v - Try a smaller model first (TinyLlama, ~700 MB)
GPU Not Detected
- Verify GPU works:
nvidia-smi - Install NVIDIA drivers if needed
- Enable GPU offloading:
./swarmllm run --gpu-layers 99
WSL2 users: The CUDA driver comes from your Windows NVIDIA driver. Check that /usr/lib/wsl/lib/libcuda.so.1 exists and add to your ~/.bashrc:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/wsl/lib:$LD_LIBRARY_PATH
Port Already in Use
./swarmllm run --port 9000 # Use a different port
lsof -i :8800 # Find what's using 8800
./swarmllm status # Check if another instance is running
Slow First Request
If the first inference request to a model takes noticeably longer than subsequent ones, this is expected. SwarmLLM uses on-demand model loading — models whose shards are on disk but not loaded into VRAM are loaded when first requested. If VRAM is full, an LRU eviction occurs first. Subsequent requests to the same model will be fast.
Slow Inference
- GPU vs CPU: CPU is 5-20x slower. Check Dashboard for GPU status.
- Model too large: Use Q4 quantization, match model size to VRAM.
- Enable batching: Set
max_batch_size = 4in config.
Database Corrupted
# Back up first
cp -r ~/.local/share/swarmllm ~/.local/share/swarmllm-backup
# Delete database (models and config are preserved)
rm ~/.local/share/swarmllm/db.redb
# Restart
./swarmllm run
GPU Out of Memory
If a model exceeds your GPU's VRAM, SwarmLLM automatically falls back to CPU inference. You'll see this in the logs:
WARN GPU OOM detected, retrying on CPU
CPU inference is 5-20x slower but works for any model size. To avoid OOM:
- Use smaller quantizations (Q4 instead of Q8)
- Use a model that fits in VRAM (check model size vs available VRAM in the dashboard)
- For models too large for one GPU, use distributed inference across multiple nodes
Distributed Inference Issues
Peers visible but inference fails:
- Ensure both nodes have the required shards loaded (check Dashboard > Models)
- Verify P2P TCP connectivity: port
<base_port> + 10must be reachable - Run with
-vvand filter:./swarmllm run -vv 2>&1 | grep "DIAG:" - Check for
DIAG: segment TIMED OUT— indicates network or compute bottleneck
High latency per token:
- Distributed inference adds ~20-130ms per token for network round-trips
- Use TCP bootstrap addresses (not QUIC) for lowest latency
- Ensure nodes are on the same LAN for tensor parallelism
Pipeline assembly fails:
- The scheduler needs enough shard coverage to build a complete pipeline
- Check
DIAG: assemble_pipeline_forfor candidate counts
Inference fails with "peer never acknowledged" or "silent drop":
- A
SendDirectMessagewas issued but neither a Response nor anOutboundFailureevent arrived from libp2p within 10s (RR_ACK_TIMEOUT_SECS). Treated as a transient failure: the router automatically retries once with a fresh pipeline assembly that filters out the unreachable peer. If retry also fails, the user sees the error within ~20s (vs the 120sFIRST_TOKEN_TIMEOUT). - Most common cause: the target peer was killed or partitioned and the local libp2p connection state hasn't yet caught up.
- Look for
DIAG: rr ACK timeout — closing streaming callerin the logs to confirm the fast-fail path engaged.
Concurrent requests stall when only some get dispatched:
- Per-tier concurrency caps come from
inference.max_concurrent_requests(default 10): Bronze=2, Silver=5, Gold=10, Platinum=20. Excess requests queue until prior ones complete. To raise: bump the config knob or earn credits to climb tiers. - If queued requests don't dispatch even after others complete,
check for a missed
queue_notify.notify_one()afteractive_count.fetch_sub(1)(should never happen onmain; was a real regression fixed inda6f485).
Cross-Node Prefix-KV Sharing
The cross-node prefix fetch is default-on. Expected logs on a successful first hit of a peer's cached prefix:
B: DIAG: cross-node prefix HIT — hydrated KV matched_tokens=N total_tokens=M
A: DIAG: served PrefixKvFetch ... hit=true
I never see cross-node prefix HIT:
- Only fires on iter 1 of a prompt whose prefix your local node hasn't prefilled yet. Iter 2/3 hit the local cache (populated by iter 1).
- Check the peer even announced the prefix: look for
DIAG: PrefixCacheAnnounce indexed node_id=... blocks=Nin your log. No announce → peer's gossip never reached you (checkgrep 'Published message to GossipSub' | grep 'swarm/models'). - Check the peer passes the trust gate: default
cross_node_prefix_trust_min = 0.5equalsDEFAULT_TRUST, so a freshly-seen peer should just barely pass. Any misbehavior drops it below.
I see prefix-probe: fetch timed out:
- The peer didn't return a snapshot inside the worker-probe window (3000 ms by default). On a large model (7B+) with cold CPU this can happen if the snapshot is >100 MB. The path degrades to local prefill — no worse than not having the feature. The current 3000/2500/2000 ms chained timeouts are sized for 7B-class snapshots; the older 500/400/500 ms values were TinyLlama-sized and forced a fallback to local prefill on larger models.
I see rejected KV snapshot — penalizing peer trust:
- The returned snapshot failed BLAKE3 reverification or contained
NaN/Inf. Three rejection reasons:
hash_chain_mismatch→prefix_cache_block_tokensdiffers between nodes (default 64, common alternatives 32/128)non_finite_tensors→ GPU overflow on the serving sidedeserialize_failed→ wire corruption — open an issue
Disable cross-node fetch entirely:
Set inference.cross_node_prefix_trust_min = 2.0 in config.toml. The
probe never fires because no peer passes the trust gate.
Running the Test Suite
SwarmLLM ships 943 lib tests + 75 integration tests + VLM E2E.
# Run all tests (release, used in CI)
cargo test --release
# Unit tests only (fastest feedback loop)
cargo test --lib
# Integration tests only
cargo test --test '*'
# A specific test by name substring
cargo test --release prefix_cache
# With CUDA features on (requires NVIDIA GPU)
cargo test --release --features candle-cuda
If a test fails, the release build shows the name + line; rerun with
--nocapture to see its stderr:
cargo test failing_test_name -- --nocapture
Integration tests under tests/integration/ simulate multi-node P2P on
loopback — they're the slow ones, and CI runs them with
--test-threads=1 to avoid port contention.
See Benchmarking for reproducing the performance benchmarks and Performance for which knobs turn each speedup on/off.
Model Trust
Models go through trust levels: Discovered → Pinned → DemandVerified → NetworkPopular. Auto-manage only downloads shards for models at sufficient trust levels.
Model stuck at "Discovered":
- Pin it manually from the Dashboard to promote to "Pinned"
- Models reach "DemandVerified" after receiving inference requests
- Models reach "NetworkPopular" when enough peers host them
Still Stuck?
- Run with full diagnostics:
./swarmllm run -vv 2>&1 | grep "DIAG:" - See the Diagnostics Guide for detailed log instrumentation
- Check GitHub Issues
- Open a new issue with: OS, hardware,
./swarmllm version, and logs from-vv