Performance & Inference Speedups

SwarmLLM's distributed inference path ships with a stack of optimisations that are on by default — you get them without touching a config. This chapter names each one, explains what it does, and shows the measured win so you can tell which levers matter for your workload.

A few are flag-gated because the win is workload-dependent or the path is still being hardened; those are documented at the bottom so you can turn them on intentionally.

The full design notes live in docs/plans/archive/distributed_inference_speedup.md with benchmark recipes in docs/plans/benchmarks/.

The default-on stack

Continuous batching

Concurrent /v1/chat/completions requests for the same model share one forward pass per decode tick instead of running serially. GPU builds use a fused forward_batch kernel; CPU workers fall through to sequential with no regression.

Measured: 1.34–1.55× GPU throughput at batch 2–8 on RTX 3070 + TinyLlama Q4
Config: inference.continuous_batching = true (default)

Remote-generate fast path

For single-segment distributed inference (the common case: one remote node owns the whole model, requester does embedding + sampling), skip the per-token coordinator round-trips and run the decode loop end-to-end on the remote worker. Tokens stream back as they're sampled.

Measured: 1.93× decode speedup
Config: default-on — no flag, triggered automatically on single-segment pipelines

Cross-request prefix cache

Each worker keeps an LRU cache of prefill KV snapshots keyed by the prompt's token prefix. A re-submission with the same system prompt (different user turn) skips prefill for the shared prefix and only forwards the suffix.

Measured: 29.4× wall-clock speedup on re-submission of the same 513-token prompt (single node, TinyLlama)
Config: inference.prefix_cache_enabled = true (default), inference.prefix_cache_block_tokens = 64 (default — block granularity), inference.prefix_cache_max_entries = 16 (default — per model)

Batched prefill + chunked prefill

Sarathi-style chunked prefill: a long admission advances by prefill_chunk_tokens (default 128) per decode tick, so new requests don't wait behind a full prior prefill. Phase 4 adds batched_prefill_forward = true (default), which fuses concurrent same-shape prefill chunks into one forward_batch call.

Measured (Phases 1+2): 17–23× TTFT fairness at concurrency 2/4/8 on RTX 3070 + TinyLlama Q4 vs serial prefill
Measured (Phase 4): 1.57× aggregate tok/s at c=4 with uniform 180/180/180 ms TTFT (vs pre-fix 52/235/447 ms spread)
Config: inference.continuous_batching = true, inference.prefill_chunk_tokens = 128, inference.batched_prefill_forward = true (all default)

When node B receives a prompt whose prefix was already prefilled by peer A, B fetches A's KV snapshot over the wire instead of re-prefilling locally. The pipeline is:

A prefills → inserts prefix-cache block → gossips PrefixCacheAnnounce
B receives prompt → local cache miss → probe daemon → walk index
B sends SendPrefixKvFetch to A → A's worker exports snapshot
B verifies BLAKE3 + NaN/Inf → hydrates KV → prefill suffix only

Measured (TinyLlama, GPU-GPU): fetched path is ~100 ms slower than local prefill — the 28 MB f32 snapshot takes ~260 ms to ship while the local prefill it replaces is only ~460 ms. TinyLlama is too small to demonstrate the win on localhost + fast GPU.
Measured (Qwen2.5-Coder-7B, CPU-CPU): 12.9× TTFT speedup on iter 1 — control full-prefill = 151.7 s, fetched path = 11.8 s. The 73 MB f32 snapshot transfers in ~1 s while 640-token Qwen-7B CPU prefill runs ~150 s.
Config: inference.cross_node_prefix_trust_min = 0.5 (default — gates peers by trust score; set to 2.0 to disable the fetch path entirely).

The fetch path uses three chained timeouts (worker probe 3000 ms, daemon network 2500 ms, serving IPC 2000 ms) sized for 7B-class f32 snapshots. Missing the window degrades to a clean miss — no worse than not having the feature. See the two-daemon loopback bench recipe for reproduction details.

Parallax scheduler

Pipeline assignment uses shortest-path dynamic programming over observed per-layer latencies (EMA over recent forwards) rather than a greedy pick-the-closest-peer heuristic. Cross-gossip of top-32 observed latencies via NodeCapability.observed_latencies lets every node keep a current view of the network's compute profile. A soft acquire/prune bias in AutoShardManager driven by a per-shard stability counter (≥3 consistent ticks before it acts) drifts shards toward where they're actually used without violating existing hard constraints.

Measured: 10 routing + 7 allocator + 2 scheduler integration tests passing; real-world improvements depend on network heterogeneity. The biggest impact is in asymmetric setups where a cheap peer's low observed latency should beat a high-VRAM peer's big shard slot.
Config: default-on. Multi-pipeline concurrency is deferred.

Flag-gated features

Turn these on when you've measured that they match your workload.

Distributed speculative decoding (`speculative_distributed`)

Draft model proposes γ tokens locally; target verifies all γ in one remote forward pass.

Status: End-to-end verified. 40–52% accept rate in a llama-cpp-draft / candle-target pairing (cross-backend numerical mismatch caps accept rate).
Config: inference.speculative_distributed = true, inference.draft_model_path = "path/to/draft.gguf", inference.speculative_gamma = 4 (tokens per verify round)

SWIFT self-speculative decoding (`swift_self_speculative`)

The target model acts as its own draft by skipping a contiguous range of layers on the proposal pass. No external draft model needed.

Status: Landed behind flag. Structurally slower than baseline on candle CPU until flash-attn-with-mask lands (attention kernel mismatch on multi-position verify). Shelved on CPU; may help on GPU.
Config: inference.swift_self_speculative = true, inference.swift_skip_ratio = 0.45 (fraction of layers to skip on the draft pass)

DSD — decentralized speculative decoding (`decentralized_spec_decoding`)

Multi-segment distributed inference with speculative decoding woven in. A γ-token decode on the last-segment worker plus KV truncation primitives plus a coordinator loop in pipeline/dsd.rs.

Status: All phases landed 2026-04-18 behind flag. End-to-end multi-segment WAN benchmark pending.
Config: inference.decentralized_spec_decoding = true

Activation compression Q8_0 (`activation_compression`)

Intermediate pipeline hidden-state activations are quantized from f16 to Q8_0 before going over the wire. Receivers auto-dispatch on the dtype tag.

Status: Codec verified. ~3.76× wire compression, RMS error <0.005. End-to-end multi-segment benchmark pending.
Config: inference.activation_compression = true

Persistent pipeline stream (`persistent_pipeline_stream`)

Replace per-token request/response with one long-lived libp2p bidirectional stream per pipeline session.

Status: Landed behind flag. Wire-level verified; no measured latency win because the bottleneck was elsewhere (solved by remote-generate + batched prefill).
Config: inference.persistent_pipeline_stream = true

Debugging slow inference

Default verbosity (-v) gives an INFO-level stream. Bump to -vv to see per-request DIAG: logs, which include the per-feature speedup signals:

./swarmllm run -vv 2>&1 | grep "DIAG:"

Key DIAG kinds:

DIAG: prefix-cache HIT — local prefix cache hit
DIAG: cross-node prefix HIT — cross-node prefix-KV fetch succeeded
DIAG: prefix-probe: fetch timed out — cross-node fetch missed the window (see Troubleshooting for timeout sizing on 7B+ models)
DIAG: served PrefixKvFetch ... hit=true — this node served a cross-node fetch
DIAG: BatchGenerate — batched-prefill slot table activity
DIAG: chunk fused batch_size=N — fused prefill chunks (Phase 4)
DIAG: Parallax — Parallax scheduler decisions

For the full DIAG taxonomy and what each line means, see docs/DIAGNOSTICS.md.

When should I turn a speedup off?

Almost never. The default-on features degrade cleanly under edge cases — the prefix cache falls through to full prefill on a miss, cross-node fetch falls through to local prefill on a timeout, batched prefill falls back to sequential when concurrency is 1. If you suspect one is the cause of a regression:

Prefix cache off: inference.prefix_cache_enabled = false
Cross-node fetch off: inference.cross_node_prefix_trust_min = 2.0 (gates every peer out)
Continuous batching off: inference.continuous_batching = false (also disables Phase 4 fusion)
Phase 4 fusion off, keep continuous batching: inference.batched_prefill_forward = false

Please open an issue if a speedup is costing you — the benchmarks above are RTX 3070 + WSL2 + a specific set of models, so real-world workloads will surface corners the benches miss.

SwarmLLM Documentation