KwaaiNet: Distributed AI Inference on a P2P Fabric (SCaLE 23x)

The first talk I caught on Thursday at SCaLE 23x was Brian Ragazzi's workshop on KwaaiNet, and the central idea clicked immediately: what if you ran AI inference the way BitTorrent distributes files?

KwaaiNet is an open-source distributed inference network. The idea is that no single machine needs enough RAM to hold an entire model. Instead, different nodes hold different transformer layers, and a prompt routes through each node in sequence until it reaches the prediction head. The DHT (distributed hash table) tracks which nodes have which layers, the same way BitTorrent tracks which peers have which chunks. If I need layer 0, the DHT finds the node advertising it. If I need layer 20, it finds that one too.

KwaaiNet architecture: user prompt routes through DHT to volunteer nodes each holding transformer layers 0-10, 11-20, 21-32, with KV-cache per node, built on Petals framework and Hivemind DHT, with trust graph using Verifiable Credentials. Use cases and privacy tradeoffs shown.

The reason this is non-trivial comes down to how LLM inference actually works. Transformer inference is fundamentally serial. Generating token N requires the output of token N-1. You cannot parallelize the autoregressive loop across machines the way you can parallelize embarrassingly parallel workloads. What you can do is pipeline the transformer layers. The prompt (prefill phase) processes the full context in parallel within each layer, then passes the key-value cache state to the next layer on the next node. During the decode phase, each new token moves through the full layer chain. Every KV-cache block needs to be resident in RAM somewhere on the path. That is what Brian meant when he said "you need all the blocks in RAM." The parallelism is spatial across nodes, not temporal across tokens.

The framework KwaaiNet builds on is Petals, which works out this layer distribution and routing. Nodes can appear and disappear. Hivemind, the underlying coordination layer, handles that churn with the DHT. It is the same basic resilience model as BitTorrent or Bitcoin's node discovery, where DNS seeds seed the initial peer list and the DHT handles routing from there.

The practical target is the home user who cannot afford to run a 70B model locally but has a few machines. Brian described this as the reason the project exists: routing larger models across hardware that individually cannot handle them. From an infrastructure perspective it is compelling. Instead of one machine with 140GB RAM or a GPU cluster, you potentially spread the model across a small homelab with multiple modest machines.

The trust layer is more sophisticated than I expected. KwaaiNet implements a multi-layer trust graph using W3C Verifiable Credentials. Each node has a self-certifying did:peer: identity from an Ed25519 keypair. Nodes can earn Verifiable Credentials (summit badges, uptime proofs, fiduciary pledges from the GliaNet Foundation) that raise their trust score. The scoring applies time-decay so stale credentials matter less. Nodes with higher trust scores get priority routing and premium task allocation. This is how the network incentivizes reliable participants without requiring a central authority to vouch for nodes.

Brian mentioned a cryptocurrency token as a business incentive mechanism. Contributing compute earns tokens. The logic is similar to Filecoin for storage: make the economic incentive explicit so nodes have a reason to stay up and behave honestly.

The current implementation is in Go with a Rust/WASM runtime being developed to support browser, mobile, and embedded platforms. Performance on Apple Silicon with GGUF Q4_K_M quantization is around 33 tokens per second on an M4 Pro. The API is OpenAI-compatible, which means existing tools that speak the OpenAI protocol can point at KwaaiNet without modification.

The thing I keep thinking about for homelab use is the RAG angle. If each machine in a cluster holds both model layers and a piece of a shared vector database, you could have a system where a prompt routes through the inference pipeline and retrieval happens on each segment in parallel. Not quite how KwaaiNet works today, but the P2P coordination primitives are already there.

Brian noted that distributed inference like this is probably not suitable for private data. When your prompt passes through nodes you do not control, the confidentiality guarantees are limited. The trust graph helps, but it is not end-to-end encryption of the inference path. For personal or sensitive workloads, the right answer is still a local model. KwaaiNet's target is community-scale inference of public or non-sensitive tasks, access to larger models you cannot self-host, and community infrastructure built outside of corporate control.

GitHub: github.com/Kwaai-AI-Lab/KwaaiNet