PoC v2 on vLLM 0.15.1: The Port Is Underway

The work is in the open at gonka-ai/vllm#24, opened on April 4 by gmorgachev and co-authored with tamazgadaev, baychak, clanster and qdanik. The PR is still marked [Draft] — intentionally, because three pieces of integration plumbing are not wired up yet. But the core port is in, the cross-validation experiments passed, and there is enough here to build working images and run inference end-to-end.

Why the port

Gonka's custom vLLM fork has been running on the 0.9.1 V0 engine, where PoC v2 and Enforced Sampling were built. vLLM upstream moved on — the V1 engine in 0.15.1 is a different beast architecturally, and the compatibility experiments in issue #730 back in March (covered separately in the March 24 post) only established that 0.15.1 could host Qwen3-235B-A22B at all. That was step one. Step two — porting the consensus-critical bits of the fork onto the new engine — is what PR #24 is about.

PoC v2 is the mechanism that lets a Gonka node prove it actually ran the inference it was paid for. Without it, the trustless model breaks. Enforced Sampling is the other half: it lets a validator replay the exact token sequence a worker produced, by pinning the sampler to a specific set of enforced_token_ids. Both need hooks deep inside the inference loop — layer hooks, GPU-side randomness, callback queues, a separate generate queue — and all of that had to be re-implemented against the V1 engine's abstractions.

What's in the PR

The diff is 34 commits across 47 files, +5462 lines and -62. That's mostly additive: the V1 engine was built for a different world, so most of the PoC v2 machinery lands as new files rather than edits to existing ones.

Specifically, the PR ports:

Layer hooks — the instrumentation that lets PoC v2 observe intermediate activations inside the model. This is the bit most sensitive to engine changes.
GPU random generation — the on-device RNG that makes PoC v2 reproducible across workers.
Callback queue and generate queue — the validation-aware request scheduling.
Enforced Sampling — replay-by-token-ids sampling for validators.
Cross-validation experiments — extensive side-by-side comparisons between PoC v2 output and real inference output, to verify that the two paths agree. This is the hardest thing to check and the reason the PR took 34 commits to get right.

What's still rough

The Draft status is honest. Three pieces are flagged as not-yet-done:

Chat priority gating (commit 4ea4882) — when PoC is running, regular inference requests have to be rejected to avoid NCCL deadlocks. The hook exists but is not wired into the request dispatch path yet.
Grammar graceful degradation (commit 134609f) — if a validator's enforced_token_ids conflict with a grammar-constrained decoding FSM, the current code fails repeatedly. The fix is to detect the conflict and disable grammar decoding for the affected request instead of retrying.
Logprobs mode auto-detection — the validation experiments showed that raw_logprobs are significantly more stable than processed_logprobs for cross-validation, but vLLM 0.9.1 defaulted to processed. The port needs logic to classify the logprobs type at runtime and switch modes automatically.

The A100 issue, said plainly

Cross-GPU validation between Ampere (A100) and Hopper/Blackwell can produce logprob divergence above the match threshold, especially on long-context prompts. Translated: nodes running on A100s have a higher probability of getting their inference invalidated than nodes on newer cards. The expected fix is the raw_logprobs switch from item 3 above — that should close the gap. Until the switch lands, A100 operators should expect more validation misses.

Trying it early

The branch builds a Docker image via Dockerfile.quick in the repo root. Prebuilt vLLM and MLNode images are available (see the PR description for links). Inside the MLNode container, the model server is launched with the Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 weights, FLASHINFER attention backend, tensor-parallel-size 4, and max-model-len 240000. The PR author's note on the deployment setup: "being cleaned up and will be simplified soon."

What's next

Three named TODOs, one known issue, and then the PR moves out of Draft. After that, the MLNode image that ships with v0.2.12 (currently on testnet) can start running on the V1 engine instead of V0 — catching up several upstream releases' worth of vLLM improvements, and, more importantly, opening the door to features that only exist in the newer engine.

PoC v2 on vLLM 0.15.1: The Port Is Underway, Rough Edges and All