Bear the Tokens: FP16 and Suffix Decoding to 9,000+ Tokens a Second on a Tesla T4

⚠ TL;DR

H2Loop ran a contest called Bear the Tokens: serve Qwen2.5-0.5B on a free Colab Tesla T4 and push output throughput as high as it will go, under hard latency limits. I started where everyone starts, quantizing the weights to INT4, and hit a wall around 4,000 tok/s. The thing that actually broke the ceiling was the opposite of shrinking weights: keep them in FP16 and bolt on model-free suffix speculative decoding from Arctic Inference. That hit 9,052 tok/s measured on Colab and 7,800 tok/s on H2Loop's verified re-run, with the best raw latency in the field and a podium finish. This is the whole journey, the theory, and the exact config. And because I could not let it go, there is an epilogue where I rebuild the suffix drafter from scratch and it lands at parity with the production library.

The contest

The brief was refreshingly narrow. One model, one GPU, one number to maximize.

Model: must be served under the name Qwen/Qwen2.5-0.5B. The actual weights can be anything you like as long as the tokenizer matches.
Hardware: a Google Colab Tesla T4. 16 GB, Turing (sm_75), about 320 GB/s of memory bandwidth, 65 TFLOPS of FP16. Fixed. No swapping in a better card.
Harness: vLLM 0.15.1's vllm bench serve, pinned. Your server can be any framework that speaks the OpenAI HTTP API.
Score: output_throughput in tokens per second, nothing else.
Gates: P99 TPOT (time per output token) under 50 ms, P99 TTFT (time to first token) under 2000 ms, and the outputs have to stay correct.
Budget: about 4 hours on a free T4 before Colab gets bored of you.

The benchmark itself is worth understanding, because every decision flows from it. It fires 200 prompts at the server with a concurrency of 50, each prompt exactly 512 input tokens, each completion exactly 512 output tokens, greedy sampling, and ignore_eos set so the model is forced to emit all 512 tokens whether it wants to or not.

So the score is just arithmetic: 200 times 512 is 102,400 output tokens, divided by however many seconds of wall clock it takes to produce them all. Make the wall clock smaller, win. Everything below is a different idea about how to make that wall clock smaller without tripping a latency gate.

The T4 is the whole problem

Before any optimization, you have to internalize what this specific GPU can and cannot do, because the T4 is a 2018 Turing card and half the modern inference toolbox simply does not run on it. The contest is as much about the hardware floor as it is about cleverness.

Capability	On the T4 (sm_75)?
FlashInfer attention	Yes, the vLLM 0.15.1 default
FP16 inference	Yes
AWQ GEMM INT4 weights	Yes
CUDA graphs (non-piecewise)	Yes, and they matter a lot
FP8 KV cache (e5m2)	Yes
Marlin INT4 kernel	No, needs sm_80+ (Ampere)
FlashAttention 2	No, needs sm_80+
FP8 compute	No, needs sm_89+ (Ada)
BF16	No, the T4 has no BF16 tensor cores
SGLang piecewise CUDA graphs	No, the RMSNorm kernel is not compiled for sm_75

That Marlin line is the painful one. Marlin is the fast INT4 kernel, often several times quicker than the older AWQ GEMM path, and it is exactly the thing you would reach for to make a quantized model fly. It needs Ampere. On Turing you are stuck with the slower kernels, which quietly sets the ceiling for the entire "just quantize it" strategy before you have written a line of code.

The model is small enough to keep the whole picture in your head: Qwen2.5-0.5B is about 494M parameters, 24 layers, hidden size 896, grouped-query attention with 14 query heads over 2 KV heads, head dim 64. In FP16 the weights are about 1 GB. In INT4 they are about 200 MB. Hold onto those two numbers, because the gap between them is the entire premise of the first approach, and the reason that approach eventually stops mattering.

First instinct: shrink the weights

Decoding one token at a time is memory-bound. For every single token, the GPU streams the whole weight matrix out of VRAM, does a relatively tiny amount of math, and writes one token back. The compute units spend most of their life waiting on memory. The textbook fix is to make the weights smaller so each token reads fewer bytes, and the textbook tool is INT4 quantization.

I went with AWQ, activation-aware weight quantization, which picks per-channel scales so the 4-bit rounding lands where it hurts the least.

● ● ●PYTHON

AWQ_CONFIG = {
    "zero_point":   True,
    "q_group_size": 128,
    "w_bit":        4,
    "version":      "GEMM",
}

Then served it through vLLM with FlashInfer attention and CUDA graphs on. After cross-validating more than 35 flag combinations, the best self-test landed around 4,084 tok/s, verified around 3,569 tok/s, and the top twenty configurations all sat within about 5 percent of each other. That clustering is the tell: I was not leaving much on the table, I was sitting against a wall.

A few things the sweep taught me, which are useful no matter what you serve on a T4:

CUDA graphs are not optional. Turning on enforce-eager roughly halved throughput. The per-kernel launch overhead on this many tiny ops is enormous, and graph capture is what hides it.
You cannot be greedy with VRAM. A gpu-memory-utilization of 0.95 or higher OOMs on Colab. 0.90 to 0.92 is the safe shelf.
You cannot be greedy with concurrency either. Pushing max-num-seqs past about 256 exhausts the KV cache and the server falls over.
A 2 GB CPU swap-space consistently beat 0 by a couple of percent, basically free.

The ceiling was real and it was about 4,100 tok/s. Quantization had done its job and then run out of room. The Marlin kernel that would have unlocked the next tier does not exist for Turing. So the bandwidth story was over, and I needed a different story.

Dead ends, told honestly

Before the good idea, two detours that did not pan out, because a build log that only shows the wins is lying to you.

GPTQ with the ExLlamaV2 kernel. Another INT4 path, and a good one, but ExLlamaV2 is tuned for single-user, low-batch, interactive decoding. At a batch of 50 concurrent requests it loses to AWQ's GEMM kernel. Best I saw was about 3,243 tok/s. Wrong tool for a throughput contest.

SGLang. On paper SGLang is a serious vLLM competitor with aggressive scheduling. On a T4 it is death by a thousand workarounds. Its AWQ path crashes with a CUDA device-side assert. Its FP16 path runs, but the moment the benchmark sends ignore_eos the server dies. Its piecewise CUDA graphs do not compile because the sgl_kernel RMSNorm is not built for sm_75. Every fix for a T4 problem disabled one of the speed features that made SGLang attractive in the first place. With all the workarounds stacked up, throughput limped in at 700 to 941 tok/s. I stopped.

The real lever: stop taking so many forward passes

Here is the reframe that changed everything. The cost of generating a sequence is dominated by the number of forward passes through the model, because each pass pays that full weight-read tax. Quantization makes each pass cheaper. But there is another axis entirely: produce more than one token per pass.

That is speculative decoding. You cheaply guess several future tokens, run them through the model in a single batched forward pass to check them all at once, and keep the longest correct prefix. When the guesses are good, you get several tokens for the price of one pass. Under greedy sampling it is exactly lossless, because every accepted token is the one the model would have produced anyway, the verification step just confirms it in bulk.

vLLM ships an n-gram drafter, also called prompt-lookup decoding. There is no draft model at all: it searches recent context for a matching n-gram and proposes whatever followed it last time. On a workload as repetitive as this benchmark, that is shockingly effective. Bolting it onto the AWQ server jumped throughput to about 5,070 tok/s. The first real break past the quantization ceiling, and it came from doing fewer passes, not cheaper ones.

Then I added chunked prefill. With 50 requests in flight, the 512-token prefills and the decode steps fight for the same GPU. Chunked prefill slices each prefill into small pieces and interleaves them with decode, so a burst of prefills cannot stall the decode-and-verify loop. That pushed the same INT4 setup to about 5,955 tok/s verified, with P99 TTFT dropping from a scary 1357 ms to a comfortable 354 ms. This was the best INT4 configuration I found.

But I was still dragging INT4 around, and that was about to stop making sense.

The pivot: drop quantization, go FP16 plus suffix decoding

Walk through the arithmetic of what speculation does to the bandwidth argument.

The whole case for INT4 was that decoding is memory-bound, so smaller weights mean less to stream per token. But once a good speculator is accepting around 12 tokens per forward pass, you are doing roughly 12 times fewer passes per output token, which means the per-token weight-read bandwidth has already dropped by about 12 times. The thing INT4 was saving you from is now a twelfth of what it was. Meanwhile INT4 still costs you a dequantization step on every pass, and on Turing it is stuck on the slow kernels because Marlin will not run. The savings shrank to almost nothing and the overhead stayed.

So the move is to stop quantizing. Serve the weights in plain FP16, which on a 0.5B model is only about 1 GB and fits the T4 with room to spare, and spend the entire optimization budget on a better speculator instead.

The better speculator is suffix decoding from Snowflake's Arctic Inference. Instead of a tiny draft model or a simple n-gram lookup, it maintains a global suffix tree over tokens it has recently seen. To draft, it finds the longest suffix of the current context that exists in the tree, then walks the most frequent continuation, keeping tokens while their cumulative probability stays above a floor. The model verifies the whole draft in one pass and keeps the matching prefix. It is model-free, it needs only CPU memory for the tree, and under greedy decoding it is lossless by construction.

Why does it work so absurdly well here? Be honest about it: the benchmark is the perfect customer. Greedy sampling plus ignore_eos on a tiny 0.5B model produces highly repetitive output, the same phrases and structures over and over, and a suffix tree eats repetition for breakfast. On this workload the mean accepted draft length came out to 12.27 tokens per forward pass. That is the whole ballgame. About a twelvefold reduction in forward passes, on a model small enough that FP16 was never the bottleneck. The number jumped to 9,052 tok/s.

The winning configuration

Here is the part you can actually reproduce. The install order matters, because Arctic Inference ships the suffix-tree implementation but pins an older vLLM, and the contest requires 0.15.1. Install Arctic first, then re-pin vLLM on top of it. vLLM 0.15.1 exposes the suffix speculative method and uses Arctic's tree underneath.

● ● ●BASH

pip install -q "arctic-inference[vllm]" h2loop_bench
pip install -q vllm==0.15.1

One environment variable, to keep the CUDA allocator from fragmenting under the mix of KV blocks, graph captures, and verification buffers:

● ● ●PYTHON

import os
os.environ["PYTORCH_ALLOC_CONF"] = "expandable_segments:True"

And the server command itself:

● ● ●BASH

vllm serve Qwen/Qwen2.5-0.5B \
  --dtype float16 \
  --max-model-len 1152 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 16384 \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.85 \
  --stream-interval 8 \
  --generation-config vllm \
  --override-generation-config '{"temperature": 0.0, "top_p": 1.0, "repetition_penalty": 1.0}' \
  --speculative-config '{"method": "suffix", "num_speculative_tokens": 32, "suffix_decoding_max_tree_depth": 24, "suffix_decoding_max_cached_requests": 10000, "suffix_decoding_max_spec_factor": 1.0, "suffix_decoding_min_token_prob": 0.1}' \
  --disable-log-requests \
  --disable-log-stats \
  --uvicorn-log-level warning \
  --port 8000

The speculative config broken out, since it is the heart of it:

● ● ●JSON

{
  "method": "suffix",
  "num_speculative_tokens": 32,
  "suffix_decoding_max_tree_depth": 24,
  "suffix_decoding_max_cached_requests": 10000,
  "suffix_decoding_max_spec_factor": 1.0,
  "suffix_decoding_min_token_prob": 0.1
}

Three flags deserve a note because they are not obvious and each one bought real throughput:

--stream-interval 8. This is the sleeper. Once you are generating around 9,000 tokens a second, the per-token work of detokenizing and flushing each token over HTTP becomes its own bottleneck, the client cannot drink from the firehose one drop at a time. Flushing in batches of 8 removes that overhead. The text the client receives is identical, only the framing changes.
--enable-chunked-prefill. Same reason as before. It keeps a wave of 512-token prefills from starving the decode-and-verify batch.
--max-num-seqs 64. A deliberate breakpoint, not a round number. It keeps the speculator engaged for the contest's batch of 50 without letting the scheduler balloon the verification batch into instability.

What did not help, and where the ceiling actually is

Once suffix decoding was in, the temptation was to crank every knob higher. Almost none of it helped, and the reason is instructive.

The mean accepted length is about 12 tokens because that is how far ahead the model itself is predictable on this workload, not because the drafter is holding back. So raising suffix_decoding_max_tree_depth past 24, or pushing suffix_decoding_max_spec_factor above 1.0, just drafts longer guesses that get rejected past the twelfth token. You pay to build and verify a bigger draft and accept the same number of tokens. Worse, at 50 concurrent requests those oversized drafts blow up the verification batch and the engine core OOMs and crashes outright, every aggressive sweep config came back at 0.0 tok/s. Dropping suffix_decoding_min_token_prob to 0.0 let noise tokens into the draft, which is its own kind of slower.

The takeaway from all that knob-twisting was that the workload and the silicon set the ceiling, not the configuration. Which raised a question I could not resist answering after the contest was over: if the drafter is doing all the work, could I just build it myself? That is the next section.

A note on honesty: 9,052 versus 7,800

Two numbers show up in this post and they deserve an explanation, because the gap is real and it is not a rounding error.

The 9,052 tok/s is my own measurement, run on the free Colab T4 inside my notebook. The 7,800 tok/s is H2Loop's verified score, the same config re-run on their own T4 for ranking. Same code, same flags, different silicon temperature.

A free Colab T4 is a shared, thermally throttled card. During my runs I watched the SM clock sag into the 600 to 800 MHz range against a 1590 MHz boost ceiling, which on a compute-bound run like this costs roughly a thousand tokens a second on its own. A cool, dedicated T4 holds its clocks and clears 9k comfortably. So the verified number is the one that counts for the contest, and the self-test number is what the same configuration does when the card is not cooking itself. Both are true, and the methodology is the difference.

Metric	Verified (H2Loop T4)	Self-test (Colab free T4)
Output throughput	7,800 tok/s	9,052 tok/s
Mean TPOT	-	4.77 ms
P99 TPOT	6.54 ms (limit 50)	6.54 ms
Mean TTFT	-	193.91 ms
P99 TTFT	353.73 ms (limit 2000)	353.73 ms
Completed	200 / 200	200 / 200
Overall score	35.0	-

The whole journey, on one line

Every stage was a different theory of where the time was going, and the throughput at each step tells the story better than I can.

Approach	Throughput	P99 TPOT	Why it landed there
AWQ INT4, best of 35+ configs	~4,084 tok/s	~12 ms	bandwidth play, hard ceiling near 4,100
GPTQ ExLlamaV2	~3,243 tok/s	13.2 ms	single-user kernel, loses at batch 50
SGLang with T4 workarounds	700-941 tok/s	-	speed features disabled on sm_75
AWQ + n-gram spec	~5,070 tok/s	13.66 ms	first jump past the ceiling
AWQ + n-gram + chunked prefill	~5,955 tok/s	11.24 ms	best INT4 config
FP16 + suffix decoding	9,052 tok/s	6.54 ms	the winner

That FP16 result posted the best raw throughput and the lowest P99 TPOT of anything I tried, and it earned a podium finish in the contest. H2Loop announced the results on LinkedIn. Genuinely a fun problem to chase for four hours at a time, and a clean example of a result you only reach by being willing to throw out the approach you started with.

Epilogue: building the drafter from scratch

The contest was over, but I could not leave it alone. Suffix decoding had done all the heavy lifting and I had used it as a library, a black box that made the number go up. I wanted to know whether I actually understood it, and there is only one honest way to find out: throw away the library and rebuild it yourself.

So I wrote my own suffix-decoding drafter, a from-scratch suffix tree and all, and dropped it into stock vLLM in place of Arctic's. The one rule I set was that I would not fork vLLM or edit a single line of it on disk. The patch had to inject itself the clean way.

It turns out vLLM hands you exactly the hook for this: a plugin entry point called vllm.general_plugins that it loads inside every process it spawns, the API server, the engine core, and the workers. A register() function in my package runs in all of them, and it does two small, surgical things: it flips the internal flag that gates the native method="suffix" path, and it rebinds vLLM's SuffixDecodingProposer to my own class.

● ● ●PYTHON

def register():
    # vLLM loads this in every process via the vllm.general_plugins entry point.
    import vllm.utils.import_utils as import_utils
    import_utils.has_arctic_inference = lambda: True        # un-gate method="suffix"

    import vllm.v1.worker.gpu_model_runner as gmr
    from t4_spec_patch.proposer import T4SuffixProposer
    gmr.SuffixDecodingProposer = T4SuffixProposer            # swap in my drafter

Behind that swap is the real work: the suffix tree. I wrote it as a Numba-JIT structure-of-arrays, a single open-addressing hash table mapping each (node, token) edge to a child node, with per-node visit counts and a cached most-frequent child. Building it that way, flat arrays and a hand-rolled hash instead of Python objects, is what keeps the append-and-draft bookkeeping off the critical path while the GPU is busy. Drafting itself is a greedy walk: find the longest suffix of the current context that exists in the tree, then follow the most-frequent child as long as the running probability stays above a floor.

● ● ●TEXT

t4_spec_patch/
  pyproject.toml         # registers the vllm.general_plugins entry point
  t4_spec_patch/
    __init__.py          # register(): un-gate suffix, swap in my proposer
    suffix_tree.py       # numba-JIT global suffix tree (struct-of-arrays hash)
    proposer.py          # T4SuffixProposer, drop-in for SuffixDecodingProposer

● ● ●PYTHON

# from the matched suffix node, follow the most-frequent child,
# keeping tokens while the cumulative probability stays above the floor
prob = 1.0
cur = node
while d < max_tokens:
    child = best_child[cur]
    if child == -1:
        break
    prob *= counts[child] / counts[cur]
    if prob < min_token_prob:
        break
    tmp[d] = token_of[child]
    d += 1
    cur = child

Every drafted token is still verified by the model's argmax, so correctness is never on the line, a wrong guess just gets rejected. And the headline of the whole exercise: my from-scratch drafter landed level with Arctic's production C++ implementation, 7,826 tok/s against 7,772, both holding the same roughly 12 accepted tokens per pass (and measured at slightly different clocks, so honestly call it a tie). I understood it.

The wall that was not my code

The first time I swept this, every single arm came back at about 6.3k tok/s. Every one. Even the exact reference drafter that had posted 9,052 in an earlier session. When every knob you turn produces the identical number, that is not a configuration you are tuning, it is a ceiling you are hitting, and the ceiling was the card. A free Colab T4 is passively cooled and power-capped around 70 W, and under a sustained run it throttles hard.

So I stopped trusting bare throughput and instrumented the benchmark. Every run now cooled the GPU to idle first, sampled the SM clock, temperature, and the throttle bits once a second during the run, and printed the clock right next to the score. The rule was simple: if two arms were measured at different clocks, you are not allowed to compare them. A slow result was never again blamed on a patch when it was really the silicon sitting at 670 MHz instead of its 1590 MHz boost.

What the instrumented sweep actually showed

With the clock finally visible, the ranking held up even though throttling suppressed every absolute number:

Arm	What it changes	tok/s	Accept	SM clock
s_2048	max-num-batched-tokens 2048	8,735	12.05	670 MHz
c_capture	CUDA-graph the verify step	8,473	12.21	672 MHz
p_primary	my from-scratch drafter	7,826	12.09	762 MHz
combo	depth 8 + capture + 2048	7,810	-	712 MHz
p_arctic	reference Arctic drafter	7,772	12.17	667 MHz
d_minprob03	min_token_prob 0.3	7,504	12.06	695 MHz
d_depth8	cap draft to depth 8	7,334	4.89	675 MHz
p_fullgraph	Triton full-graph, pad to K	7,241	7.77	767 MHz
combo_fullgraph	stacked full-graph	6,251	-	785 MHz

Two levers genuinely helped. Dropping max-num-batched-tokens to 2048 was the best of the lot, smaller prefill chunks mean fewer mixed prefill-and-decode steps and so fewer decode stalls. CUDA-graphing the verify step was close behind: that forward normally runs eager because the batch is far past vLLM's 128-token capture limit, so capturing it removes a pile of per-launch overhead.

And two clever ideas backfired, for the same underlying reason. Capping the draft to depth 8 cut acceptance from about 12 down to 4.89 and lost throughput. The full-graph patch, which padded every draft to a uniform length so the whole verify step could replay one CUDA graph, also dragged acceptance down and finished near the bottom. On this workload the model genuinely repeats about 12 tokens, so anything that shortens or pads the draft is throwing away real, free, already-accepted tokens. The boring full-length draft wins.

That was the satisfying part. The from-scratch build matched the library, the fancy engine patches lost to a one-line batch-size change, and the true ceiling turned out to be a thermal limit on a free GPU rather than anything in the code. Sometimes the most first-principles thing you can do is build the entire thing yourself, just to earn the right to say the simple version was correct all along.

◆ KEY TAKEAWAY

The instinct on a memory-bound model is to shrink the weights. But on a tiny model under a repetitive greedy workload, the bigger lever by far is taking fewer forward passes, and model-free speculative decoding does exactly that with no draft model and no training. Once the speculator is accepting a dozen tokens per pass, the bandwidth argument for INT4 mostly evaporates and plain FP16 becomes the better base, especially on a Turing card with no Marlin kernel. Profile the regime you are actually in before you reach for the obvious knob.

Bear the Tokens: FP16 and Suffix Decoding to 9,000+ Tokens a Second on a Tesla T4

The contest

The T4 is the whole problem

First instinct: shrink the weights

Dead ends, told honestly

The real lever: stop taking so many forward passes

The pivot: drop quantization, go FP16 plus suffix decoding

The winning configuration

What did not help, and where the ceiling actually is

A note on honesty: 9,052 versus 7,800

The whole journey, on one line

Epilogue: building the drafter from scratch

The wall that was not my code

What the instrumented sweep actually showed

References