Speculative decoding

a method to speed up LLM decoding.

Étiquette

Idea: “draft-and-verify” using smaller models to generate a head tokens (quick explanation from karpathy)

Intuitively:

we generate a small set of lookahead tokens, albeit 2-5 tokens with smaller speculators
uses the larger model to “verify” the input sequences + draft tokens (then replace tokens that aren’t valid from rejection sampler)

In a sense, we are verify these in parallel instead of autoregressive decoding.

A few techniques such as ngrams, EAGLE are supported in vLLM

via combined tokens/embedding speculators

Idea: “draft-and-verify” using smaller models to generate a head tokens (quick explanation from karpathy)

Intuitively:

we generate a small set of lookahead tokens, albeit 2-5 tokens with smaller speculators
uses the larger model to “verify” the input sequences + draft tokens (then replace tokens that aren’t valid from rejection sampler)

In a sense, we are verify these in parallel instead of autoregressive decoding.

A few techniques such as ngrams, EAGLE are supported in vLLM