The idea is for a inference engine to have separate prefill/decode node and ratio to scale independently. Think of DeepSeek R1
See also: distributed inference for LLMs
and scaling in hyperscaler.
The idea is for a inference engine to have separate prefill/decode node and ratio to scale independently. Think of DeepSeek R1
See also: distributed inference for LLMs