Apple has partnered with Nvidia to enhance large language model inference using the open source technology, Recurrent Drafter (ReDrafter). This collaboration aims to address computational challenges in auto-regressive token generation, crucial for improving efficiency in real-time applications. ReDrafter has shown to generate 2.7x more tokens per second compared to traditional approaches, potentially reducing user latency and requiring fewer GPUs.
Integrated into Nvidia’s TensorRT-LLM framework, ReDrafter now enables faster LLM inference, enhancing performance in production environments. While the focus is currently on Nvidia, future extensions to rival GPUs like AMD or Intel are not ruled out. This collaborative breakthrough improves machine learning efficiency and opens up possibilities for advanced models and performance enhancements in LLM workloads on Nvidia GPUs.
For further details on this collaboration, check out the Nvidia Developer Technical Blog.