1. Điều gì khiến những người này nghĩ rằng điều này xứng đáng để viết thành giấy tờ 2. Tôi đoán số lượng giao dịch sẽ xảy ra với điều này là khủng khiếp
Aditya Tomar
Aditya Tomar14:07 20 thg 8
Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats state-of-the-art KV quantization🔥 Key insights: 1. KV cache = bottleneck → grows linearly with context length + batch size. 2. Compute >> memory → GPUs offer FLOPs orders of magnitude faster than memory bandwidth. 3. Key idea → don’t store KV, just recompute it. 🧠 Since LLM inference is typically memory-bandwidth bound, compute units are often idle and underutilized. So, we can put this available compute to use without any overhead! GPU hardware trends show that compute capabilities are scaling much faster than memory bandwidth. Thus, reducing memory operations in exchange for more computation can help speed up LLM inference. The KV cache grows linearly with sequence length and batch size, incurring the majority of memory operations during LLM inference. If we can trade additional computation to circumvent loading and storing the KV cache, we can accelerate inference! XQuant exploits this hardware trend: 🧵 [1/7] Paper: Joint work with: @coleman_hooper1 @mjlee_official from @FuriosaAI @HaochengXiUCB @rish2k1 Wonjun Kang from @FuriosaAI @lucamanolache0 Michael Mahoney @KurtKeutzer @amir__gholami
899