Yuandong Tian | Efficient Inference of LLMs with Long Context Support
YouTube Viewers YouTube Viewers
4.26K subscribers
713 views
0

 Published On Dec 8, 2023

Sponsored by Evolution AI: https://www.evolution.ai

Abstract: While Large Language Models (LLMs) demonstrate impressive performance across many applications, how to inference with long context remains an open problem. There are two issues. First, current pre-trained LLMs may experience perplexity blow-up, when the input length goes beyond the pre-trained window; Second, inference with long context is both time-consuming and memory-intensive, due to increasing KV cache size. In this talk, we introduce a series of works to address both issues by interpolating positional encoding, understanding and leveraging sparse attention patterns for underlying Transformer architectures. Our works can extend context windows by up to 8x with less than 1000 fine-tune samples (Positional Interpolation) and can substantially speed up the inference with up to 4 million context length (StreamingLLMs and H2O).

Bio: Yuandong Tian is a Research Scientist and Senior Manager in Meta AI Research (FAIR), working on long context, fast inference and understanding of Large Language Models (LLMs), optimization and reinforcement learning. He has been the project lead for story generation (2023) and OpenGo project (2018). He is the first-author recipient of 2021 ICML Outstanding Paper Honorable Mentions and 2013 ICCV Marr Prize Honorable Mentions, and also received the 2022 CGO Distinguished Paper Award. Prior to that, he worked in Google Self-driving Car team in 2013-2014 and received a Ph.D in Robotics Institute, Carnegie Mellon University in 2013. He has been appointed as area chairs for NeurIPS, ICML, AAAI and AIStats.

show more

Share/Embed