Jing Yu Koh | Grounding Language Models to Images for Multimodal Generation

4.26K subscribers

1,491 views

About
Share

Published On Mar 23, 2023

Sponsored by Evolution AI: https://www.evolution.ai

Abstract: Can we leverage the abilities of text-only language models for processing and generating interleaved image-and-text data? In this talk, I present an efficient approach for adapting pretrained language models to multimodal tasks. By keeping the language model frozen, and finetuning input and output linear layers for cross-modality interaction, we are able to leverage the abilities of language models learnt from large scale pretraining, such as in-context learning and free-form text generation. Experimental results and qualitative examples show the capabilities of our model for generating compelling multimodal discourse, as well as several zero and few-shot abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.

Bio: Jing Yu Koh is a 1st year machine learning PhD student at Carnegie Mellon University, where he is advised by Daniel Fried and Ruslan Salakhutdinov. He works on grounded language understanding, and his research aims to build machine learning models which can fuse different modalities (text, images, videos, and more) to achieve strong performance on complex reasoning and generation tasks. Prior to joining CMU, he was a research engineer at Google Research, where he worked on text-to-image generation and multimodal learning.

Published On Mar 23, 2023

Share/Embed

Video Link