The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
Click on the sections below to learn more about this project :
A high-quality dataset for video instruction-tuning is crucial for developing effective video-language models. We identify a key factor in building such datasets: ensuring richness and diversity in both video content and its language annotations. We perform comprehensive survey on the existing video benchmarks, covering across various public video captioning and question-answering datasets, then identify ten unique video sources that contribute to over 40 video-language benchmarks. From each source, we select videos that exhibit significant temporal dynamics. To maintain diversity in the annotations, we establish a pipeline capable of generating detailed captions for videos of any length. Additionally, we define 16 types of questions that guide GPT-4o in creating question-answer pairs to assess the perceptual and reasoning skills of the video-language models.
We provide a comparison of high-quality instruction-following video-language datasets, with a focus on synthetic data created with strong AI models, as shown in Table 1.
Text | Video Source | #Video | Total Video Length | Average FPS | #Caption | #OE QA | #MC QA | |
---|---|---|---|---|---|---|---|---|
LLaVA-Hound | GPT-4V | ★ | 900K | 3Khr | 0.008 | 900K | 900K | 0 |
ShareGPT4Video | GPT-4V | ◾ | 40K | 0.2Khr | 0.15 | 40K | 0 | 0 |
LLaVA-Video-178K | GPT-4o | ✪ | 178K | 2Khr | 1 | 178K | 960K | 196K |
Following the classic SlowFast idea in video representations, we develop \(\text{LLaVA-Video}_{~\mathtt{SlowFast}}\) to optimize the balance between the number of frames and the count of visual tokens, within the budget of the limited context window in LLM and GPU memory for video representation.
Specifically, we categorize the frames into two groups, based on the strike rate \(s\), where every \(s\) frames are uniformly selected to form the slow frame group, and the rest of the frames are considered as the fast frame group. Note that a special case \(s=1\) leads to only one group, reducing the SlowFast representation to the original simple representation. For each group, we apply different pooling rates using the PyTorch function \(\mathtt{avg\_pool2d}()\). We apply \(p \times p\) pooling and \(2p \times 2p\) pooling for slow and fast frames, respectively.
To summarize, we parameterize the video representation configuration as \(\mathcal{V} = (T, M, s, p)\).
We fine-tune LLaVA-OneVision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA, NExT-QA, PerceptionTest, and LLaVA-Hound-255K, focusing on videos shorter than three minutes. These datasets were selected to improve our modelβs performance, contributing to a total of 1.6 million video-language samples, which include 193,510 video descriptions, 1,240,801 open-ended questions, and 215,625 multiple-choice questions. Remarkably, 92.2% of the video descriptions, 77.4% of the open-ended questions, and 90.9% of the multiple-choice questions were newly annotated. Additionally, we used 1.1 million image-language pairs from the LLaVA-OneVision model.
Caption | Open-Ended Q&A | Multiple-Choice Q&A | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Model | VideoDC | Dream-1K | ActNet-QA | VideoChatGPT | EgoSchema | MLVU | MVBench | NExT-QA | PerceptionTest | LongVideoBench | VideoMME |
test | test | test | test | test | m-avg | test | mc | val | val | wo/w-subs | |
Proprietary models | |||||||||||
GPT-4V | 4.06 | 34.4 | 57.0 | 4.00 | - | 49.2 | 43.5 | - | - | 61.3 | 59.9/63.3 |
GPT-4o | - | 39.2 | - | - | - | 64.6 | - | - | - | 66.7 | 71.9/77.2 |
Gemini-1.5-Flash | - | 34.8 | 55.3 | - | 65.7 | - | - | - | - | 61.6 | 70.3/75.0 |
Gemini-1.5-Pro | - | 36.2 | 57.5 | - | 72.2 | - | - | - | - | 64.0 | 75.0/81.3 |
Open-source models | |||||||||||
VILA-40B | 3.37 | 33.2 | 58.0 | 3.36 | 58.0 | - | - | 67.9 | 54.0 | - | 60.1/61.1 |
PLLaVA-34B | - | 28.2 | 60.9 | 3.48 | - | - | 58.1 | - | - | 53.2 | - |
LongVA-7B | 3.14 | - | 50.0 | 3.20 | - | 56.3 | - | 68.3 | - | - | 52.6/54.3 |
IXC-2.5-7B | - | - | 52.8 | 3.46 | - | 37.3 | 69.1 | 71.0 | 34.4 | - | 55.8/58.8 |
LLaVA-OV-7B | 3.75 | 31.7 | 56.6 | 3.51 | 60.1 | 64.7 | 56.7 | 79.4* | 57.1 | 56.5 | 58.2/61.5 |
VideoLLaMA2-72B | - | 27.1 | 55.2 | 3.16 | 63.9 | 61.2 | 62.0 | - | - | - | 61.4/63.1 |
LLaVA-OV-72B | 3.60 | 33.2 | 62.3 | 3.62 | 62.0 | 68.0 | 59.4 | 80.2* | 66.9 | 61.3 | 66.2/69.5 |
LLaVA-Video-7B | 3.66 | 32.5 | 56.5* | 3.52 | 57.3 | 70.8 | 58.6 | 83.2* | 67.9* | 58.2 | 63.3/69.7 |
LLaVA-Video-72B | 3.73 | 34.0 | 63.4* | 3.62 | 65.6 | 74.4 | 64.1 | 85.4* | 74.3* | 61.9 | 70.5/76.9 |
This study introduces the LLaVA-Video-178K dataset, a high-quality synthetic dataset for video-language instruction-following. It is favored for its dense frame sampling rate in longer, untrimmed videos, covering diverse tasks such as captioning, open-ended and multi-choice QA. By training on the joint dataset of LLaVA-Video-178K with existing visual instruction tuning data, we developed a new model family, LLaVA-Video, which also considers video representation to effectively use GPU resources. This allows us to include more frames in the training process. The experimental results have demonstrated the effectiveness of the proposed synthetic dataset, and LLaVA-Video models have achieved excellent performance on a wide range of video benchmarks.
We provide interactive demos to showcase the capabilities of LLaVA-Video for realistic multimodal interactions.
LLaVA-Video teaches me how to download "TikTok" on my iPhone, step by step.
LLaVA-Video helps me find the healthy drink in the living room, and describe the living room.
@misc{zhang2024videoinstructiontuningsynthetic,
title={Video Instruction Tuning With Synthetic Data},
author={Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li},
year={2024},
eprint={2410.02713},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.02713},
}