Video Instruction Tuning with Synthetic Data


Yuanhan Zhangβ™‘, Jinming Wuβ™‘, Wei Li, Bo Liβ™‘, Zejun Ma
Ziwei Liu*, Chunyuan Li*
ByteDance, NTU S-Lab, BUPT
β™‘ Work collaborated with ByteDance * Co-senior authors

Abstract

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

Click on the sections below to learn more about this project :

  1. §Video Instruction-Following Data Synthesis
  2. §Video Representation
  3. §Benchmark Performance

Video Instruction-Following Data Synthesis

A high-quality dataset for video instruction-tuning is crucial for developing effective video-language models. We identify a key factor in building such datasets: ensuring richness and diversity in both video content and its language annotations. We perform comprehensive survey on the existing video benchmarks, covering across various public video captioning and question-answering datasets, then identify ten unique video sources that contribute to over 40 video-language benchmarks. From each source, we select videos that exhibit significant temporal dynamics. To maintain diversity in the annotations, we establish a pipeline capable of generating detailed captions for videos of any length. Additionally, we define 16 types of questions that guide GPT-4o in creating question-answer pairs to assess the perceptual and reasoning skills of the video-language models.

Video Sources

We noticed that although different video-language datasets focus on various video understanding tasks , most are sourced from ten main video sources, which offer a wide range of video data from different websites, viewpoints, and domains. The relationship between these ten selected video datasets and others is shown in figure below. We select the dynamic video from these source, we detail the video selection logic in the paper.

benchmark category
Figure 1: The relationship between 10 video sources we have utilized and other existing video-language datasets.

Automated Generation for Video Detail Description

For selected videos, we use GPT-4o to systematically describe their content. We start by sampling video frames at one frame per second (fps). However, due to the input size constraints of GPT-4o, we cannot use all sampled frames. Instead, we describe the videos sequentially, as shown in figure below. We create descriptions at three distinct levels, detailed below.

benchmark category
Figure 2: A three-level creation pipeline is considered, with each level developed via a recurrent approach. Note that t is the index of time internal at its own level, and T is the last time internal index. (a) To generate the caption for time internal t at level-1, we condition on the current frames in this internal, the caption for time internal t-1, and the most recent description summary at level-2 if applicable. (b) To generate caption for time internal t at level-2, we condtion on the previous caption at level-2, and captions from three most recent time internals at level-1. (c) To generate the overall caption at the last time internal t at level-3, we condtion on the the most recent caption at level-2 and the current caption from level-1.

Automated Generation for Video Question Answering

In addition to detailed video descriptions, our dataset includes a variety of question-answer pairs designed for complex interactions. This setup improves the video understanding model's ability to handle real-life queries. We refer to public video question-answering benchmarks to organize these questions into 16 specific categories, as shown in Figure 3. Given a detailed video description, we use GPT-4o to generate at most one question-answer pair for each type of question. Please refer to the paper for more details of the question types and the generation process.
benchmark category
Figure 3: Question types for video question answering in data creation. For each type, we provide its name and an example question.

Dataset Statistics

We carefully select from our collected data sources to form a balanced and comprehensive collection, resulting in a total of 178K videos and 1.3M instruction-following samples. This includes 178K captions, 960K open-ended QAs, and 196K multiple-choice QAs.
benchmark category
Figure 4: Distribution of data across different datasets and question types (Caption, Open-ended, and Multi-Choice).


benchmark category
Figure 5: One example to illustrate the video instruction-following data.

Dataset Comparison

We provide a comparison of high-quality instruction-following video-language datasets, with a focus on synthetic data created with strong AI models, as shown in Table 1.

  1. A broad collection of dynamic videos. In terms of video sources, although LLaVA-Hound contains the largest number of videos, 44% of its video data are sourced from WebVid, where most videos are static. ShareGPT4Video includes 30% of its videos from Pexels, ,Pixabay, and Mixkit, which are aesthetically good but also mostly static. Additionally, the majority of its videos come from Panda-70M, which are short clips from longer videos, suggesting simpler plots. In contrast, we carefully select video sources that offer dynamic, untrimmed videos with complex plots, which are crucial for developing a powerful video understanding model.
  2. High frames per second. Regarding frame sampling in language annotations, the proposed dataset considers 1 FPS, while other datasets consider much lower FPS. LLaVA-Hound uniformly samples 10 frames from videos of any length. The average FPS is 0.008, which may miss some fine details. ShareGPT4Video picks key frames using CLIP based on frame uniqueness. This method might also miss subtle changes in the video because CLIP embeddings do not capture fine-grained dynamics well. Our method samples FPS=1 without using key frame selection algorithms, ensuring that detailed temporal information can be expressed in annotations with high coverage.
  3. Diverse tasks. The proposed dataset considers three common task types, including caption, free-form, and closed-form QA, while existing datasets only consider a subset. Meanwhile, the quality and number of samples in our dataset is higher.
Text Video Source #Video Total Video Length Average FPS #Caption #OE QA #MC QA
LLaVA-Hound GPT-4V 900K 3Khr 0.008 900K 900K 0
ShareGPT4Video GPT-4V 40K 0.2Khr 0.15 40K 0 0
LLaVA-Video-178K GPT-4o 178K 2Khr 1 178K 960K 196K
Table 1: Comparison of LLaVA-Video-178K and other video-language datasets. Average FPS represents the average number of frames per second that are used to prompt GPT-4o/GPT-4V for annotation. : VIDAL, WebVid, ActivityNet. : Panda-70M, Pexels, Pixabay, Mixkit, BDD100K, Ego4d. : HD-VILA-100M, Kinetics-700M, Ego4D, VidOR, InternVid, YouCook2, ActivityNet, Sth-sthv2, VIDAL, Charades.

Video Representation

Following the classic SlowFast idea in video representations, we develop \(\text{LLaVA-Video}_{~\mathtt{SlowFast}}\) to optimize the balance between the number of frames and the count of visual tokens, within the budget of the limited context window in LLM and GPU memory for video representation.

Specifically, we categorize the frames into two groups, based on the strike rate \(s\), where every \(s\) frames are uniformly selected to form the slow frame group, and the rest of the frames are considered as the fast frame group. Note that a special case \(s=1\) leads to only one group, reducing the SlowFast representation to the original simple representation. For each group, we apply different pooling rates using the PyTorch function \(\mathtt{avg\_pool2d}()\). We apply \(p \times p\) pooling and \(2p \times 2p\) pooling for slow and fast frames, respectively.

To summarize, we parameterize the video representation configuration as \(\mathcal{V} = (T, M, s, p)\).



benchmark category
Figure 5: Video representations. A different number of tokens are utilized to represent frames.

Benchmark Performance

We fine-tune LLaVA-OneVision (SI) on the joint dataset of video and image data. Specifically, we added video data from the LLaVA-Video-178K dataset and four public datasets: ActivityNet-QA, NExT-QA, PerceptionTest, and LLaVA-Hound-255K, focusing on videos shorter than three minutes. These datasets were selected to improve our model’s performance, contributing to a total of 1.6 million video-language samples, which include 193,510 video descriptions, 1,240,801 open-ended questions, and 215,625 multiple-choice questions. Remarkably, 92.2% of the video descriptions, 77.4% of the open-ended questions, and 90.9% of the multiple-choice questions were newly annotated. Additionally, we used 1.1 million image-language pairs from the LLaVA-OneVision model.

Caption Open-Ended Q&A Multiple-Choice Q&A
Model VideoDC Dream-1K ActNet-QA VideoChatGPT EgoSchema MLVU MVBench NExT-QA PerceptionTest LongVideoBench VideoMME
test test test test test m-avg test mc val val wo/w-subs
Proprietary models
GPT-4V 4.06 34.4 57.0 4.00 - 49.2 43.5 - - 61.3 59.9/63.3
GPT-4o - 39.2 - - - 64.6 - - - 66.7 71.9/77.2
Gemini-1.5-Flash - 34.8 55.3 - 65.7 - - - - 61.6 70.3/75.0
Gemini-1.5-Pro - 36.2 57.5 - 72.2 - - - - 64.0 75.0/81.3
Open-source models
VILA-40B 3.37 33.2 58.0 3.36 58.0 - - 67.9 54.0 - 60.1/61.1
PLLaVA-34B - 28.2 60.9 3.48 - - 58.1 - - 53.2 -
LongVA-7B 3.14 - 50.0 3.20 - 56.3 - 68.3 - - 52.6/54.3
IXC-2.5-7B - - 52.8 3.46 - 37.3 69.1 71.0 34.4 - 55.8/58.8
LLaVA-OV-7B 3.75 31.7 56.6 3.51 60.1 64.7 56.7 79.4* 57.1 56.5 58.2/61.5
VideoLLaMA2-72B - 27.1 55.2 3.16 63.9 61.2 62.0 - - - 61.4/63.1
LLaVA-OV-72B 3.60 33.2 62.3 3.62 62.0 68.0 59.4 80.2* 66.9 61.3 66.2/69.5
LLaVA-Video-7B 3.66 32.5 56.5* 3.52 57.3 70.8 58.6 83.2* 67.9* 58.2 63.3/69.7
LLaVA-Video-72B 3.73 34.0 63.4* 3.62 65.6 74.4 64.1 85.4* 74.3* 61.9 70.5/76.9
Table 2: LLaVA-Video performance on video benchmarks. We report the score out of 5 for VideoDC, VideoChatGPT while other results are reported in accuracy. All results are reported as 0-shot accuracy. *indicates that the training set has been observed in our data mixture.

Conclusion

This study introduces the LLaVA-Video-178K dataset, a high-quality synthetic dataset for video-language instruction-following. It is favored for its dense frame sampling rate in longer, untrimmed videos, covering diverse tasks such as captioning, open-ended and multi-choice QA. By training on the joint dataset of LLaVA-Video-178K with existing visual instruction tuning data, we developed a new model family, LLaVA-Video, which also considers video representation to effectively use GPU resources. This allows us to include more frames in the training process. The experimental results have demonstrated the effectiveness of the proposed synthetic dataset, and LLaVA-Video models have achieved excellent performance on a wide range of video benchmarks.

Citation

@misc{zhang2024videoinstructiontuningsynthetic,
    title={Video Instruction Tuning With Synthetic Data}, 
    author={Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li},
    year={2024},
    eprint={2410.02713},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2410.02713}, 
}