LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Section 1 - Emerging Capabilities

Task Transfer between Single-Image and Multi-Image
Task Transfer between Image and Video
Real-World Applications

Section 2 - Multi-image Instruction Tuning
- M4-Instruct: Traing Data
- LLaVA-Interleave Bench
Section 3 - Evaluation Results
Section 4 - Training Techniques
Team

Recent advancements in Large Multimodal Models (LMMs) have showcased impressive capabilities in multimodal understanding and reasoning. However, most existing open-source LMMs such as LLaVA-NeXT have primarily focused on pushing the performance limit of single-image, leaving the potential of multi-image scenarios largely unexplored. Considering the diverse range of computer vision scenarios and data formats—including single and multi-image inputs, videos, and 3D data—there is a pressing need to develop methodologies for open LMMs that can operate effectively across these varied contexts. We observe that the image-text interleaved format can naturally serve as a general data template to unify different scenarios, e.g., single-image or multi-image as special cases, video as multi-frames, and 3D as multi-views. Therefore, we present LLaVA-NeXT-Interleave, an all-around LMM that extends the model capabilities to new real-world settings: Multi-image, Multi-frame (videos), Multi-view (3D) and maintains the performance of the Multi-patch (single-image) scenarios. We denote the four settings as M4.

Highlights:

Interleave data format unifies different tasks. We represent multi-image, video, 3D, and single-image data all into an interleaved training format, which unifies different tasks in a single LLM.
New datasets: (1) Training Data: M4-Instruct. We compile a high-quality dataset with 1177.6k samples, spanning 4 primary domains (multi-image, video, 3D, and single-image). (2). LLaVA-Interleave Bench. We curate a diverse set of tasks to evaluate the multi-image capabilities in 3 scenarios, including 9 newly collected and 13 existing in/out-domain benchmarks.
SoTA Performance.(1). With a single model, LLaVA-NeXT-Interleave can achieve leading results in different multi-image benchmarks compared to the previous SoTA. (2). With a proper data mix of different scenarios, the performance of previous individual tasks can be improved or maintained. For example, we maintain the single-image performance of LLaVA-NeXT and improve performance on video tasks.
Emerging capabilities with cross-task transfer. By jointly training on a diverse set of visual data modalities, the model shows emerging capabilities to transfer tasks between different scenarios.

Section 1 - Emerging Capabilities

Task Transfer between Single-Image and Multi-Image

Multi-image coding: Our multi-image training data does not include the coding-related problem. However, by jointly training on single-image data with coding problems, the model learns to write code based on multiple images. In this example, the model writes an HTML code to simulate the movement of the square from the first image to the second image.
What is fun about the images: The model learns to reason over mutliple images and tell the funny part of the images. It is only in single-image tasks but not in multi-image tasks.

Multi-image Coding

Conversation

User Please pay attention to the movement of the object from the first image to the second image, then write a HTML code to show this movement.

LLaVA-NeXT-Interleave Certainly! To create an animation that shows the object moving from one position to another, you can use HTML with CSS for styling and JavaScript for the animation. Below is a simple example of how you might code this:

The rendered video

This code will create an animation where the red object moves horizontally from left to right. The animation is set to last 2 seconds with an ease-in-out timing function. The object's width and height are set to 100 pixels, and the background color is red. The animation is applied using the @keyframes rule, which defines the keyframes for the animation. The animation property on the object sets the animation to use the move keyframe defined earlier. In the JavaScript part, we maintain a variable currentPos that keeps track of the current position of the object. We use setInterval to animate the object's movement, updating its position by 1 pixel every 5 milliseconds. After 2 seconds, we clear the interval id to stop the animation. Remember to adjust the keyframes to match the exact movement you want to achieve.

Fun Mems

Conversation

User what is fun about the images?

LLaVA-NeXT-Interleave The humor in these images comes from the anthropomorphized birds engaging in human-like activities and conversations. In the first image, one bird is asking about going to get some fries on the pier, which is a human activity. The second image shows a bird asking about the "culminating result of consciousness," which is a philosophical question rather than a literal one. The third image has a bird expressing its desire to do something with the fries on the pier, again referencing a human activity. The anthropomorphism of the birds adds a layer of humor by attributing human thoughts and desires to animals.

Task Transfer between Image and Video

Multi-video difference: All the video data trained is about single-video QA or caption. Tell-the-difference data only appears in multi-image data, and the difference is only about the difference between two single images. By joint training, the model generalizes to tell the difference of two videos. In this example, viewing two similar videos of different styles, the model can tell the general difference and list the detailed differences.
Twitter post for video: The model is trained to write a Twitter post on multi-image data. By joint training, the model can write a Twitter post of a video.

Multi-video difference

Conversation

User The input contains two videos, the first half is the first video and the second half is the second video. What is the difference between the two videos?

LLaVA-NeXT-Interleave The first video shows a lion with a fiery mane, while the second video shows a lion with a bright yellow mane.

User List the detailed difference

LLaVA-NeXT-InterleaveThe differences between the two videos are: 1. The mane of the lion in the first video is a fiery orange-red color, while in the second video, it is a vibrant yellow. 2. The background in the first video is dark and indistinct, whereas in the second video, the background is green, suggesting a natural setting. 3. The lion's expression changes from a neutral look to a more aggressive one with its mouth open in the second video.

Twitter post for video

Conversation

User Help me write a twitter post to describe this video.

LLaVA-NeXT-Interleave "Feeling the chill in the air, but the cherry blossoms are a sight to behold! 🌸❄️ Walking down the street, each person bundled up against the winter chill. The snow-covered trees and market stalls add to the serene beauty of this winter wonderland. #CherryBlossoms #WinterWalk #CityScenes".

Real-World Applications

PPT summarization and QA: The model can summarize and retrieve information from multiple images.
Painting styles/Category difference: The model generalizes to recognize artists' painting styles and tell different categories.
Image editing: The model can be used to create image editing prompts for image generation.
Multi-doc QA: The model can summarize the information of multiple documents and provide comparisons for different documents.

Multi-doc VQA

Conversation

User Hi, there are two candidates, can you provide a brief description for each of them for me?

LLaVA-NeXT-Interleave Certainly! Here's a brief description of each candidate: 1. Oprah Winfrey: - Professional Experience: Known for her talk show, "The Oprah Winfrey Show," and her work on "COVID-19" as a presenter on Apple TV. - Skills: Emphasized on emotional intelligence, interpersonal skills, and creativity. - Honors & Awards: Notable achievements include the Golden Globe Award for Best Television Series and the Golden Globe Award for Best Television Special. 2. Steve Jobs: - Professional Experience: Co-founder of Apple Inc., where he led the development of iconic products like the iMac, iPod, and iPhone. - Skills: Highlighted as a thinker with a focus on innovation and problem-solving. - Honors & Awards: Recognized with numerous accolades, including the MacArthur Fellowship and the NeXT Computer Science Award. Please note that these descriptions are based on the information provided in the images and do not include any personal details or specific accomplishments mentioned in the images.

User Which of them better suits the role of the Congressman of the US? Why?

LLaVA-NeXT-Interleave The first resume, featuring Oprah Winfrey, would better suit the role of a Congressman because it highlights skills and experiences that are relevant to political office, such as leadership, public speaking, and media experience. These qualities are essential for a politician to effectively represent their constituents and engage in legislative processes. The second resume, while it could be suitable for a creative or entrepreneurial role, does not specifically mention skills or experiences that align with the responsibilities of a Congressman.

PPT Sumarization and QA

Conversation

User The images are the PPT of iPhone 15 review. can you summarize the main information?

LLaVA-NeXT-Interleave The PPT features a sleek design with a black background and includes text and images related to the iPhone 15. It highlights the phone's innovative features such as a custom dual glass back cover, a dynamic island display for delivery tracking, and a USB-C compatible fast charging cable. The color options are showcased in a gradient of pink, green, gray, and onyx. The presentation also emphasizes the phone's powerful processors, including the A16 Bionic chip and A17 Pro chip.

User What is the price of iPhone 15 plus?

LLaVA-NeXT-Interleave The price of iPhone 15 plus is Rs. 89,000 as mentioned in the PPT slide.

User Why the dynamic island useful?

LLaVA-NeXT-Interleave The dynamic island is useful for delivery tracking, as it allows the user to know when their order will arrive and if there are any delays. It also provides a new way to interact with the phone's interface, making it easier to navigate and access various features.

Patining styles

Conversation

User Do you kown who draw these paintings?

LLaVA-NeXT-Interleave Yes, these paintings are by the renowned artist Vincent van Gogh. The first painting is his famous work "The Starry Night," which depicts a swirling night sky with a bright crescent moon and stars. The second painting is his self-portrait, which shows him smoking a pipe and wearing a hat. The third painting is also a self-portrait, but it features a different expression and hairstyle. The fourth painting is by another artist, not Van Gogh, and depicts a person seated with flowers in the background.

Image Editing

Conversation

User Please generate a brief sentence to edit image1 to make it look like image2.

LLaVA-NeXT-Interleave To edit image1 to make it look like image2, you would need to add a patterned cushion or seat cover to the bench.

Multi page doc vqa (information retrival)

Conversation

User What is his email?

LLaVA-NeXT-Interleave The email address provided in the image is "herman.walton@gmail.com".

User What language can he speak?

LLaVA-NeXT-Interleave The language skills section of the resume mentions that the individual is fluent in English and Dutch.

Twitter blog

Conversation

User Please write a twitter blog post with the images.

LLaVA-NeXT-Interleave 🚀✨ Launch! 🚀✨ The sky is alight with the brilliance of a rocket's ascent. The rocket, a magnificent arc against the night sky, leaves behind a trail of light that cuts through the darkness. It's a moment frozen in time, where human ingenuity and nature's grandeur converge. The rocket's trajectory is a testament to our progress, pushing the boundaries of what's possible. This is not just a launch; it's a celebration of human aspiration and the relentless pursuit of knowledge. #SpaceExploration #RocketLaunch

Section 2 - Interleaved Visual Instruction Tuning

Task Overview. We utilize interleave format to unify different tasks of data input, including the following four settings:

Multi-image scenarios include instructions with single or multiple images, and interleaved text-image input. This setting covers 12 challenging real-world tasks included in our training data, such as spotting the difference, visual storytelling, image editing instruction generation, interleaved multi-image dialogue, multi-image puzzle, low-level multi-image assessment, etc.
Multi-frame scenarios refer to taking as input the video data by sampling it into multiple frames, preserving temporal cues across the multi-image sequence. The video-based tasks mainly focus on 2 tasks: video detailed captioning and video VQA.
Multi-view scenarios represent 3D environments by multi-view images from different perspectives, where the visual correspondence and disparity can indicate spatial information in the 3D world. For 3D perception, we include 2 tasks: embodied VQA (dialogue and planning), and 3D scene VQA (captioning and grounding).
Multi-patch represents the single-image scenario. This is because the design of AnyRes in LLaVA-NeXT allows us to divide a single image into multiple patches, compatible with the interleaved format. This is used to maintain single-image performance and accommodate cross-task transfer.

M4-Instruct: Training Data

To empower all-round multi-image capabilities, we curate a comprehensive dataset including 1177.6K training samples, M4-Instruct, spanning multi-image, multi-frame, and multi-view scenarios with 14 tasks and 41 datasets, along with single-image data to preserve basic instruction-following capabilities. We exhibit the detailed data statistics in the table below.

Most multi-image datasets are collected from previous public efforts and rigorously converted into our data format, referring to DEMON and Mantis. On top of that, we also utilize GPT-4V to annotate 3 new tasks to enable more diverse capabilities, i.e., Real-world Difference, Synthetic Difference, and Twitter Post.

For video data, we utilize a subset of 255k from LLaVA-Hound, including 240k video QA and 15k video captions. We also include NExT-QA and STAR in our training.

For 3D data, we utilize 3D data from Nuscenes QA, ALFRED, ScanQA, and 3D-LLM.

For single-image data, we use 40% of original LLaVA-NeXT single-image data.

[Fold / Unfold to See the Details of Data Statistics]

Task	Dataset	Scenario	# Samples
Task	Dataset	Scenario	# Samples
Multi-image Scenarios
Spot the Difference(42.6K)	Real-world Difference	Realistic	6.7K
	Synthetic Difference	Sythetic	7.0K
	Spot-the-Diff	Surveilance	10.8K
	Birds-to-Words	Birds	14.2K
	CLEVR-Change	Solids	3.9K
Image Edit Instruction(67.7K)	HQ-Edit	Sythentic	50K
	MagicBrush	Realistic	14.2K
	IEdit	Realistic	3.5K
Visual Story Telling(67.5K)	AESOP	Cartoon	6.9K
	FlintstonesSV	Cartoon	22.3K
	PororoSV	Cartoon	12.3K
	VIST	Realistic	26K
Text-rich VQA(21.3K)	WebQA	Webpage	9.3K
	TQA	Textbook	8.2K
	OCR-VQA	OCR	1.9K
	DocVQA	Document	1.9K
Multi-image VQA(153.3K)	NLVR2	Realistic	86.4K
	MIT-States_StateCoherence	General	1.9K
	MIT-States_PropertyCoherence	General	1.9K
	RecipeQA_ImageCoherence	Recipe	8.7K
	VISION	Industrial	9.9K
	Multi-VQA	General	5K
	IconQA	General	34.6K
	VizWiz	General	4.9K
Low-level Comparison(65.9K)	Coinstruct	Low-level	50K
Low-level Comparison(65.9K)	Dreamsim	Low-level	15.9K
Image-caption Comprehension (41.8K)	ImageCoDe	General	16.6K
Image-caption Comprehension (41.8K)	Contrast-Caption	General	25.2K
Daily Scenarios (5.7K)	MMChat_Twitter_Post	General	5.7K
Multi-image Puzzle (35K)	Raven	Abstract	35K
Multi-frame (Video) Scenarios
Video QA(246.9K)	NExT-QA	General	3.9K
	STAR	General	3K
	ShareGPTVideo-VQA	General	240K
Video Detailed Captioning (15K)	ShareGPTVideo-Caption	General	15K
Multi-view (3D) Scenarios
Scene VQA(45.4K)	Nuscenes	Outdoor	9.8K
	ScanQA	Indoor Realistic	25.6K
	3D-LLM-Scene	Indoor Realistic	10K
Embodied VQA(62.5K)	ALFRED	Indoor Synthetic	22.6K
	3D-LLM-Dialogue	Indoor Realistic	20K
	3D-LLM-Planning	Indoor Realistic	19.9K
Single-image Scenarios
Single-image Tasks(307K)	Randomly sampling 40% SFT data of LLaVA-NeXT	General	307K

Task Examples. We provide examples to illustrate the multi-image/fame/view tasks of M4-Instruct dataset.

Multi-Image Scenarios

Social Media Post

Given an idea and a collection of photos, Assistant produces captivating social media content, such as a Twitter post, featuring well-crafted copy.

User Help me write a Twitter post considering the following images.

Assistant Wandering through the enchanting streets of Milan, where the vibrant pulse of the city is as captivating as its historical architecture. From the mesmerizing ... the gentle glow of candles offers a moment of serenity in this bustling metropolis. #MilanDiaries #TravelMoments #CityLights

Multi-Image Scenarios

Image Edit Instruction Generation

For a pair of images, the model generates accurate image editing instruction. Such capabilities can potentially benefit research related to text-to-image synthesis and diffusion models.

User Please provide the image edit instruction that can transfrom the source image to the target image.

Assistant In Image 1, replace the black and white soccer ball in the puppy's mouth with a sunflower similar to the one in Image 2.

Multi-Image Scenarios

Spotting the Difference

When presented with two or more visually similar images, this task requires identifying and detailing their differences with fine-grained visual analysis.

User Given two similar images, please provide a list of their differences.

Assistant The differences between the two images are:\n1. In the first image, the dogs are biting a red frisbee, while in the second image, they are biting a stuffed toy.\n2. In the second image, the dog on the left is wearing a blue jacket with a decorative pattern around the collar.

Multi-Image Scenarios

Low-level Comparison

Provided with images subjected to low-level transformation, the model conducts quality comparison, assessment, and question-answering concerning distortions, debluring, and denoising.

User Compared to the first image, how is the sharpness of the second image?\n (A) Sharper\n (B) Blurrier\n (C) About the same

Assistant A

Multi-Frame Scenarios

Video Detailed Description

User Please provide a detailed description of the video.

Assistant The video opens with a close-up of a person wearing a white space suit with red detailing, including a helmet with the number "33" on it. The individual is standing in a vast, snow-covered landscape under a clear blue sky. The scene transitions to show the person walking away from the camera towards a large, metallic UFO parked on the snow. The person then enters the UFO through a hatch, and the interior is briefly visible, showing control panels and screens. The final scenes depict the person adjusting controls within the UFO, with a focus on the gloved hands manipulating the controls.

Multi-Frame Scenarios

Video Detailed Description

User Please provide a detailed description of the video.

Assistant The video shows a bustling street scene in Tokyo during winter. People are walking along a busy market street covered with snow. The street is lined with various stalls and shops under red and white striped awnings, and large, snow-covered trees add a beautiful, wintery charm to the scene. Everyone is dressed warmly, and most individuals are wearing masks, suggesting the scene occurs during a period of health precautions, possibly during the COVID-19 pandemic. This setting likely depicts a popular area for shopping or commuting, emphasized by the vibrant activity and dense foot traffic.

Multi-View Scenarios

Embodied VQA

This task enables the model to interact with a 3D environment to solve problems or answer questions by navigating and manipulating its surroundings, which are essential for robotics and augmented reality applications.

User Your Main Goal: Place a glass with a fork in it on the table in front of a glass pitcher. Step Details: <image>Step#1: Move forward, turn right at the refrigerator, turn left at the stove and face the sink. <image>Step#2: Pick up the fork from in the sink. <image> Current Step:

Assistant Turn to the left, move forward, turn to the right and face the refrigerator.

Multi-View Scenarios

Embodied VQA

User What is on the left of the TV?

Assistant A bicycle on the floor.

User How many trash cans can be seen on your right when you enter?

Assistant A bicycle on the floor.

[Fold / Unfold to see task overview]

LLaVA-Interleave Bench

We categorize multi-image reasoning task into two classes:

(1) In-domain Evaluation includes tasks that have been "seen" during training, designed to verify the model performance within familiar scenarios. We adopt 7 newly curated multi-image tasks corresponding to training datasets, and 2 existing benchmarks, Q-Bench and NLVR2, with 12.9K samples in total.

(2) Out-domain Evaluation involves tasks that do not overlap with the training scenarios, aiming to reveal the generalization capacity of LMMs. We construct 2 new tasks for multi-image mathematical and scientific comprehension, and utilize 3 existing benchmarks, Mantis-Eval, BLINK, and MMMU, with 4.1K samples in total.

[Fold / Unfold to See Table for detailed multi-image benchmark statistics]

Task	Dataset	Scenario	# Samples
Task	Dataset	Scenario	# Samples
In-domain Evaluation - Newly Curated Benchmarks
Spot the Difference(0.3K)	Spot-the-Diff	Surveilance	0.1K
	Birds-to-Words	Birds	0.1K
	CLEVR-Change	Solids	0.1K
Image Edit Instruction(2K)	HQ-Edit	Sythentic	1K
	MagicBrush	Realistic	0.9K
	IEdit	Realistic	0.1K
Visual Story Telling(0.4K)	AESOP	Cartoon	0.1K
	FlintstonesSV	Cartoon	0.1K
	PororoSV	Cartoon	0.1K
	VIST	Realistic	0.1K
Text-rich VQA(0.4K)	WebQA	Webpage	0.1K
	TQA	Textbook	0.1K
	OCR-VQA	OCR	0.1K
	DocVQA	Document	0.1K
Multi-image VQA(0.4K)	MIT-States_StateCoherence	General	0.1K
	MIT-States_PropertyCoherence	General	0.1K
	RecipeQA_ImageCoherence	Recipe	0.1K
	VISION	Industrial	0.1K
Puzzle (1.4K)	Raven	Abstract	1.4K
In-domain Evaluation - Existing Benchmarks
NLVR2 (7K)	NLVR2	Realistic	7K
Q-Bench (1K)	Q-Bench	Low-level	1K
Out-domain Evaluation - Newly Curated Benchmarks
MathVerse-mv (0.8K)	MathVerse	Math Diagram	0.8K
SciVerse-mv (0.4K)	SciVerse	Scientific Diagram	0.4K
Out-domain Evaluation - Existing Benchmarks
Mantis-Eval (0.2K)	Mantis-Eval	General	0.2K
BLINK (1.9K)	BLINK	General	1.9k
MMMU-mv (test) (0.8K)	MMMU	Scientific Diagram	0.8K

Section 3 - Evaluation Results

We conduct extensive evaluation compared with existing open-source models to demonstrate the multimodal capabilities of LLaVA-NeXT-Interleave, which covers a variety of existing and newly curated in/out-domain benchmarks. In the tables below, we exhibit the results across 3 primary scenarios, where our model achieves consistently leading performance for multi-image/frame/view reasoning. For our final solution, LLaVA-NeXT-Interleave (0.5/7/14B) adopts Qwen 1.5-0.5B, 7B and -14B as base LLMs, SigLIP-400M with 384x384 resolutions as the vision encoder, and a two-layer MLP as the projection layer.

Multi-image Evaluation

As reported in the table, the average multi-image performance of LLaVA-NeXT-Interleave surpasses previous open-source models in both in- and out-domain benchmarks. For in-domain evaluation, our model demonstrates significant advantages across various tasks as expected, due to the multi-image instruction tuning with M4-Instruct. For out-domain evaluation, LLaVA-NeXT-Interleave also showcases superior generalization capacity within novel scenarios, e.g., comparable to GPT-4V on Mantis-Eval and BLINK.

Model	In-domain Evaluation										Out-domain Evaluation
	Avg	Newly Curated Benchmarks						Existing Benchmarks		Avg	Newly Curated Benchmarks		Existing Benchmarks
	Avg	Spot the Difference	Image Edit Instruction	Visual Story Telling	Text-rich VQA	Multi-image VQA	Multi-image Puzzle	Q-Bench	NLVR2	Avg	MathVerse-mv	SciVerse-mv	Mantis-Eval	BLINK	MMMU-mv (test)
GPT4V	39.2	12.5	11.0	10.9	54.5	52.0	17.1	76.5	88.8	57.78	60.3	66.9	62.7	51.1	47.9
Open-source LMMs
LLaVA-NeXT-Image(7B)	32.4	12.9	13.2	10.1	59.6	39.4	9.0	51.0	68.0	29.4	13.5	12.2	46.1	41.8	33.5
VPG-C (7B)	35.8	27.8	15.2	21.5	38.9	46.8	2.4	57.6	73.2	34.5	24.3	23.1	52.4	43.1	29.4
Mantis (7B)	39.6	17.6	11.2	12.5	45.2	52.5	25.7	69.9	87.4	39.3	27.2	29.3	59.5	46.4	34.1
Our Models: LLaVA-NeXT-Interleave
0.5B	43.9	34.3	21.6	29.7	63.9	54.8	35.4	52.0	67.8	33.1	13.3	12.2	45.6	39.2	28.6
7B	58.6	37.1	24.3	33.1	76.1	87.5	48.7	74.2	88.8	42.8	32.8	31.6	62.7	52.6	34.5
14B	62.3	40.5	24.5	33.3	78.6	95.0	59.9	76.7	91.1	44.3	33.4	32.7	66.4	52.1	37.1

Multi-frame Evaluation

For video understanding, we also evaluate our models on NextQA, MVBench, Video Detailed Description (VDD), ActivityNet-QA, and VideoChat-GPT. For the tasks evaluated by GPT, we follow VideoChat-GPT to use the same GPT version.

Compared with previous video-based LMMs under similar model size, LLaVA-NeXT-Interleave achieves superior results on many benchmarks, though not specifically designed for video tasks. We also follow LLaVA-Hound to add DPO training after our M4-Instruct training. After adding DPO, our 7B model achieves SOTA performance on VDD and VideoChat-GPT benchmarks, surpassing previous SOTA model LLaVA-NEXT-Video (34B). This demonstrates the effective temporal understanding and reasoning capabilities of our model across sequential frames.

Model	NextQA (ACC)	MVBench	ActivityNet-QA (Acc/Score)	Video Detailed Description	VideoChat-GPT
Model	NextQA (ACC)	MVBench	ActivityNet-QA (Acc/Score)	Video Detailed Description	Correctness	Detail	Context	Temporal	Consistency	Avg
Closed-source LMMs
GPT-4V	-	-	-	4.00	4.09	3.88	4.37	3.94	4.02	4.06
Open-source LMMs
VideoChatGPT (7B)	-	-	35.2/2.70	-	2.40	2.52	2.62	1.98	2.37	2.38
Video-LLaVA (7B)	-	-	45.3/3.30	-	2.87	2.94	3.44	2.45	2.51	2.84
VISTA-LLAMA (7B)	-	-	48.3/3.30	-	2.44	2.31	2.64	3.18	2.26	2.57
VideoChat2 (7B)	68.6	51.9	49.1/3.30	-	3.02	2.88	3.51	2.66	2.81	2.98
LLaMA-VID (7B)	-	50.2	47.4/3.30	2.84	3.01	2.97	3.54	2.53	2.6	2.93
LLaVA-NeXT-Video (7B)	-	-	53.5/3.20	3.32	3.39	3.29	3.92	2.6	3.12	3.26
LLaVA-NeXT-Video-DPO (7B)	-	-	60.2/3.50	3.72	3.64	3.45	4.17	2.95	4.08	3.66
LLaVA-NeXT-Video-DPO (34B)	-	-	64.4/3.60	3.84	3.81	3.55	4.24	3.14	4.12	3.77
Our Models: LLaVA-NeXT-Interleave
(0.5B)	59.5	45.6	48.0/2.84	3.25	3.12	2.97	3.62	2.36	3.27	3.07
(7B)	78.2	53.1	55.3/3.13	3.57	3.51	3.28	3.89	2.77	3.68	3.43
(14B)	79.1	54.9	56.2/3.19	3.59	3.65	3.37	3.98	2.74	3.67	3.48
DPO (7B)	77.9	52.3	55.0/3.13	3.90	3.99	3.61	4.24	3.19	4.12	3.83

Note that we calculate the average scores by multiplying a weight of 10 times to the score of Video Detailed Description and VideoChat-GPT.

Multi-view Evaluation (3D)

For 3D perception, we adopt 3 in-domain benchmarks to evaluate the 3D spatial understanding performance of our model, which are 3D-assisted Dialog, Task Decomposition, ScanQA-val, ALFRED, and nuScenes VQA. The first two tasks are constructed by 3D-LLM.

Compared to 3D-LLM and Point-LLM with additional point clouds as input, LLaVA-NeXT-Interleave only accepts multi-view images to interpret the 3D world, attaining significantly higher scores for in-door and out-door scenarios.

Model	In-domain Evaluation
Model	Avg	3D-assisted Dialogue	Task Decomposition	ScanQA (val)	ALFRED	nuScenes VQA
Closed-source LMMs
Flamingo	20.5	27.9	33.2	31.1	5.3	4.9
GPT-4V	34.6	31.2	35.4	32.6	10.3	63.7
Open-source LMMs
ImageBind-LLM	20.8	31.4	32.3	28.6	4.7	6.8
Point-Bind & Point-LLM	22.5	38.3	35.8	34.6	0.6	3.3
3D-LLM	22.9	39.3	37.8	35.7	1.4	0.4
Mantis (7B)	18.7	2.6	14.7	16.1	14.0	46.2
Our Models: LLaVA-NeXT-Interleave
0.5B	53.0	67.2	48.5	29.3	57.0	62.8
7B	58.2	69.3	51.4	32.2	61.6	76.5
14B	59.2	70.6	52.2	34.5	62.0	76.7

[Fold / Unfold to See Single-image Evaluation (multi-patch)]

Single-image Evaluation (multi-patch)

Model	LLM	ai2d	chartqa	docvqa	mme*	scienceqa	pope	Avg
LLaVA-NeXT	0.5B	51.65	50.16	59.13	52.83	59.96	85.36	59.80
LLaVA-NeXT-Interleave	0.5B	52.20	52.20	59.20	52.00	60.60	86.78	60.5
LLaVA-NeXT	7B	72.73	66.28	75.57	60.96	71.10	86.90	72.30
LLaVA-NeXT-Interleave	7B	73.90	67.16	75.70	63.53	72.63	86.75	73.3
LLaVA-NeXT	14B	77.49	72.12	79.96	67.68	78.88	87.32	77.2
LLaVA-NeXT-Interleave	14B	76.52	71.24	78.91	66.21	77.44	87.90	76.40

We also add 307k of original LLaVA-NEXT single image data, which makes our model capable of doing single-image tasks. We use anyres training for single-image data, which divides an image into multiple patches, forming another multi-image setting. By adding 307k single-image data (40% of original LLaVA-NEXT data), we maintain single-image performance of LLaVA-NEXT. As single-image data is of high quality and contains diverse instructions, adding single-image data also improves the instruction-following ability and enables task transfer from single-image to multi-image, which is demonstrated in our demo and the ablation study).

Section 4 - Training Techinques

We illustrate two key techniques for training LLaVA-NeXT-Interleave using the M4-Instruct dataset, and provide abalation studies for analysis.

Continue training from single-image model

To better leverage the pre-trained vision-language alignment, we adopt an off-the-shelf LLaVA-NeXT-Image as the base model, which has gone through stage-1 558K image-caption pre-training and stage-2 760K single-image fine-tuning. On top of this checkpoint, we perform the multi-image instruction tuning with the M4-Instruct dataset.

As shown by the ablation below, compared to the training based on the stage-1 pre-training, the 'continue training from stage 2' scheme performs better. In this way, we can inherit the instruction-following capabilities in single-image tasks, and better extend the scope to multi-image, video, and 3D scenarios. In addition, single-image performance cannot be maintained if training directly from stage 1.

Continue training	Mutl-image					Single-image						ActivityNet-QA (Acc/Score)	MVBench	Video Detailed Description	VideoChat-GPT
Continue training	Mantis-Eval	BLINK	Q-Bench	NLVR2	ScanQA	ai2d	chartqa	docvqa	MME*	pope	scienceqa_img	ActivityNet-QA (Acc/Score)	MVBench	Video Detailed Description	Correctness	Detail	Context	Temporal	Consistency
stage-1	41.0	37.6	47	54.0	27.7	46.3	38.3	47.5	47.1	85.4	59.4	44.7/2.17	43.0	2.96	2.97	2.87	3.49	2.42	3.14
stage-2	45.6	39.2	52	67.8	29.3	52.2	52.2	59.2	52.0	86.8	60.6	48.0/2.84	45.6	3.25	3.12	2.97	3.62	2.36	3.27

Mix training for in-the-front and interleaved formats

For interleaved multi-image input, we have two format choices for the positions of image tokens during training. The first is to place all the image tokens in front of the text and refer to each image in the sentence with a special token, aligning with the format in the single-image model. The second preserves the interleaved instruction to put image tokens in the place they are originally in, which extends models to real interleaved data format.

In the below ablation, we train on the multi-image data with different formats. The results indicate that mixing two strategies during training leads to higher performance in both two inference schemes, which provides more flexibility for multi-image input format by users.

Training Setting	Inference Setting	In-domain Evaluation
Training Setting	Inference Setting	Avg	Spot the Difference	Visual Story Telling	Text-rich VQA	Q-Bench
In-the-front format	Interleaved format	52.88	36.8	30.5	70.1	74.0
In-the-front format	In-the-front format	54.27	36.6	32.8	74.7	75.3
Interleaved format	Interleaved format	55.38	37.8	32.9	76.2	76.0
Interleaved format	In-the-front format	52.35	36.1	29.0	72.9	71.8
Mix format	Interleaved format	56.96	38.3	32.5	78.1	76.9
Mix format	In-the-front format	56.62	37.9	32.5	78.4	76.3

[Fold / Unfold to see more ablations on pooling and data]

Training strategy comparison on video (Pooling vs not pooling)

In this ablation, we study the impact of image token pooling. We train and infer our model under two settings: pooling to 1/4 and not pooling with ShareGPTVideo-Caption+QA(255K) data. Pooling to a 1/4 setting is similar to LLaVA-NEXT-Video, which uses the pooling technique to tradeoff between the number of image tokens and the number of frames. In our experiment, we find that not pooling yields better performance under similar #image tokens.

During training, we sample 10 frames for videos. In this table, we also observe that adding more frames (from 10 to 16) during inference improves performance.

Train Setting	Inference Setting	Inference #frames	# Image tokens	ActivityNet-QA (Acc/Score)	Avg	VDD	VideoChat-GPT
Train Setting	Inference Setting	Inference #frames	# Image tokens	ActivityNet-QA (Acc/Score)	Avg	VDD	Correctness	Detail	Context	Temporal	Consistency
Pooling 1/4	Pooling 1/4	40	40x729x1/4=10x729	52.75/3.53	3.35	3.38	3.46	3.25	3.87	2.59	3.57
Pooling 1/4	Pooling 1/4	64	64x729x1/4=16x729	52.7/3.53	3.33	3.38	3.45	3.23	3.86	2.49	3.55
Not Pooling	Not Pooling	10	10x729	52.9/3.48	3.38	3.46	3.43	3.26	3.85	2.64	3.61
Not Pooling	Not Pooling	16	16x729	54.4/3.51	3.41	3.46	3.48	3.28	3.87	2.74	3.62

Effects of adding more tasks and data

To study the impact of adding more data, we conduct experiments under different data settings and evaluate the video benchmark. As we gradually add more data, performance consistently improves. VDD means Video Detailed Description.

Data	Next-QA	Avg	VDD	VideoChat-GPT
Data	Next-QA	Avg	VDD	Correctness	Detail	Context	Temporal	Consistency
video	42.60	3.40	3.46	3.47	3.27	3.87	2.74	3.61
video + single-image	67.70	3.40	3.49	3.46	3.30	3.85	2.71	3.60
video + multi-image	77.70	3.42	3.50	3.50	3.31	3.90	2.70	3.63
video + both	78.20	3.45	3.58	3.50	3.27	3.87	2.77	3.68

Team

Feng Li*†: Hong Kong University of Science and Technology ( Work collaborated with ByteDance)
Renrui Zhang*: The Chinese University of Hong Kong (Work collaborated with ByteDance)
Hao Zhang*: Hong Kong University of Science and Technology (Work collaborated with ByteDance)
Yuanhan Zhang: Nanyang Technological University ( Work collaborated with ByteDance)
Bo Li: Nanyang Technological University ( Work collaborated with ByteDance)
Wei Li: Bytedance
Zejun Ma: Bytedance
Chunyuan Li: Bytedance
*Core contributors, †Project lead

Related Blogs

Citation

								

@misc{li2024llavanext-interleave,
	title={LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models},
	url={https://llava-vl.github.io/blog/2024-06-16-llava-next-interleave/},
	author={Li, Feng and Zhang, Renrui and Zhang, Hao and Zhang, Yuanhan and Li, Bo and Li, Wei and Ma, Zejun and Li, Chunyuan},
	month={June},
	year={2024}
}

@misc{li2024llavanext-ablations,
	title={LLaVA-NeXT: What Else Influences Visual Instruction Tuning Beyond Data?},
	url={https://llava-vl.github.io/blog/2024-05-25-llava-next-ablations/},
	author={Li, Bo and Zhang, Hao and Zhang, Kaichen and Guo, Dong and Zhang, Yuanhan and Zhang, Renrui and Li, Feng and Liu, Ziwei and Li, Chunyuan},
	month={May},
	year={2024}
}

@misc{li2024llavanext-strong,
    title={LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild},
    url={https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/},
    author={Li, Bo and Zhang, Kaichen and Zhang, Hao and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Yuanhan and Liu, Ziwei and Li, Chunyuan},
    month={May},
    year={2024}
}

@misc{zhang2024llavanextvideo,
    title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
    url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
    author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
    month={April},
    year={2024}
}
    
@misc{liu2024llavanext,
    title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
    url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
    author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
    month={January},
    year={2024}
}

@misc{liu2023improvedllava,
      title={Improved Baselines with Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
      publisher={arXiv:2310.03744},
      year={2023},
}

@misc{liu2023llava,
      title={Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
      publisher={NeurIPS},
      year={2023},
}

							

Table of Contents

Section 1 - Emerging Capabilities

Task Transfer between Single-Image and Multi-Image

Multi-image Coding

Conversation

Fun Mems

Conversation

Task Transfer between Image and Video

Multi-video difference

Conversation

Twitter post for video

Conversation

Real-World Applications

Multi-doc VQA

Conversation

PPT Sumarization and QA

Conversation

Patining styles

Conversation

Image Editing

Conversation

Multi page doc vqa (information retrival)

Conversation

Twitter blog

Conversation

Section 2 - Interleaved Visual Instruction Tuning

M4-Instruct: Training Data

Multi-Image Scenarios

Social Media Post

Multi-Image Scenarios

Image Edit Instruction Generation

Multi-Image Scenarios

Spotting the Difference

Multi-Image Scenarios

Low-level Comparison

Multi-Frame Scenarios

Video Detailed Description

Multi-Frame Scenarios

Video Detailed Description

Multi-View Scenarios

Embodied VQA

Multi-View Scenarios

Embodied VQA

LLaVA-Interleave Bench

Section 3 - Evaluation Results

Multi-image Evaluation

Multi-frame Evaluation

Multi-view Evaluation (3D)

Single-image Evaluation (multi-patch)

Section 4 - Training Techinques

Continue training from single-image model

Mix training for in-the-front and interleaved formats

Training strategy comparison on video (Pooling vs not pooling)

Effects of adding more tasks and data

Team

Related Blogs

Citation