Recent advancements in Large Multimodal Models (LMMs) have showcased impressive capabilities in multimodal understanding and reasoning. However, most existing open-source LMMs such as LLaVA-NeXT have primarily focused on pushing the performance limit of single-image, leaving the potential of multi-image scenarios largely unexplored. Considering the diverse range of computer vision scenarios and data formats—including single and multi-image inputs, videos, and 3D data—there is a pressing need to develop methodologies for open LMMs that can operate effectively across these varied contexts. We observe that the image-text interleaved format can naturally serve as a general data template to unify different scenarios, e.g., single-image or multi-image as special cases, video as multi-frames, and 3D as multi-views. Therefore, we present LLaVA-NeXT-Interleave, an all-around LMM that extends the model capabilities to new real-world settings: Multi-image, Multi-frame (videos), Multi-view (3D) and maintains the performance of the Multi-patch (single-image) scenarios. We denote the four settings as M4.

Description of GIF

Highlights:

  1. Interleave data format unifies different tasks. We represent multi-image, video, 3D, and single-image data all into an interleaved training format, which unifies different tasks in a single LLM.
  2. New datasets: (1) Training Data: M4-Instruct. We compile a high-quality dataset with 1269k samples, spanning 4 primary domains (multi-image, video, 3D, and single-image). (2). LLaVA-Interleave Bench. We curate a diverse set of tasks to evaluate the multi-image capabilities in 3 scenarios, including 9 newly collected and 13 existing in/out-domain benchmarks.
  3. SoTA Performance.(1). With a single model, LLaVA-NeXT-Interleave can achieve leading results in different multi-image benchmarks compared to the previous SoTA. (2). With a proper data mix of different scenarios, the performance of previous individual tasks can be improved or maintained. For example, we maintain the single-image performance of LLaVA-NeXT and improve performance on video tasks.
  4. Emerging capabilities with cross-task transfer. By jointly training on a diverse set of visual data modalities, the model shows emerging capabilities to transfer tasks between different scenarios.

Release

Section 1 - Interleaved Visual Instruction Tuning

Task Overview. We utilize interleave format to unify different tasks of data input, including the following four settings:

  • Multi-image scenarios include instructions with single or multiple images, and interleaved text-image input. This setting covers 12 challenging real-world tasks included in our training data, such as spotting the difference, visual storytelling, image editing instruction generation, interleaved multi-image dialogue, multi-image puzzle, low-level multi-image assessment, etc.
  • Multi-frame scenarios refer to taking as input the video data by sampling it into multiple frames, preserving temporal cues across the multi-image sequence. The video-based tasks mainly focus on 2 tasks: video detailed captioning and video VQA.
  • Multi-view scenarios represent 3D environments by multi-view images from different perspectives, where the visual correspondence and disparity can indicate spatial information in the 3D world. For 3D perception, we include 2 tasks: embodied VQA (dialogue and planning), and 3D scene VQA (captioning and grounding).
  • Multi-patch represents the single-image scenario. This is because the design of AnyRes in LLaVA-NeXT allows us to divide a single image into multiple patches, compatible with the interleaved format. This is used to maintain single-image performance and accommodate cross-task transfer.

M4-Instruct: Training Data

To empower all-round multi-image capabilities, we curate a comprehensive dataset including 1269K training samples, M4-Instruct, spanning multi-image, multi-frame, and multi-view scenarios with 14 tasks and 43 datasets, along with single-image data to preserve basic instruction-following capabilities. We exhibit the detailed data statistics in the table below.

Most multi-image datasets are collected from previous public efforts and rigorously converted into our data format, referring to DEMON and Mantis. On top of that, we also utilize GPT-4V to annotate 5 new tasks to enable more diverse capabilities, i.e., Real-world Difference, Synthetic Difference, COCO Difference, Twitter Post, and Interleaved Multi-image Dialogue.

For video data, we utilize a subset of 255k from LLaVA-Hound, including 240k video QA and 15k video captions. We also include NExT-QA and STAR in our training.

For 3D data, we utilize 3D data from Nuscenes QA, ALFRED, ScanQA, and 3D-LLM.

For single-image data, we use 40% of original LLaVA-NeXT single-image data.

Data Statistics Figure

[Fold / Unfold to See the Details of Data Statistics]
Task Dataset Scenario # Samples
Multi-image Scenarios
Spot the Difference(63.9K) Real-world Difference Realistic 7.5K
Synthetic Difference Sythetic 7.5K
COCO Difference Realistic 20K
Spot-the-Diff Surveilance 11K
Birds-to-Words Birds 14K
CLEVR-Change Solids 3.9K
Image Edit Instruction(67.7K) HQ-Edit Sythentic 50K
MagicBrush Realistic 14.2K
IEdit Realistic 3.5K
Visual Story Telling(66.9K) AESOP Cartoon 6.9K
FlintstonesSV Cartoon 22K
PororoSV Cartoon 12K
VIST Realistic 26K
Visual Cloze(14.6K) COMICS_Dialogue Cartoon 5.9K
RecipeQA_VisualCloze Realistic 8.7K
Text-rich VQA(21.4K) WebQA Webpage 9.3K
TQA Textbook 8.3K
OCR-VQA OCR 1.9K
DocVQA Document 1.9K
Multi-image VQA(174.5K) NLVR2 Realistic 83K
MIT-States_StateCoherence General 1.9K
MIT-States_PropertyCoherence General 1.9K
RecipeQA_ImageCoherence Recipe 8.7K
VISION Industrial 10K
Multi-VQA General 5K
IconQA General 64K
Low-level Comparison(66K) Coinstruct Low-level 50K
Dreamsim Low-level 16K
Image-caption Comprehension (73K) ImageCoDe General 17K
Contrast-Caption General 36K
Multi-image Caption General 10K
Daily Scenarios (11K) MMChat_Twitter_Post General 10K
Interleaved Multi-image Dialogue General 1K
Multi-image Puzzle (35K) Raven Abstract 35K
Multi-frame (Video) Scenarios
Video VQA(247K) NExT-QA General 4K
STAR General 3K
ShareGPTVideo-VQA General 240K
Video Detailed Captioning (15K) ShareGPTVideo-Caption General 15K
Multi-view (3D) Scenarios
Scene VQA(51K) Nuscenes Outdoor 10K
ScanQA Indoor Realistic 25k
3D-LLM-Scene Indoor Realistic 16K
Embodied VQA(48.5K) ALFRED Indoor Synthetic 23K
3D-LLM-Dialogue Indoor Realistic 8.5K
3D-LLM-Planning Indoor Realistic 17K
Single-image Scenarios
Single-image Tasks(307K) Randomly sampling 40% SFT data of LLaVA-NeXT General 307K

Task Examples. We provide examples to illustrate the multi-image/fame/view tasks of M4-Instruct dataset.

Multi-Image Scenarios

Image 1
Image 2
Image 2
Image 2

Social Media Post

Given an idea and a collection of photos, Assistant produces captivating social media content, such as a Twitter post, featuring well-crafted copy.

User Help me write a Twitter post considering the following images.

Assistant Wandering through the enchanting streets of Milan, where the vibrant pulse of the city is as captivating as its historical architecture. From the mesmerizing ... the gentle glow of candles offers a moment of serenity in this bustling metropolis. #MilanDiaries #TravelMoments #CityLights

Multi-Image Scenarios

Image 1
Image 2

Image Edit Instruction Generation

For a pair of images, the model generates accurate image editing instruction. Such capabilities can potentially benefit research related to text-to-image synthesis and diffusion models.

User Please provide the image edit instruction that can transfrom the source image to the target image.

Assistant In Image 1, replace the black and white soccer ball in the puppy's mouth with a sunflower similar to the one in Image 2.

Multi-Image Scenarios

Image 1
Image 2

Spotting the Difference

When presented with two or more visually similar images, this task requires identifying and detailing their differences with fine-grained visual analysis.

User Given two similar images, please provide a list of their differences.

Assistant The differences between the two images are:\n1. In the first image, the dogs are biting a red frisbee, while in the second image, they are biting a stuffed toy.\n2. In the second image, the dog on the left is wearing a blue jacket with a decorative pattern around the collar.

Multi-Image Scenarios

Image 1
Image 2

Low-level Comparison

Provided with images subjected to low-level transformation, the model conducts quality comparison, assessment, and question-answering concerning distortions, debluring, and denoising.

User Compared to the first image, how is the sharpness of the second image?\n (A) Sharper\n (B) Blurrier\n (C) About the same

Assistant A

Multi-Frame Scenarios


Video Detailed Description

User Please provide a detailed description of the video.

Assistant The video opens with a close-up of a person wearing a white space suit with red detailing, including a helmet with the number "33" on it. The individual is standing in a vast, snow-covered landscape under a clear blue sky. The scene transitions to show the person walking away from the camera towards a large, metallic UFO parked on the snow. The person then enters the UFO through a hatch, and the interior is briefly visible, showing control panels and screens. The final scenes depict the person adjusting controls within the UFO, with a focus on the gloved hands manipulating the controls.

Multi-Frame Scenarios


Video Detailed Description

User Please provide a detailed description of the video.

Assistant The video shows a bustling street scene in Tokyo during winter. People are walking along a busy market street covered with snow. The street is lined with various stalls and shops under red and white striped awnings, and large, snow-covered trees add a beautiful, wintery charm to the scene. Everyone is dressed warmly, and most individuals are wearing masks, suggesting the scene occurs during a period of health precautions, possibly during the COVID-19 pandemic. This setting likely depicts a popular area for shopping or commuting, emphasized by the vibrant activity and dense foot traffic.

Multi-View Scenarios

Image 1
Image 2
Image 2

Embodied VQA

This task enables the model to interact with a 3D environment to solve problems or answer questions by navigating and manipulating its surroundings, which are essential for robotics and augmented reality applications.

User Your Main Goal: Place a glass with a fork in it on the table in front of a glass pitcher. Step Details: <image>Step#1: Move forward, turn right at the refrigerator, turn left at the stove and face the sink. <image>Step#2: Pick up the fork from in the sink. <image> Current Step:

Assistant Turn to the left, move forward, turn to the right and face the refrigerator.

Multi-View Scenarios

Image 1

Embodied VQA

User What is on the left of the TV?

Assistant A bicycle on the floor.

User How many trash cans can be seen on your right when you enter?

Assistant A bicycle on the floor.



[Fold / Unfold to see task overview]
Data Statistics Figure
Examples of M4-Insutrct Dataset.

LLaVA-Interleave Bench

We categorize multi-image reasoning task into two classes:

(1) In-domain Evaluation includes tasks that have been "seen" during training, designed to verify the model performance within familiar scenarios. We adopt 7 newly curated multi-image tasks corresponding to training datasets, and 2 existing benchmarks, Q-Bench and NLVR2, with 13.1K samples in total.

(2) Out-domain Evaluation involves tasks that do not overlap with the training scenarios, aiming to reveal the generalization capacity of LMMs. We construct 2 new tasks for multi-image mathematical and scientific comprehension, and utilize 3 existing benchmarks, Mantis-Eval, BLINK, and MMMU, with 4.2K samples in total.

Data Statistics Figure
[Fold / Unfold to See Table for detailed multi-image benchmark statistics]
Task Dataset Scenario # Samples
In-domain Evaluation - Newly Curated Benchmarks
Spot the Difference(0.3K) Spot-the-Diff Surveilance 0.1K
Birds-to-Words Birds 0.1K
CLEVR-Change Solids 0.1K
Image Edit Instruction(2K) HQ-Edit Sythentic 1K
MagicBrush Realistic 0.9K
IEdit Realistic 0.1K
Visual Story Telling(0.4K) AESOP Cartoon 0.1K
FlintstonesSV Cartoon 0.1K
PororoSV Cartoon 0.1K
VIST Realistic 0.1K
Visual Cloze(0.1K) COMICS_Dialogue Cartoon 0.1K
Text-rich VQA(0.4K) WebQA Webpage 0.1K
TQA Textbook 0.1K
OCR-VQA OCR 0.1K
DocVQA Document 0.1K
Multi-image VQA(0.4K) MIT-States_StateCoherence General 0.1K
MIT-States_PropertyCoherence General 0.1K
RecipeQA_ImageCoherence Recipe 0.1K
VISION Industrial 0.1K
Puzzle (1.4K) Raven Abstract 1.4K
In-domain Evaluation - Existing Benchmarks
NLVR2 (7K) NLVR2 Realistic 7K
Q-Bench (1K) Q-Bench Low-level 1K
Out-domain Evaluation - Newly Curated Benchmarks
MathVerse-mv (0.8K) MathVerse Math Diagram 0.8K
SciVerse-mv (0.5K) SciVerse Scientific Diagram 0.5K
Out-domain Evaluation - Existing Benchmarks
Mantis-Eval (0.2K) Mantis-Eval General 0.2K
BLINK (1.9K) BLINK General 1.9k
MMMU-mv (test) (0.8K) MMMU Scientific Diagram 0.8K



Section 2 - Evaluation Results

We conduct extensive evaluation compared with existing open-source models to demonstrate the multimodal capabilities of LLaVA-NeXT-Interleave, which covers a variety of existing and newly curated in/out-domain benchmarks. In the tables below, we exhibit the results across 3 primary scenarios, where our model achieves consistently leading performance for multi-image/frame/view reasoning. For our final solution, LLaVA-NeXT-Interleave (0.5/7/14B) adopts Qwen 1.5-0.5B, 7B and -14B as base LLMs, SigLIP-400M with 384x384 resolutions as the vision encoder, and a two-layer MLP as the projection layer.

Multi-image Evaluation

As reported in the table, the average multi-image performance of LLaVA-NeXT-Interleave surpasses previous open-source models in both in- and out-domain benchmarks. For in-domain evaluation, our model demonstrates significant advantages across various tasks as expected, due to the multi-image instruction tuning with M4-Instruct. For out-domain evaluation, LLaVA-NeXT-Interleave also showcases superior generalization capacity within novel scenarios, e.g., comparable to GPT-4V on Mantis-Eval and BLINK.

Model In-domain Evaluation Out-domain Evaluation
Avg Newly Curated Benchmarks Existing Benchmarks Avg Newly Curated Benchmarks Existing Benchmarks
Spot the Difference Image Edit Instruction Visual Story Telling Visual Cloze Text-rich VQA Multi-image VQA Multi-image Puzzle Q-Bench NLVR2 MathVerse-mv SciVerse-mv Mantis-Eval
BLINK MMMU-mv (test)
GPT4V 39.2 12.5 11.0 10.9 29.5 54.5 52.0 17.1 76.5 88.8 57.78 60.3 66.9 62.7 51.1 47.9
Open-source LMMs
LLaVA-NeXT-Image(7B) 32.4 12.9 13.2 10.1 28.0 59.6 39.4 9.0 51.0 68.0 29.4 13.5 12.2 46.1 41.8 33.5
VPG-C (7B) 35.8 27.8 15.2 21.5 38.6 38.9 46.8 2.4 57.6 73.2 34.5 24.3 23.1 52.4 43.1 29.4
Mantis (7B) 39.6 17.6 11.2 12.5 34.0 45.2 52.5 25.7 69.9 87.4 39.3 27.2 29.3 59.5 46.4 34.1
Our Models: LLaVA-NeXT-Interleave
0.5B 43.9 34.3 21.6 29.7 36.0 63.9 54.8 35.4 52.0 67.8 33.1 24.7 27.6 45.6 39.2 28.6
7B 58.6 37.1 24.3 33.1 58.0 76.1 87.5 48.7 74.2 88.8 42.8 32.8 31.6 62.7 52.6 34.5
14B 62.3 40.5 24.5 33.3 61.0 78.6 95.0 59.9 76.7 91.1 44.3 33.4 32.7 66.4 52.1 37.1

Multi-frame Evaluation

For video understanding, we also evaluate our models on NextQA, MVBench, Video Detailed Description (VDD), ActivityNet-QA, and VideoChat-GPT. For the tasks evaluated by GPT, we follow VideoChat-GPT to use the same GPT version.

Compared with previous video-based LMMs under similar model size, LLaVA-NeXT-Interleave achieves superior results on many benchmarks, though not specifically designed for video tasks. We also follow LLaVA-Hound to add DPO training after our M4-Instruct training. After adding DPO, our 7B model achieves SOTA performance on VDD and VideoChat-GPT benchmarks, surpassing previous SOTA model LLaVA-NEXT-Video (34B). This demonstrates the effective temporal understanding and reasoning capabilities of our model across sequential frames.

Model NextQA
(ACC)
MVBench ActivityNet-QA
(Acc/Score)
Video Detailed Description VideoChat-GPT
Correctness Detail Context Temporal Consistency Avg
Closed-source LMMs
GPT-4V - - - 4.00 4.09 3.88 4.37 3.94 4.02 4.06
Open-source LMMs
VideoChatGPT (7B) - - 35.2/2.70 - 2.40 2.52 2.62 1.98 2.37 2.38
Video-LLaVA (7B) - - 45.3/3.30 - 2.87 2.94 3.44 2.45 2.51 2.84
VISTA-LLAMA (7B) - - 48.3/3.30 - 2.44 2.31 2.64 3.18 2.26 2.57
VideoChat2 (7B) 68.6 51.9 49.1/3.30 - 3.02 2.88 3.51 2.66 2.81 2.98
LLaMA-VID (7B) - 50.2 47.4/3.30 2.84 3.01 2.97 3.54 2.53 2.6 2.93
LLaVA-NeXT-Video (7B) - - 53.5/3.20 3.32 3.39 3.29 3.92 2.6 3.12 3.26
LLaVA-NeXT-Video-DPO (7B) - - 60.2/3.50 3.72 3.64 3.45 4.17 2.95 4.08 3.66
LLaVA-NeXT-Video-DPO (34B) - - 64.4/3.60 3.84 3.81 3.55 4.24 3.14 4.12 3.77
Our Models: LLaVA-NeXT-Interleave
(0.5B) 59.5 45.6 48.0/2.84 3.25 3.12 2.97 3.62 2.36 3.27 3.07
(7B) 78.2 53.1 55.3/3.13 3.57 3.51 3.28 3.89 2.77 3.68 3.43
(14B) 79.1 54.9 56.2/3.19 3.59 3.65 3.37 3.98 2.74 3.67 3.48
DPO (7B) 77.9 52.3 55.0/3.13 3.90 3.99 3.61 4.24 3.19 4.12 3.83
Note that we calculate the average scores by multiplying a weight of 10 times to the score of Video Detailed Description and VideoChat-GPT.

Multi-view Evaluation (3D)

For 3D perception, we adopt 3 in-domain benchmarks to evaluate the 3D spatial understanding performance of our model, which are 3D-assisted Dialog, Task Decomposition, ScanQA-val, ALFRED, and nuScenes VQA. The first two tasks are constructed by 3D-LLM.

Compared to 3D-LLM and Point-LLM with additional point clouds as input, LLaVA-NeXT-Interleave only accepts multi-view images to interpret the 3D world, attaining significantly higher scores for in-door and out-door scenarios.

Model In-domain Evaluation
Avg 3D-assisted Dialogue Task Decomposition ScanQA (val) ALFRED nuScenes VQA
Closed-source LMMs
Flamingo 20.5 27.9 33.2 31.1 5.3 4.9
GPT-4V 34.6 31.2 35.4 32.6 10.3 63.7
Open-source LMMs
ImageBind-LLM 20.8 31.4 32.3 28.6 4.7 6.8
Point-Bind & Point-LLM 22.5 38.3 35.8 34.6 0.6 3.3
3D-LLM 22.9 39.3 37.8 35.7 1.4 0.4
Mantis (7B) 18.7 2.6 14.7 16.1 14.0 46.2
Our Models: LLaVA-NeXT-Interleave
0.5B 53.0 67.2 48.5 29.3 57.0 62.8
7B 58.2 69.3 51.4 32.2 61.6 76.5
14B 59.2 70.6 52.2 34.5 62.0 76.7

[Fold / Unfold to See Single-image Evaluation (multi-patch)]

Single-image Evaluation (multi-patch)

Model LLM ai2d chartqa docvqa mme* scienceqa pope Avg
LLaVA-NeXT 0.5B 51.65 50.16 59.13 52.83 59.96 85.36 59.80
LLaVA-NeXT-Interleave 52.20 52.20 59.20 52.00 60.60 86.78 60.5
LLaVA-NeXT 7B 72.73 66.28 75.57 60.96 71.10 86.90 72.30
LLaVA-NeXT-Interleave 73.90 67.16 75.70 63.53 72.63 86.75 73.3
LLaVA-NeXT 14B 77.49 72.12 79.96 67.68 78.88 87.32 77.2
LLaVA-NeXT-Interleave 76.52 71.24 78.91 66.21 77.44 87.90 76.40

We also add 307k of original LLaVA-NEXT single image data, which makes our model capable of doing single-image tasks. We use anyres training for single-image data, which divides an image into multiple patches, forming another multi-image setting. By adding 307k single-image data (40% of original LLaVA-NEXT data), we maintain single-image performance of LLaVA-NEXT. As single-image data is of high quality and contains diverse instructions, adding single-image data also improves the instruction-following ability and enables task transfer from single-image to multi-image, which is demonstrated in our demo and the ablation study).



Section 3 - Emerging Capabilities

Task Transfer between Single-Image and Multi-Image

  • Multi-image coding: Our multi-image training data does not include the coding-related problem. However, by jointly training on single-image data with coding problems, the model learns to write code based on multiple images. In this example, the model writes an HTML code to simulate the movement of the square from the first image to the second image.
  • What is fun about the images: The model learns to reason over mutliple images and tell the funny part of the images. It is only in single-image tasks but not in multi-image tasks.

Multi-image Coding

Image 1
Image 2

Conversation

User Please pay attention to the movement of the object from the first image to the second image, then write a HTML code to show this movement.

LLaVA-NeXT-Interleave Certainly! To create an animation that shows the object moving from one position to another, you can use HTML with CSS for styling and JavaScript for the animation. Below is a simple example of how you might code this:

Image 1
The rendered video

This code will create an animation where the red object moves horizontally from left to right. The animation is set to last 2 seconds with an ease-in-out timing function. The object's width and height are set to 100 pixels, and the background color is red. The animation is applied using the @keyframes rule, which defines the keyframes for the animation. The animation property on the object sets the animation to use the move keyframe defined earlier. In the JavaScript part, we maintain a variable currentPos that keeps track of the current position of the object. We use setInterval to animate the object's movement, updating its position by 1 pixel every 5 milliseconds. After 2 seconds, we clear the interval id to stop the animation. Remember to adjust the keyframes to match the exact movement you want to achieve.

Fun Mems

Image 1
Image 2
Image 3

Conversation

User what is fun about the images?

LLaVA-NeXT-Interleave The humor in these images comes from the anthropomorphized birds engaging in human-like activities and conversations. In the first image, one bird is asking about going to get some fries on the pier, which is a human activity. The second image shows a bird asking about the "culminating result of consciousness," which is a philosophical question rather than a literal one. The third image has a bird expressing its desire to do something with the fries on the pier, again referencing a human activity. The anthropomorphism of the birds adds a layer of humor by attributing human thoughts and desires to animals.


Task Transfer between Image and Video

  • Multi-video difference: All the video data trained is about single-video QA or caption. Tell-the-difference data only appears in multi-image data, and the difference is only about the difference between two single images. By joint training, the model generalizes to tell the difference of two videos. In this example, viewing two similar videos of different styles, the model can tell the general difference and list the detailed differences.
  • Twitter post for video: The model is trained to write a Twitter post on multi-image data. By joint training, the model can write a Twitter post of a video.

Multi-video difference

Conversation

User The input contains two videos, the first half is the first video and the second half is the second video. What is the difference between the two videos?

LLaVA-NeXT-Interleave The first video shows a lion with a fiery mane, while the second video shows a lion with a bright yellow mane.

User List the detailed difference

LLaVA-NeXT-InterleaveThe differences between the two videos are: 1. The mane of the lion in the first video is a fiery orange-red color, while in the second video, it is a vibrant yellow. 2. The background in the first video is dark and indistinct, whereas in the second video, the background is green, suggesting a natural setting. 3. The lion's expression changes from a neutral look to a more aggressive one with its mouth open in the second video.

Twitter post for video


Conversation

User Help me write a twitter post to describe this video.

LLaVA-NeXT-Interleave "Feeling the chill in the air, but the cherry blossoms are a sight to behold! 🌸❄️ Walking down the street, each person bundled up against the winter chill. The snow-covered trees and market stalls add to the serene beauty of this winter wonderland. #CherryBlossoms #WinterWalk #CityScenes".


Real-World Applications

  • PPT summarization and QA: The model can summarize and retrieve information from multiple images.
  • Painting styles/Category difference: The model generalizes to recognize artists' painting styles and tell different categories.
  • Image editing: The model can be used to create image editing prompts for image generation.
  • Multi-doc QA: The model can summarize the information of multiple documents and provide comparisons for different documents.

Multi-doc VQA

Image 2
Image 1

Conversation

User Hi, there are two candidates, can you provide a brief description for each of them for me?

LLaVA-NeXT-Interleave Certainly! Here's a brief description of each candidate: 1. Oprah Winfrey: - Professional Experience: Known for her talk show, "The Oprah Winfrey Show," and her work on "COVID-19" as a presenter on Apple TV. - Skills: Emphasized on emotional intelligence, interpersonal skills, and creativity. - Honors & Awards: Notable achievements include the Golden Globe Award for Best Television Series and the Golden Globe Award for Best Television Special. 2. Steve Jobs: - Professional Experience: Co-founder of Apple Inc., where he led the development of iconic products like the iMac, iPod, and iPhone. - Skills: Highlighted as a thinker with a focus on innovation and problem-solving. - Honors & Awards: Recognized with numerous accolades, including the MacArthur Fellowship and the NeXT Computer Science Award. Please note that these descriptions are based on the information provided in the images and do not include any personal details or specific accomplishments mentioned in the images.

User Which of them better suits the role of the Congressman of the US? Why?

LLaVA-NeXT-Interleave The first resume, featuring Oprah Winfrey, would better suit the role of a Congressman because it highlights skills and experiences that are relevant to political office, such as leadership, public speaking, and media experience. These qualities are essential for a politician to effectively represent their constituents and engage in legislative processes. The second resume, while it could be suitable for a creative or entrepreneurial role, does not specifically mention skills or experiences that align with the responsibilities of a Congressman.

PPT Sumarization and QA

Image 1
Image 2
Image 3
Image 4
Image 5

Conversation

User The images are the PPT of iPhone 15 review. can you summarize the main information?

LLaVA-NeXT-Interleave The PPT features a sleek design with a black background and includes text and images related to the iPhone 15. It highlights the phone's innovative features such as a custom dual glass back cover, a dynamic island display for delivery tracking, and a USB-C compatible fast charging cable. The color options are showcased in a gradient of pink, green, gray, and onyx. The presentation also emphasizes the phone's powerful processors, including the A16 Bionic chip and A17 Pro chip.

User What is the price of iPhone 15 plus?

LLaVA-NeXT-Interleave The price of iPhone 15 plus is Rs. 89,000 as mentioned in the PPT slide.

User Why the dynamic island useful?

LLaVA-NeXT-Interleave The dynamic island is useful for delivery tracking, as it allows the user to know when their order will arrive and if there are any delays. It also provides a new way to interact with the phone's interface, making it easier to navigate and access various features.

Patining styles

Image 1
Image 2
Image 2
Image 2

Conversation

User Do you kown who draw these paintings?

LLaVA-NeXT-Interleave Yes, these paintings are by the renowned artist Vincent van Gogh. The first painting is his famous work "The Starry Night," which depicts a swirling night sky with a bright crescent moon and stars. The second painting is his self-portrait, which shows him smoking a pipe and wearing a hat. The third painting is also a self-portrait, but it features a different expression and hairstyle. The fourth painting is by another artist, not Van Gogh, and depicts a person seated with flowers in the background.

Image Editing

Image 1
Image 2

Conversation

User Please generate a brief sentence to edit image1 to make it look like image2.

LLaVA-NeXT-Interleave To edit image1 to make it look like image2, you would need to add a patterned cushion or seat cover to the bench.

Multi page doc vqa (information retrival)

Image 1
Image 2

Conversation

User What is his email?

LLaVA-NeXT-Interleave The email address provided in the image is "herman.walton@gmail.com".

User What language can he speak?

LLaVA-NeXT-Interleave The language skills section of the resume mentions that the individual is fluent in English and Dutch.

Twitter blog

Image 2
Image 1
Image 1

Conversation

User Please write a twitter blog post with the images.

LLaVA-NeXT-Interleave 🚀✨ Launch! 🚀✨ The sky is alight with the brilliance of a rocket's ascent. The rocket, a magnificent arc against the night sky, leaves behind a trail of light that cuts through the darkness. It's a moment frozen in time, where human ingenuity and nature's grandeur converge. The rocket's trajectory is a testament to our progress, pushing the boundaries of what's possible. This is not just a launch; it's a celebration of human aspiration and the relentless pursuit of knowledge. #SpaceExploration #RocketLaunch


Section 4 - Training Techinques

We illustrate two key techniques for training LLaVA-NeXT-Interleave using the M4-Instruct dataset, and provide abalation studies for analysis.

Continue training from single-image model

To better leverage the pre-trained vision-language alignment, we adopt an off-the-shelf LLaVA-NeXT-Image as the base model, which has gone through stage-1 558K image-caption pre-training and stage-2 760K single-image fine-tuning. On top of this checkpoint, we perform the multi-image instruction tuning with the M4-Instruct dataset.

As shown by the ablation below, compared to the training based on the stage-1 pre-training, the 'continue training from stage 2' scheme performs better. In this way, we can inherit the instruction-following capabilities in single-image tasks, and better extend the scope to multi-image, video, and 3D scenarios. In addition, single-image performance cannot be maintained if training directly from stage 1.

Continue training Mutl-image Single-image ActivityNet-QA
(Acc/Score)
MVBench Video Detailed Description VideoChat-GPT
Mantis-Eval
BLINK Q-Bench NLVR2 ScanQA ai2d chartqa docvqa MME* pope scienceqa_img Correctness Detail Context Temporal Consistency
stage-1 41.0 37.6 47 54.0 27.7 46.3 38.3 47.5 47.1 85.4 59.4 44.7/2.17 43.0 2.96 2.97 2.87 3.49 2.42 3.14
stage-2 45.6 39.2 52 67.8 29.3 52.2 52.2 59.2 52.0 86.8 60.6 48.0/2.84 45.6 3.25 3.12 2.97 3.62 2.36 3.27

Mix training for in-the-front and interleaved formats

For interleaved multi-image input, we have two format choices for the positions of image tokens during training. The first is to place all the image tokens in front of the text and refer to each image in the sentence with a special token, aligning with the format in the single-image model. The second preserves the interleaved instruction to put image tokens in the place they are originally in, which extends models to real interleaved data format.

In the below ablation, we train on the multi-image data with different formats. The results indicate that mixing two strategies during training leads to higher performance in both two inference schemes, which provides more flexibility for multi-image input format by users.

Training
Setting
Inference
Setting
In-domain Evaluation
Avg Spot the Difference Visual Story Telling Visual Cloze Text-rich VQA Q-Bench
In-the-front format Interleaved format 52.88 36.8 30.5 53 70.1 74.0
In-the-front format 54.27 36.6 32.8 52 74.7 75.3
Interleaved format Interleaved format 55.38 37.8 32.9 54 76.2 76.0
In-the-front format 52.35 36.1 29.0 52 72.9 71.8
Mix format Interleaved format 56.96 38.3 32.5 59 78.1 76.9
In-the-front format 56.62 37.9 32.5 58 78.4 76.3

[Fold / Unfold to see more ablations on pooling and data]

Training strategy comparison on video (Pooling vs not pooling)

In this ablation, we study the impact of image token pooling. We train and infer our model under two settings: pooling to 1/4 and not pooling with ShareGPTVideo-Caption+QA(255K) data. Pooling to a 1/4 setting is similar to LLaVA-NEXT-Video, which uses the pooling technique to tradeoff between the number of image tokens and the number of frames. In our experiment, we find that not pooling yields better performance under similar #image tokens.

During training, we sample 10 frames for videos. In this table, we also observe that adding more frames (from 10 to 16) during inference improves performance.

Train Setting Inference Setting Inference #frames # Image tokens ActivityNet-QA
(Acc/Score)
Avg VDD VideoChat-GPT
Correctness Detail Context Temporal Consistency
Pooling 1/4 Pooling 1/4 40 40x729x1/4=10x729 52.75/3.53 3.35 3.38 3.46 3.25 3.87 2.59 3.57
Pooling 1/4 Pooling 1/4 64 64x729x1/4=16x729 52.7/3.53 3.33 3.38 3.45 3.23 3.86 2.49 3.55
Not Pooling Not Pooling 10 10x729 52.9/3.48 3.38 3.46 3.43 3.26 3.85 2.64 3.61
Not Pooling Not Pooling 16 16x729 54.4/3.51 3.41 3.46 3.48 3.28 3.87 2.74 3.62

Effects of adding more tasks and data

To study the impact of adding more data, we conduct experiments under different data settings and evaluate the video benchmark. As we gradually add more data, performance consistently improves. VDD means Video Detailed Description.

Data Next-QA Avg VDD VideoChat-GPT
Correctness Detail Context Temporal Consistency
video 42.60 3.40 3.46 3.47 3.27 3.87 2.74 3.61
video + single-image 67.70 3.40 3.49 3.46 3.30 3.85 2.71 3.60
video + multi-image 77.70 3.42 3.50 3.50 3.31 3.90 2.70 3.63
video + both 78.20 3.45 3.58 3.50 3.27 3.87 2.77 3.68

Team

  • Feng Li*†: Hong Kong University of Science and Technology ( Work collaborated with ByteDance)
  • Renrui Zhang*: The Chinese University of Hong Kong (Work collaborated with ByteDance)
  • Hao Zhang*: Hong Kong University of Science and Technology (Work collaborated with ByteDance)
  • Yuanhan Zhang: Nanyang Technological University ( Work collaborated with ByteDance)
  • Bo Li: Nanyang Technological University ( Work collaborated with ByteDance)
  • Wei Li: Bytedance
  • Zejun Ma: Bytedance
  • Chunyuan Li: Bytedance
  • *Core contributors, †Project lead

Related Blogs

Citation



@misc{li2024llavanext-interleave,
	title={LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models},
	url={https://llava-vl.github.io/blog/2024-06-16-llava-next-interleave/},
	author={Li, Feng and Zhang, Renrui and Zhang, Hao and Zhang, Yuanhan and Li, Bo and Li, Wei and Ma, Zejun and Li, Chunyuan},
	month={June},
	year={2024}
}

@misc{li2024llavanext-ablations,
	title={LLaVA-NeXT: What Else Influences Visual Instruction Tuning Beyond Data?},
	url={https://llava-vl.github.io/blog/2024-05-25-llava-next-ablations/},
	author={Li, Bo and Zhang, Hao and Zhang, Kaichen and Guo, Dong and Zhang, Yuanhan and Zhang, Renrui and Li, Feng and Liu, Ziwei and Li, Chunyuan},
	month={May},
	year={2024}
}

@misc{li2024llavanext-strong,
    title={LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild},
    url={https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/},
    author={Li, Bo and Zhang, Kaichen and Zhang, Hao and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Yuanhan and Liu, Ziwei and Li, Chunyuan},
    month={May},
    year={2024}
}

@misc{zhang2024llavanextvideo,
    title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
    url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
    author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
    month={April},
    year={2024}
}
    
@misc{liu2024llavanext,
    title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
    url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
    author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
    month={January},
    year={2024}
}

@misc{liu2023improvedllava,
      title={Improved Baselines with Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
      publisher={arXiv:2310.03744},
      year={2023},
}

@misc{liu2023llava,
      title={Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
      publisher={NeurIPS},
      year={2023},
}