On January 30, 2024, we released LLaVA-NeXT, an open-source Large Multimodal Model (LMM) that has been trained exclusively on text-image data. With the proposed AnyRes technique, it boosts capabilities in reasoning, OCR, and world knowledge, demonstrating remarkable performance across a spectrum of image-based multimodal understanding tasks, and even exceeding Gemini-Pro on several image benchmarks, e.g. MMMU and MathVista.

In today’s exploration, we delve into the performance of LLaVA-NeXT within the realm of video understanding tasks. We reveal that LLaVA-NeXT surprisingly has strong performance in understanding video content. The current version of LLaVA-NeXT for videos has several improvements:

  1. Zero-shot video representation capabilities with AnyRes: The AnyRes technique naturally represents a high-resolution image into multiple images that a pre-trained VIT is able to digest, and forms them into a concantenated sequence. This technique is naturally generalizable to represent videos (consisting of multiple frames), allowing the image-only-trained LLaVA-Next model to perform surprisingly well on video tasks. Notably, this is the first time that LMMs show strong zero-shot modality transfer ability.
  2. Inference with length generalization improves on longer videos. The linear scaling technique enables length generalization, allowing LLaVA-NeXT to effectively handle long-video beyond the limitation of the "max_token_length" of the LLM.
  3. Strong video understanding ability. (1) LLaVA-Next-Image, which combines the above two techniques, yields superior zero-shot performance than open-source LMMs tuned on videos. (2) LLaVA-Next-Video, further supervised fine-tuning (SFT) LLaVA-Next-Image on video data, achieves better video understanding capabilities compared to LLaVA-Next-Image. (3) LLaVA-Next-Video-DPO, which aligns the model response with AI feedback using direct preference optimization (DPO), showing significant performance boost.
  4. Efficient deployment and inference with SGLang. It allows 5x faster inference on video tasks, allowing more scalable serving such as million-level video re-captioning. See instructions in our repo.

Open-Source Release

  • Code (Training code is coming soon)
  • Model
  • Demo is comming soon

Results

Data
(Pre-training)
Data
(Post-training)
Max Sequence Length Model Throughput
(seconds/video)
NextQA
(WUPS@All)
ActivityNet-QA
(Acc/Score)
Video Detailed Description
(Score)
VideoChat-GPT
(Score)
Training Inference Correctness Detail Context Temporal Consistency
Proprietary
N/A N/A - - GPT-4V (1106) - - - 4.00 4.09 3.88 4.37 3.94 4.02
N/A N/A - - Flamingo - 26.7 45.3 - - - - - -
N/A N/A - - Gemini Pro - 28.0 49.8 - - - - - -
N/A N/A - - Gemini Ultra - 29.9 52.2 - - - - - -
Open-Source SoTA
558K 765K 4096 4096 VideoChatGPT (7B) - - 35.2/2.7 - 2.40 2.52 2.62 1.98 2.37
1260K 765K 4096 4096 Video-LLaVA (7B) - - 45.3/3.3 - 2.87 2.94 3.44 2.45 2.51
558K 765K 4096 4096 VISTA-LLAMA (7B) - - 48.3/3.3 - 2.44 2.31 2.64 3.18 2.26
35M 1.9M 4096 4096 VideoChat2 (7B) - - 49.1/3.3 - 3.02   2.88 3.51 2.66 2.81
790K 765K 4096 4096 LLaMA-VID (7B) 20 21.03 47.4/3.3 2.84 3.01 2.97 3.54 2.53 2.60
LLaVA-NeXT
558K   760K 4096 4096 LLaVA-NeXT-Image (7B)* 4 26.00 41.6/2.8 2.64 2.29 2.38 2.83 2.41 2.47
558K   760K 4096 4096 LLaVA-NeXT-Image (7B) 16 26.88 53.8/3.2 2.76 2.94 2.91 3.43 2.23 3.00
558K 760K 4096 8192 LLaVA-NeXT-Image (7B) 33 27.33 53.5/3.2 3.12 3.05 3.12 3.68 2.37 3.16
558K 860K 4096 8192 LLaVA-NeXT-Video (7B) 37 26.90 53.5/3.2 3.32 3.39 3.29 3.92 2.60 3.12
558K 860K+17K 4096 8192 LLaVA-NeXT-Video-DPO (7B) 37 26.92 60.2/3.5 3.72 3.64 3.45 4.17 2.95 4.08
558K 760K 4096 4096 LLaVA-NeXT-Image (34B) 28 28.20 55.4/3.3 3.00 3.21 3.06 3.49 2.53 3.31
558K 760K 4096 8192 LLaVA-NeXT-Image (34B) 45 28.33 55.6/3.3 3.20 3.29 3.23 3.83 2.51 3.47
558K 860K 8192 8192 LLaVA-NeXT-Video (34B) 45 27.34 58.8/3.4 3.48 3.48 3.37 3.95 2.64 3.28
558K 860K+17K 8192 8192 LLaVA-NeXT-Video-DPO (34B) 45 27.30 64.42/3.6 3.84 3.81 3.55 4.24 3.14 4.12
*The video input is represented as only one frame.
 

✨ Highlights:

  • SoTA Performance! Without seeing any video data, LLaVA-Next demonstrates strong zero-shot modality transfer ability, outperforming all the existing open-source LMMs (e.g., LLaMA-VID) that have been specifically trained for videos. Compared with proprietary ones, it achieves comparable performance with Gemini Pro on NextQA and ActivityNet-QA.
  • Strong length generalization ability. Despite being trained under the sequence length constraint of a 4096-token limit, LLaVA-Next demonstrates remarkable ability to generalize to longer sequences. This capability ensures robust performance even when processing long-frame content that exceeds the original token length limitation.
  • DPO pushes performance. DPO with AI feedback on videos yields significant performance gains.

Qualitative Results

Video Detailed Caption Tasks

User: Please provide a detailed description of the video, focusing on the main subjects, their actions, and the background scenes.

Prompt for T2V: A realistic image of clouds forming the word "SORA"

LLaVA-NeXT-Video

The video you've provided appears to be a digital artwork that features a large, fluffy white cloud in the sky with the word \"SORA\" written across it. The cloud is the main subject of the video, and it is positioned against a clear sky with a few wispy clouds in the background. The cloud itself is dense and appears to be a cumulus cloud, which is characterized by its fluffy, rounded shape and white color, suggesting that it could be a cloud that is forming or about to rain. The background is a gradient of blue, indicating that the sky is clear and possibly sunny, with a few wispy clouds scattered around, which adds depth to the scene. The overall atmosphere of the video is serene and somewhat dreamlike, with the cloud's formation and the clear sky creating a peaceful and somewhat surreal visual effect.


Video QA Tasks

User: Which part of the video goes against common sense?

Prompt for T2V: Step-printing scene of a person running, cinematic film shot in 35mm.

LLaVA-NeXT-Video

The video you've provided appears to show a person running on a treadmill, but the unusual aspect is that the person's legs are not aligned with the treadmill's belt. This defies common sense because when running on a treadmill, one would expect the person's legs to be in sync with the belt's motion, either moving forward or backward in a coordinated manner. In this video, the person's legs are positioned as if they are running in a different direction or at a different pace than the belt, which is not possible in reality and thus goes against common sense. This suggests that the video may have been manipulated or is a composite of two different scenes to create an illusion.


Technical Insights

(1) AnyRes: From multi-patch to multi-frame

The AnyRes algorithm, introduced in LLaVA-NeXT, achieves an optimal balance between performance efficiency and operational costs to deal with images of any high-resolution. It segments the image into a grid of sub-images with various configurations, such as {2x2, 1x{2,3,4}, {2,3,4}x1}.


Illustration that AnyRes digests a set of image as a sequence of concatenated visual tokens, allowing unified image and video input, which natually suppots the evolution from multi-image to multi-frame

With minor code adjustments, LLaVA-NeXT can process N video frames arranged in a {1xN} grid. Assuming each frame comprises 24x24 tokens, the total token count for a video would be 24x24xN. However, considering the "max_token_length" limit of 4096 for the LLM, it is crucial to ensure that 24x24xN + the number of text tokens < 4096 to avoid nonsensical outputs. This necessitates a careful balance between the tokens per frame and the total number of frames. For example, applying spatial pooling with a stride of 2 reduces the tokens per frame from 24x24 to 12x12, accommodating up to 16 frames. Our findings indicate that a configuration of 12x12 tokens across 16 frames yields optimal performance. However, 16 frames may not sufficiently capture the essence of most videos. Next, we focus on enabling LLaVA-NeXT to handle more frames.

(2) Length generalization: From multi-frame to long-video.

Inspired by recent advance in dealing with long sequences in LLMs, such as the implementation of linear scaling in rotary position embeddings (RoPE), we apply a similar scaling approach in LLaVA-NeXT. By introducing a scaling factor of 2, for instance, we effectively double the "max_token_length" capacity of the model, enabling it to process sequences up to 8192 tokens. With this technique, LLaVA-NeXT can process up to 56 frames with a per frame token count of 12x12, significantly expanding the model's applicability in analyzing longer video sequences.


Illustration of length generalization to deal with a long sequence with much more frames, where the sequence length in inference can be longer than the sequence length in training.

This adaptation of linear scaling not only facilitates the processing of a greater number of frames but also opens new avenues for LLaVA-NeXT's application in comprehensive video analysis, setting a precedent for future research in the field.

(3) Direct Preference Optimization from AI Feedback

Recent advancements in reinforcement learning from human feedback (RLHF) have been effective in guiding LLMs towards generating more honest, helpful, and harmless content. Their effectiveness in multimodal contexts remains limited. The critical obstacle lies in (1) developing a robust reward system capable of distinguishing preferred responses from less preferred ones, and (2) acquiring high-quality preference data from humans is both costly and labor-intensive, especially for video inputs with multiple frames. As a cost-efficient alternative, AI feedback using LLM reward is considered in our recent study LLaVA-Hound. DPO is used to train LMM, based on the preference data generated by LLM, where videos are represented with their detailed captions as supporting evidence. Following the same recipe, we explore DPO based on the stronger SFT LMM LLaVA-NeXT-Video, leading to the final model LLaVA-NeXT-Video-DPO.

The results are presented in the main table at the beginning of the blog, showcasing DPO's significant improved performance gain, sometimes even surpassing the gains from supervised fine-tuning with video data. This highlights the promising future of exploring training methods that utilize AI or human feedback.

[Fold / Unfold] to see the examples with DPO

We illustrate two examples to demonstrate the superiority of DPO. Texts of interest are highlighted in blue, while parts that might contain hallucinations are marked in red. LLaVA-NeXT-Video-DPO reduces hallucinations, and follows the instructions very well, as it responds clearly to different parts of the instructions in separate sections. This resonates with the objective of alignment research: to better follow human intent. Note that we utilize Genimi-Pro, which supports video input, to assess the quality of the model's responses with and without DPO in the last column.

User: Please provide a detailed description of the video, focusing on the main subjects, their actions, the background scenes, and the temporal transitions.

LLaVA-NeXT-Video-DPO

The video depicts a dynamic and surreal scene where the action of surfing is juxtaposed with an indoor setting. In the foreground, there are two surfers riding waves, one slightly ahead of the other, maintaining their balance on surfboards. The surfer in the foreground is skillfully navigating the crest of a wave, while the one behind is positioned on the face of the wave, both displaying athletic postures typical of surfing. The waves are large and crashing, with the wave in the background creating a dramatic arc that suggests motion and power.

The background is an opulent interior space with high vaulted ceilings, ornate detailing, and classical architecture, including arches and what appears to be marble or stone work. The lighting is warm, with a golden hue that contrasts with the cool tones of the water. The architecture suggests a grand hall or a space designed for public gatherings or performances, possibly inspired by classical or neoclassical styles.

The temporal transition in the video is interesting as it captures a moment of action (the surfers riding the wave) within a stationary, non-moving environment (the interior space). This creates a surreal effect, blending the dynamic motion of the ocean with the static elegance of a building interior. The image is a blend of two disparate subjects—the exhilarating sport of surfing and the serene grandeur of architectural design—into a single, captivating composition.

LLaVA-NeXT-Video

The video depicts a large wave crashing inside a building, with a surfer riding the wave. The surfer is positioned in the center of the frame, skillfully navigating the wave's crest. The wave is massive, filling the majority of the space from the top to the bottom of the frame, with its white foam and water spray creating a dynamic and powerful visual effect. The background of the scene is a grand hall with high ceilings and ornate architecture, featuring arches and columns that add a sense of grandeur to the setting. The lighting is dim, with the focus on the surfer and the wave, casting shadows and highlighting the contours of the wave's movement. The camera captures the surfer from a low angle, emphasizing the height of the wave and the surfer's skill in maintaining balance. As the wave begins to break, the surfer is seen riding it towards the right side of the frame, creating a sense of motion and direction. The video transitions from a close-up of the surfer to a wider shot, showing the full scale of the wave and the grand hall. The sound of the crashing wave is prominent, adding to the immersive experience. The video concludes with a shot of the surfer disappearing from view as the wave subsides, leaving the grand hall empty and still.

Evaluation and Comments

Description A: LLaVA-NeXT-Video-DPO

Description B: LLaVA-NeXT-Video

Comparison:

Description A provides a more comprehensive and evocative description of the video, capturing the surreal juxtaposition of surfing with the opulent indoor setting. It highlights the dynamic motion of the surfers and the contrasting serenity of the architecture, creating a unique and captivating image.

Description B, on the other hand, focuses more on the surfer riding the wave within the grand hall. While it describes the visual elements and motion of the surfer, it lacks the depth and nuance of Description A in capturing the surreal and immersive nature of the scene.

Conclusion:

Description A is the better description for this video as it provides a more complete and engaging portrayal of the surreal scene, capturing the blend of dynamic action and serene grandeur, while also highlighting the unique juxtaposition of surfing with the opulent indoor setting.

 

More Empirical Explorations

(1) How to represent videos? Configurations: (# Tokens/Frame, # Frames)

As discussed in the "From Multi-Patch to Multi-Frame" section, navigating the "max_token_length" constraint of 4096 for the Large Language Model (LLM) necessitates a strategic balance between the number of tokens allocated per frame and the total frame count included in the analysis. Our exploration led to an optimal setup: configuring each frame to contain 12x12 tokens and sampling 16 frames for each video.

Further amplifying the model's capabilities, "linear scaling" emerges as a pivotal tool in exploring configurations that accommodate more image tokens, thus enabling longer inference tokens. The table below encapsulates our findings, illustrating that configurations enabled by "linear scaling" significantly enhance performance by allowing for longer sequences of inference tokens.

Model Pooling Stride # Tokens/Frame # Frames Video Detailed Description
(Score)
LLaVA-NeXT-Image-7B
Max inference tokens=4096
4 6X6 32 2.73
4 6X6 64 2.24
2 12X12 16 2.76
1 24X24 4 2.71
Max inference tokens=8192 (with linear scale factor=2)
2 12X12 32 3.12
1 24X24 8 3.08
Max inference tokens=16384 (with linear scale factor=4)
2 12X12 64 3.12
1 24X24 16 3.16

(2) How to fine-tune on videos?

It is natural to further tune the model on video data for performance boost. Our analysis reveals that a mixed training regimen of video and image data is essential for optimizing the performance of "LLaVA-NeXT-Video". Specifically, we consider different strategies: (1) Continual fine-tuning LLaVA-NeXT stage-2 checkpoint on video data only; (2) Starting from a LLaVA-NeXT stage-1 checkpoint, the model is tuned on the joint data of image and video in Stage-2, where data types in each batch can be split (each batch only contains one type) or mixed (each batch contains both types). It becomes evident that training with batches that mix image and video data performs the best, while other strategies even perform worse than LLaVA-NeXT-Image models. This outcome emphasizes the importance of mixing video and image data in the training process to enhance the model's proficiency in video-related tasks.

Model Training recipes Pooling Stride #Tokens/Frame # Frames Video Detailed Description
(Score)
LLaVA-NeXT-Image-7B - 2 12X12 32 3.14
LLaVA-NeXT-Video-7B Continual fine-tuning on videos 3.02
Joint video-image: Batch data split 3.00
Joint video-image: Batch data mixed 3.32
 

More Examples

User: Please provide a detailed description of the video, focusing on the main subjects, their actions, and the background scenes.

Prompt for T2V: The story of a robot’s life in a cyberpunk setting.

LLaVA-NeXT-Video

The video you've provided appears to be a still from a science fiction or cyberpunk-themed movie or video game. It features a humanoid robot standing in a bustling urban environment. The robot has a sleek, futuristic design with a dark color scheme and a prominent head that seems to have a visor or helmet-like feature. It is standing upright and facing the camera, with its arms at its sides, suggesting a calm or neutral stance. In the background, there are various elements that contribute to the dystopian atmosphere:
1. A large, imposing structure that resembles a futuristic building or part of a cityscape, with a metallic sheen and angular architecture.
2. A flying vehicle, which could be a drone or a futuristic aircraft, adding to the advanced technology theme.
3. Other robots or machines that are less humanoid, possibly indicating a variety of robotic forms in this setting.
4. A street scene with blurred figures that suggest a busy, possibly crowded environment, possibly indicating a market or a public area.
5. A neon sign that adds to the cyberpunk aesthetic, with its bright, artificial glow against the darker tones of the surroundings.
The overall scene conveys a sense of a technologically advanced world where robots and humans coexist, possibly in a future where robots have become commonplace and integrated into daily life. The setting is dense with detail, suggesting a complex and possibly overpopulated urban environment. The lighting and color palette are moody, with a mix of dark and bright elements that create a contrast that is typical of the cyberpunk genre.

User: Please provide a detailed description of the video, focusing on the main subjects, their actions, and the background scenes.

Prompt for T2V: A young man at his 20s is sitting on a piece of cloud in the sky, reading a book.

LLaVA-NeXT-Video

In the video, we see a young man with curly hair sitting on a bed of fluffy white clouds. He is wearing a white t-shirt and appears to be engrossed in reading a book. The book is open in his hands, and he seems to be focused on the pages, suggesting that he is deeply engaged in the content. The background is a clear blue sky with a few scattered clouds, which contrasts with the white of the clouds he is sitting on, creating a serene and peaceful atmosphere. The man's position on the clouds gives the impression that he is floating or levitating, adding a whimsical element to the scene. The overall mood of the video is calm and contemplative, with the man's concentration on his book indicating a moment of quiet reflection or intellectual pursuit.

User: Please provide a detailed description of the video, focusing on the main subjects, their actions, and the background scenes.

Prompt for T2V: A petri dish with a bamboo forest growing within it that has tiny red pandas running around.

LLaVA-NeXT-Video

The video depicts a laboratory setting with a focus on scientific equipment and plant growth. In the foreground, there is a clear plastic container with a small plant growing inside, which appears to be a test tube or a petri dish. The plant is green and has multiple leaves, suggesting it is a small shrub or a type of succulent. The container is placed on a white surface, which could be a lab bench or a table. In the background, there are various pieces of scientific equipment commonly found in a laboratory, including:
- A microscope with a long, slender body and a round, darkened head, which is likely the eyepiece and objective lens assembly.
- A beaker, a glass container with a wide mouth and a narrow base, used for holding and measuring chemical substances.
- A flask, a glass container with a narrow neck and a wider mouth, often used for holding or measuring small amounts of liquid.
- A test tube, a narrow glass tube used for holding and measuring small volumes of liquid.
- A pipette, a device used to transfer small volumes of liquid into or out of containers.
- A graduated cylinder, a glass tube with markings to measure the volume of liquid.
- A flask with a long neck and a rounded bottom, which is often used for holding and measuring gases or liquids.
The scene suggests a controlled environment for scientific research or experimentation, possibly related to botany, plant growth, or biological studies. The focus is on the plant in the container, indicating that it may be the subject of observation or experimentation. The background equipment implies that this is a space where scientific work is conducted, and the overall setting is clean and organized.

User: Please provide a detailed description of the video, focusing on the main subjects, their actions, and the background scenes.

Prompt for T2V:Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.

LLaVA-NeXT-Video

The video depicts a vibrant scene of a snowy street in what appears to be a Japanese city, possibly during the winter season. The street is lined with cherry blossom trees that are dusted with snow, creating a picturesque contrast between the white snow and the delicate pink blossoms. The ground is covered in a layer of snow, suggesting recent snowfall or cold weather. the foreground, there are two individuals walking hand in hand, likely a couple, dressed in winter clothing, with one wearing a light blue jacket and the other in a dark jacket. They are walking along the snow-covered street, enjoying the wintry atmosphere. Their actions suggest they are engaged in a leisurely stroll, possibly enjoying each other's company and the beauty of the snow-covered surroundings. In the background, there are various shops and buildings with traditional Japanese architecture, including what looks like a temple or shrine with a distinctive roof design. The buildings are adorned with lanterns and signs, indicating a commercial or cultural area. The presence of cherry blossoms suggests this might be a popular spot for tourists or locals to enjoy the beauty of the season. The overall atmosphere is serene and peaceful, with the snow adding a sense of tranquility to the scene. The video captures a moment of shared experience and enjoyment of nature's beauty amidst the urban setting.

User: Please provide a detailed description of the video, focusing on the main subjects, their actions, and the background scenes.

Prompt for T2V: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

LLaVA-NeXT-Video

In the video, we see a woman walking down a street at night. She is dressed in a black leather jacket, a red dress, and sunglasses, and is carrying a black handbag. Her hair is styled in loose waves and she is wearing high heels. The woman is the main subject of the video, and she is walking with confidence, her gaze directed straight ahead. The background of the video is a bustling city street with neon lights from various signs and advertisements illuminating the scene. There are blurred figures of other pedestrians and vehicles in the background, suggesting a busy urban environment. The lighting and the reflections on the wet pavement suggest that it has recently rained. The woman's attire and the overall atmosphere suggest a stylish, possibly fashionable setting, and her stride and posture convey a sense of purposefulness.


Team

  • Yuanhan Zhang: Nanyang Technological University (Work collaborated with ByteDance/TikTok)
  • Bo Li: Nanyang Technological University (Work collaborated with ByteDance/TikTok)
  • Haotian Liu: University of Wisconsin-Madison
  • Yong Jae Lee: University of Wisconsin-Madison
  • Liangke Gui: Bytedance/Tiktok
  • Di Fu: Bytedance/Tiktok
  • Jiashi Feng: Bytedance/Tiktok
  • Ziwei Liu: Nanyang Technological University
  • Chunyuan Li: Bytedance/Tiktok
 

Acknowledgement

  • We thank Kaichen Zhang, Kairui Hu, Fanyi Pu, for the building of video benchmark.
  • We thank Lianmin Zheng, for the integration of LLaVA-Next-Video to SGLang.
 

Related Blogs

Citation


@misc{zhang2024llavanextvideo,
  title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
  url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
  author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
  month={April},
  year={2024}
}

@misc{liu2024llavanext,
    title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
    url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
    author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
    month={January},
    year={2024}
}

@misc{liu2023improvedllava,
      title={Improved Baselines with Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
      publisher={arXiv:2310.03744},
      year={2023},
}

@misc{liu2023llava,
      title={Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
      publisher={NeurIPS},
      year={2023},
}