LLaVA-OneVision: Easy Visual Task Transfer

Introduction

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate show that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image and video scenarios. Importantly, the design of LLaVA-OneVision allow strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demosntrated through task transfer from images to videos.

LLaVA-OneVision Network Architecture.
Left: Current model instantiation; Right: The general form of LLaVA extended to more visual signals.

Visual Representation Strategy in LLaVA-OneVision.
The maximum number of visual tokens across different scenarios is designed to be similar,
ensuring balanced representations to accommodate cross-scenario capability transfer.

Open-source Release

We open-source the LLaVA-OneVision to facilitate future development of LMM in the community.

Training Code: Cook a SOTA model with our released training code
🤗 Checkpoints: Access pre-trained model checkpoints (0.5B, 7B, 72B)
🤗 LLaVA-OneVision Data: Explore training datasets for Single-Image and OneVision stages
🎨 Live Demo: Try it out yourself!

Emerging Capabilities

In addition to reporting the LLaVA-OneVision’s capabilities across various benchmarks, we also observe the emerging behaviors of the proposed model with task transfer and composition, paving a promising way to generalize to tackle real-world computer vision tasks in the wild. We illustrate several emerging capabilities using examples as below.

LLaVA-OneVision transfers its ability to understand diagram and table to multi-image scenarios, interpreting multiple images in a coherent manner.

LLaVA-OneVision plays the role of agent. It recognizes multiple screenshots on the iPhone and take action to interact with the iPhone, providing operation instructions for automating tasks.

LLaVA-OneVision exhibits excellent set-of-mark prompting capabilities, ie, referring to marks when answering questions. This example demonstrates that describing specific objects based on numerical labels within an image highlights its comprehension skills in handling fine-grained visual content.

LLaVA-OneVision learns to generate detailed video creation prompts based on a static image. This capability is generalized to videos from the image-to-image language editing generation.

LLaVA-OneVision learns to analyze differences between videos with the same starting frame but different endings.

LLaVA-OneVision learns to analyze differences between videos with similar backgrounds but different foreground objects.

LLaVA-OneVision analyzes and interprets multi-camera video footage in self-driving contexts.

LLaVA-OneVision learns to understand and describe composed sub-videos in detail.

LLaVA-OneVision learns to provide detailed descriptions of highlighted subjects in video content.

LLaVA-OneVision’s capability in referring image and video understanding. It accurately identifies the same individual in two images in the first instance. It identifies the same individual in both the image and the video in the second instance and correctly concludes the absence of the individual in the third instance, indicating its understanding capability to relate visual query in both image and video understanding.

Development Roadmap from LLaVA-NeXT to LLaVA-OneVision

Citation

@article{li2024llava,
  	title={LLaVA-OneVision: Easy Visual Task Transfer},
  	author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
  	journal={arXiv preprint arXiv:2408.03326},
  	year={2024}
}