LLaVA-OneVision

Easy Visual Task Transfer

1ByteDance, 2NTU, 3CUHK, 4HKUST
Work collaborated with ByteDance

Introduction

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate show that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image and video scenarios. Importantly, the design of LLaVA-OneVision allow strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demosntrated through task transfer from images to videos.

data-overview

LLaVA-OneVision Network Architecture.
Left: Current model instantiation; Right: The general form of LLaVA extended to more visual signals.

data-overview

Visual Representation Strategy in LLaVA-OneVision.
The maximum number of visual tokens across different scenarios is designed to be similar,
ensuring balanced representations to accommodate cross-scenario capability transfer.

Open-source Release

    We open-source the LLaVA-OneVision to facilitate future development of LMM in the community.

  • Training Code: Cook a SOTA model with our released training code

  • 🤗 Checkpoints: Access pre-trained model checkpoints (0.5B, 7B, 72B)

  • 🤗 LLaVA-OneVision Data: Explore training datasets for Single-Image and OneVision stages

  • 🎨 Live Demo: Try it out yourself!

Emerging Capabilities

In addition to reporting the LLaVA-OneVision’s capabilities across various benchmarks, we also observe the emerging behaviors of the proposed model with task transfer and composition, paving a promising way to generalize to tackle real-world computer vision tasks in the wild. We illustrate several emerging capabilities using examples as below.

Citation

@article{li2024llava,
  	title={LLaVA-OneVision: Easy Visual Task Transfer},
  	author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
  	journal={arXiv preprint arXiv:2408.03326},
  	year={2024}
}