LLaVA-OneVision: Easy Visual Task Transfer

Abstract

We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (i) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (ii) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.

Explore the sections belows to learn more about the project:

Curation of Critic Instruction-Following Dataset
Scenario 1: LMM-as-a-Judge
Scenario 2: Preference Learning

Open-source Release

We open-source the LLaVA-Critic to facilitate future development of LMM evaluators in the community.

🤗 LLaVA-Critic Data: Explore the 113k critic instruction-following data across various evaluation scenarios
Training Code: Build LLaVA-Critic with standard LLaVA-OneVision's training code
🤗 LLaVA-Critic Checkpoints: Access pre-trained model checkpoints (7B, 72B)
🤗 LLaVA-OneVision-Chat [7B]/[72B]: Enjoy enhanced visual chat through preference alignment with LLaVA-Critic

Curation of Critic Instruction-Following Dataset

To develop a generalist evaluator for LMM responses, as with GPT-4/4V, we curate LLaVA-Critic-113k, a high-quality dataset tailored to follow instructions in complex evaluation setting to provide quantitative judgment and the corresponding reasoning process. It consists of 46k images with 113k evaluation instruction samples, primarily including two evaluation settings:

Pointwise Scoring: Assign a score to an individual candidate response.

Pairwise Ranking: Compare two candidate responses to determine their relative quality.

Scenario 1: LMM-as-a-Judge

LLaVA-Critic serves as a general evaluator for LMM responses, reducing labor costs by automating the evaluation process. It consistently provides reliable judgments and justifications aligned with GPT-4o or human evaluations across a range of widely used multimodal benchmarks. This consistency holds true for both instance-level scoring and model-level ranking.

Compared to LLaVA-OneVision, LLaVA-Critic delivers more accurate judgments, and provides more concrete, image-grounded justifications. This is crucial for reliable AI, as offering well-supported reasons for evaluations establishes LLaVA-Critic as a transparent evaluator of LMM-generated responses.

By accurately recognizing the visual content of the input image and grounding the differences between the responses, LLaVA-Critic offers judgments consistent with human evaluators, along with clear justifications.

LLaVA-Critic closely follows the evaluation prompt and, by referring to the image content, accurately identifies the strengths and weaknesses of the response at both overall and fine-grained levels.

Scenario 2: Preference Learning

LLaVA-Critic produces AI-generated feedback datasets, thereby improving the visual chat performance of supervised fine-tuned LMMs through preference alignment. Notably, the reward signals generated by our critic can be utilized in any preference learning algorithms, including RLHF and DPO. Here, we focus on incorpating LLaVA-Critic into the iterative DPO training process:

Step 1: Response generation. The iterative DPO process begins with a pretrained LMM \(\pi_0\) as the initial checkpoint and a set of multimodal instructions \(\{(x_k, v_k)\}_{k=1}^N\). For each question-image pair \((x_k, v_k)\), the pretrained LMM \(\pi_0\) randomly generates \(K\) responses \(\{y_1, y_2, ..., y_k\}\), sampled independently from its distribution.
Step 2: Scoring. To mitigate order-related variance in LLaVA-Critic's preferences, we form all possible ordered pairs from these responses, resulting in \(K \times (K-1)\) pairs. For each response pair \((y_i, y_j)\), we apply LLaVA-Critic with an evaluation prompt to generate a relative score \(a_{ij}\), which normalizes the score of \(y_j\) based on \(y_i\).
Step 3: Reward Preference. The overall reward score \(r_i\) for each response \(y_i\) is calculated by aggregating these preference scores:

\( r_i = \sum_{k \ne i} a_{ki} - \sum_{l \ne i} a_{il} \)
We then select the responses with the highest and lowest reward scores as the best and worst responses, denoted as \(y^+\) and \(y^-\), respectively. These form the pairwise feedback data \((y^+, y^-)\), which is used for DPO training.
Iterative Improvement. After each round of DPO training, the updated LMM becomes the new starting checkpoint. The process is then repeated iteratively for another \(M-1\) rounds.

LLaVA-OneVision-Chat. In experiment, we take LLaVA-OneVision as the base policy model, and use the question-image pairs from LLaVA-RLHF as multimodal instructions. We conduct iterative DPO training for \(M=3\) rounds to obtain the final LMM checkpoint, which is referred to as LLaVA-OneVision-Chat. For both LLaVA-OV-7B and LLaVA-OV-72B base models, feedback from LLaVA-Critic progressively improves their performance based on their self-generated responses, leading to consistent gain across 6 open-ended multimodal benchmarks. The performance gain is more pronounced, by leanring from AI feedback of LLaVA-Critic than the feedbck of reward model from LLaVA-RLHF that is trained with human preference. It indicates a promising path to learn from superhuman feedback for self-improvement AI.

Related Blogs

Citation

@article{xiong2024llavacritic,
  title={LLaVA-Critic: Learning to Evaluate Multimodal Models},
  author={Xiong, Tianyi and Wang, Xiyao and Guo, Dong and Ye, Qinghao and Fan, Haoqi and Gu, Quanquan and Huang, Heng and Li, Chunyuan},
  year={2024},
  eprint={2410.02712},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2410.02712},
}