LLaVA-Critic: Learning to Evaluate Multimodal Models

1ByteDance, 2University of Maryland, College Park
Work collaborated with ByteDance

Abstract

We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (i) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (ii) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.

Explore the sections belows to learn more about the project:

Open-source Release

    We open-source the LLaVA-Critic to facilitate future development of LMM evaluators in the community.

  • 🤗 LLaVA-Critic Data: Explore the 113k critic instruction-following data across various evaluation scenarios

  • Training Code: Build LLaVA-Critic with standard LLaVA-OneVision's training code

  • 🤗 LLaVA-Critic Checkpoints: Access pre-trained model checkpoints (7B, 72B)

  • 🤗 LLaVA-OneVision-Chat [7B]/[72B]: Enjoy enhanced visual chat through preference alignment with LLaVA-Critic


Curation of Critic Instruction-Following Dataset

To develop a generalist evaluator for LMM responses, as with GPT-4/4V, we curate LLaVA-Critic-113k, a high-quality dataset tailored to follow instructions in complex evaluation setting to provide quantitative judgment and the corresponding reasoning process. It consists of 46k images with 113k evaluation instruction samples, primarily including two evaluation settings:

  1. Pointwise Scoring: Assign a score to an individual candidate response.
  2. We collect instrucion-response pairs across 8 multimodal datasets and 13 response models, gather evaluation prompts from 7 open-ended benchmarks, and utilize GPT-4o to produce judgment scores and reasons.
  3. Pairwise Ranking: Compare two candidate responses to determine their relative quality.
  4. We gather pairwise responses with known preferences, design a set of 30 pairwise evaluation prompt templates, and ask GPT-4o to generate justification for the preference.
benchmark category

benchmark category
Figure 1: Example of training data. LLaVA-Critic learns to predict both quantitative judgements and the corresponding reasons.

Scenario 1: LMM-as-a-Judge

LLaVA-Critic serves as a general evaluator for LMM responses, reducing labor costs by automating the evaluation process. It consistently provides reliable judgments and justifications aligned with GPT-4o or human evaluations across a range of widely used multimodal benchmarks. This consistency holds true for both instance-level scoring and model-level ranking.

benchmark category
Figure 2: (Top): Overall distribution of evaluation scores across 4 benchmarks. (Bottom): Calculated average evaluation score for each response model on each benchmark. Leveraging high-quality critic training data, LLaVA-Critic closely aligns with GPT-4o in delivering balanced evaluation scores and accurately ranking response LMMs.

Compared to LLaVA-OneVision, LLaVA-Critic delivers more accurate judgments, and provides more concrete, image-grounded justifications. This is crucial for reliable AI, as offering well-supported reasons for evaluations establishes LLaVA-Critic as a transparent evaluator of LMM-generated responses.

Scenario 2: Preference Learning

LLaVA-Critic produces AI-generated feedback datasets, thereby improving the visual chat performance of supervised fine-tuned LMMs through preference alignment. Notably, the reward signals generated by our critic can be utilized in any preference learning algorithms, including RLHF and DPO. Here, we focus on incorpating LLaVA-Critic into the iterative DPO training process:

  • Step 1: Response generation. The iterative DPO process begins with a pretrained LMM \(\pi_0\) as the initial checkpoint and a set of multimodal instructions \(\{(x_k, v_k)\}_{k=1}^N\). For each question-image pair \((x_k, v_k)\), the pretrained LMM \(\pi_0\) randomly generates \(K\) responses \(\{y_1, y_2, ..., y_k\}\), sampled independently from its distribution.
  • Step 2: Scoring. To mitigate order-related variance in LLaVA-Critic's preferences, we form all possible ordered pairs from these responses, resulting in \(K \times (K-1)\) pairs. For each response pair \((y_i, y_j)\), we apply LLaVA-Critic with an evaluation prompt to generate a relative score \(a_{ij}\), which normalizes the score of \(y_j\) based on \(y_i\).
  • Step 3: Reward Preference. The overall reward score \(r_i\) for each response \(y_i\) is calculated by aggregating these preference scores:
    \( r_i = \sum_{k \ne i} a_{ki} - \sum_{l \ne i} a_{il} \)
    We then select the responses with the highest and lowest reward scores as the best and worst responses, denoted as \(y^+\) and \(y^-\), respectively. These form the pairwise feedback data \((y^+, y^-)\), which is used for DPO training.
  • Iterative Improvement. After each round of DPO training, the updated LMM becomes the new starting checkpoint. The process is then repeated iteratively for another \(M-1\) rounds.

LLaVA-OneVision-Chat. In experiment, we take LLaVA-OneVision as the base policy model, and use the question-image pairs from LLaVA-RLHF as multimodal instructions. We conduct iterative DPO training for \(M=3\) rounds to obtain the final LMM checkpoint, which is referred to as LLaVA-OneVision-Chat. For both LLaVA-OV-7B and LLaVA-OV-72B base models, feedback from LLaVA-Critic progressively improves their performance based on their self-generated responses, leading to consistent gain across 6 open-ended multimodal benchmarks. The performance gain is more pronounced, by leanring from AI feedback of LLaVA-Critic than the feedbck of reward model from LLaVA-RLHF that is trained with human preference. It indicates a promising path to learn from superhuman feedback for self-improvement AI.

benchmark category
Figure 3: Performance gain from preference learning with LLaVA-Critic. The delta numbers above the bars indicate the improvement of the iterative DPO-trained variant(7B/72B) over its base model LLaVA-OneVision.

Citation

@article{xiong2024llavacritic,
  title={LLaVA-Critic: Learning to Evaluate Multimodal Models},
  author={Xiong, Tianyi and Wang, Xiyao and Guo, Dong and Ye, Qinghao and Fan, Haoqi and Gu, Quanquan and Huang, Heng and Li, Chunyuan},
  year={2024},
  eprint={2410.02712},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2410.02712},
}