We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (i) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (ii) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.
Explore the sections belows to learn more about the project:
We open-source the LLaVA-Critic to facilitate future development of LMM evaluators in the community.
🤗 LLaVA-Critic Data: Explore the 113k critic instruction-following data across various evaluation scenarios
Training Code: Build LLaVA-Critic with standard LLaVA-OneVision's training code
🤗 LLaVA-Critic Checkpoints: Access pre-trained model checkpoints (7B, 72B)
🤗 LLaVA-OneVision-Chat [7B]/[72B]: Enjoy enhanced visual chat through preference alignment with LLaVA-Critic
To develop a generalist evaluator for LMM responses, as with GPT-4/4V, we curate LLaVA-Critic-113k, a high-quality dataset tailored to follow instructions in complex evaluation setting to provide quantitative judgment and the corresponding reasoning process. It consists of 46k images with 113k evaluation instruction samples, primarily including two evaluation settings:
LLaVA-Critic serves as a general evaluator for LMM responses, reducing labor costs by automating the evaluation process. It consistently provides reliable judgments and justifications aligned with GPT-4o or human evaluations across a range of widely used multimodal benchmarks. This consistency holds true for both instance-level scoring and model-level ranking.
Compared to LLaVA-OneVision, LLaVA-Critic delivers more accurate judgments, and provides more concrete, image-grounded justifications. This is crucial for reliable AI, as offering well-supported reasons for evaluations establishes LLaVA-Critic as a transparent evaluator of LMM-generated responses.
By accurately recognizing the visual content of the input image and grounding the differences between the responses, LLaVA-Critic offers judgments consistent with human evaluators, along with clear justifications.
LLaVA-Critic closely follows the evaluation prompt and, by referring to the image content, accurately identifies the strengths and weaknesses of the response at both overall and fine-grained levels.
LLaVA-Critic produces AI-generated feedback datasets, thereby improving the visual chat performance of supervised fine-tuned LMMs through preference alignment. Notably, the reward signals generated by our critic can be utilized in any preference learning algorithms, including RLHF and DPO. Here, we focus on incorpating LLaVA-Critic into the iterative DPO training process:
LLaVA-OneVision-Chat. In experiment, we take LLaVA-OneVision as the base policy model, and use the question-image pairs from LLaVA-RLHF as multimodal instructions. We conduct iterative DPO training for \(M=3\) rounds to obtain the final LMM checkpoint, which is referred to as LLaVA-OneVision-Chat. For both LLaVA-OV-7B and LLaVA-OV-72B base models, feedback from LLaVA-Critic progressively improves their performance based on their self-generated responses, leading to consistent gain across 6 open-ended multimodal benchmarks. The performance gain is more pronounced, by leanring from AI feedback of LLaVA-Critic than the feedbck of reward model from LLaVA-RLHF that is trained with human preference. It indicates a promising path to learn from superhuman feedback for self-improvement AI.
@article{xiong2024llavacritic,
title={LLaVA-Critic: Learning to Evaluate Multimodal Models},
author={Xiong, Tianyi and Wang, Xiyao and Guo, Dong and Ye, Qinghao and Fan, Haoqi and Gu, Quanquan and Huang, Heng and Li, Chunyuan},
year={2024},
eprint={2410.02712},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.02712},
}