On January 30, 2024, we unveiled LLaVA-NeXT, a state-of-the-art Large Multimodal Model (LMM) developed using a cost-effective training method leveraging open resources. It enhances reasoning, OCR, and world knowledge across multimodal capabilities using the leading LLM of that time, Yi-34B. LLaVA-NeXT has showcased outstanding performance across various multimodal understanding tasks, even surpassing Gemini-Pro on benchmarks such as MMMU and MathVista. Recently, the community has witnessed the emergence of open-source LLM with stronger language capability, exemplified by LLaMA3 and Qwen-1.5 family. Simultaneously, there is speculation that proprietary LMMs like OpenAI GPT-V are supported with stronger LLMs such as GPT-4. This naturally raises the question: as the disparity between open and proprietary LLMs diminishes with the introduction of potent new language models, does the gap between open and proprietary multimodal models also narrow, when powered by these stronger LLMs?

Today, we expanded the LLaVA-NeXT with recent stronger open LLMs, reporting our findings on more capable language models:

  1. Increasing multimodal capaiblies with stronger & larger language models, up to 3x model size. This allows LMMs to present better visual world knowledge and logical reasoning inherited from LLM. It supports LLaMA3 (8B) and Qwen-1.5 (72B and 110B).
  2. Better visual chat for more real-life scenarios, covering different applications. To evaluate the improved multimodal capabilities in the wild, we collect and develop new evaluation datasets, LLaVA-Bench (Wilder), which inherit the spirit of LLaVA-Bench (in-the-wild) to study daily-life visual chat and enlarge the data size for comprehensive evaluation.
To clearly highlight the impact of LLM in supercharging multimodal performance improvements, we re-use the same training recipe with LLaVA-NeXT, thereby maintaining the minimalist design and data efficiency of LLaVA family. The largest 110B variant finishes training in 18 hours with 128 H800s. Our code, data, and model will be made accessible to the public.

Open-Source Release

We open-source the LLaVA-NeXT to facilitate future development of LMM in the community. Code, data, model will be made publicly available.

Benchmark Results

Results with LMMs-EvalGPT4-VLLaVA-NeXT (2024-05 Release)LLaVA-NeXT (2024-01 Release)
Datasets Split Metric Instances Qwen1.5-110BQwen1.5-72BLLaMA3-8BYi-34BVicuna-1.5-13BVicuna-1.5-7BMistral-7B
AI2D*testAcc.308878.280.477.471.674.970.066.660.8
ChartQA*testRelaxedAcc.250078.579.777.069.568.762.254.838.8
DocVQA*valANLS5349-85.784.478.284.077.574.472.2
MathVistatestAcc.100049.949.046.637.546.035.134.437.4
MMBenchdevAcc.437775.080.580.572.179.3---
MME-CognitiontestTotal Score2374517.1453.9459.6367.8397.1316.8322.5323.9
MME-Perceptiontest1409.41746.51699.31603.71633.21575.11519.31500.9
MMMUvalAcc.90056.849.146.441.746.735.935.133.4
RealWorldQAtestAcc.76561.463.165.460.061.0--54.4
LLaVA-W**testGPT4-Eval6098.090.489.280.188.872.372.371.7
LLaVA-Bench (Wilder)SmallGPT4V-Eval12071.570.571.262.5----
MediumGPT4V-Eval102078.572.573.463.1----
*Train split observed during SFT stage.
**We report the evaluation results with GPT-4-0613 on LLaVA-W.

Highlights

  1. SoTA level Performance! LLaVA-NeXT achieves consistently better performance compared with prior open-source LMMs by simply increasing the LLM capability. It catches up to GPT4-V on selected benchmarks.
  2. Low Training Cost! We maintain an efficient training strategy like previous LLaVA models. We supervised finetuned our model on the same data as in previous LLaVA-NeXT 7B/13B/34B models. Our current largest model LLaVA-NeXT-110B is trained on 128 H800-80G for 18 hours.

[Fold / Unfold to see full tables for comparison with LLaVA Family and SoTA LMMs]
Results with LMMs-Eval Claude3-Opus GPT4-V Gemini 1.5 Pro Qwen-VL Max LLaVA-NeXT (2024-05 Release) LLaVA-NeXT (2024-01) LLaVA-1.5
Datasets Split Metric Instances Qwen1.5-110B Qwen1.5-72B LLaMA3-8B Yi-34B Vicuna-1.5-13B
AI2D*testAcc.308888.178.280.379.380.477.471.674.954.8
ChartQA*testRelaxedAcc.250080.878.581.379.879.777.069.568.718.2
DocVQA*valANLS5349----85.784.478.284.028.1
MathVistatestAcc.100050.549.952.151.049.046.637.546.026.7
MMBenchdevAcc.4377-75.0--80.580.572.179.367.8
MME-CognitiontestTotal Score2374-517.1-2281.7453.9459.6367.8397.1348.2
MME-Perceptiontest-1409.4-1746.51699.31603.71633.21510.8
MMMUvalAcc.90059.456.858.551.449.146.441.746.735.3
RealWorldQAtestAcc.76551.961.467.5-63.165.460.061.0-
LLaVA-W**testGPT4-Eval60-98.0-82.390.489.280.188.859.6
LLaVA-Bench (Wilder)SmallGPT4V-Eval12068.671.570.5-70.571.262.5-
MediumGPT4V-Eval102079.778.5--72.573.463.1--
*Train split observed during SFT stage.
**We report the evaluation results with GPT-4-0613 on LLaVA-W.

[Fold / Unfold to see qualitative examples]

Exploring the Capability Limit of Large Language Models

In our exploration with LLaVA-NeXT, we witnessed a significant performance leap when scaling LLM from 13B to 34B. With the emergence of more powerful open LLMs, there arises a natural curiosity to push the boundaries of multimodal performance, prompting the question: How effectively can the language capabilities of LLMs be transferred to multimodal settings? To measure the language capability of LLMs, we employ evaluation scores from the Massive Multitask Language Understanding (MMLU) benchmark. To measure the multimodal capability after applying the same LLaVA-NeXT training recipe, we examine four key benchmarks: MMMU for multidisciplinary understanding, Mathvista for visual math reasoning, AI2D for science diagram comprehension, and LLaVA-W for daily visual chat scenarios. These benchmarks encapsulate diverse real-world applications of LMM in the wild.
The correlation between multimodal and language capabilities is visually depicted in Figure 1, utilizing regression lines to illustrate trends across each benchmark.

  1. Improved Language Capability: Across LLMs of comparable sizes (e.g., 7B Mistral/Vicuna, 7B Qwen, 8B LLaMa3), there exists a consistent pattern where higher language proficiency, as measured by MMMU scores, corresponds to improved multimodal capabilities.
  2. Influence of Model Size: Within the same LLM family (e.g., Qwen LLM: 7B, 72B, 110B), larger models consistently demonstrate superior performance on multimodal benchmarks. This underscores the notion that larger models tend to possess enhanced language capabilities, leading to improved performance across multimodal tasks.
In both of the aforementioned analyses, it's likely that stronger LLMs yield superior multimodal capabilities. This phenomenon can be attributed to broader world knowledge, robust logical reasoning, and conversational prowess typically associated with stronger LLMs. These language capabilities are well maintained and transferred across the vision-language domain by applying the lightweight training of LLaVA-NeXT, owing to the alignment of cross-modality concepts, as well as the alignment with human intent in visual instruction tuning.

Figure 1: The performance correlation between multimodal and language capability on four benchmarks. The circle size represents model size.


[Fold / Unfold to see the table for details]
ModelsLanguage PerformanceMultimodal Performance
-MMLUMMMUMathVistaAI2DLLaVA-W
GPT4-V86.456.849.978.298.0
Qwen1.5 (110B)80.449.149.080.490.4
Qwen1.5 (72B)77.546.446.677.489.2
Yi (34B)76.346.746.074.988.8
Llama 3 (8B)66.641.737.571.680.1
Qwen1.5 (7B)61.037.333.572.574.5
Mistral-Instruct-v0.2 (7B)60.133.437.460.871.7
Vicuna1.5 (13B)52.135.935.170.072.3
Vicuna1.5 (7B)47.135.134.466.672.3

LLaVA-Bench (Wilder): Daily-life Visual Chat Benchmarks

One of the ultimate goals to develop LLMs is to build general-purpose assistant of aiding humans in various multimodal tasks in their daily lives. It is thus important to have robust benchmarks to precisely measure the related progresses. LLaVA-Bench (In-the-Wild), also known as LLaVA-W, is such a benchmark to measure the daily-life visual chat capability of LMMs. However, with only 60 examples available, we recognized the need for a more expansive dataset. In line with this spirit, we introduce LLaVA-Bench (Wilder), comprising two versions: a smaller iteration featuring 120 examples for swift assessment, and a medium-sized version with 1020 examples for comprehensive measurement. These datasets encompass diverse scenarios such as mathematical problem-solving, image comprehension, code generation, visual AI assistance, and image-based reasoning. To construct these datasets, we gathered instructions and images reflecting real-world user requests from an online service. Subsequently, we meticulously filtered samples to address privacy concerns and mitigate potential harm. Responses to these prompts were generated using GPT4-V.

Comparison with other benchmarks. Figure 2 provides a visual comparison between LLaVA-Bench (Wider) and existing LMM evaluation benchmarks. Many current benchmarks adopt a fixed-form question-and-answer (QA) format, chosen for its ease of use in evaluating metrics and presenting model comparisons. Reflecting this trend, benchmarks like MMMU, Mathvista, and AI2D are tailored to assess LMM performance in specific knowledge-intensive domains. In contrast, RealWorldQA focuses on everyday scenarios but is confined to short-answer formats. However, as assistant models, possessing the ability to engage users in free-form conversations is crucial for eliciting interest, surpassing the limitations of simple short-answer interactions. Hence, the inclusion of free-form conversation in daily-life visual chat scenarios becomes pivotal. LLaVA-W sets the precedent by introducing such a benchmark prototype, and LLaVA-Bench-Wilder endeavors to build upon this benchmark by including more daily-life scenarios and covering different applications.
Figure 2: Comparison of different benchmakrs. The circle size indicate the dataset size.

[Fold / Unfold to see the table for details]
BenchmarksInstancesMultimodal CapabilitiesInstruction FormatResponse FormatEvaluation Metric
AI2D3088Science Diagrams UnderstandingMultiple ChoicesOptionsExact Match
MMMU900Multi-dimensional Understanding & ReasoningMultiple Choices, Short ResponsesOptions & Short ResponsesExact Match
MathVista1000Math ReasoningMultiple Choices, Short ResponsesOptions & Short ResponsesGPT-4 Extract & Exact Match
RealWorldQA765Real-world Visual Question AnsweringMultiple Choices, Short ResponsesOptions & Short ResponsesFiltered Match
LLaVA-Bench (in-the-Wild)60Real-life Visual ChatFree-formFree-formGPT-4 Evaluation
LLaVA-Bench (Wilder)Small: 120Real-life Visual ChatFree-formFree-formGPT4-V Evaluation
LLaVA-Bench (Wilder)Medium: 1020Real-life Visual ChatFree-formFree-formGPT4-V Evaluation

Construction & Evaluation Metrics. For a larget set of queries from the online services, we used the ONE-PEACE embedding model to generate embeddings. Next, we applied weighted K-Means clustering, using the min-max normalized total pixel values of the image as weights, ensuring images with higher pixel values were more likely to be included in our test set. After removing duplicates, we ended up with a small version containing 120 questions and a medium version containing 1020 questions. We also conducted decontamination checks to ensure the dataset is clean and not contaminated. Both versions have less than 2% image overlap. For comparison, the original LLaVA-W had 5% image overlap. The evaluation data is excluded and decontainminated from LLaVA-NeXT's training data.
Reference Answer Construction. For each screened question, we first used GPT4V to generate a reference response and involved human annotators to manually verify the accuracy of both the question and the reference answer. A considerable number of users' inquiries were vague, involving queries about image resolution, grammar errors, or being unrelated to uploaded images. In these cases, GPT4V may decline to respond, or the reference answers provided could be incorrect. To maintain the evaluation data's quality, we manually reviewed and revised problematic answers, ensuring accuracy and reliability.
Scoring Methodology. We adopted the same evaluation process as LLaVA-W, but we substituted GPT-4 with GPT4-V. Instead of using multiple categories as in LLaVA-W, we simply calculated the overall score ratio between the GPT4-V reference answer and the model's response. In our evaluation, we noticed that the score doesn't show the problems well in different models and might unfairly lower the scores of the reference answers, resulting the bad cases of model can not be correctly reflected in the overall score. To fix this, we made GPT4-V always think the right answers are perfect and give them a score of ten. This means the other models get lower scores and get punished more for their mistakes. This helps us to evaluate abilities of the model better for real-life situations.

Comparison of Benchmarks & Models

Quantative Results. The distinctive measurement provided by LLaVA-Bench (Wilder) compared to other benchmarks becomes evident due to the substantial performance gap among state-of-the-art (SoTA) LMMs. Certain highly proficient LMMs in knowledge-intensive tasks may not excel in daily-life visual chat scenarios as assessed by LLaVA-Bench (Wilder). The LLaVA-NeXT models featured in this release persist in advancing performance across various domains.
ModelsLLaVA-Bench Wilder (Small | Medium)LLaVA-WRealWorldQAAI2DMMEMMMUMathVista
LLaVA-NeXT in this release
LLaVA-Next-110B70.572.590.463.280.42200.449.149.0
LLaVA-Next-72B71.273.489.265.477.42158.946.446.6
LLaMA3-LLaVA-Next-8B62.563.180.160.071.61971.541.737.5
Previous Open-sourced State-of-the-Art Models
LLaVA-Next-34B--88.861.774.92030.446.746.5
Intern-VL-1.562.4-83.366.080.72187.846.854.7
Commercial State-of-the-Art Models
Qwen-VL-Max----79.32281.751.451.0
GPT4-V71.578.598.061.478.21926.056.849.9
Claude-3-Opus68.679.798.549.888.159.450.5-

Daily-life Scenarios & Qualitative Comparisons


The detailed model outputs for HTML code scenario are available here.


Model Card

NameLLaMA-3-LLaVA-NeXT-8BLLaVA-NeXT-72BLLaVA-NeXT-110B
Model SizeTotal8.35B72.7B111.5B
Vision Encoder303.5M303.5M303.5M
Connector20.0M72.0M72.0M
LLM8.03B72.3B111.0B
Resolution336 x [(2,2), (1,2), (2,1), (1,3), (3,1), (1,4), (4,1)]
Stage-1Training Data558K
Trainable ModuleConnector
Stage-2Training Data~790K
Trainable ModuleFull model
Compute (#GPU x #Hours)16 A100-80G x 15~20 hours128 A100-80G x ~18 hours128 H800-80G x ~18 hours
Total Training Data (#Samples)1348K

Team

  • Bo Li: Nanyang Technological University (Work collaborated with ByteDance/TikTok)
  • Kaichen Zhang: Nanyang Technological University (Work collaborated with ByteDance/TikTok)
  • Hao Zhang: Hong Kong University of Science and Technology (Work collaborated with ByteDance/TikTok)
  • Dong Guo: ByteDance/Tiktok
  • Renrui Zhang: The Chinese University of Hong Kong (Work collaborated with ByteDance/TikTok)
  • Feng Li: Hong Kong University of Science and Technology (Work collaborated with ByteDance/TikTok)
  • Yuanhan Zhang: Nanyang Technological University (Work collaborated with ByteDance/TikTok)
  • Ziwei Liu: Nanyang Technological University
  • Chunyuan Li: Bytedance/Tiktok

Acknowledgement

  • We thank Fanyi Pu, Shuai Liu, Kairui Hu for the continuous contribution of lmms-eval to accelerate our development of LLaVA-NeXT.

Related Blogs

Citation



@misc{li2024llavanext-strong,
    title={LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild},
    url={https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/},
    author={Li, Bo and Zhang, Kaichen and Zhang, Hao and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Yuanhan and Liu, Ziwei and Li, Chunyuan},
    month={May},
    year={2024}
}
    
@misc{liu2024llavanext,
    title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
    url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
    author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
    month={January},
    year={2024}
}

@misc{liu2023improvedllava,
      title={Improved Baselines with Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
      publisher={arXiv:2310.03744},
      year={2023},
}

@misc{liu2023llava,
      title={Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
      publisher={NeurIPS},
      year={2023},
}