πŸŒ‹ LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models



HKUST SCUT Microsoft Research, Redmond IDEA Research University of Wisconsin-Madison Tsinghua University CUHK
* Equal Contribution    Equal Advisory Contribution    🚩 Directional Lead

Highlights

LLaVA-Grounding maintains

  1. New grounded visual chat data. We introduce a data annotation pipeline to label high-quality Grounded Visual Chat (GVC) data. Leveraging human-labeled object detection data and harnessing the robust matching capability of GPT-4, we have successfully labeled 150K GVC instances using the LLaVA instruction tuning dataset.
  2. πŸŒ‹ LLaVA-Grounding Model. We present an end-to-end model, which connects a Large Multimodal Model (LMM) with a grounding model to facilitate grounded visual chat. Our model supports both object and pixel-level grounding, accommodating various visual prompts such as mark, click, box, and scribble. Our model offers a broader range of input and output prompt types compared to other LMMs.
  3. Grounding Bench. We establish Grounding Bench for evaluating grounded visual chat and propose an auto-evaluation pipeline aided by GPT-4. This benchmark assesses grounded visual chat capabilities and provides performance metrics for other state-of-the-art methods.
  4. Performance. Our empirical study validates the effectiveness of LLaVA-Grounding with the best overall performance on our Grounding Bench and competitive performance on traditional grounding tasks such as RefCOCO and Flickr30K.

πŸŒ‹ LLaVA-Grounging Network Architechture

LLaVA-Grounding enables grounding and visual prompts with two additional modules.

Prompt encoder.

    For an input image \(X_{\texttt{v}}\) and a visual prompt \(X_{\texttt{p}}\), we employ the pre-trained Semantic-SAM as the prompt encoder. This encoder extracts visual features based on the input image and visual prompts, denoted as \(Z_{\texttt{p}}=h(X_{\texttt{v}},X_{\texttt{p}})\). To convert these prompt features into language embedding tokens \(H_{\texttt{p}}\) of the same dimensionality as the word embedding space in the language model, we use a simple linear layer with a trainable projection matrix \(W_{\texttt{p}}\): \begin{equation} H_{\texttt{p}}=W_{\texttt{p}} \cdot Z_{\texttt{p}}, \text{ where } Z_{\texttt{p}}=h\left(X_{\texttt{v}},X_{\texttt{p}}\right) \end{equation}
Grounding model.
    In addition to the language response \(X_{\texttt{a}}\), our model also produces features \(X_{\texttt{g}}\) for grounding. we employ a pretrained OpenSeeD model as the grounding model to generate bounding boxes \(\mathbf{B}\) and masks \(\mathbf{M}\). This process can be defined as follows: \begin{equation} \mathbf{B, M}=s\left(X_{\texttt{v}},W_{\texttt{g}} \cdot X_{\texttt{g}}\right) \end{equation}

Comparison with other LMMs: Grounded detailed description

1 / 11
Example 1: A real-life image.
2 / 11
Example 2: An open-set concept "dragon".
3 / 11
Example 3: A real-life image.

BibTeX


@misc{zhang2023llavagrounding,
title={LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models},
author={Hao Zhang and Hongyang Li and Feng Li and Tianhe Ren and Xueyan Zou and Shilong Liu and Shijia Huang and Jianfeng Gao and Lei Zhang and Chunyuan Li and Jianwei Yang},
year={2023},
booktitle={arXiv}
}
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaVA, CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Related Links: [REACT] [GLIGEN] [Computer Vision in the Wild (CVinW)] [Insutrction Tuning with GPT-4]