πŸŒ‹ LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

Learning to Use Tools For Creating Multimodal Agents



Tsinghua University Microsoft Research University of Wisconsin-Madison HKUST IDEA Research
* Work performed during an internship at Microsoft    🚩 Project Lead

LLaVA-Plus Capabilities enabled by Plug and Learning to Use Skills

Highlights

LLaVA-Plus maintains a skill repository that contains a wide range of vision and vision-language pre-trained models (tools), and is able to activate relevant tools, given users’ multimodal inputs, to compose their execution results on the fly to fulfill many real-world tasks.

  1. New multimodal instruction-following tool use data. We present a new pipeline for curating vision-language instruction-following data, dedicated for tool use in human-AI interaction sessions, leveraging ChatGPT and GPT-4 as labeling tools
  2. πŸŒ‹ LLaVA-Plus Model. We have developed LLaVA-Plus, a general-purpose multimodal assistant that extends LLaVA by incorporating a large and diverse set of external tools that can be selected, composed, and activated on the fly for performing tasks
  3. Performance. Our empirical study validates the effectiveness of LLaVA-Plus with consistently improved results on multiple benchmarks, and in particular, new SoTA on VisIT-Bench with a diverse set of real-life tasks.
  4. Open-source. We will release the following assets to the public: the generated multimodal instruction data, the codebase, the LLaVA-Plus checkpoints, and a visual chat demo.

πŸŒ‹ LLaVA-Plus Human-AI Interaction Process

LLaVA-Plus enables tool use with four steps.

    β‘  Humans provide a task instruction \(X_{{q}}\) related to an image \(I_{{q}}\).
    β‘‘ The LMM-powered assistant analyzes both \(X_{{q}}\) and \(I_{{q}}\), and outputs \(X_{{skill\_use}}\) that chooses the tool from skill repository and writes the appropriate prompt as the tool argument.
    β‘’ By executing the tool, the result \(X_{{skill\_result}}\) is returned to the assistant.
    β‘£ The assistant aggregates \(X_{{skill\_result}}\) with \(X_{{q}}\) and \(I_{{q}}\), and outputs \(X_{{anwser}}\) to humans.
The interaction can be represented as:
    Humans: \(I_q\)<\n> \(X_{{q}}\) < STOP> Assistant: \(X_{{skill\_use}}\)
    Humans: \(X_{{skill\_result}}\)< STOP> Assistant: \(X_{{anwser}}\)
Only the green sub-sequences (or tokens) are used to compute the loss, and thus the model learns to predict skill use, answers, and when to stop. One example of the training data sequence is shown as below.

Training Data Example

Preliminary Evaluation: New Application Scenarios with Learning to Use Tools

1 / 11
The comparison of detection capabilities and its impact on visual chat. LLaVA-Plus is the only system that is able to detect the frisbee and leverage the location information to tell the motion and status of the object as well as human activity, revealing the importance of object localization on the LMM response
2 / 11
Detection for counting and actions
3 / 11
Language-enriched detection and description
4 / 11
External knowledge retrieval help improve the entity and fact based responses
5 / 11
LLaVA-Plus improves SD-favored language prompt from user instructions for image generation
6 / 11
Semantic segmentation and mask-based conditional image generation of LLaVA-Plus. Purple is human questions, green is LLaVA-Plus response. The semantic segmentation task is fullfilled via OpenSEED. Based on the segmented images, new editing instructions and history, InstructPix2Pix and ControlNet can be called to complete the tasks. The captions of the target edited images are generated by LLaVA-Plus, revealing the unique advantage of LMM for tool use
7 / 11
Multimodal social media post by editing an image and writing a message. Four season of the same image are considered to edit and further associate the text to attract attention of Instagram
8 / 11
Multimodal social media post by editing an image and writing a message. Four season of the same image are considered to edit and further associate the text to attract attention of Instagram
9 / 11
Multimodal social media post on fireworks.
10 / 11
Visual Prompt: Multi-granularity segmentation with an user input point, using Semantic SAM
11 / 11
Visual Prompt:Visual referring image segmentation of LLaVA-Plus. Purple is human questions, green is LLaVA-Plus response. Users can make a stroke on the reference image (a red curve) as the visual target to segment, LLaVA-Plus calls SEEM model to predict the corresponding masks in the target image.

BibTeX


@misc{liu2023llavaplus,
title={LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents},
author={Shilong Liu and Hao Cheng and Haotian Liu and Hao Zhang and Feng Li and Tianhe Ren and Xueyan Zou and Jianwei Yang and Hang Su and Jun Zhu and Lei Zhang and Jianfeng Gao and Chunyuan Li},
year={2023},
booktitle={arXiv}
}
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaVA, CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Related Links: [REACT] [GLIGEN] [Computer Vision in the Wild (CVinW)] [Insutrction Tuning with GPT-4]