Image Chat, Segmentation and Generation/Editing -- All-in-one demo

Microsoft Research, Redmond

LLaVA-Interactive is a large language-and-vision assistant demo, dedicated to demonstrate the possibilities of multimodal human-machine interaction: visual input, visual output and visual interaction. It combines complementary skills from three models: visual chat of LLaVA, visual prompt for segmentation from SEEM, and visual prompt for image generation/editing from GLIGEN. It achieves impressive multimodall interaction capabilities, going beyond the langauge-only interaction of LLaVA/GPT-4V.

LLaVA-Interactive is a system-level synergy of the inference stages of three models, without additional model training. It is surprisingly cheap to build. Checkout our code release on GitHub.

For better demo experience, please play LLaVA-Interactive in a seperate tab by clicking me


The rapid advancement of large language models (LLMs) has revolutionized chatbot systems, resulting in unprecedented levels of intelligence as seen in OpenAI's ChatGPT. This success of ChatGPT on language tasks has inspired the community to anticipate a similar success paradigm in the multimodal space, where both language and vision (LV) modalities are involved in the human-machine interaction to unlock many new scenarios, leading to the increasingly popular research topic of building general-purpose assisants. GPT-4V is such an example, taking one step forward to showcase the interesting capabitlies of chatbots with langauge-image input and langauge output. However, despite its impressive performance, GPT-4V is limited in: (1) it is largely a language interaction system, where input images only play the role of providing additional context for chat; (2) the training and architecture details remain unclear, hindering research and open-source innovation in this field.

To demonstrate the new application scenarios of general-purpose assistants in the multimodal space, we introduce LLaVA-Interactive, an open-source demo system, backed by three powerful LV models and an easy-to-use, extensible framework. LLaVA-Interactive is favorable:

  1. Visual Interaction. It supports visual prompt by allowing users to draw strokes and bounding boxes to better express human intents in visual creation process (including image segmentation and generation/editing), in addition to visual chat. Therefore, LLaVA-Interactive has demonstrated more engaged human-machine interaction experiences compared to GPT-4V/LLaVA, in terms of following human intents.
  2. Open-source. We make our demo system and code base publicly available, to facilitate future improvement in the community
This blog post provides a preliminary evaluation of LLaVA-Interactive's new capabilities and describes its work flow and serving infrastructure. We also invite the community to interact with our online demo to test the capabilities of this multimodal chatbot.


This figure provides a workflow of LLaVA-Interactive. We describe the one typical visual creation process as below:

  1. Image Input. To begin, an image is needed. The user can either upload an image, or generate an image by specifying its language caption and drawing bounding boxes for the intended spatial layout of the objects. Once the image is ready, one may play with image by applying one of following three steps: chat, segmentation or editing.
  2. Visual Chat: Ask any questions about the image, eg, the suggestions on how to revise the image. Based on the editing suggestions, one may remove or add new objects using Step 3 or 4 respectively.
  3. Interactive Segmentation: One may segment an object mask using either stroke drawing or text prompt. To remove it, please drag the mask out of the image, and a background will be aumatically filled. Alternatively, the masked can be dragged to a different location. To fill in a new object, please provide the text prompt for the mask
  4. Grounded Editing: One may put new objects directly on the image, by drawing the bounding boxes and associating the corresponding concepts for the intended objects.
  5. Mult-turn Interaction: Repeating Step 2, 3 or 4 to iteratively refine the visual creation.

Capability Comparisons

Based on LLaVA that allows image input for visual chat only, LLaVA-Interactive extend it to support visual interaction such as user-drawn strokes and bounding box, as well as visual image generation/editing. Please see the comparisons of the capabilities below:
System Visual Input Visual Output Visual Interaction

Behind the Scenes: Individual Models

LLaVa-Interactive is an all-in-one demo that connects three LV models in one interactive session for image chat, segmentation and generation/editing, which can complete more complex tasks than a single model alone. As a background, we briefly describe the individual models for who are interested in the key techniques:
  • LLaVA: Large Language and Vision Assistant, the first open-source alternative to GPT-4V. It is an end-to-end trained large multimodal model that combines CLIP vision encoder and Vicuna for general-purpose visual understanding and reasoning, achieving impressive chat capabilities mimicking spirits of the GPT-4V.
  • SEEM: Segment Everything Everywhere with Multi-modal prompts all at once. SEEM allows users to easily segment an image using prompts of different types including visual prompts (points, marks, boxes, scribbles) and language prompts. It can also work with any combination of prompts or generalize to custom prompts.
  • GLIGEN: Grounded-Language-to-Image Generation, an open-source model that extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on visual prompts such as bounding boxes.

Development Challenges

LLaVA-Interactive is a system-level demo synergy that leverages existing model checkpoints to build general-purpose assistants/agents, without any additional model training. Though the requirements on training AI models are low, it poses several technical challenges that we have addressed when developing LLaVA-Interactive along the way: (1) One challenge we faced was that the GLIGEN inpainting model was not designed to handle filling the background hole. Instead, we used LAMA for background filling. (2) Another challenge was that Gradio did not have enough support for user interaction, such as drag-and-drop. We solved this by implementing a new Gradio Image component tool that enabled this functionality. (3) The complexity of integrating several projects and models together, each of them being complex already. We overcame this by experimenting with different approaches and creating a very clean UI layout and an efficient data sharing scheme. (4) The last challenge was managing different package requirements and dependencies. We dealt with this by running different models, such as LAMA, as separate web services.

Case Study: Multimodal Interactive Creation for Photographic Artists

Preliminary Evaluation: Sparks of New Application Scenarios

1 / 19
Caption Text
2 / 19
Caption Two
3 / 19
Caption Three
4 / 19
Caption Text
5 / 19
Caption Two
6 / 19
Caption Three
7 / 19
Caption Text
8 / 19
Caption Two
9 / 19
Caption Three
10 / 19
Caption Text
11 / 19
Caption Two
12 / 19
Caption Three
13 / 19
Caption Three
14 / 19
Caption Three
15 / 19
Caption Three
16 / 19
Caption Three
17 / 19
Caption Three
18 / 19
Caption Three
19 / 19
Caption Three


    author      = {Chen, Wei-Ge and Spiridonova, Irina and Yang, Jianwei and Gao, Jianfeng and Li, Chunyuan},
    title       = {LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing},
    publisher   = {https://llava-vl.github.io/llava-interactive},
    year        = {2023}


This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of SEEM, GLIGEN, CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Related Links: [LLaVA] [SEEM] [GLIGEN] [Computer Vision in the Wild (CVinW)]