[LISA++] Making Vision Models Talk and Point at the Same Time

Paper at a Glance

Paper Title: LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model
Authors: Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, Jiaya Jia
Affiliation: The Chinese University of Hong Kong, SmartMore
Published in: arXiv, January 2024
Link to Paper: https://arxiv.org/abs/2312.17240
Models: https://huggingface.co/collections/Senqiao/lisa-67713837a32d6abf516a162e

The Gist of It: TL;DR

In one sentence: This paper introduces LISA++, an enhanced version of the LISA model that teaches a large language model not only to segment objects in an image based on complex instructions but also to distinguish between different instances of the same object and seamlessly embed these segmentations into a natural, multi-turn conversation.

Why It Matters: The Big Picture

We’ve seen an explosion of Large Multimodal Models (LMMs) like GPT-4V that can look at an image and describe it with stunning detail. You can ask, “What’s happening in this picture?” and get a rich, textual answer. But what if you ask, “Which horse is the darker one?” A text-only model can describe it, but it can’t show you. It can’t point.

This is the gap that “reasoning segmentation” models aim to fill. The original LISA model was a major step forward, introducing a clever way for a language model to output not just text, but also a segmentation mask—effectively highlighting a specific region in an image. However, LISA had its own limitations. It struggled to tell apart two different objects of the same class (e.g., two different horses) and its responses were clunky and robotic, often starting with a fixed phrase like “Sure, it is [SEG]”.

LISA++ is the authors’ answer to these problems. It’s not a complete architectural redesign but a powerful upgrade that makes the model smarter, more precise, and a much better conversationalist, bringing us closer to a truly interactive visual assistant.

The Core Idea: How It Works

The magic of LISA++ isn’t in adding new complex modules to the model. Instead, it’s all about a smarter way of training the model by curating better data. The authors retain the core mask-as-embedding paradigm from the original LISA but teach it two crucial new skills.

1. The Problem They’re Solving

The original LISA model faced two key challenges:

No Instance Awareness: If an image contained two cars, and you asked a question about “the cars,” LISA would likely segment both as a single blob. It couldn’t differentiate between “car A” and “car B.”
Unnatural Dialogue: The model’s responses were rigid. The segmentation token <SEG> was always tacked onto a canned phrase, making the dialogue feel unnatural and limiting its flexibility.

2. The Key Innovation

LISA++ solves these problems through a refined instruction-tuning process. The authors used GPT-4V to generate new training data from existing datasets (like COCO), specifically designed to teach the model two new abilities:

Reasoning Instance Segmentation (InstSeg): This teaches the model to distinguish between individual objects. For example, given an image with two cats, the training data might include the question, “Please identify all the furry companions,” with an answer that maps to two separate segmentation masks.
Segmentation in Dialogue (SiD): This teaches the model to weave the segmentation tokens directly and naturally into its text responses. Instead of a rigid format, the model learns to say things like, “The image features a large teddy bear [SEG] sitting on a wooden armchair [SEG].”

3. The Method, Step-by-Step

The overall process builds directly on the original LISA framework.

LISA’s Foundation: The model is built on an LMM (LLaVA v1.5). It learns a special token, <SEG>. When the LMM generates this token as part of its text output, a separate decoder simultaneously generates a segmentation mask corresponding to that token.
Teaching Instance Awareness: During training for instance segmentation, if the model predicts multiple masks for a query like “segment the horses,” the system uses a technique called bipartite matching (popularized by models like DETR). This finds the best one-to-one mapping between the predicted masks and the ground-truth masks, allowing the model to be properly trained to separate each instance.
Teaching Natural Conversation: For SiD, the training data is structured as natural dialogue or detailed captions where <SEG> tokens are embedded within sentences. The model learns through standard language modeling to generate these fluid responses.
Unified and Controllable Training: LISA++ is trained on a mixture of data for old and new tasks (semantic segmentation, instance segmentation, SiD, pure conversation). Crucially, the model is guided by task-specific instruction templates (shown in Table 1 of the paper). This allows a user to control the model’s behavior at inference time by simply changing the instruction—asking for instance segmentation, semantic segmentation, or just a text-only answer.

As shown in the paper’s first figure, this leads to a huge improvement. While LISA lumps both horses together, LISA++ can distinguish them and discuss them separately in a natural, multi-turn dialogue.

Key Experimental Results

LISA++ was evaluated on both the original semantic segmentation benchmark and a new, more challenging instance segmentation benchmark.

Mastering Instance Segmentation: On the new ReasonSeg-Inst benchmark, LISA++ demonstrates a massive leap in performance (Table 2). The 7B parameter version of LISA++ achieves an AP50 of 34.1, more than doubling the 15.7 score of the larger LISA-13B model. This clearly shows the effectiveness of the new instance-focused training data.
No Compromise on Existing Skills: Impressively, adding these new capabilities didn’t degrade the model’s original skills. On the ReasonSeg-Sem (semantic segmentation) benchmark, LISA++ performs on par with, and even slightly better than, the original LISA model (Table 3). This highlights the generalizability of the framework.
Qualitative Prowess: The visual examples (Figures 4, 5, and 6) are compelling. They show LISA++ successfully captioning an image with embedded segmentations, correctly identifying multiple “white cars” as separate instances, and engaging in a logical multi-turn dialogue about a baseball scene, correctly segmenting the player and the bat in response to follow-up questions.

A Critical Look: Strengths & Limitations

Strengths / Contributions

Elegant Extension, Major Impact: The primary contribution is showing that a simple and effective paradigm (“mask-as-embedding”) can be extended to handle significantly more complex tasks like instance segmentation and natural dialogue integration without any architectural changes. It’s a powerful case study in the effectiveness of data curation and instruction tuning.
Vastly Improved User Interaction: The Segmentation in Dialogue (SiD) capability is a significant step toward making these models practical assistants. It transforms the interaction from a rigid command-response system into a fluid, natural conversation where the AI can point things out contextually.
New Public Benchmark: The creation of the ReasonSeg-Inst dataset provides a valuable resource for the research community to properly evaluate and benchmark models on the challenging task of instance-level reasoning segmentation.

Limitations / Open Questions

Dependency on GPT-4V for Data Curation: The quality of the new instruction-tuning datasets is fundamentally tied to the capabilities of GPT-4V. Any inherent biases, inaccuracies, or logical gaps in GPT-4V could be baked into LISA++'s training data, potentially limiting its robustness.
Incremental by Design: The work is explicitly framed as an improvement (“++”) on LISA. While the improvements are substantial and highly valuable, it builds upon an existing foundation rather than proposing a fundamentally new approach to reasoning segmentation.
Generalization to the Wild: The experiments are based on standard academic datasets (COCO, ADE20K). How the model performs on truly “in-the-wild” images with complex, crowded scenes or in specialized domains (e.g., medical imaging, satellite imagery) remains an open question.

Contribution Level: Significant Improvement. LISA++ makes a substantial leap forward from its predecessor. It addresses LISA’s most critical shortcomings and introduces new capabilities that are essential for real-world applications. While it doesn’t reinvent the wheel, it polishes it to a brilliant shine, setting a new and much stronger baseline for future work in conversational visual reasoning.

Conclusion: Potential Impact

LISA++ is a prime example of how progress in AI is not always about building bigger, more complex architectures. Sometimes, the most impactful work comes from teaching existing models new tricks through clever data and training strategies. By enabling models to distinguish between individual objects and discuss them naturally in conversation, this work paves the way for more sophisticated and intuitive AI assistants. Future research can now build on this stronger baseline to tackle even more nuanced visual dialogues, making our interactions with AI more like collaborating with a perceptive human partner.