Is GPT-4V a True Expert? A Deep Dive into MMMU, the AI 'College Exam' That Even Top Models Fail

Paper at a Glance

Paper Title: MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Authors: Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen
Affiliation: A multi-institutional collaboration including IN.AI Research, University of Waterloo, and The Ohio State University.
Published in: Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Link to Paper: https://openaccess.thecvf.com//content/CVPR2024/html/Yue_MMMU_A_Massive_Multi-discipline_Multimodal_Understanding_and_Reasoning_Benchmark_for_CVPR_2024_paper.html
Project Page: https://mmmu-benchmark.github.io/

The Gist of It: TL;DR

In one sentence: This paper introduces MMMU, a massive and diverse benchmark of 11,500 college-level multimodal questions across six disciplines, designed to rigorously evaluate the expert-level reasoning and perception capabilities of advanced AI models on the path to Artificial General Intelligence (AGI).

Why It Matters: The Big Picture

Large Multimodal Models (LMMs) like GPT-4V and Gemini have shown astounding abilities to understand and discuss images. But are they truly intelligent, or are they just very good at recognizing common objects? Most existing benchmarks test these models with everyday scenarios—like identifying a cat in a photo—which is the equivalent of giving a PhD student a middle school quiz. It tells you they have basic competency, but it doesn’t measure their expert knowledge or their ability to perform complex, specialized reasoning.

The authors of MMMU argue that to measure progress towards “Expert AGI”—an AI that can perform on par with a skilled human professional—we need a much, much harder test. We need to move beyond common sense and into the realm of specialized, domain-specific knowledge. This is where MMMU comes in. It is designed to be the “college final exams” for AI, testing models on tasks that require genuine expertise, from interpreting medical MRIs to analyzing circuit diagrams.

The Core Idea: How It Works

The key innovation here isn’t a new model, but a new, meticulously crafted benchmark designed to push current models to their limits. The MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning) benchmark presents four core challenges that set it apart from previous evaluation datasets.

1. The Problem They’re Solving

Existing multimodal benchmarks are too simple. They lack both the breadth (covering many expert fields) and depth (requiring college-level thinking) to truly challenge state-of-the-art models. An AI can ace a test on identifying objects in a living room but fail spectacularly when asked to solve a calculus problem presented as a graph or identify an abnormality in a pathology image. MMMU was created to fill this critical evaluation gap.

2. The Key Innovation

The MMMU benchmark is built on the principle that an expert AI must demonstrate proficiency across a wide array of specialized domains. As shown in Figure 1 of the paper, it achieves this through four key design pillars:

Comprehensiveness: The benchmark includes 11,500 questions spanning six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These are further broken down into 30 subjects and 183 subfields.
Highly Heterogeneous Images: This isn’t just about photos. MMMU contains 30 distinct image types, including charts, diagrams, chemical structures, sheet music, medical scans, and paintings, testing a model’s ability to perceive and interpret complex, symbolic information.
Interleaved Text and Images: Many questions require the model to understand a sequence of text and images together, forcing a deeper, contextual understanding of how they relate.
Expert-Level Skills: The problems demand a sophisticated blend of visual perception, recalled domain-specific knowledge, and deliberate, multi-step reasoning to arrive at the correct answer.

3. The Method, Step-by-Step

Creating this “AI college exam” was a massive undertaking:

Data Collection: A team of over 50 university students and researchers, specializing in the target fields, manually collected questions from college exams, quizzes, and textbooks.
Curating for Difficulty: The team meticulously filtered the collected data, removing problems that were too simple or didn’t require genuine multimodal understanding. The goal was to create a dataset that tests the upper limits of AI capability.
Quality Control: A rigorous multi-stage review process was used to check for duplicates, fix typos, standardize formatting, and ensure the overall quality and difficulty of the benchmark.

Key Experimental Results

The authors tested 28 open-source LMMs, plus the powerful proprietary models GPT-4V and Gemini. The results, detailed in Table 2, are sobering for anyone thinking AGI is just around the corner.

Leaderboard: https://mmmu-benchmark.github.io/#leaderboard

Even the Best Models Struggle: The top-performing models, Gemini Ultra and GPT-4V, achieved accuracies of only 59% and 56%, respectively. This is a far cry from the 88.6% average score of human experts (college seniors) on the same questions, highlighting a massive performance gap.
A Stark Divide Between Open-Source and Proprietary Models: The best open-source models at the time of publication hovered around 34-46% accuracy, showing a significant disparity in capabilities compared to their closed-source counterparts.
Simple Tricks Don’t Work: The researchers found that simply feeding text from an image (using OCR) or an image caption to a powerful text-only LLM like GPT-4 yielded poor results. This confirms that MMMU requires a deep, integrated understanding of both vision and text, not just surface-level processing.
Why Models Fail: An error analysis on 150 of GPT-4V’s mistakes (Figure 5) revealed the primary bottlenecks: 35% were perceptual errors (misinterpreting the image), 29% were from a lack of knowledge, and 26% were due to flawed reasoning.

A Critical Look: Strengths & Limitations

Strengths / Contributions

Unprecedented Scope and Depth: MMMU is the first benchmark to comprehensively evaluate AI on college-grade, expert-level tasks across such a wide range of disciplines. It moves the goalposts from “common sense” to “expert knowledge.”
Challenging and Diverse Visuals: By including 30 types of complex, symbolic images, the benchmark forces the community to build models with far more robust and flexible perceptual abilities.
Exposes a Clear Performance Ceiling: The results clearly show that even today’s most advanced AI is far from human expert-level performance, providing a vital, difficult challenge to guide future research.
Actionable Insights from Error Analysis: The breakdown of why GPT-4V fails offers a clear roadmap for the community, highlighting that improvements are needed across perception, knowledge, and reasoning simultaneously.

Limitations / Open Questions

Static Knowledge Test: Like any exam, MMMU tests established knowledge and reasoning. It doesn’t evaluate creativity, the ability to use tools for problem-solving, or the discovery of new information.
Potential for Future Contamination: The authors took care to source questions from materials not easily found online. However, as foundation models are trained on ever-larger swaths of the internet, there’s a risk that these questions could inadvertently become part of future models’ training data, which would invalidate the benchmark.
Primarily Multiple-Choice: About 94% of the questions are multiple-choice. While this makes evaluation straightforward and objective, it might not fully capture a model’s free-form reasoning capabilities in the same way open-ended questions would.

Contribution Level: Foundational. This paper does not introduce a new model but instead provides a foundational resource for the entire AI community. By creating a difficult, comprehensive, and well-designed benchmark, it defines the next frontier for evaluating multimodal AI. Such ambitious benchmarks are critical for measuring true progress and preventing the field from stagnating on solved problems.

Conclusion: Potential Impact

MMMU is more than just another leaderboard; it’s a new standard for what it means for an AI to be “intelligent.” It challenges the research community to move beyond models that can merely describe what they see and toward models that can understand, analyze, and reason like human experts. For any organization building the next generation of AI, demonstrating strong performance on MMMU will likely become a crucial rite of passage. This work lays down a clear, difficult, and necessary path toward building more capable and truly expert-level AI systems.