[Paper2Poster] This AI Agent Turns Your 22-Page Paper into a Conference Poster for Less Than a Cent

Paper at a Glance

Paper Title: Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Authors: Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr
Affiliation: University of Waterloo, National University of Singapore, University of Oxford
Published in: arXiv, 2025
Link to Paper: https://arxiv.org/abs/2505.21497
Project Page: https://paper2poster.github.io

The Gist of It: TL;DR

In one sentence: This paper introduces Paper2Poster, the first benchmark and metric suite for automatically generating academic posters from scientific papers, and proposes PosterAgent, a multi-agent system that uses a top-down, visual-feedback loop to transform long-form papers into concise, editable .pptx posters.

Why It Matters: The Big Picture

Anyone in research knows the pre-conference scramble: your paper is accepted, and now you have to distill dozens of pages of dense text, figures, and tables into a single, visually compelling A0 poster. This task is more art than science, requiring summarization, design sense, and a knack for spatial arrangement. While AI has made strides in generating slide decks, posters remain a uniquely hard problem. A slide deck can spread information across many simple layouts, but a poster must condense everything onto one canvas, demanding a complex interplay of text and graphics without becoming a cluttered mess.

Current large language models (LLMs) and vision-language models (VLMs), on their own, are not up to the task. They struggle to reason about spatial constraints, leading to text overflowing its boundaries or poorly aligned elements. Furthermore, how do we even measure if an AI-generated poster is “good”?

This is where Paper2Poster comes in. The authors make two key contributions:

A Benchmark: They create the first comprehensive dataset and evaluation framework specifically for the paper-to-poster task.
An Agent: They build PosterAgent, a system that intelligently mimics the human workflow of creating a poster, from high-level planning down to fine-grained visual tweaks.

The Core Idea: How It Works

PosterAgent breaks down the daunting task of poster creation into a structured, three-step pipeline that combines global planning with local, visually-grounded refinement.

1. The Problem They’re Solving

The core challenge is multimodal context compression. A 20-page paper with 20,000 tokens and 20+ figures must be transformed into a single page with ~1,500 tokens and ~8 figures. This requires not just summarizing text, but also selecting the right visuals, arranging them logically, and ensuring the final layout is both readable and aesthetically pleasing. End-to-end models like GPT-4o can generate beautiful images, but as the study shows, the embedded text is often nonsensical, and the informational content gets lost.

2. The Key Innovation

The standout idea is the visual-in-the-loop, multi-agent framework. Instead of trying to generate the entire poster in one shot, PosterAgent acts like a team of specialists. A Parser organizes the content, a Planner sketches the layout, and a Painter-Commenter duo works iteratively to perfect each section, with the Commenter acting as a “critic” that provides visual feedback. This mirrors how a human designer would draft, review, and revise their work.

3. The Method, Step-by-Step

As illustrated in Figure 4 of the paper, the process unfolds in three stages:

Parser (Global Organization): The agent first ingests the raw paper PDF. Using tools like MARKER, it converts the paper into Markdown and then uses an LLM to distill it into a structured “asset library.” This library contains paragraph-level summaries for each section (Introduction, Methods, etc.) and all the extracted figures and tables.
Planner (Layout Generation): Next, the Planner acts as a high-level designer. It semantically matches each visual asset (e.g., a results graph) to its corresponding text summary. Then, based on the estimated length of the content for each section, it generates a binary-tree layout that maps out the poster’s panels, preserving reading order and ensuring a balanced composition.
Painter-Commenter Loop (Local Refinement): This is where the magic happens. For each panel defined by the Planner:
- The Painter takes the section summary and figure, distills the text into concise bullet points, and generates python-pptx code to render a draft of that panel.
- The Commenter, a VLM, then “looks” at the rendered image of the panel. Using a “zoom-in” focus and guided by examples of good and bad layouts, it provides targeted feedback like “text is overflowing” or “layout is too blank.”
- This feedback is sent back to the Painter, which revises the content and code. This loop continues until the Commenter signals that the panel is “good to go.”

This iterative process ensures each part of the poster is coherent and visually sound before the final, editable .pptx file is assembled.

Key Experimental Results

The authors introduce a powerful new metric called PaperQuiz, where various VLMs act as “readers” with different expertise levels (from student to professor). These AI readers try to answer multiple-choice questions about the original paper based only on the generated poster. A higher score means the poster is better at conveying the core content.

Pixel-level generation is not enough: GPT-4o’s image generation (40-Image) created visually appealing posters but scored poorly on PaperQuiz and had terrible text quality (high perplexity). This shows that aesthetics alone don’t make a good scientific poster.
PaperQuiz is a robust metric: The scores from the PaperQuiz metric showed strong correlation with human evaluations (Figure 6), confirming it’s a reliable proxy for how effectively a poster communicates information.
PosterAgent excels in communication and efficiency: PosterAgent consistently achieved the highest PaperQuiz scores, outperforming all other baselines (Table 2). The fully open-source version, PosterAgent-Qwen, surpassed more resource-intensive systems while using 87% fewer tokens. As shown in Figure 7, this translates to an astonishingly low cost of just $0.005 per poster.

A Critical Look: Strengths & Limitations

Strengths / Contributions

A Foundational Benchmark: The Paper2Poster benchmark, and especially the PaperQuiz metric, provides the community with the first standardized way to measure progress on this complex task. This is a crucial contribution that will enable future research.
Elegant and Effective Agent Design: The top-down Parser -> Planner -> Painter/Commenter architecture is a smart solution to the spatial reasoning problem. The “visual-in-the-loop” refinement step is a clever mechanism for correcting layout errors that plague single-shot generation methods.
Highly Practical and Accessible: By producing an editable .pptx file at an extremely low cost, the authors have created a tool with real-world utility for researchers. The open-source variant further democratizes this capability.

Limitations / Open Questions

Dependency on Conventional Paper Structure: The Parser’s success seems tied to the standard IMRAD (Introduction, Methods, Results, and Discussion) structure of scientific papers. It’s unclear how it would perform on papers with non-traditional formats or from different academic disciplines.
The Aesthetic Gap: While the generated posters are functionally excellent, they remain visually generic (see Figure 8b). They lack the creative, high-impact visual design choices that distinguish the best human-made posters. The VLM-as-Judge “Engagement” score for PosterAgent still trails behind human-designed posters.
Reliability of VLM Feedback: The framework’s success, particularly in the Painter-Commenter loop and evaluation, hinges on the visual reasoning capabilities of state-of-the-art VLMs. The paper notes that GPT-4o was a better “Commenter” than open-source alternatives, suggesting that the quality of this feedback loop may be a bottleneck.

Contribution Level: Significant Improvement. This paper carves out a new and important problem space for AI. While not introducing a new foundational model, it presents a highly effective system and, more importantly, a robust benchmark to measure success. The Paper2Poster framework and the PosterAgent solution together represent a major step forward in AI-powered scientific communication.

Conclusion: Potential Impact

Paper2Poster and PosterAgent offer a glimpse into a future where researchers can offload tedious design tasks to intelligent AI assistants. This work moves beyond simple text generation to tackle a complex, multimodal, and layout-sensitive problem. By creating a practical, efficient, and open tool, the authors have not only advanced the state of generative AI but also provided a tangible benefit to the scientific community. The next steps will likely involve improving the aesthetic creativity of the agent and extending its capabilities to even more diverse and complex document types.