[Paper2Video] From Paper to Presentation in Minutes

Paper at a Glance

Paper Title: Paper2Video: Automatic Video Generation from Scientific Papers
Authors: Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou
Affiliation: Show Lab, National University of Singapore
Published in: arXiv Preprint, 2025
Link to Paper: https://arxiv.org/abs/2510.05096
Project Page: https://github.com/showlab/Paper2Video, https://showlab.github.io/Paper2Video/

The Gist of It: TL;DR

In one sentence: This paper introduces Paper2Video, a comprehensive benchmark and a multi-agent framework called PaperTalker, which fully automates the creation of high-quality academic presentation videos directly from research papers, handling everything from slide design and narration to adding a personalized talking head of the author.

Why It Matters: The Big Picture

For any researcher, the process of creating a presentation video for a conference is a familiar, and often dreaded, task. It’s a time-consuming grind that involves hours of designing slides, writing a script, recording audio, and painstakingly editing everything together. A short 5-10 minute video can easily consume an entire day’s work—time that could be spent on research.

With the rise of virtual and hybrid conferences, these videos have become essential for communicating science. Yet, the tools to create them have not kept pace. This is the critical bottleneck that “Paper2Video” aims to solve. The authors argue that generating a good academic presentation is a “superproblem” that goes far beyond standard text-to-video models. It requires understanding long, dense scientific documents, coordinating multiple aligned channels (slides, voice, subtitles, presenter, cursor), and—most importantly—effectively conveying complex knowledge. This paper is the first to systematically tackle this entire pipeline, from raw paper to polished video.

The Core Idea: How It Works

The authors identify two fundamental challenges: how to create a good presentation video automatically, and how to evaluate whether the result is actually any good. They address both.

1. The Problem They’re Solving

Generating an academic presentation video is uniquely difficult. Unlike a typical “cat jumping over a fence” video, it involves:

Long-Context Input: A research paper is dense with text, figures, tables, and equations.
Multi-Channel Coordination: The final video must seamlessly sync slides, subtitles, a presenter’s speech, a talking-head video, and even cursor movements.
Measuring Success: What makes a presentation “good”? It’s not just about visual quality. It’s about how well it transfers knowledge to the audience. Existing video metrics are not designed for this.

To solve this, the authors introduce two key components: the Paper2Video Benchmark to measure success, and the PaperTalker Agent to create the videos.

2. The Key Innovation

The core innovations are a new way of thinking about evaluation and a modular, agent-based system for generation.

First, the Paper2Video Benchmark is a high-quality dataset of 101 research papers paired with their author-created videos, slides, and speaker metadata. This allows for direct comparison against a human-created ground truth. More importantly, the authors propose four new evaluation metrics focused on knowledge transfer:

Meta Similarity: How similar are the AI-generated slides and speech to the human version?
PresentArena: Using a VideoLLM as a proxy audience, which video is preferred in a head-to-head comparison?
PresentQuiz: Can a VideoLLM answer questions about the paper correctly after watching the video? This directly measures information coverage.
IP Memory: Can the “audience” correctly associate the work with its author after watching? This measures the video’s impact and memorability.

Second, the PaperTalker is a multi-agent framework designed to mimic the human workflow, with different “builders” handling specialized tasks. This makes the system robust and efficient.

3. The Method, Step-by-Step

PaperTalker’s pipeline, illustrated in Figure 4 of the paper, is a masterclass in breaking down a complex problem into manageable steps:

Slide Builder: This agent takes the paper’s LaTeX source code as input and generates presentation slides using Beamer (a popular LaTeX class for presentations). This is a smart choice because it enforces a formal, academic style. The key magic here is a novel module called Tree Search Visual Choice. If a slide has layout problems (like a figure being too large and running off the page), the system doesn’t just ask an LLM to “fix it.” Instead, it generates several variations (e.g., with the figure at 100%, 75%, and 50% scale), and then uses a Vision-Language Model (VLM) to pick the best-looking one. This is a much more reliable way to handle fine-grained visual adjustments.
Subtitle Builder: Once the slides are ready, a VLM analyzes each slide to generate a corresponding script (the subtitles) and also creates prompts for where a cursor should point to highlight key information.
Cursor Builder: This builder brings the presentation to life. It takes the cursor prompts and uses a GUI-grounding model to find the exact (x, y) coordinates on the slide. It then uses WhisperX, a time-accurate speech transcription tool, to synchronize the cursor’s appearance with the spoken words, perfectly mimicking how a presenter uses a laser pointer.
Talker Builder: Using a short voice sample and a portrait of the author, this module generates a personalized talking-head video. It uses a state-of-the-art text-to-speech model to create audio that sounds like the author, and then generates a lip-synced video of them speaking. For efficiency, it generates the video on a slide-by-slide basis and runs these jobs in parallel, achieving a greater than 6x speedup.

Finally, all these components—the slides, the synthesized speech, the subtitles, the cursor track, and the talking-head video—are composed into the final presentation video.

Key Experimental Results

The results show that PaperTalker is remarkably effective, not only outperforming previous automated methods but in some ways even surpassing human-made videos.

Outperforming Baselines: PaperTalker consistently scored higher than other methods (like PresentAgent and end-to-end models like Veo3) across all the new evaluation metrics.
More Informative than Humans: In the PresentQuiz evaluation, videos generated by PaperTalker enabled the AI audience to answer questions with higher accuracy (0.842) than the human-made videos (0.738), despite being significantly shorter. This suggests the automated process creates more concise and information-dense presentations.
Comparable to Human Quality: In a human user study, PaperTalker’s videos were rated a 3.8 out of 5, second only to the human-made videos (4.6) and far ahead of the next best AI method (2.8). This indicates the generated quality is approaching that of manual creation.
Cursor is Crucial: An ablation study showed that adding the cursor highlight increased the accuracy on a content localization task from 8.4% to 63.3%, proving its vital role in guiding audience attention and improving comprehension.

A Critical Look: Strengths & Limitations

Strengths / Contributions

Holistic Framework: This is the first work to address the entire academic video generation pipeline from start to finish. It defines the problem, provides the data to measure it, and offers a powerful solution.
Meaningful Evaluation: The proposed metrics (PresentQuiz, IP Memory) are a major contribution. They shift evaluation from superficial visual similarity to the true goal of a presentation: effective communication and knowledge transfer.
Robust and Clever Engineering: The PaperTalker agent is filled with smart design choices. The Tree Search Visual Choice for fixing layouts and the parallelized, slide-wise video generation are practical solutions that make the system work reliably.
A Foundational Benchmark: The Paper2Video dataset is a valuable public resource that will enable further research in this challenging area.

Limitations / Open Questions

Reliance on LaTeX Source: The current method for slide generation requires the paper’s original LaTeX project files. It is unclear how it would perform on papers only available as a PDF, which would require a much more complex document parsing system.
Static Presenter: While the talking head is personalized, it appears to be limited to the head and upper body without the hand gestures or dynamic body language that can make a human presentation more engaging.
Simplified Cursor Movement: The cursor logic is a smart simplification (moving between sentences), but it doesn’t capture the more fluid, dynamic ways a human might use a pointer, such as circling an area or underlining a key term.
Domain Specificity: The benchmark focuses on AI conference papers from fields like CV, NLP, and ML. The system’s effectiveness on papers from vastly different domains (e.g., medicine, history) with different structural and visual conventions remains to be tested.

Contribution Level: Significant Improvement / Foundational. This paper establishes a new, practical, and challenging task within AI for Research. It provides the first comprehensive benchmark, novel evaluation metrics focused on the right goals, and a highly effective agent-based framework that serves as a powerful baseline for all future work in this area.

Conclusion: Potential Impact

Paper2Video and the PaperTalker agent represent a major step towards automating a crucial part of the scientific communication process. This work has the potential to save researchers worldwide thousands of hours, allowing them to focus on what they do best: research. By making it trivial to create a high-quality presentation, it could also help democratize science, making it easier for researchers to share their work broadly and effectively. While there are still open questions, this paper lays a strong foundation for a future where every scientific paper can be accompanied by an engaging, informative, and instantly generated video presentation.