[DrivingDojo] Why Can't Self-Driving AIs Turn Left? The Dataset for Smarter World Models

Paper at a Glance

Paper Title: DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model
Authors: Yuqi Wang, Ke Cheng, Jiawei He, Qitai Wang, Hengchen Dai, Yuntao Chen, Fei Xia, and Zhaoxiang Zhang
Affiliation: Chinese Academy of Sciences, Meituan Inc., HKISI CAS
Published in: NeurIPS 2024 (Track on Datasets and Benchmarks)
Link to Paper: https://arxiv.org/abs/2410.10738
Project Page: https://drivingdojo.github.io

The Gist of It: TL;DR

In one sentence: This paper introduces DrivingDojo, a large-scale video dataset specifically curated to train driving “world models,” featuring a rich diversity of driving actions, complex multi-agent interactions, and rare real-world events to overcome the limitations of existing perception-focused datasets.

Why It Matters: The Big Picture

The holy grail for autonomous vehicles is not just seeing the world but understanding it—predicting how a situation will unfold and simulating possible futures to make the safest decision. This is the promise of world models: AI systems that learn the underlying physics and dynamics of an environment, acting as a general-purpose simulator. Imagine an AI that could play out a scenario in its “mind” before acting: “If I change lanes now, how will the car behind me react?”

However, these powerful models are incredibly data-hungry. A major roadblock has been that existing autonomous driving datasets, like nuScenes and Waymo, were primarily designed for perception tasks—identifying cars, pedestrians, and lane lines. While excellent for that purpose, their data is heavily skewed towards routine, “boring” driving like staying in a lane. They lack the rich variety of complex maneuvers (like U-turns or sudden braking) and intricate interactions (like aggressive cut-ins) needed to train a world model that can truly understand the dynamic “dance” of traffic. This data gap means current world models are good at cruising straight but often fail when asked to follow more complex instructions, limiting their real-world utility.

The Core Idea: How It Works

The authors of DrivingDojo argue that to build better world models, we don’t just need more data; we need the right data. Their core contribution is not a new algorithm but a meticulously curated dataset designed from the ground up to address the shortcomings of previous collections.

1. The Problem They’re Solving

Current driving datasets suffer from a lack of diversity in actions and interactions. As the authors show in Figure 3a of the paper, the frequency of events like “Turn Left” or “Lane Change” in datasets like nuScenes and ONCE is incredibly low compared to straight-line driving. A model trained on such data will naturally be terrible at simulating or executing turns, because it has barely seen any examples. It’s like trying to learn to play chess by only watching the opening pawn moves.

2. The Key Innovation

The key innovation is purpose-built curation for interaction. Instead of just recording hours of driving, the researchers actively sought out and collected video clips that were rich in three specific areas that are crucial for world models:

Rich Ego Actions: A balanced distribution of the ego-car’s own maneuvers, including acceleration, emergency braking, lane changes, U-turns, and stopping.
Diverse Multi-agent Interplay: Scenarios that explicitly capture complex interactions between the ego-car and other agents, such as vehicles cutting in, pedestrians crossing unexpectedly, and navigating blocked roads.
Rich Open-world Knowledge: The “long tail” of rare but critical events that define real-world driving—falling objects, crossing animals, construction barriers, and unusual weather conditions.

3. The Method, Step-by-Step

As illustrated in Figure 2, the creation of DrivingDojo was a multi-stage process:

Data Collection: The team started with a massive pool of raw data—approximately 7,500 hours of driving footage collected from Meituan’s fleet of autonomous delivery vehicles across multiple major Chinese cities.
Strategic Curation: To find the “interesting” moments, they used several strategies. They pulled data from safety driver interventions and automatic emergency braking events, which are by nature non-routine. They also used manually defined rules and even employed GPT-4 to identify and label rare scenarios from text descriptions.
Dataset Organization: The final dataset contains around 18,000 video clips and is organized into three subsets to help researchers focus on specific problems:
- DrivingDojo-Action: Focused on the ego-vehicle’s maneuvers.
- DrivingDojo-Interplay: Focused on interactions with other road users.
- DrivingDojo-Open: Focused on rare, open-world events.
A New Benchmark: Crucially, they also introduce the Action Instruction Following (AIF) benchmark. This measures how accurately a model’s generated video follows a sequence of action commands (e.g., a series of steering and acceleration values). This moves evaluation beyond just “does the video look real?” to “can the model be controlled?”.

Key Experimental Results

The paper demonstrates the value of DrivingDojo by fine-tuning a standard video generation model (Stable Video Diffusion) and evaluating it against models trained on other datasets.

Better Action Following: The headline result, shown in Table 5, is that the model trained on DrivingDojo has a significantly lower AIF error. This means it can follow instructions like “turn left” or “change lanes” far more accurately than models trained on nuScenes or ONCE. In fact, the model trained on ONCE was reportedly unable to generate videos of turning or lane changing, instead just continuing straight, highlighting the critical importance of training data diversity.
Higher Visual Quality: Even when evaluated on a generic video prediction task, the model fine-tuned on DrivingDojo produced higher-quality videos (lower FID and FVD scores in Table 3) than models trained on other driving datasets. This suggests that the diversity of scenarios also leads to a more robust visual understanding.
Qualitative Prowess: Figures 5 and 7 showcase the model’s ability to generate multiple, plausible futures from the same starting frame based on different action commands (e.g., go straight, turn left, or turn right) and even simulate interactions with other agents.

A Critical Look: Strengths & Limitations

Strengths / Contributions

Addresses a Critical Data Gap: DrivingDojo is the first large-scale dataset specifically designed for training interactive driving world models. It directly confronts the well-known problem of action and interaction scarcity in existing datasets.
Thoughtful Curation and Benchmarking: The focus on action completeness, multi-agent interplay, and rare events is precisely what the field needs. The accompanying AIF benchmark provides a much-needed tool for measuring the controllability of these generative models.
Clear Empirical Validation: The experiments provide strong evidence that training on this dataset leads to models that are not only visually better but, more importantly, are far more controllable and loyal to action commands.

Limitations / Open Questions

Single-Camera View: The dataset is limited to a single, forward-facing camera. Real-world autonomous systems use a 360-degree sensor suite (multiple cameras, LiDAR, etc.). A world model trained on this data can’t learn the full dynamics of its surroundings, such as what’s happening behind or beside it.
Geographic and Cultural Bias: The data is collected exclusively from major Chinese cities. Driving behaviors, road infrastructure, and traffic rules differ globally, which may limit the direct generalization of models trained on DrivingDojo to other regions.
Short-Term Simulation: The experiments focus on generating short video clips (around 6 seconds). While useful for immediate interactions, this doesn’t tackle the challenge of long-horizon prediction, a key requirement for strategic planning in driving.
Persistent Hallucination: As the authors acknowledge in Figure 8, the baseline model can still “hallucinate” or generate unrealistic scenes, such as objects suddenly disappearing or inventing a new road where one doesn’t exist. This highlights that the gap between simulation and reality remains a fundamental challenge.

Contribution Level: Significant Improvement. While it doesn’t introduce a new model architecture, DrivingDojo provides an essential resource—a high-quality, purpose-built dataset and benchmark—that enables the entire research community to overcome a major bottleneck in developing next-generation driving world models. Foundational datasets are a cornerstone of AI progress, and this one addresses a clear and pressing need.

Conclusion: Potential Impact

DrivingDojo represents a crucial step forward for research in autonomous driving. By shifting the focus from passive perception to active interaction, this dataset and its accompanying benchmark will empower researchers to build and evaluate more capable, controllable, and ultimately safer driving world models. It lays a stronger foundation for developing AIs that don’t just see the road but can reason about and navigate the complex, dynamic world of human driving. The next steps will likely involve expanding this concept to multi-sensor data and tackling the challenge of long-horizon, strategic planning.