Depth Anything: How 62 Million Unlabeled Photos Created a New State-of-the-Art Vision Model

Paper at a Glance

Paper Title: Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Authors: Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao
Affiliation: HKU, TikTok, CUHK, ZJU
Published in: Conference on Computer Vision and Pattern Recognition (CVPR) 2024
Link to Paper: https://openaccess.thecvf.com//content/CVPR2024/html/Yang_Depth_Anything_Unleashing_the_Power_of_Large-Scale_Unlabeled_Data_CVPR_2024_paper.html
Project Page: https://depth-anything.github.io

The Gist of It: TL;DR

In one sentence: This paper introduces Depth Anything, a powerful foundation model for monocular depth estimation that achieves remarkable zero-shot performance by training on a massive dataset of 62 million unlabeled images, guided by two simple yet effective learning strategies.

Why It Matters: The Big Picture

Estimating the depth of a scene from a single 2D image—a task called Monocular Depth Estimation (MDE)—is a cornerstone of computer vision. It’s the magic that allows self-driving cars to perceive distance, robots to navigate complex spaces, and augmented reality apps to realistically place virtual objects in our world.

For years, progress in MDE was driven by meticulously collected datasets where each image came with a corresponding “ground truth” depth map, usually captured by expensive LiDAR sensors or complex stereo camera setups. Models like MiDaS pushed the boundaries by combining many of these labeled datasets. However, this approach has a ceiling: creating large, diverse, high-quality labeled datasets is incredibly expensive and time-consuming. The real world is infinitely more varied than our labeled datasets can ever be.

This is where “Depth Anything” comes in. The authors asked a powerful question: What if we could tap into the vast ocean of unlabeled images available on the internet? Instead of chasing more costly labeled data, they developed a system to effectively “unleash the power” of 62 million everyday photos, dramatically improving the model’s ability to understand depth in any scene it encounters.

The Core Idea: How It Works

The authors’ approach is not about inventing a complex new model architecture. Instead, it’s a brilliant data-centric pipeline focused on scaling up training data and making the learning process more effective. The overall workflow is illustrated in Figure 2 of the paper.

1. The Problem They’re Solving

How do you make a model learn from unlabeled data when you already have a decent amount of labeled data? A common technique is “self-training”:

Train a “teacher” model on your labeled data.
Use this teacher to predict “pseudo-labels” for your unlabeled data.
Train a new “student” model on both the original labeled data and the new pseudo-labeled data.

The problem is, if the student model is too similar to the teacher, it doesn’t learn much. It simply re-learns what the teacher already knows, including its mistakes. The authors found that this naive approach provided no improvement. Their key innovations were designed to break this cycle.

2. The Key Innovations

The success of Depth Anything hinges on two clever strategies implemented during the student training phase:

A More Challenging Target: Instead of just showing the student an unlabeled image and its pseudo-label, they force the student to work harder. They subject the input image to aggressive data augmentations—strong color distortions, blurring, and CutMix (where patches of two images are cut and pasted together). This forces the student model to look beyond superficial textures and learn deeper, more robust visual knowledge to correctly predict the depth provided by the (un-augmented) teacher’s output.
Inheriting Semantic Priors: A model that understands what objects are in a scene (a car, a person, the sky) can make better judgments about their depth. Instead of using a traditional auxiliary task like semantic segmentation (which they found to be ineffective), the authors use a more direct method. They enforce a feature alignment loss, encouraging the student’s encoder to produce feature representations that are similar to those from a powerful, pre-trained semantic model (DINOv2). This directly injects rich, high-level understanding into the depth model without losing information.

3. The Method, Step-by-Step

The entire process can be broken down into three main phases:

Train the Teacher: First, a teacher model is trained on 1.5 million labeled images aggregated from six different public datasets. It uses an affine-invariant loss function, which is crucial for handling the different scales and shifts in depth values across varied datasets.
Build the Data Engine: The authors collect a massive dataset of 62 million diverse, unlabeled images from eight public sources like SA-1B, Open Images, and ImageNet. The trained teacher model then works through this entire dataset, generating a pseudo-depth map for every single image.
Train the Smarter Student: A new student model is trained from scratch on a combined dataset of the 1.5M labeled images and the 62M pseudo-labeled images. During this phase, the two key innovations are applied: the challenging augmentations are used for the unlabeled inputs, and the semantic feature alignment loss helps guide the encoder. The result is a model that has learned from a vastly larger and more diverse set of visual data than any MDE model before it.

Key Experimental Results

The paper presents extensive evaluations that highlight the model’s impressive generalization capabilities.

Superior Zero-Shot Performance: As shown in Table 2, Depth Anything significantly outperforms the previous state-of-the-art, MiDaS v3.1, on six unseen evaluation datasets. For instance, on the DDAD autonomous driving dataset, the ViT-L version improves the Absolute Relative error from 0.251 (MiDaS) to 0.230. This holds true across different model sizes, with even the small Depth Anything (ViT-S) beating the much larger MiDaS model in several cases.
Effectiveness of the Training Strategies: The ablation studies in Table 9 are crucial. They show that simply adding unlabeled data with pseudo-labels gives almost no benefit. The performance jump comes from first adding the strong augmentations (S) and then adding the semantic constraint (L_feat), confirming that these two strategies are essential for unlocking the value of the unlabeled data.
Excellent Fine-tuning Potential: When fine-tuned for metric depth estimation on standard benchmarks like NYUv2 and KITTI (Tables 3 and 4), the pre-trained Depth Anything model sets new state-of-the-art records, demonstrating its value as a powerful foundation for downstream tasks.
A Powerful Multi-Task Encoder: The encoder trained by Depth Anything is not just a one-trick pony. When transferred to semantic segmentation, it outperforms strong baselines trained on ImageNet-21K (Tables 7 and 8). This shows that the process of learning depth from a massive, diverse dataset creates a highly capable and general-purpose visual feature extractor.

A Critical Look: Strengths & Limitations

Strengths / Contributions

Data-Centric Triumph: This work is a powerful demonstration that a well-designed, data-centric approach can be more effective than inventing complex new architectures. It highlights the immense value of leveraging large-scale unlabeled data.
Simple and Effective Strategies: The two core ideas—challenging the student with strong augmentations and using feature-level semantic alignment—are simple to understand and implement, yet proven to be highly effective for semi-supervised learning.
State-of-the-Art Practical Model: The paper delivers a set of models that are immediately useful for the community, exhibiting robust zero-shot generalization across a wide variety of indoor, outdoor, and challenging real-world scenes (as seen in Figure 1).
Creates a Strong Foundation Encoder: The resulting encoder is not only excellent for depth but also proves to be a top-tier backbone for other dense prediction tasks like semantic segmentation, making it a valuable asset for the broader computer vision community.

Limitations / Open Questions

Massive Computational Requirement: The primary barrier to this research is the immense computational cost. Training a model on over 60 million images requires resources that are unavailable to most academic labs, making it difficult to replicate or build upon the data pipeline.
Dependence on Teacher Quality: The entire process is bootstrapped from a “teacher” model. Any systemic biases or failures in the teacher (e.g., poor performance on certain types of scenes) could be propagated and potentially amplified across the 62 million pseudo-labeled images.
Relative vs. Metric Depth: The foundational zero-shot model produces relative depth (it knows what’s nearer and farther) but not metric depth (it doesn’t know the distance in meters). While it can be fine-tuned to achieve this, it’s an extra step required for many practical applications in robotics or autonomous driving.

Contribution Level: Significant Improvement. This paper doesn’t introduce a fundamentally new paradigm. Instead, it takes the existing ideas of foundation models and self-training and executes them masterfully for monocular depth estimation. By solving the key challenges of learning from massive unlabeled datasets, it pushes the state-of-the-art forward by a substantial margin and provides a highly practical, powerful tool for the research community.

Conclusion: Potential Impact

“Depth Anything” provides a clear and effective blueprint for building powerful vision models by tapping into the world’s most abundant resource: unlabeled images. Its success reinforces the idea that scaling data, when done intelligently, is one of the most reliable paths to better generalization. The released models are already being used in a variety of applications, from image editing to 3D reconstruction, and the pre-trained encoder is a valuable starting point for any task requiring a deep understanding of scene geometry and semantics. This work will likely inspire further research into data-centric AI and novel ways to extract knowledge from the web-scale datasets that surround us.