Stop Tuning Your Losses: How Uncertainty Can Automatically Balance Multi-Task Learning Models

Paper at a Glance

Paper Title: Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
Authors: Alex Kendall, Yarin Gal, Roberto Cipolla
Affiliation: University of Cambridge
Published in: Conference on Computer Vision and Pattern Recognition (CVPR), 2018
Link to Paper: https://openaccess.thecvf.com/content_cvpr_2018/html/Kendall_Multi-Task_Learning_Using_CVPR_2018_paper.html

The Gist of It: TL;DR

In one sentence: This paper proposes a principled method to automatically balance the loss functions in a multi-task learning model by framing the loss weights as task-dependent (homoscedastic) uncertainty, which the model learns during training, thereby eliminating the need for manual weight tuning.

Why It Matters: The Big Picture

Multi-Task Learning (MTL) is a powerful concept in deep learning. The idea is to train a single model to perform several different tasks simultaneously—for instance, a self-driving car’s vision system might need to identify objects (semantic segmentation), distinguish between individual cars (instance segmentation), and estimate distances (depth estimation) all at once. By sharing a common “backbone,” the model can learn a richer, more general representation of the world, often leading to better performance and computational efficiency than training separate models for each task.

But there’s a catch, and it’s a big one. To combine the tasks, you have to sum their individual loss functions. The total loss might look something like this:

The problem is choosing the weights (w₁, w₂, w₃). If the depth loss is naturally much larger than the segmentation loss (e.g., meters vs. a probability), it will dominate the training, and the model will neglect the other tasks. Finding the right balance is a frustrating and expensive process of manual trial-and-error, often involving a massive grid search that can take days or weeks. This practical bottleneck makes MTL prohibitive for many real-world applications. This paper offers an elegant solution: what if the model could learn the optimal weights itself?

The Core Idea: How It Works

The authors reframe the problem from a probabilistic perspective, using the concept of homoscedastic uncertainty to automatically balance the losses. Homoscedastic uncertainty refers to the uncertainty inherent to a task that is constant across all input data. Think of it as a measure of the task’s intrinsic difficulty or noise level.

1. The Problem They’re Solving

Manually setting loss weights is difficult because the optimal weight for a task depends on its units (e.g., meters for depth vs. pixels for instance vectors), its scale, and its inherent noise. A fixed, hand-tuned weight is a crude approximation that fails to capture these dynamics.

2. The Key Innovation

The central idea is to derive a multi-task loss function directly from maximizing the probabilistic likelihood of the model’s outputs. By doing this, learnable parameters representing the uncertainty of each task naturally appear in the loss function, acting as adaptive weights.

3. The Method, Step-by-Step

Let’s break down how this works for two regression tasks, like depth and instance segmentation.

Probabilistic Framing: For a regression task, we can model the output as a sample from a Gaussian distribution. The model predicts the mean f(x), and the true value y is assumed to be drawn from p(y|f(x)) = N(f(x), σ²) $ p(y \mid f(x)) = \mathcal{N}\big(f(x), \sigma^2\big) $, where σ² is the variance, representing the observation noise.
Maximizing Log-Likelihood: In machine learning, we train by maximizing the log-likelihood of the data. For the Gaussian above, the log-likelihood is:

Notice two things here: ||y - f(x)||² is just the standard L2 loss! And it’s being weighted by 1/σ². The log(σ) term acts as a regularizer, preventing the model from simply setting the uncertainty σ to infinity to make the loss zero.
Combining Multiple Tasks: When we have multiple tasks, we assume their uncertainties are independent. To get the joint probability, we multiply their individual probabilities. In log-space, this means adding them up. For two tasks with outputs y₁ and y₂ and corresponding uncertainties σ₁ and σ₂, the combined loss function to minimize becomes:

Here, L₁(W) and L₂(W) are the individual losses for each task.
Learning the Weights: The crucial step is to treat σ₁ and σ₂ not as fixed hyperparameters but as learnable parameters of the model. During training, the network will learn to adjust σ₁ and σ₂ via backpropagation, just like any other weight.
- If a task is noisy or difficult early in training, the model can increase its corresponding σ, effectively down-weighting its loss and focusing on easier tasks.
- As the model gets better at a task, it can decrease the σ to focus more on refining its predictions.
- The log(σ) regularizer ensures the uncertainties don’t grow uncontrollably.

The authors show this framework can be extended to classification tasks (like semantic segmentation) by introducing a temperature term into the softmax function, which also relates to uncertainty. The result is a unified loss function for any combination of regression and classification tasks.

Key Experimental Results

The authors applied their method to a challenging scene understanding problem on the CityScapes dataset, combining three tasks: semantic segmentation, instance segmentation, and depth regression.

Multi-tasking is better, but only if weighted correctly: As shown in Figure 2 of the paper, manually sweeping through different loss weights reveals a “sweet spot” where the multi-task model outperforms single-task models. However, this spot is narrow; a poor choice of weights leads to worse performance. Their proposed uncertainty weighting method automatically finds a balance that achieves superior performance for all tasks.
Outperforming Baselines: Table 1 shows that the uncertainty-weighted model significantly outperforms models trained on each task individually, as well as MTL models using a naive unweighted sum of losses or even carefully hand-tuned “optimal” weights found via grid search. For example, semantic segmentation IoU improves from 59.4% (single task) to 63.4% (3-task with uncertainty weighting).
State-of-the-Art System: On the full-resolution CityScapes benchmark (Table 2), their model was the first to tackle all three tasks jointly, achieving competitive performance and demonstrating the real-world applicability of their method.

A Critical Look: Strengths & Limitations

Strengths / Contributions

Principled and Elegant Solution: The paper provides a theoretically sound, probabilistic foundation for weighting losses in MTL, replacing ad-hoc manual tuning with a learnable mechanism.
Automation and Practicality: It automates the most painful part of multi-task learning. This makes MTL far more accessible and practical for researchers and engineers, as it removes the need for expensive and time-consuming hyperparameter searches.
Improved Performance: The method doesn’t just simplify the process; it leads to better results. By dynamically balancing the tasks, it allows the model to leverage shared information more effectively, outperforming both single-task models and manually weighted MTL models.

Limitations / Open Questions

Task Homogeneity: The experiments focus on dense, per-pixel prediction tasks. It is less clear how this method would perform when combining tasks with very different structures (e.g., a per-pixel task with a single, global image-level classification).
Interpretation of Uncertainty: While framed as “homoscedastic uncertainty,” the learned σ values are a complex function of task scale, units, and training dynamics. Their direct interpretation as a true statistical measure of task noise might be an oversimplification.
Optimization Stability: The paper reports robust results, but adding more learnable parameters (σ for each task) could, in theory, introduce new optimization challenges in more complex or unstable training setups. The use of log-variance is a clever trick to improve numerical stability.

Contribution Level: Significant Improvement. This work didn’t invent multi-task learning, but it solved a massive, practical bottleneck that was holding the field back. It provided a simple, elegant, and highly effective solution that has been widely adopted and cited, fundamentally improving how MTL is implemented in practice.

Conclusion: Potential Impact

This paper is a landmark in the field of multi-task learning. By introducing a principled way to automatically learn loss weights, Kendall et al. transformed MTL from a powerful but impractical art into a robust and accessible engineering tool. Anyone working with multiple objectives—from robotics and autonomous driving to medical imaging—can benefit from this technique. It encourages a more thoughtful, probabilistic approach to model design and has paved the way for more complex and capable multi-task architectures. The next steps might involve exploring how this concept applies to heterogeneous tasks or how it interacts with other forms of uncertainty in deep learning.