Uncertainty Reduction through Ensemble Methods

Reducing uncertainty in medical segmentation using ensemble methods

AUTHOR

Rémy SIAHAAN--GENSOLLEN

PUBLISHED ON

September 7, 2025

AUTHOR

Rémy SIAHAAN--GENSOLLEN

PUBLISHED ON

September 7, 2025

This project is based on academic work carried out at ENSAE with my classmates Lucas Cumunel, Tara Leroux, and Léo Leroy, under the supervision of Xavier Coubez, PhD, and Tristan Kirsher. A detailed report (in French) is available by clicking on the PDF icon on the right. The repository can be found by clicking on the GitHub icon.

Automatic organ segmentation, although very useful in medical imaging, remains subject to high uncertainty, especially when it relies on (subjective) manual annotations. This project evaluates the use of an ensemble method to reduce this uncertainty, by training and combining several U-Nets on multiple CT scans annotated by different experts. We evaluate the accuracy of the predictions, their aleatoric uncertainty and epistemic uncertainty. Results indicate that this simple method significantly reduce the uncertainties of the predictions.

Background and project

Introduction

For several years, artificial intelligence has been revolutionizing medical practice, supporting doctors in their diagnoses and decision-making. Medical imaging, in particular, plays a central role in assessing patients' health and guiding their care ^{[Li, 2023]}

Medical image analysis using deep learning algorithms

Li, Mengfang and Jiang, Yuanyuan and Zhang, Yanzhou and Zhu, Haisheng (2023)

Frontiers in Public Health, vol. 11.

DOI: 10.3389/fpubh.2023.1273253

. Automatic segmentation—that is, the precise delineation of organs and structures by algorithms—facilitates diagnosis, treatment planning, and clinical monitoring. These algorithms include convolutional neural networks. (Convolutional Neural Network, CNN), powerful deep learning tool that has outperformed human experts in many image understanding tasks ^{[D. R. Sarvamangala, 2022]}

Convolutional neural networks in medical image understanding: a survey

D. R. Sarvamangala and Raghavendra V. Kulkarni (2022)

Evolutionary Intelligence, vol. 15(1), pp. 1--22.

DOI: 10.1007/s12065-020-00540-3

. One of the most used CNN architectures for medical segmentation is the U-Net network ^{[Olaf Ronneberger, 2015]}

U-Net: Convolutional Networks for Biomedical Image Segmentation

Olaf Ronneberger and Philipp Fischer and Thomas Brox (2015)

Source

3D segmentation of pancreas, kidneys and liver, as well as a section of the abdominal scanner used to delineate them.

However, many of the structures and anomalies analyzed (organs, blood vessels, tumors, etc.) are particularly complex and variable, leading to a certain uncertainty in their delimitation. This uncertainty is accentuated by the inter-expert variability : different medical specialists may have different opinions on the precise location of the boundaries of segmented entities. This increases even more when multiple structures are predicted simultaneously. Neural networks must deal with these discrepancies, sometimes leading to inconsistencies in segmentation results.

Quantifying these uncertainties allows for the generation of uncertainty maps on medical images, in order to isolate areas where physicians need to pay extra attention. This provides clinicians with better calibrated predictions, and integrate confidence measures into medical image analysis and subsequent decision-making ^{[Kim-Celine Kahl, 2024]}

ValUES: A Framework for Systematic Validation of Uncertainty Estimation in Semantic Segmentation

Kim-Celine Kahl and Carsten T. Lüth and Maximilian Zenk and Klaus Maier-Hein and Paul F. Jaeger (2024)

Source

. This not only improves the safety of AI-assisted diagnostics, but also makes algorithms more transparent and reliable for medical applications. Ensemble learning methods, which combine multiple individual models or their predictions, are a common choice for improving the performance of artificial intelligence models ^{[Ganaie, 2022]}

Ensemble deep learning: A review

Ganaie, M.A. and Hu, Minghui and Malik, A.K. and Tanveer, M. and Suganthan, P.N. (2022)

Engineering Applications of Artificial Intelligence, vol. 115, pp. 105151.

DOI: 10.1016/j.engappai.2022.105151

Uncertainty quantification

Machine learning models do not always clearly indicate their level of confidence in the predictions they produce: this is the problem of uncertainty in algorithmic predictions. Furthermore, medical experts may annotate the same image differently due to the ambiguity of certain anatomical structures. These disagreements reduce the quality of the annotations used to train the models and complicate the evaluation of their performance. In the left figure below are depicted three sections of the tomographic scan (or abdominal scan / CT scan) of the first patient in the dataset provided for the CURVAS challenge (more details below), as well as the three annotations of the pancreas, kidney and liver. The right figure highlights areas of disagreement:

Contours made by three doctors for different organs on three CT scan slices of the same patient.

Areas of disagreement highlighted in yellow

Theoretically, we distinguish two types of uncertainty, which, when combined, give Predictive Uncertainty

PU

Aleatoric Uncertainty $AU$ , which comes from the data itself. It is linked to the ambiguities intrinsic to the image. We can cite as causes of aleatoric uncertainty artifacts, digitization errors, etc… These causes include disagreements between annotators, as illustrated above.
Epistemic Uncertainty $EU$ , which comes from the learning model itself. We can cite as causes of epistemic uncertainty a lack of knowledge (not enough diversified data observed during training), an architecture which does not allow them to be properly "learn", etc…

The most notable approach to capturing these uncertainties was introduced by ^{[Alex Kendall, 2017]}

What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?

Alex Kendall and Yarin Gal (2017)

Source

, which relies on Bayesian classifiers. This classifier receives an input

x

and produces probabilities for classes

Y

\bbP\p{Y \enstq x} = \bdE_{\omega \sim \Omega}\intc{\bbP\p{Y \enstq x, \omega}}

where the model parameters

\Omega

\mathbb{P}(\omega \mid D )

for training data

D

This Bayesian framework ^{[David Smerkous, 2024]}

Enhancing Diversity in Bayesian Deep Learning via Hyperspherical Energy Minimization of CKA

David Smerkous and Qinxun Bai and Fuxin Li (2024)

Source

assumes that epistemic uncertainty is represented by Predictive Entropy

PE

, which is the sum of Mutual Information

MI

and Expected Entropy

EE

representing epistemic uncertainty and aleatory uncertainty respectively. Noting

\bbH

the Shannon entropy, we have:

\underbrace{\bbH\p{Y \enstq x}}_{PU = PE} = \underbrace{\text{MI}(Y, \Omega|x)}_{EU = MI} + \underbrace{\mathbb{E}_{\omega \sim \Omega}[H(Y|\omega, x)]}_{AU = EE \ \text{(for x i.i.d.)}}

The interactive figure below, based on the thesis by ^{[Lambert, 2024]}

Quantification et caractérisation de l'incertitude de segmentation d'images médicales par des réseaux profonds

Lambert, Benjamin (2024)

Source

, illustrates the two types of uncertainty in a one-dimensional regression task. You can hover over the colored regions to see details, adjust their sizes, or change the shape of the function.

g(x)

x

g(x) =

Another very important concept is calibration. Neural networks produce probability distributions over possible class labels, which is a natural measure of uncertainty. Ideally, a well-calibrated model should have high confidence for correct predictions and low confidence for incorrect predictions. However, modern architectures often fail to achieve this ideal calibration. To assess calibration, reliability plots (or calibration graphs) are used, which compare predicted confidence to actual accuracy, highlighting deviations—called calibration deviations.

Mathematically, a perfectly calibrated model satisfies:

\forall p \in \intc{0, 1},\qquad \bbP\p{\hat{Y} = Y \enstq \hat{P} = p} = p

Otherwise, this means that if the model assigns an 80 % probability to a prediction, it should be right 80 % of the time.

Experiment

Data and model

Held from May to October 2024, the CURVAS Challenge (Calibration and Uncertainty for Multi-Rater Volume Assessment in Multiorgan Segmentation) aimed to develop accurate segmentation models capable of providing both optimal calibration and quantification of inter-expert variability. For this project, we used the dataset released for the challenge, which includes 90 patient CT scans, each annotated by three different experts for the pancreas, kidneys, and liver. The figures above were generated using data from the first patient in the cohort. These CT scans were collected at University Hospital Erlangen between August and October 2023. A total of 20 scans were provided for training (group A), 5 for validation (group A), and 65 for testing (20 in group A, 22 in group B, and 23 in group C) ^{[Riera-Marín, 2024]}

CURVAS dataset

Riera-Marín, Meritxell and Kleiß, Joy-Marie and Aubanell, Anton and Antolín, Andreu (2024)

DOI: 10.5281/zenodo.12687192

For training, we used the nnU-Net (no-new-Net) framework ^{[Fabian Isensee, 2018]}

nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation

Fabian Isensee and Jens Petersen and Andre Klein and David Zimmerer and Paul F. Jaeger and Simon Kohl and Jakob Wasserthal and Gregor Koehler and Tobias Norajitra and Sebastian Wirkert and Klaus H. Maier-Hein (2018)

Source

, a self-configuring library for training U-Net-based ^{[Olaf Ronneberger, 2015]}

U-Net: Convolutional Networks for Biomedical Image Segmentation

Olaf Ronneberger and Philipp Fischer and Thomas Brox (2015)

Source

architectures. It is specifically designed for automated biomedical image segmentation. nnU-Net automatically configures many training parameters based on dataset characteristics. This is particularly valuable in clinical contexts, where medical images often vary in format (2D vs. 3D), resolution, saturation, and acquisition protocols due to the use of different imaging instruments. However, these architectures comes with the drawback of being highly computationally intensive, requiring powerful GPUs.

We first trained 9 different models on the training dataset (20 patients): for each annotator, we trained three models with different weight initializations, in order to explore distinct optimization trajectories in the loss landscape. We then ran inference with each model on the test dataset (65 patients). For every model and patient, we systematically generated the predicted probabilities (softmax outputs), which were then used to construct four ensembles by averaging them: one per annotator-specific model triplet, and a general ensemble combining all nine models. Finally, for each patient and all 13 models (individual and ensembles), we computed prediction accuracy, as well as aleatoric and epistemic uncertainty estimates. These computations and results are detailed in the following sections.

Evaluation

To analyse accuracy and uncertainty, we use a range of metrics from the CURVAS challenge (Consensus-based DICE, Confidence, ECE, CRPS) ^{[Riera-Marín, 2024]}

CURVAS dataset

Riera-Marín, Meritxell and Kleiß, Joy-Marie and Aubanell, Anton and Antolín, Andreu (2024)

DOI: 10.5281/zenodo.12687192

and the ValUES} framework (ACE, AUROC, AURC, EAURC, NCC) ^{[Kim-Celine Kahl, 2024]}

ValUES: A Framework for Systematic Validation of Uncertainty Estimation in Semantic Segmentation

Kim-Celine Kahl and Carsten T. Lüth and Maximilian Zenk and Klaus Maier-Hein and Paul F. Jaeger (2024)

Source

, as well as classical performance measures such as entropy and the Hausdorff distance. These metrics enable us to capture both types of uncertainties, aleatoric and epistemic, as well as the overall performance of the models. Without going into too much detail, let's look at a few of them:

The consensus-based DICE can be used to evaluate the prediction performance / accuracy. It is an ordinary DICE score (a measure of similarity between two sets) but it considers the consensus area (i.e., where all experts agree) and the prediction. This allows us to account for inter-expert variability (and thus aleatoric uncertainty). We calculate it for each organ, and like a classic DICE score, the closer it is to 1, the more accurate the prediction. This measure gives us a general idea of the model's performance while accounting for inter-expert variability.

\text{DICE} = \frac{2 |P \cap G|}{|P| + |G|}

$P$ : Predicted segmentation.
$G$ : Consensus area among annotators.
$|P \cap G|$ : Intersection between the segmentations (i.e., overlapping voxels).
$|X|$ : Total number of voxels in segmentation $x$ .

The Expected Calibration Error (ECE) is used to evaluate epistemic uncertainty and for calibration. It is computed by dividing the predicted probabilities into several intervals

B_m

, ranging from 0 to 1 (referred to as bins). Then, within each bin, the average confidence and accuracy are determined. The ECE is the weighted sum (by the size of the interval) of the difference between accuracy and average confidence for each bin (hence the discrepancy mentioned earlier). The ECE is a key measure of model calibration (it is indeed the sole calibration measure in the CURVAS challenge and is even directly integrated into the \texttt{torchmetrics} library, which facilitates its use).

\text{ECE} = \sum_{m=1}^{B} \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{conf}(B_m) \right|

$B$ : numbers of bins.
$B_m$ : the $m$ -th bin.
$|B_m|$ : number of predictions in bin $B_m$ .
$n$ : total number of predictions.
$\text{acc}(B_m)$ : Accuracy (Proportion of correct predictions in the bin $B_m$ ).
$\text{conf}(B_m)$ : Confidence (average of the predicted probabilities in the bin $B_m$ ).
$\frac{|B_m|}{n}$ : Weights associated with bins according to the proportion of predictions they contain.

Two others metrics that can be used for failure detection are the AUROC and AURC metrics. The Area Under the Receiver Operating Characteristic curve is a measure of the area under the ROC curve (True Positive Rate vs. False Positive Rate). More precisely, this curve represents the ratio of true to false positives for different thresholds (value from which a probability is considered to return a positive value). Thus, an ideal ROC curve would be in the upper left corner of an orthonormal axis (100\% true positives and 0\% false positives) and therefore its area (AUROC) would be equal to 1. This means that the model perfectly distinguishes positive values from negative ones (absence of organ). Here we see the interest of adding the ValUES framework with metrics that capture other causes of epistemic uncertainty (in this case, linked to the identification of errors).

\text{AUROC} = \int_{0}^{1} \text{TPR}(t) \, d\text{FPR}(t)

$\text{TPR}(t)$ : True positive rate at threshold $t$ .
$\text{FPR}(t)$ : False positive rate at threshold $t$ .

The Area Under the Risk Curve is also an area but of a risk curve. It is plotted by taking for each predicted probability, the associated risk (here, the difference between annotation and prediction). Each point thus corresponds to a confidence threshold (as for the ROC curve, we calculate the risk at different thresholds). The smaller the area, the lower the risk at each threshold and the more the model predicts with a low error. AURC complements AUROC by emphasizing the risk/quality trade-off; another beneficial contribution of ValUES.

\text{Risk}(t) = \frac{1}{N} \sum_{i=1}^{N} \left| y_i - \hat{y_i}(t) \right|

\text{AURC} = \int_{0}^{1} \text{Risk}(t) \, dt

$y_i$ : value from annotation.
$\hat{y_i}(t)$ : predicted value at threshold $t$ .
$N$ : number of voxels.

Results

We compute four ensembles: three corresponding to the models trained on the same annotator, and one combining all the models. Several metrics show a reduction in prediction uncertainty and an improvement in prediction accuracy.

Uncertainty reduction

The figure below shows the violin plots illustrating the distribution of the average ECE (averaged across the three organs), with a classic boxplot. Each point in the distribution corresponds to a patient–model pair (three times more for individual models than for annotator ensembles, and nine times more than for the overall ensemble). We observe a strong decrease in the number of extreme values. In addition, both the mean and median decrease, which clearly indicates a reduction in uncertainty. This result is statistically significant according to a one-sided Wilcoxon test (“less”) ^{[Frank Wilcoxon, 1945]}

Individual Comparisons by Ranking Methods

Frank Wilcoxon (1945)

Biometrics Bulletin, vol. 1(6), pp. 80--83.

Source

(see PDF report).

The AURC and EAURC are very similar in their distributions, which means that our models closely approximate a model with perfect risk/quality trade-off. Similarly, the differences between models are quite small, except that the ensemble models have a lower dispersion, and therefore greater consistency in their results. However, this concentration of the distribution also means that, for some patients, some randomly initialized models perform slightly better than the ensemble models. We also still observe differences between organs, with the liver in the lead: little risk of error for good prediction quality. Despite these differences, the scores remain very good overall and indicate good uncertainty management by the models.

Prediction performance

Regarding prediction performance, we once again observe the superiority of ensemble models over other models. The average DICE across the three organs of the global ensemble is higher than that of the ensembles per annotator, which is itself higher than that of the individual models. This result is also statistically significant (see PDF report).

Background and project

Introduction

Uncertainty quantification

Experiment

Data and model

Evaluation

Results

Uncertainty reduction

Prediction performance

Bibliography