A Cautionary Analysis of STAPLE Using Direct Inference of Segmentation Truth

  • Koen Van Leemput
  • Mert R. Sabuncu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8673)


In this paper we analyze the properties of the well-known segmentation fusion algorithm STAPLE, using a novel inference technique that analytically marginalizes out all model parameters. We demonstrate both theoretically and empirically that when the number of raters is large, or when consensus regions are included in the model, STAPLE devolves into thresholding the average of the input segmentations. We further show that when the number of raters is small, the STAPLE result may not be the optimal segmentation truth estimate, and its model parameter estimates might not reflect the individual raters’ actual segmentation performance. Our experiments indicate that these intrinsic weaknesses are frequently exacerbated by the presence of undesirable global optima and convergence issues. Together these results cast doubt on the soundness and usefulness of typical STAPLE outcomes.


Segmentation Truth Truth Label Direct Inference Segmentation Performance Consensus Region 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Warfield, S.K., et al.: Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE TMI 23(7), 903–921 (2004)Google Scholar
  2. 2.
    Commowick, O., et al.: Estimating a reference standard segmentation with spatially varying performance parameters: Local MAP STAPLE. IEEE TMI 31(8), 1593–1606 (2012)Google Scholar
  3. 3.
    Asman, A.J., Landman, B.A.: Formulating spatially varying performance in the statistical fusion framework. IEEE TMI 31(6), 1326–1336 (2012)Google Scholar
  4. 4.
    Commowick, O., Warfield, S.K.: Estimation of inferential uncertainty in assessing expert segmentation performance from STAPLE. IEEE TMI 29(3), 771–780 (2010)Google Scholar
  5. 5.
    Landman, B., et al.: Robust statistical fusion of image labels. IEEE TMI 31(2), 512–522 (2012)Google Scholar
  6. 6.
    Langerak, T.R., et al.: Label fusion in atlas-based segmentation using a selective and iterative method for performance level estimation (SIMPLE). IEEE TMI 29(12), 2000–2008 (2010)Google Scholar
  7. 7.
    Sabuncu, M.R., et al.: A generative model for image segmentation based on label fusion. IEEE TMI 29(10), 1714–1729 (2010)Google Scholar
  8. 8.
    Rohlfing, T., Russakoff, D.B., Maurer, C.R.: Expectation maximization strategies for multi-atlas multi-label segmentation. In: Taylor, C.J., Noble, J.A. (eds.) IPMI 2003. LNCS, vol. 2732, pp. 210–221. Springer, Heidelberg (2003)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Koen Van Leemput
    • 1
    • 2
  • Mert R. Sabuncu
    • 1
  1. 1.A.A. Martinos Center for Biomedical ImagingMGH, Harvard Medical SchoolUSA
  2. 2.Technical University of DenmarkDenmark

Personalised recommendations