Computational Statistics

, Volume 30, Issue 2, pp 293–316 | Cite as

Using visual statistical inference to better understand random class separations in high dimension, low sample size data

  • Niladri Roy Chowdhury
  • Dianne Cook
  • Heike Hofmann
  • Mahbubul Majumder
  • Eun-Kyung Lee
  • Amy L. Toth
Original Paper

Abstract

Statistical graphics play an important role in exploratory data analysis, model checking and diagnosis. With high dimensional data, this often means plotting low-dimensional projections, for example, in classification tasks projection pursuit is used to find low-dimensional projections that reveal differences between labelled groups. In many contemporary data sets the number of observations is relatively small compared to the number of variables, which is known as a high dimension low sample size (HDLSS) problem. This paper explores the use of visual inference on understanding low-dimensional pictures of HDLSS data. Visual inference helps to quantify the significance of findings made from graphics. This approach may be helpful to broaden the understanding of issues related to HDLSS data in the data analysis community. Methods are illustrated using data from a published paper, which erroneously found real separation in microarray data, and with a simulation study conducted using Amazon’s Mechanical Turk.

Keywords

Statistical graphics Lineup Visualization Projection pursuit  Data mining 

References

  1. Amazon (2010) Mechanical Turk. http://aws.amazon.com/mturk/
  2. Buja A, Wolgang R (2005) Calibration for simultaneity: (re)sampling methods for simultaneous inference with applications to functional estimation and functional data. Tech. rep. http://stat.wharton.upenn.edu/buja/PAPERS/paper-sim.pdf
  3. Buja A, Cook D, Hofmann H, Lawrence M, Lee E, Swayne D, Wickham H (2009) Statistical inference for exploratory data analysis and model diagnostics. R Soc Philoso Trans A 367(1906):4361–4383MATHMathSciNetCrossRefGoogle Scholar
  4. Comon P (1994) Independent component analysis: a new concept? Sig Process 36(3):287–314MATHCrossRefGoogle Scholar
  5. Donoho D, Jin J (2008) Higher criticism thresholding: optimal feature selection when useful features are rare and weak. Proc Natl Acad Sci U S A 105:14,790–14,795CrossRefGoogle Scholar
  6. Donoho D, Jin J (2009) Feature selection by higher criticism thresholding achieves the optimal phase diagram. Philos Trans R Soc A 367:4449–4470MATHMathSciNetCrossRefGoogle Scholar
  7. Dudoit S, Fridlyand J, Speed T (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97(457):77–87MATHMathSciNetCrossRefGoogle Scholar
  8. Friedman JH, Tukey JW (1974) A projection pursuit algorithm for exploratory data analysis. IEEE Trans Comput c–23:881–890CrossRefGoogle Scholar
  9. Hall P, Marron J, Neeman A (2005) Geometric representation of high dimension, low sample size data. J R Stat Soc B 67:427–444MATHMathSciNetCrossRefGoogle Scholar
  10. Hennig C (2014) fpc: Flexible procedures for clustering. http://CRAN.R-project.org/package=fpc. R package version 2.1-7
  11. Huber PJ (1985) Projection pursuit. Ann Stat 13:435–475MATHCrossRefGoogle Scholar
  12. Johnson RA, Wichern DW (2002) Applied multivariate statistical analysis, 5th edn. Prentice-Hall, Englewood CliffsGoogle Scholar
  13. Jung S, Sen A, Marron JS (2012) Boundary behavior in high dimension, low sample size asymptotics of PCA. J Multivar Anal 109:190–203MATHMathSciNetCrossRefGoogle Scholar
  14. Lee EK, Cook D (2010) A projection pursuit index for large p small n data. Stat Comput 20(3):381–392MathSciNetCrossRefGoogle Scholar
  15. Majumder M, Hofmann H, Cook D (2013) Validation of visual statistical inference, applied to linear models. J Am Stat Assoc 108(503):942–956MATHMathSciNetCrossRefGoogle Scholar
  16. Marron JS, Todd MJ, Ahn J (2007) Distance weighted discrimination. J Am Stat Assoc 480:1267–1271MathSciNetCrossRefGoogle Scholar
  17. R Core Team (2013) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org/
  18. Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, New YorkMATHCrossRefGoogle Scholar
  19. Roweis S, Saul L (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323–2326CrossRefGoogle Scholar
  20. Roy Chowdhury N, Cook D, Hofmann H, Majumder M (2012) Where’s Waldo: looking closely at a lineup. Tech. Rep. 2, Iowa State University, Department of Statistics. http://www.stat.iastate.edu/preprint/articles/2012-02.pdf
  21. Toth A, Varala K, Newman T, Miguez F, Hutchison S, Willoughby D, Simons J, Egholm M, Hunt J, Hudson M, Robinson G (2007) Wasp gene expression supports an evolutionary link between maternal behavior and eusociality. Science 318:441–444CrossRefGoogle Scholar
  22. Toth A, Varala K, Henshaw M, Rodriguez-Zas S, Hudson M, Robinson G (2010) Brain transcriptomic analysis in paper wasps identifies genes associated with behaviour across social insect lineages. Proc R Soc Biol Sci B 277:2139–2148CrossRefGoogle Scholar
  23. Wickham H (2009) ggplot2: Elegant graphics for data analysis. Springer, New York. http://had.co.nz/ggplot2/book
  24. Wickham H, Cook D, Hofmann H, Buja A (2011) tourr: An R package for exploring multivariate data with projections. J Stat Softw 40(2):1–18. http://www.jstatsoft.org/v40/i02/
  25. Witten D, Tibshirani R (2011) Penalized classification using Fisher’s linear discriminant. J R Stat Soc Ser B (Stat Methodol) 73(5):753–772MATHMathSciNetCrossRefGoogle Scholar
  26. Yata K, Aoshima M (2011) Effective PCA for high dimension, low sample size data with noise reduction via geometric representations. J Multivar Anal 105:193–215MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Niladri Roy Chowdhury
    • 1
  • Dianne Cook
    • 1
  • Heike Hofmann
    • 1
  • Mahbubul Majumder
    • 2
  • Eun-Kyung Lee
    • 3
  • Amy L. Toth
    • 4
    • 5
  1. 1.Department of StatisticsIowa State UniversityAmesUSA
  2. 2.Department of MathematicsUniversity of NebraskaOmahaUSA
  3. 3.Department of StatisticsEWHA Womans UniversitySeoulKorea
  4. 4.Department of Ecology, Evolution, and Organismal BiologyIowa State UniversityAmesUSA
  5. 5.Department of EntomologyIowa State UniversityAmesUSA

Personalised recommendations