Advertisement

Statistical Methods in High Dimensions

  • Florian FrommletEmail author
  • Małgorzata Bogdan
  • David Ramsey
Chapter
Part of the Computational Biology book series (COBO, volume 18)

Abstract

This is the core chapter that introduces the theory related to the advanced statistical methods applied in the later chapters on QTL mapping and GWAS analysis. More basic statistical methods are included in the Appendix. Section 3.2 covers the use of classical procedures, like the Bonferroni correction, in multiple testing, as well as approaches based on permutation and resampling, which guarantee control of the familywise error rate (FWER). Afterwards, more modern techniques, like the Benjamini-Hochberg procedure to control the false discovery rate (FDR), are discussed and a somewhat advanced theoretical discussion on optimal multiple testing strategies in high dimensions follows. The second part of this chapter is concerned with model selection. Section 3.3 starts by introducing the basic concepts of likelihood and then recapitulates the development of Akaike’s information criterion (AIC) using information theoretic principles. This is then compared with the use of the Bayesian information criterion (BIC) in the context of Bayesian model selection. It is then pointed out why both AIC and BIC fail to work in a high-dimensional setting and different modifications of BIC designed to control either FWER or FDR are presented. The chapter ends by discussing various further approaches to model selection in high dimensions.

Keywords

FDRFalse Discovery Rate Bayesian Information Criterion Bonferroni Procedure Bayesian Model Selection Multiple Testing Procedure 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Abramovich, F., Benjamini, Y., Donoho, D.L., Johnstone, I.M.: Adapting to unknown sparsity by controlling the false discovery rate. Ann. Stat. 34, 584–653 (2006)Google Scholar
  2. 2.
    Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)Google Scholar
  3. 3.
    Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Proceedings of the 2nd International Symposium on Information Theory, 267–281 (1973)Google Scholar
  4. 4.
    Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995)Google Scholar
  5. 5.
    Benjamini, Y., Hochberg, Y.: On the adaptive control of the false discovery fate in multiple testing with independent statistics. J. Educ. Behav. Stat. 25, 60–83 (2000)Google Scholar
  6. 6.
    Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29(4), 1165–1188 (2001)Google Scholar
  7. 7.
    Bera, A.K., Bilias, Y.: Rao’s score, Neyman’s \(C(\alpha )\) and Silvey’s LM tests: an essay on historical developments and some new results. J. Stat. Plan. Infer. 97, 9–44 (2001)Google Scholar
  8. 8.
    Birgé, L., Massart, P.: Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3, 203–268 (2001)Google Scholar
  9. 9.
    Bogdan, M., Chakrabarti, A., Frommlet, F., Ghosh, J.K.: Asymptotic Bayes-optimality under sparsity of some multiple testing procedures. Ann. Stat. 39, 1551–1579 (2011)Google Scholar
  10. 10.
    Bogdan, M., Frommlet, F., Szulc, P., Tang H.: Model selection approach for genome wide association studies in admixed populations. Technical Report (2013)Google Scholar
  11. 11.
    Bogdan, M., Ghosh, J.K., Doerge, R.W.: Modifying the Schwarz Bayesian information criterion to locate multiple interacting quantitive trait loci. Genetics 167, 989–999 (2004)Google Scholar
  12. 12.
    Bogdan, M., Ghosh, J.K., Tokdar S.T.: A comparison of the Simes-Benjamini-Hochberg procedure with some Bayesian rules for multiple testing. In: Balakrishnan, N., Peña, E., Silvapulle, M.J. (eds.) Beyond Parametrics in Interdisciplinary Research: Fetschrift in Honor of Professor Pranab K. Sen, IMS collections, vol. 1, pp. 211–230. Beachwood Ohio (2008)Google Scholar
  13. 13.
    Bogdan, M., van den Berg, E., Sabatti, C., Su, W., Candès, E.J.: SLOPE—Adaptive Variable Selection via Convex Optimization. Ann. Appl. Stat. 9, 1103–1140 (2015)Google Scholar
  14. 14.
    Bogdan, M., van den Berg, E., Su, W., Candès, E.J.: Statistical estimation and testing via the sorted \(\ell _1\) norm. arXiv:1310.1969 (2013)
  15. 15.
    Bogdan, M., Żak-Szatkowska, M., Ghosh, J.K.: Selecting explanatory variables with the modified version of Bayesian Information criterion. Qual. Reliab. Eng. Int. 24, 627–641 (2008)Google Scholar
  16. 16.
    Boyd, S., Vandenberghe, L.: Convex Optimization. Kluwer, Cambridge University Press (2004)Google Scholar
  17. 17.
    Broberg, P.: A comparative review of estimates of the proportion unchanged genes and the false discovery rate. BMC Bioinform. 6, 199 (2005)Google Scholar
  18. 18.
    Broman, K.W., Speed, T.P.: A model selection approach for the identification of quantitative trait loci in experimental crosses. J. Roy. Stat. Soc.: Ser. B (Stat. Meth.) 64(4), 641–656 (2002)Google Scholar
  19. 19.
    Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data. Springer, Heidelberg (2011)Google Scholar
  20. 20.
    Burnham, K.P., Anderson, D.R.: Model Selection and Multimodel Inference, 2nd edn. Springer, New York (2002)Google Scholar
  21. 21.
    Cai, T., Jin, J.: Optimal rates of convergence for estimating the null and proportion of non-null effects in large-scale multiple testing. Ann. Stat. 38, 100–145 (2010)Google Scholar
  22. 22.
    Candès, E.J., Plan, Y.: Near-ideal model selection by l1 minimization. Ann. Stat. 37, 2145–2177 (2007)Google Scholar
  23. 23.
    Chipman, H., George, E.I., McCulloch, R.E.: The practical implementation of bayesian model selection. In: Lahiri, P. (ed.) Model Selection (IMS Lecture Notes), pp. 65–116. Beachwood, OH (2001)Google Scholar
  24. 24.
    Chun, H., Keles, S.: Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J. Roy. Stat. Soc.: Ser. B (Stat. Meth.) 72(1), 3–25 (2010)Google Scholar
  25. 25.
    Churchill, G.A., Doerge, R.W. Empirical threshold values for quantitative trait mapping. Genetics 138, 963–971 (1994)Google Scholar
  26. 26.
    De Leeuw, J., Hornik, K., Mair, P.: Isotone optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and active set methods. Journal of statistical software 32 (5): 1–24, (2009)Google Scholar
  27. 27.
    Do, K., Müller, P., Tang, F.: A Bayesian mixture model for differential gene expression. Appl. Stat. 54, 627–644 (2005)zbMATHGoogle Scholar
  28. 28.
    Doerge, R.W., Churchill, G.A.: Permutation tests for multiple loci affecting a quantitative character. Genetics 142, 285–294 (1996)Google Scholar
  29. 29.
    Donoho, D., Tanner, J.: Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing. Phil. Trans. R. Soc. A 367, 4273–4293 (2009)Google Scholar
  30. 30.
    Dudoit, S., Shaffer, J.P., Boldrick, J.C.: Multiple hypothesis testing in microarray experiments. Stat. Sci. 18, 71–103 (2003)Google Scholar
  31. 31.
    Dudoit, S., van der Laan, M.J.: Multiple Testing Procedures with Applications to Genomics. Springer, New York (2008)Google Scholar
  32. 32.
    Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)Google Scholar
  33. 33.
    Efron, B., Tibshirani, R., Storey, J.D., Tusher, V.: Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 96, 1151–1160 (2001)Google Scholar
  34. 34.
    Efron, B., Tibshirani, R.: Empirical Bayes methods and false discovery rates for microarrays. Genet. Epidemiol. 23, 70–86 (2002)Google Scholar
  35. 35.
    Efron, B.: Microarrays, empirical Bayes and the two-group model. Stat. Sci. 23(1), 1–22 (2008)Google Scholar
  36. 36.
    Ferreira, J.A., Zwinderman, A.H.: On the Benjamini-Hochberg method. Ann. Stat. 34(4), 1827–1849 (2006)Google Scholar
  37. 37.
    Foster, D.P., Stine, R.A.: Local asymptotic coding and the minimum description length. IEEE Trans. Inf. Theor. 45, 1289–1293 (1999)Google Scholar
  38. 38.
    Frank, I.E., Friedman, J.H.: A statistical view of some chemometrics regression tools. Technometrics 35, 109–148 (1993)Google Scholar
  39. 39.
    Frommlet, F., Bogdan, M: Some optimality properties of FDR controlling rules under sparsity. Technical Report (2012)Google Scholar
  40. 40.
    Frommlet, F., Chakrabarti, A., Murawska, M., Bogdan, M.: Asymptotic Bayes optimality under sparsity for generally distributed effect sizes under the alternative. arXiv:1005.4753 (2011)
  41. 41.
    Genovese, C., Wasserman, L.: A stochastic process approach to false discovery control. Ann. Stat. 32, 1035–1061 (2004)Google Scholar
  42. 42.
    Genovese, C., Wasserman, L.: Operating characteristics and extensions of the false discovery rate procedure. J. Roy. Stat. Soc. Ser. B 64, 499–517 (2002)Google Scholar
  43. 43.
    George, E.I. Foster, D.F.: Calibration and empirical Bayes variable selection. Biometrika 87, 731–747 (2000)Google Scholar
  44. 44.
    Ghosh, J.K., Samanta, T.: Model selection—an overview. Curr. Sci. 80, 1135–1144 (2001)Google Scholar
  45. 45.
    Hochberg, Y., Tamhane, A.C.: Multiple Comparison Procedures. Wiley, New York (1987)Google Scholar
  46. 46.
    Hochberg, Y.: A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75, 800–803 (1988)Google Scholar
  47. 47.
    Hoerl A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970)Google Scholar
  48. 48.
    Holm, S.: A simple sequentially rejective Bonferroni test procedure. Scand. J. Stat. 6, 65–70 (1979)Google Scholar
  49. 49.
    Hsu, J.C.: Multiple Comparisons: Theory and Methods. Chapman and Hall, New York (1996)Google Scholar
  50. 50.
    James, W., Stein, C.: Estimation with quadratic loss, Proc. Fourth Berkeley Symp. Math. Stat. Prob. 1, 361–79 (1961)Google Scholar
  51. 51.
    Jin, J., Cai, T.C.: Estimating the null and the proportion of non-null effects in large-scale multiple comparisons. J. Am. Stat. Assoc. 102, 495–506 (2007)Google Scholar
  52. 52.
    Johnstone, I.M., Silverman, B.W.: EbayesThresh: R programs for empirical Bayes thresholding. J. Stat. Softw. 12(8) (2005)Google Scholar
  53. 53.
    Johnstone, I.M., Silverman, B.W.: Needles and straw in haystacks: empirical Bayes estimates of possibly sparse sequences. Ann. Stat. 32, 1594–1649 (2004)Google Scholar
  54. 54.
    Korn, E.L., Troendleb, J.F., McShanea, L.M., Simona, R.: Controlling the number of false discoveries: application to high-dimensional genomic data. J. Stat. Plan. Infer. 124(2), 379–398 (2004)Google Scholar
  55. 55.
    Kullback, S.: Information Theory and Statistics. John Wiley and Sons, New York (1959)Google Scholar
  56. 56.
    Lehmann, E.L., Romano, J.P.: Generalizations of the familywise error rate. Ann.Stat. 33, 1138–1154 (2005)Google Scholar
  57. 57.
    Lehmann, E.L., Romano, J.P.: Testing Statistical Hypotheses. Springer, New York (2005)Google Scholar
  58. 58.
    Lehmann, E.L. D’Abrera, H.J.M.: Nonparametrics: Statistical Methods Based on Ranks. McGraw-Hill, New York (1975)Google Scholar
  59. 59.
    Marcus, R., Peritz, E., Gabriel, K.R.: On closed testing procedures with special reference to ordered analysis of variance. Biometrika 63, 655–660 (1976)Google Scholar
  60. 60.
    Martin, R., Tokdar, S.T.: A nonparametric empirical Bayes framework for large-scale multiple testing. Biostatistics. 13, 427–439 (2012)Google Scholar
  61. 61.
    Müller, P., Giovanni, P., Rice, K.: FDR and Bayesian multiple comparisons rules. In: Proceedings of the Valencia/ISBA 8th World Meeting on Bayesian Statistics. Oxford University Press (2007)Google Scholar
  62. 62.
    Neuvial, P., Roquain, E.: On false discovery rate thresholding for classification under sparsity. Ann. Stat. 40, 2572–2600 (2012)Google Scholar
  63. 63.
    Neyman, J., Pearson, E.: On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. Ser. A 231: 289–337 (1933)Google Scholar
  64. 64.
    Rao, C.R., Wu, Y.: On model selection. In: Lahiri, P. (ed.) Model selection (IMS Lecture Notes), pp. 1–57. Beachwood, OH (2001)Google Scholar
  65. 65.
    Schwarz, G: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)Google Scholar
  66. 66.
    Scott, J.G., Berger, J.O.: An exploration of aspects of Bayesian multiple testing. J. Stat. Plan. Inf. 136, 2144–2162 (2006)Google Scholar
  67. 67.
    Seber, A.F., Lee, A.J.: Linear Regression Analysis. John Wiley and Sons (2003)Google Scholar
  68. 68.
    Seeger, P.: A note on a method for the analysis of significance en masse. Technometrics. 10, 586–593 (1968)Google Scholar
  69. 69.
    Shaffer, J.P.: Multiple hypothesis testing. Annu. Rev. Psychol. 46, 561–584 (1995)Google Scholar
  70. 70.
    Simes, R.J.: An improved Bonferroni procedure for multiple tests of significance. Biometrika 73(3), 751–754 (1986)Google Scholar
  71. 71.
    Stein, C.: Inadmissibility of the usual estimator for the mean of a multivariate distribution. Proc. Third Berkeley Symp. Math. Stat. Prob. 1, 197–06 (1956)Google Scholar
  72. 72.
    Storey, J.D.: The positive false discovery rate: a Bayesian interpretation and the q-value. Ann. Stat. 31(6), 2013–2035 (2003)Google Scholar
  73. 73.
    Storey, J.D.: A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B 64, 479–498 (2002)Google Scholar
  74. 74.
    Sun, T., Zhang, C.-H.: Scaled sparse linear regression. Biometrika 99(4), 879–898 (2012)Google Scholar
  75. 75.
    Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc B. 58(1), 267–288 (1996)Google Scholar
  76. 76.
    Tibshirani, R. Knight, K.: The covariance inflation criterion for adaptive model selection, J. Roy. Stat. Soc. B 55, 757–796 (1999)Google Scholar
  77. 77.
    Westfall, P.H., Young, S.S.: Resampling-Based Multiple Testing. Wiley, New York (1993)Google Scholar
  78. 78.
    Wettenhall, J. M., Smyth G. K.: limmaGUI: a graphical user interface for linear modeling of microarray data. Bioinformatics 20(18): 3705–3706 (2004)Google Scholar
  79. 79.
    Wold, H.: Estimation of principal components and related models by iterative least squares. In Krishnaiaah, P.R. (ed.) Multivariate Analysis, pp. 391–420. Academic Press, New York (1966)Google Scholar
  80. 80.
    Yuan, M., Lin, Y. Model selection and estimation in regression with grouped variables. J. Roy. Stat. Soc. Ser. B 68(1), 49–67 (2007)Google Scholar
  81. 81.
    Żak-Szatkowska, M., Bogdan, M.: Modified versions of Bayesian information criterion for sparse generalized linear models. CSDA 55, 2908–2924 (2011)Google Scholar
  82. 82.
    Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc B 67(2), 301–320 (2005)Google Scholar

Copyright information

© Springer-Verlag London 2016

Authors and Affiliations

  • Florian Frommlet
    • 1
    Email author
  • Małgorzata Bogdan
    • 2
  • David Ramsey
    • 3
  1. 1.Center for Medical Statistics, Informatics, and Intelligent Systems Section for Medical StatisticsMedical University of ViennaViennaAustria
  2. 2.Institute of MathematicsUniversity of WrocławWrocławPoland
  3. 3.Department of Operations ResearchWrocław University of TechnologyWrocławPoland

Personalised recommendations