Statistical Methods in High Dimensions

Frommlet, Florian; Bogdan, Małgorzata; Ramsey, David

doi:10.1007/978-1-4471-5310-8_3

Florian Frommlet⁸,
Małgorzata Bogdan⁹ &
David Ramsey¹⁰

Part of the book series: Computational Biology ((COBO,volume 18))

2243 Accesses

Abstract

This is the core chapter that introduces the theory related to the advanced statistical methods applied in the later chapters on QTL mapping and GWAS analysis. More basic statistical methods are included in the Appendix. Section 3.2 covers the use of classical procedures, like the Bonferroni correction, in multiple testing, as well as approaches based on permutation and resampling, which guarantee control of the familywise error rate (FWER). Afterwards, more modern techniques, like the Benjamini-Hochberg procedure to control the false discovery rate (FDR), are discussed and a somewhat advanced theoretical discussion on optimal multiple testing strategies in high dimensions follows. The second part of this chapter is concerned with model selection. Section 3.3 starts by introducing the basic concepts of likelihood and then recapitulates the development of Akaike’s information criterion (AIC) using information theoretic principles. This is then compared with the use of the Bayesian information criterion (BIC) in the context of Bayesian model selection. It is then pointed out why both AIC and BIC fail to work in a high-dimensional setting and different modifications of BIC designed to control either FWER or FDR are presented. The chapter ends by discussing various further approaches to model selection in high dimensions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abramovich, F., Benjamini, Y., Donoho, D.L., Johnstone, I.M.: Adapting to unknown sparsity by controlling the false discovery rate. Ann. Stat. 34, 584–653 (2006)
Google Scholar
Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)
Google Scholar
Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Proceedings of the 2nd International Symposium on Information Theory, 267–281 (1973)
Google Scholar
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995)
Google Scholar
Benjamini, Y., Hochberg, Y.: On the adaptive control of the false discovery fate in multiple testing with independent statistics. J. Educ. Behav. Stat. 25, 60–83 (2000)
Google Scholar
Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29(4), 1165–1188 (2001)
Google Scholar
Bera, A.K., Bilias, Y.: Rao’s score, Neyman’s \(C(\alpha )\) and Silvey’s LM tests: an essay on historical developments and some new results. J. Stat. Plan. Infer. 97, 9–44 (2001)
Google Scholar
Birgé, L., Massart, P.: Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3, 203–268 (2001)
Google Scholar
Bogdan, M., Chakrabarti, A., Frommlet, F., Ghosh, J.K.: Asymptotic Bayes-optimality under sparsity of some multiple testing procedures. Ann. Stat. 39, 1551–1579 (2011)
Google Scholar
Bogdan, M., Frommlet, F., Szulc, P., Tang H.: Model selection approach for genome wide association studies in admixed populations. Technical Report (2013)
Google Scholar
Bogdan, M., Ghosh, J.K., Doerge, R.W.: Modifying the Schwarz Bayesian information criterion to locate multiple interacting quantitive trait loci. Genetics 167, 989–999 (2004)
Google Scholar
Bogdan, M., Ghosh, J.K., Tokdar S.T.: A comparison of the Simes-Benjamini-Hochberg procedure with some Bayesian rules for multiple testing. In: Balakrishnan, N., Peña, E., Silvapulle, M.J. (eds.) Beyond Parametrics in Interdisciplinary Research: Fetschrift in Honor of Professor Pranab K. Sen, IMS collections, vol. 1, pp. 211–230. Beachwood Ohio (2008)
Google Scholar
Bogdan, M., van den Berg, E., Sabatti, C., Su, W., Candès, E.J.: SLOPE—Adaptive Variable Selection via Convex Optimization. Ann. Appl. Stat. 9, 1103–1140 (2015)
Google Scholar
Bogdan, M., van den Berg, E., Su, W., Candès, E.J.: Statistical estimation and testing via the sorted \(\ell _1\) norm. arXiv:1310.1969 (2013)
Bogdan, M., Żak-Szatkowska, M., Ghosh, J.K.: Selecting explanatory variables with the modified version of Bayesian Information criterion. Qual. Reliab. Eng. Int. 24, 627–641 (2008)
Google Scholar
Boyd, S., Vandenberghe, L.: Convex Optimization. Kluwer, Cambridge University Press (2004)
Google Scholar
Broberg, P.: A comparative review of estimates of the proportion unchanged genes and the false discovery rate. BMC Bioinform. 6, 199 (2005)
Google Scholar
Broman, K.W., Speed, T.P.: A model selection approach for the identification of quantitative trait loci in experimental crosses. J. Roy. Stat. Soc.: Ser. B (Stat. Meth.) 64(4), 641–656 (2002)
Google Scholar
Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data. Springer, Heidelberg (2011)
Google Scholar
Burnham, K.P., Anderson, D.R.: Model Selection and Multimodel Inference, 2nd edn. Springer, New York (2002)
Google Scholar
Cai, T., Jin, J.: Optimal rates of convergence for estimating the null and proportion of non-null effects in large-scale multiple testing. Ann. Stat. 38, 100–145 (2010)
Google Scholar
Candès, E.J., Plan, Y.: Near-ideal model selection by l1 minimization. Ann. Stat. 37, 2145–2177 (2007)
Google Scholar
Chipman, H., George, E.I., McCulloch, R.E.: The practical implementation of bayesian model selection. In: Lahiri, P. (ed.) Model Selection (IMS Lecture Notes), pp. 65–116. Beachwood, OH (2001)
Google Scholar
Chun, H., Keles, S.: Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J. Roy. Stat. Soc.: Ser. B (Stat. Meth.) 72(1), 3–25 (2010)
Google Scholar
Churchill, G.A., Doerge, R.W. Empirical threshold values for quantitative trait mapping. Genetics 138, 963–971 (1994)
Google Scholar
De Leeuw, J., Hornik, K., Mair, P.: Isotone optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and active set methods. Journal of statistical software 32 (5): 1–24, (2009)
Google Scholar
Do, K., Müller, P., Tang, F.: A Bayesian mixture model for differential gene expression. Appl. Stat. 54, 627–644 (2005)
MATH Google Scholar
Doerge, R.W., Churchill, G.A.: Permutation tests for multiple loci affecting a quantitative character. Genetics 142, 285–294 (1996)
Google Scholar
Donoho, D., Tanner, J.: Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing. Phil. Trans. R. Soc. A 367, 4273–4293 (2009)
Google Scholar
Dudoit, S., Shaffer, J.P., Boldrick, J.C.: Multiple hypothesis testing in microarray experiments. Stat. Sci. 18, 71–103 (2003)
Google Scholar
Dudoit, S., van der Laan, M.J.: Multiple Testing Procedures with Applications to Genomics. Springer, New York (2008)
Google Scholar
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)
Google Scholar
Efron, B., Tibshirani, R., Storey, J.D., Tusher, V.: Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 96, 1151–1160 (2001)
Google Scholar
Efron, B., Tibshirani, R.: Empirical Bayes methods and false discovery rates for microarrays. Genet. Epidemiol. 23, 70–86 (2002)
Google Scholar
Efron, B.: Microarrays, empirical Bayes and the two-group model. Stat. Sci. 23(1), 1–22 (2008)
Google Scholar
Ferreira, J.A., Zwinderman, A.H.: On the Benjamini-Hochberg method. Ann. Stat. 34(4), 1827–1849 (2006)
Google Scholar
Foster, D.P., Stine, R.A.: Local asymptotic coding and the minimum description length. IEEE Trans. Inf. Theor. 45, 1289–1293 (1999)
Google Scholar
Frank, I.E., Friedman, J.H.: A statistical view of some chemometrics regression tools. Technometrics 35, 109–148 (1993)
Google Scholar
Frommlet, F., Bogdan, M: Some optimality properties of FDR controlling rules under sparsity. Technical Report (2012)
Google Scholar
Frommlet, F., Chakrabarti, A., Murawska, M., Bogdan, M.: Asymptotic Bayes optimality under sparsity for generally distributed effect sizes under the alternative. arXiv:1005.4753 (2011)
Genovese, C., Wasserman, L.: A stochastic process approach to false discovery control. Ann. Stat. 32, 1035–1061 (2004)
Google Scholar
Genovese, C., Wasserman, L.: Operating characteristics and extensions of the false discovery rate procedure. J. Roy. Stat. Soc. Ser. B 64, 499–517 (2002)
Google Scholar
George, E.I. Foster, D.F.: Calibration and empirical Bayes variable selection. Biometrika 87, 731–747 (2000)
Google Scholar
Ghosh, J.K., Samanta, T.: Model selection—an overview. Curr. Sci. 80, 1135–1144 (2001)
Google Scholar
Hochberg, Y., Tamhane, A.C.: Multiple Comparison Procedures. Wiley, New York (1987)
Google Scholar
Hochberg, Y.: A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75, 800–803 (1988)
Google Scholar
Hoerl A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970)
Google Scholar
Holm, S.: A simple sequentially rejective Bonferroni test procedure. Scand. J. Stat. 6, 65–70 (1979)
Google Scholar
Hsu, J.C.: Multiple Comparisons: Theory and Methods. Chapman and Hall, New York (1996)
Google Scholar
James, W., Stein, C.: Estimation with quadratic loss, Proc. Fourth Berkeley Symp. Math. Stat. Prob. 1, 361–79 (1961)
Google Scholar
Jin, J., Cai, T.C.: Estimating the null and the proportion of non-null effects in large-scale multiple comparisons. J. Am. Stat. Assoc. 102, 495–506 (2007)
Google Scholar
Johnstone, I.M., Silverman, B.W.: EbayesThresh: R programs for empirical Bayes thresholding. J. Stat. Softw. 12(8) (2005)
Google Scholar
Johnstone, I.M., Silverman, B.W.: Needles and straw in haystacks: empirical Bayes estimates of possibly sparse sequences. Ann. Stat. 32, 1594–1649 (2004)
Google Scholar
Korn, E.L., Troendleb, J.F., McShanea, L.M., Simona, R.: Controlling the number of false discoveries: application to high-dimensional genomic data. J. Stat. Plan. Infer. 124(2), 379–398 (2004)
Google Scholar
Kullback, S.: Information Theory and Statistics. John Wiley and Sons, New York (1959)
Google Scholar
Lehmann, E.L., Romano, J.P.: Generalizations of the familywise error rate. Ann.Stat. 33, 1138–1154 (2005)
Google Scholar
Lehmann, E.L., Romano, J.P.: Testing Statistical Hypotheses. Springer, New York (2005)
Google Scholar
Lehmann, E.L. D’Abrera, H.J.M.: Nonparametrics: Statistical Methods Based on Ranks. McGraw-Hill, New York (1975)
Google Scholar
Marcus, R., Peritz, E., Gabriel, K.R.: On closed testing procedures with special reference to ordered analysis of variance. Biometrika 63, 655–660 (1976)
Google Scholar
Martin, R., Tokdar, S.T.: A nonparametric empirical Bayes framework for large-scale multiple testing. Biostatistics. 13, 427–439 (2012)
Google Scholar
Müller, P., Giovanni, P., Rice, K.: FDR and Bayesian multiple comparisons rules. In: Proceedings of the Valencia/ISBA 8th World Meeting on Bayesian Statistics. Oxford University Press (2007)
Google Scholar
Neuvial, P., Roquain, E.: On false discovery rate thresholding for classification under sparsity. Ann. Stat. 40, 2572–2600 (2012)
Google Scholar
Neyman, J., Pearson, E.: On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. Ser. A 231: 289–337 (1933)
Google Scholar
Rao, C.R., Wu, Y.: On model selection. In: Lahiri, P. (ed.) Model selection (IMS Lecture Notes), pp. 1–57. Beachwood, OH (2001)
Google Scholar
Schwarz, G: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Google Scholar
Scott, J.G., Berger, J.O.: An exploration of aspects of Bayesian multiple testing. J. Stat. Plan. Inf. 136, 2144–2162 (2006)
Google Scholar
Seber, A.F., Lee, A.J.: Linear Regression Analysis. John Wiley and Sons (2003)
Google Scholar
Seeger, P.: A note on a method for the analysis of significance en masse. Technometrics. 10, 586–593 (1968)
Google Scholar
Shaffer, J.P.: Multiple hypothesis testing. Annu. Rev. Psychol. 46, 561–584 (1995)
Google Scholar
Simes, R.J.: An improved Bonferroni procedure for multiple tests of significance. Biometrika 73(3), 751–754 (1986)
Google Scholar
Stein, C.: Inadmissibility of the usual estimator for the mean of a multivariate distribution. Proc. Third Berkeley Symp. Math. Stat. Prob. 1, 197–06 (1956)
Google Scholar
Storey, J.D.: The positive false discovery rate: a Bayesian interpretation and the q-value. Ann. Stat. 31(6), 2013–2035 (2003)
Google Scholar
Storey, J.D.: A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B 64, 479–498 (2002)
Google Scholar
Sun, T., Zhang, C.-H.: Scaled sparse linear regression. Biometrika 99(4), 879–898 (2012)
Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc B. 58(1), 267–288 (1996)
Google Scholar
Tibshirani, R. Knight, K.: The covariance inflation criterion for adaptive model selection, J. Roy. Stat. Soc. B 55, 757–796 (1999)
Google Scholar
Westfall, P.H., Young, S.S.: Resampling-Based Multiple Testing. Wiley, New York (1993)
Google Scholar
Wettenhall, J. M., Smyth G. K.: limmaGUI: a graphical user interface for linear modeling of microarray data. Bioinformatics 20(18): 3705–3706 (2004)
Google Scholar
Wold, H.: Estimation of principal components and related models by iterative least squares. In Krishnaiaah, P.R. (ed.) Multivariate Analysis, pp. 391–420. Academic Press, New York (1966)
Google Scholar
Yuan, M., Lin, Y. Model selection and estimation in regression with grouped variables. J. Roy. Stat. Soc. Ser. B 68(1), 49–67 (2007)
Google Scholar
Żak-Szatkowska, M., Bogdan, M.: Modified versions of Bayesian information criterion for sparse generalized linear models. CSDA 55, 2908–2924 (2011)
Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc B 67(2), 301–320 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Medical Statistics, Informatics, and Intelligent Systems Section for Medical Statistics, Medical University of Vienna, Spitalgasse 23, 1090, Vienna, Austria
Florian Frommlet
Institute of Mathematics, University of Wrocław, Wrocław, Poland
Małgorzata Bogdan
Department of Operations Research, Wrocław University of Technology, Wrocław, Poland
David Ramsey

Authors

Florian Frommlet
View author publications
You can also search for this author in PubMed Google Scholar
Małgorzata Bogdan
View author publications
You can also search for this author in PubMed Google Scholar
David Ramsey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florian Frommlet .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Frommlet, F., Bogdan, M., Ramsey, D. (2016). Statistical Methods in High Dimensions. In: Phenotypes and Genotypes. Computational Biology, vol 18. Springer, London. https://doi.org/10.1007/978-1-4471-5310-8_3

Download citation

DOI: https://doi.org/10.1007/978-1-4471-5310-8_3
Published: 13 February 2016
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5309-2
Online ISBN: 978-1-4471-5310-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics