Dimension Reduction for High-Dimensional Data

  • Lexin Li
Part of the Methods in Molecular Biology book series (MIMB, volume 620)


With advancing of modern technologies, high-dimensional data have prevailed in computational biology. The number of variables p is very large, and in many applications, p is larger than the number of observational units n. Such high dimensionality and the unconventional small-n-large-p setting have posed new challenges to statistical analysis methods. Dimension reduction, which aims to reduce the predictor dimension prior to any modeling efforts, offers a potentially useful avenue to tackle such high-dimensional regression. In this chapter, we review a number of commonly used dimension reduction approaches, including principal component analysis, partial least squares, and sliced inverse regression. For each method, we review its background and its applications in computational biology, discuss both its advantages and limitations, and offer enough operational details for implementation. A numerical example of analyzing a microarray survival data is given to illustrate applications of the reviewed reduction methods.


Dimension reduction partial least squares principal component analysis sliced inverse regression 



This work was supported in part by National Science Foundation grant DMS 0706919.


  1. 1.
    Rosenwald, A., Wright, G., Chan, W.C., Connors, J.M., Campo, E., Fisher, R.I., Gascoyne, R.D., Muller-Hermelink, H.K., Smeland, E.B., and Staudt, L.M. (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. The New England Journal of Medicine 346, 1937–1947.PubMedCrossRefGoogle Scholar
  2. 2.
    Cook, R.D., Li, B., and Chiaromonte, F. (2007) Dimension reduction without matrix inversion. Biometrika 94, 569–584.CrossRefGoogle Scholar
  3. 3.
    Zhong, W., Zeng, P., Ma, P., Liu, J.S., and Zhu, Y. (2005) RSIR: regularized sliced inverse regression for motif discovery. Bioinformatics 21, 4169–4175.PubMedCrossRefGoogle Scholar
  4. 4.
    Tenenbaum, J.B., Silva, V.D., and Langford, J.C. (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323.PubMedCrossRefGoogle Scholar
  5. 5.
    Roweis, S.T., and Saul, L.K. (2000) Nonlinear dimensionality reduction by local linear embedding. Science 290, 2323–2326.PubMedCrossRefGoogle Scholar
  6. 6.
    Wold, H. (1966) Estimation of principal components and related models by iterative least squares. In Multivariate Analysis, Ed. P. R. Krishnaiah, 391–420. New York: Academic Press.Google Scholar
  7. 7.
    Li, K.C. (1991) Sliced inverse regression for dimension reduction (with discussion). Journal of the American Statistical Association 86, 316–327.CrossRefGoogle Scholar
  8. 8.
    Jolliffe, I.T. (2002) Principal Components Analysis. Second Edition. Springer, New York.Google Scholar
  9. 9.
    Alter, O., Brown, P.O., and Botstein, D. (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of National Academy of Sciences, USA 97, 10101–10106.CrossRefGoogle Scholar
  10. 10.
    West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson, J.A. Jr. Marks, J.R., and Nevins J.R. (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of National Academy of Sciences, USA 98, 11462–11467.CrossRefGoogle Scholar
  11. 11.
    Chiaromonte, F., and Martinelli, J. (2002) Dimension reduction strategies for analyzing global gene expression data with a response. Mathematical Biosciences 176, 123–144.PubMedCrossRefGoogle Scholar
  12. 12.
    Li, L., and Li, H. (2004) Dimension reduction methods for microarrays with application to censored survival data. Bioinformatics 20, 3406–3412.PubMedCrossRefGoogle Scholar
  13. 13.
    Li, L. (2006) Survival prediction of diffuse large-B-cell lymphoma based on both clinical and gene expression information. Bioinformatics 22, 466–471.PubMedCrossRefGoogle Scholar
  14. 14.
    Wei, T., Liao, B.L., Ackermann, B.L., Jolly, R.A., Eckstein, J.A., Kulkarni, N.H., Helvering, L.M., Goldsteiin, K.M., Shou, J., Estrem, S.T., Ryan, T.P., Colet, J.-M., Thomas, C.E., Stevens, J.L., and Onyia, J.E. (2005) Data-driven analysis approach for biomarker discovery using molecular-profiling technologies. Biomarkers 10, 153–172.PubMedCrossRefGoogle Scholar
  15. 15.
    Leek, J.T., and Storey, J.D. (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3, 1724–1735.PubMedCrossRefGoogle Scholar
  16. 16.
    Patterson, N., Price, A.L., and Reich, D. (2006) Population structure and eigenanalysis. PLoS Genetics 2, 2074–2093.CrossRefGoogle Scholar
  17. 17.
    Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., and Reich, D. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38, 904–909.PubMedCrossRefGoogle Scholar
  18. 18.
    Cox, D.R. (1968) Notes on some aspects of regression analysis. Journal of the Royal Statistical Society, Series A. 131, 265–279.CrossRefGoogle Scholar
  19. 19.
    Artemiou, A., and Li, B. (2009) On principal components and regression: a statistical explanation of a natural phenomenon. Statistica Sinica, 19, 1557–1565.Google Scholar
  20. 20.
    Cook, R.D. (2007) Fisher Lecture: Dimension reduction in regression (with discussion). Statistical Science 22, 1–26.CrossRefGoogle Scholar
  21. 21.
    Cook, R.D. (1998) Regression Graphics: Ideas for Studying Regressions Through Graphics. New York: Wiley.CrossRefGoogle Scholar
  22. 22.
    Cook, R.D. (1996) Graphics for regressions with a binary response. Journal of the American Statistical Association 91, 983–992.CrossRefGoogle Scholar
  23. 23.
    Cook, R.D., and Li, B. (2002) Dimension reduction for the conditional mean in regression. Annals of Statistics 30, 455–474.CrossRefGoogle Scholar
  24. 24.
    Wold, H. (1975) Soft modelling by latent variables: The nonlinear partial least squares (NIPALS) approach. In Perspectives in Probability and Statistics, Papers in Honour of M.S. Barlett, Ed. J. Gani, 117–142. London: Academic Press.Google Scholar
  25. 25.
    Helland, I.S. (1992) Maximum likelihood regression on relevant components. Journal of Royal Statistical Society, Series B 54, 637–647.Google Scholar
  26. 26.
    Helland, I.S., and Almøy, T. (1994) Comparison of prediction methods when only a few components are relevant. Journal of the American Statistical Association 89, 583–591.CrossRefGoogle Scholar
  27. 27.
    Li, K.C., and Duan, N. (1989) Regression analysis under link violation. Annals of Statistics 17, 1009–1052.CrossRefGoogle Scholar
  28. 28.
    Naik, P., and Tsai, C.L. (2000) Partial least squares estimator for single-index models. Journal of the Royal Statistical Society, Series B 62, 763–771.CrossRefGoogle Scholar
  29. 29.
    Li, L., Cook, R.D., and Tsai, C.L. (2007) Partial inverse regression method. Biometrika 94, 615–625.CrossRefGoogle Scholar
  30. 30.
    Nguyen, D.V., and Rocke, D.M. (2002a) Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18, 39–50.PubMedCrossRefGoogle Scholar
  31. 31.
    Pérez-Enciso, M., and Tenenhaus, M. (2003) Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis approach. Human Genetics 112, 581–592.PubMedGoogle Scholar
  32. 32.
    Fort, G., and Lambert-Lacroix, S. (2005) Classification using partial least squares with penalized logistic regression. Bioinformatics 21, 1104–1111.PubMedCrossRefGoogle Scholar
  33. 33.
    Nguyen, D.V., and Rocke, D.M. (2002b) Partial least squares proportional hazard regression for application to DNA microarray survival data. Bioinformatics 18, 1625–1632.PubMedCrossRefGoogle Scholar
  34. 34.
    Park, P.J., Tian, L. and Kohane, I.S. (2002) Linking gene expression data with patient survival times using partial least squares. Bioinformatics 18, 120–127.CrossRefGoogle Scholar
  35. 35.
    Li, H., and Gui, J. (2004) Partial Cox regression analysis for high-dimensional microarray gene expression data. Bioinformatics 20, 208–215.CrossRefGoogle Scholar
  36. 36.
    Cook, R.D., and Weisberg, S. (1991) Discussion of Li (1991). Journal of American Statistical Association 86, 328–332.Google Scholar
  37. 37.
    Zhu, Y., and Zeng, P. (2006) Fourier methods for estimating the central subspace and the central mean subspace in regression. Journal of the American Statistical Association 101, 1638–1651.CrossRefGoogle Scholar
  38. 38.
    Li, B., and Wang, S. (2007) On directional regression for dimension reduction. Journal of the American Statistical Association 102, 997–1008.CrossRefGoogle Scholar
  39. 39.
    Li, K.C. (1992) On principal Hessian directions for data visualization and dimension reduction: another application of Stein’s Lemma. Annals of Statistics 87, 1025–1039.Google Scholar
  40. 40.
    Xia, Y., Tong, H., Li, W.K., and Zhu, L.X. (2002) An adaptive estimation of dimension reduction space (with discussion). Journal of the Royal Statistical Society, Series B 64, 363–410.Google Scholar
  41. 41.
    Cook, R.D., and Ni, L. (2005) Sufficient dimension reduction via inverse regression: a minimum discrepancy approach. Journal of the American Statistical Association 100, 410–428.CrossRefGoogle Scholar
  42. 42.
    Cook, R.D., and Yin, X. (2001) Dimension reduction and visualization in discriminant analysis. Australian and New Zealand Journal of Statistics 43, 147–177.CrossRefGoogle Scholar
  43. 43.
    Zhu, L.X., Miao, B., and Peng, H. (2006) On sliced inverse regression with large dimensional covariates. Journal of the American Statistical Association 101, 630–643.CrossRefGoogle Scholar
  44. 44.
    Li, L., and Yin, X. (2008a) Sliced inverse regression with regularizations. Biometrics 64, 124–131.PubMedCrossRefGoogle Scholar
  45. 45.
    Bura, E., and Pfeiffer, R.M. (2003) Graphical methods for class prediction using dimension reduction techniques on DNA microarray data. Bioinformatics 19, 1252–1258.PubMedCrossRefGoogle Scholar
  46. 46.
    Antoniadis, A., Lambert-Lacroix, S., and Leblanc, F. (2003) Effective dimension reduction methods for tumor classification using gene expression data. Bioinformatics 19, 563–570.PubMedCrossRefGoogle Scholar
  47. 47.
    Li, L., and Yin, X. (2008b) Rejoinder to “A note on sliced inverse regression with regularizations”. Biometrics 64, 982–986.CrossRefGoogle Scholar
  48. 48.
    Zou, H., Hastie, T., and Tibshirani, R. (2006) Sparse principal component analysis. Journal of Computational and Graphical Statistics 15, 265–286.CrossRefGoogle Scholar
  49. 49.
    Li, L. (2007) Sparse sufficient dimension reduction. Biometrika 94, 603–613.CrossRefGoogle Scholar
  50. 50.
    Ni, L., Cook, R.D., and Tsai, C.L. (2005) A note on shrinkage sliced inverse regression. Biometrika 92, 242–247.CrossRefGoogle Scholar
  51. 51.
    Bondell, H.D., and Li, L. (2009) Shrinkage inverse regression estimation for model free variable selection. Journal of the Royal Statistical Society, Series B 71, 287–299.CrossRefGoogle Scholar
  52. 52.
    Tibshirani, R. (1996) Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B 58, 267–288.Google Scholar
  53. 53.
    Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004) Least Angle Regression. Annals of Statistics 32, 407–451.CrossRefGoogle Scholar
  54. 54.
    Fan, J., and Lv, J. (2008) Sure independence screening for ultra-high dimensional feature space (with discussion). Journal of the Royal Statistical Society, Series B 70, 849–911.CrossRefGoogle Scholar
  55. 55.
    Li, K.C., Wang, J.L., and Chen, C.H. (1999) Dimension reduction for censored regression data. The Annals of Statistics 27, 1–23.Google Scholar
  56. 56.
    Hall, P., and Li, K.C. (1993) On almost linearity of low dimensional projections from high dimensional data. Annals of Statistics 21, 867–889.CrossRefGoogle Scholar
  57. 57.
    Cook, R.D., and Nachtsheim, C.J. (1994) Re-weighting to achieve elliptically contoured covariates in regression. Journal of the American Statistical Association 89, 592–600.CrossRefGoogle Scholar
  58. 58.
    Li, L., Cook, R.D., and Nachtsheim, C.J. (2004) Cluster-based estimation for sufficient dimension reduction. Computational Statistics and Data Analysis 47, 175–193.CrossRefGoogle Scholar

Copyright information

© Humana Press, a part of Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Lexin Li
    • 1
  1. 1.Department of StatisticsNorth Carolina State UniversityRaleighUSA

Personalised recommendations