Skip to main content
Log in

The Effect of Data Contamination in Sliced Inverse Regression and Finite Sample Breakdown Point

  • Published:
Sankhya A Aims and scope Submit manuscript

Abstract

Dimension reduction procedures have received increasing consideration over the past decades. Despite this attention, the effect of data contamination or outlying data points in dimension reduction is, however, not well understood, and is compounded by the issue that outliers can be difficult to classify in the presence of many variables. This paper formally investigates the influence of data contamination for sliced inverse regression (SIR), which is a prototypical dimension reduction procedure that targets a lower-dimensional subspace of a set of regressors needed to explain a response variable. We establish a general theory for how estimated reduction subspaces can be distorted through both the number and direction of outlying data points. The results depend critically on the regressor covariance structure and the most harmful types of data contamination are shown to differ in cases where this covariance structure is known or unknown. For example, if the covariance structure is estimated, data contamination is proven to produce an estimated subspace that is automatically orthogonal to the directions of outlying data points, constituting a potentially serious loss of information. Our main results demonstrate the degree to which data contamination indeed causes incorrect dimension reduction, depending on the amount, magnitude, and direction of contamination. Further, by metricizing distances between dimension reduction subspaces, worst case results for data contamination can be formulated to define a finite sample breakdown point for SIR as a measure of global robustness. Our theoretical findings are illustrated through simulation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Becker, C. (2001) Robustness Concepts for Analyzing Structured and Complex Data Sets. Habilitationsschrift, University of Dortmund.

  • Bond, J.C. (1999) A Robust Approach to SIR. PhD Thesis, University of California, Berkeley.

  • Chen, X., Zhou, C. and Cook, R.D. (2010) Coordinate-independent sparse sufficient dimension reduction and variable selection. Ann. Statist. 38, 3696–3723.

  • Chiancone, A., Forbes, F. and Girard, S. (2016) Student Sliced Inverse Regression. Computational Statistics and Data Analysis, Elsevier, 2016, doi:10.1016/j.csda.2016.08.004. hal-01294982v3.

  • Cook, R.D. (2000) A method for dimension reduction and graphics in regression. Commun. Statist.-Theory Meth. 829, 2109–2121.

  • Cook, R.D. and Critchley, F. (2000) Identifying regression outliers and mixtures graphically. J. Amer. Statist. Assoc. 95, 81–794.

  • Cook, R.D. and Ni, L. (2005) Sufficient dimension reduction via inverse regression: a minimum discrepancy approach. J. Am. Statist. Ass. 100, 410–428.

  • Cook, R.D. and Ni, L. (2006) Using intraslice covariances for improved estimation of the central subspace in regression. Biometrika 93, 65–74.

  • Cook, R.D. and Yin, X. (2000) Dimension reduction and visualization in discriminant analysis (with Discussion). Aust. N. Z. J. Stat. 43(2), 147–199.

  • Cook, R.D. and Weisberg, S. (1991) Comment on: Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. 86, 328–332.

  • Crone, L.J. and Crosby, D.S. (1995) Statistical applications of a metric on subspaces to satellite meteorology. Technometrics 37, 324–328.

  • Croux, C. and Haesbroeck, G. (2000) Principal component analysis based on robust estimators of the covariance or correlation matrix: influence functions and efficiencies. Biometrika 87(3), 603–618.

  • Croux, C. and Ruiz-Gazen, A. (2005) High breakdown estimators for principal components: The projection-pursuit approach revisited. J. Multivariate Anal. 95, 206–226.

  • Dong, Y., Yu Z. and Zhu, L. (2015) Robust inverse regression for dimension reduction. J. Multivariate Anal. 135, 71–81.

  • Forzani, L., Cook, R.D. and Rothman, A.J. (2012) Estimating sufficient reductions of the predictions in abundant high-dimensional regressions. Ann. Statist. 40(1), 353–384.

  • Gather, U., Hilker, T. and Becker, C. (2002) A note on outlier sensitivity of sliced inverse regression. Statistics 13, 271–281.

  • Gather, U., Hilker, T. and Becker, C. (2001) A Robustified Version of Sliced Inverse Regression. In: fernholz, l.t., morgenthaler, s., stahel, w. (eds.) Statistics in Genetics and in the Environmental Sciences, Proceedings of the Workshop on Statistical Methodology for the Sciences: Environmetrics and Genetics held in Ascona from May 23–28 1999, 147–157.

  • Genschel, U. (2005) Robustness Concepts for Sliced Inverse Regression. PhD Thesis, University of Dortmund.

  • Genschel, U. (2017) Supplement to on the influence of data contamination in dimension reduction.

  • Hilker, T. (1997) Robuste Verfahren zur Dimensionsreduktion in Regressionsverfahren mit unbekannter Linkfunktion. PhD Thesis, University of Dortmund.

  • Krzanowski, W.J. (1979) Between-groups comparison in principal components. J. Amer. Statist. Assoc. 74, 703–707.

  • Li, K.-C. (1991) Sliced inverse regression for dimension reduction (with discussion). J. Amer. Statist. Assoc. 86, 316–342.

  • Li, K.-C. (1992) On principal hessian directions for data visualization and dimension reduction: Another application of Stein’s lemma. J. Amer. Statist. Assoc. 87, 1025–1039.

  • Li, L. (2007) Sparse sufficient dimension reduction. Biometrika 94, 603–613.

  • Li, G. and Chen, Z. (1985) Projection-pursuit approach to robust dispersion matrices and principal components: Primary theory and Monte Carlo. J. Am. Statist. Ass. 80, 759–766.

  • Li, B. and Dong, Y. (2009) Dimension reduction for non-elliptically distributed predictors. Ann. Statist. 37(3), 1272–1298.

  • Li, L., Li, B. and Zhu, L.-X. (2010) Groupwise dimension reduction. J. Am. Statist. Ass. 105, 1188–1201.

  • Ma, Y. and Zhu, L. (2012) A semiparametric approach to dimension reduction. J. Am. Statist. Ass. 107, 168–179.

  • Ma, Y. and Zhu, L. (2013) Efficient estimation in sufficient dimension reduction. Ann. Statist. 41(1), 250–268.

  • Prendergast, L. (2005) Influence functions for sliced inverse regression. Scand. J. Stat. 32(3), 385–404.

  • Prendergast, L. (2006) Detecting influential observations in sliced inverse regression analysis. Aust. N. Z. J. Stat. 48(3), 285–304.

  • Prendergast, L. (2007) Implications of influence function analysis for sliced inverse regression and sliced average variance estimation. Biometrika 94(3), 585–601.

  • Ripley, B.D. (1996) Pattern Recognition and Neural Networks., Cambridge University Press, New York, NY.

  • Rousseeuw, P.J. and Leroy, A.M. (1987) Robust Regression and Outlier Detection, Wiley & Sons, New York, NY.

  • Sheather, S.J. and McKean, J.W. (1997) A comparison pf procedures based on inverse regression. IMS Lecture Notes – Monograph Series 31, 271–278.

  • Sheather, S.J. and McKean, J.W. (2001) Discussion on: Dimension reduction and visualization in discriminant analysis. Aust. N. Z. J. Stat. 43(2), 185–190.

  • Stewart, G.W. and Sun, J. (1990) Matrix Perturbation Theory, 2nd ed. Academic Press, San Diego.

  • Stromberg, A.J. and Ruppert, D. (1992) Breakdown of nonlinear regression. J. Amer. Statist. Assoc. 87, 991–997.

  • Tyler, D.E. (2005) Discussion of “Breakdown and Groups” by P.L. Davies and U. Gather. Ann. Statist. 33(3), 1009–1015.

  • Welsh, A.H. (2001) Discussion on: Dimension reduction and visualization in discriminant analysis. Aust. N. Z. J. Stat. 43(2), 177–179.

  • Xia, Y., Tong H., Li, W.K. and Zhu, L.-X. (2002) An adaptive estimation of dimension reduction space (with discussion). J. R. Statist. Soc. B. 64, 363–410.

  • Yin, X. and Hilafu, H. (2014) Sequential sufficient dimension reduction for large p, small n problems. J. R. Statist. Soc. B. doi:10.1111/rssb.12093.

Download references

Acknowledgements

I am grateful to two anonymous reviewers for providing thoughtful comments for improving the manuscript. I would also like to thank Prof. Ursula Gather for suggesting and supporting this research direction.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ulrike Genschel.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 344 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Genschel, U. The Effect of Data Contamination in Sliced Inverse Regression and Finite Sample Breakdown Point. Sankhya A 80, 28–58 (2018). https://doi.org/10.1007/s13171-017-0102-x

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13171-017-0102-x

Keywords and phrases

AMS (2000) subject classification

Navigation