pp 1–3 | Cite as

Comments on: Data science, big data and statistics

  • Stefan Van Aelst
  • Ruben H. ZamarEmail author

We praise Professors Galeano and Peña for this paper and for sharing their view on the impact of Big Data on Statistics and the emerging field of Data Science. They draw attention to seven main points which are very interesting and relevant, and present two interesting applications. We will focus our discussion on two of these topics, which are most related to our own expertise: heterogeneous data (including data quality and robustness) and automatic model selection.

Heterogeneous data

Modern big data are often the result of administrative/operational data collection as it is the case for the two examples given by the authors. In a recent discussion paper, Professor David Hand (2018a, b) points out that this type of data also exhibits quality issues. Hence, there is a need for robust methods to automatically analyze such data. As pointed out by the authors, the standard statistical efficiency concept looses its meaning because the data form the whole population. On the other hand,...

Mathematics Subject Classification




Funding was provided by Canadian Network for Research and Innovation in Machining Technology, Natural Sciences and Engineering Research Council of Canada.


  1. Agostinelli C, Leung A, Yohai VJ, Zamar RH (2015) Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination (with discussion). TEST 24:441–461MathSciNetCrossRefzbMATHGoogle Scholar
  2. Alqallaf F, Van Aelst S, Yohai VJ, Zamar RH (2009) Propagation of outliers in multivariate data. Ann Stat 37(1):311–331MathSciNetCrossRefzbMATHGoogle Scholar
  3. Christidis A, Lakshmanan LVS, Smucler E, Zamar R (2017) Ensembles of regularized linear models. arXiv:1712.03561
  4. García-Escudero LA, Gordaliza A, San Martín R, Van Aelst S, Zamar RH (2009) Robust linear clustering. J R Stat Soc B 71(1):301–319MathSciNetCrossRefzbMATHGoogle Scholar
  5. García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36:1324–1345MathSciNetCrossRefzbMATHGoogle Scholar
  6. Hand DJ (2018a) Statistical challenges of administrative and transaction data (with discussion). J R Stat Soc A 181:555–605CrossRefGoogle Scholar
  7. Hand DJ (2018b) Hand writing: administrative data. IMS Bull 47(6):8–9Google Scholar
  8. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New YorkCrossRefzbMATHGoogle Scholar
  9. Khan JA, Van Aelst S, Zamar RH (2007) Building a robust linear model with forward selection and stepwise procedures. Comput Stat Data Anal 52:239–248MathSciNetCrossRefzbMATHGoogle Scholar
  10. Khan JA, Van Aelst S, Zamar RH (2007) Robust linear model selection based on least angle regression. J Am Stat Assoc 102:1289–1299MathSciNetCrossRefzbMATHGoogle Scholar
  11. Leung A, Yohai VJ, Zamar R (2017) Multivariate location and scatter matrix estimation under cellwise and casewise contamination. Comput Stat Data Anal 111:59–76MathSciNetCrossRefzbMATHGoogle Scholar
  12. Öllerer V, Alfons A, Croux C (2016) The shooting S-estimator for robust regression. Comput Stat 31(3):829–844MathSciNetCrossRefzbMATHGoogle Scholar
  13. Rousseeuw P, Van den Bossche W (2018) Detecting deviating data cells. Technometrics 60(2):135–145MathSciNetCrossRefGoogle Scholar
  14. Van Aelst S, Vandervieren E, Willems G (2011) Stahel–Donoho estimators with cellwise weights. J Stat Comput Simul 81(1):1–27MathSciNetCrossRefzbMATHGoogle Scholar
  15. Van Aelst S, Vandervieren E, Willems G (2012) A Stahel-Donoho estimator based on Huberized outlyingness. Comput Stat Data Anal 56(3):531–542MathSciNetCrossRefGoogle Scholar
  16. Van Aelst S, Wang X, Zamar RH, Zhu R (2006) Linear grouping using orthogonal regression. Comput Stat Data Anal 50:1287–1312MathSciNetCrossRefzbMATHGoogle Scholar
  17. Wang Y, Van Aelst S (2017) Robust variable screening for regression using factor profiling. Stat Anal Data Min (to appear). arXiv:1711.09586

Copyright information

© Sociedad de Estadística e Investigación Operativa 2019

Authors and Affiliations

  1. 1.KU LeuvenLeuvenBelgium
  2. 2.University of British ColumbiaVancouverCanada

Personalised recommendations