pp 1–41 | Cite as

Data science, big data and statistics

  • Pedro GaleanoEmail author
  • Daniel Peña
Invited Paper


This article analyzes how Big Data is changing the way we learn from observations. We describe the changes in statistical methods in seven areas that have been shaped by the Big Data-rich environment: the emergence of new sources of information; visualization in high dimensions; multiple testing problems; analysis of heterogeneity; automatic model selection; estimation methods for sparse models; and merging network information with statistical models. Next, we compare the statistical approach with those in computer science and machine learning and argue that the convergence of different methodologies for data analysis will be the core of the new field of data science. Then, we present two examples of Big Data analysis in which several new tools discussed previously are applied, as using network information or combining different sources of data. Finally, the article concludes with some final remarks.


Machine learning Sparse model selection Statistical learning Network analysis Multivariate data Time series 

Mathematics Subject Classification

62A01 62H99 



The invitation to write this article came from the editor Jesús López-Fidalgo and we are very grateful to him for his encouragement. The applications presented in this paper were carried out with Federico Liberatore, Lara Quijano-Sánchez and Carlo Sguera, post-docs at the UC3M-BS Institute of Financial Big Data. Iván Blanco and Jose Luis Torrecilla, also post-docs in the Institute, have also contributed with useful discussions. The ideas in this article have been clarified with the comments of Andrés Alonso, Anibal Figueiras, Rosa Lillo, Juan Romo and Rubén Zamar. To all them, our gratitude.


  1. Aghabozorgi S, Shirkhorshidi AS, Wah TY (2015) Time-series clustering—a decade review. Inform Syst 53:16–38CrossRefGoogle Scholar
  2. Akaike H (1973) Information theory and an extension of the maximum likelihood method. In: Petrov N, Caski F (eds) Proceeding of the 2nd symposium on information theory. Academiai Kiado, Budapest, pp 267–281Google Scholar
  3. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723MathSciNetzbMATHCrossRefGoogle Scholar
  4. Alonso A, Peña D (2018) Clustering time series by linear dependency. Stat Comput. CrossRefGoogle Scholar
  5. Ando T, Bai J (2017) Clustering huge number of financial time series: a panel data approach with high-dimensional predictors and factor structures. J Am Stat Assoc 112(519):1182–1198MathSciNetCrossRefGoogle Scholar
  6. Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79MathSciNetzbMATHCrossRefGoogle Scholar
  7. Arribas-Gil A, Romo J (2014) Shape outlier detection and visualization for functional data: the outliergram. Biostatistics 15(4):603–619CrossRefGoogle Scholar
  8. Asimov D (1985) The grand tour: a tool for viewing multidimensional data. SIAM J Sci Stat Comp 6:128–143MathSciNetzbMATHCrossRefGoogle Scholar
  9. Bai J, Ng S (2002) Determining the number of factors in approximate factor models. Econometrica 70(1):191–221MathSciNetzbMATHCrossRefGoogle Scholar
  10. Bailey TC, Sapatinas T, Powell KJ, Krzanowski WJ (1998) Signal detection in underwater sound using wavelets. J Am Stat Assoc 93:73–83zbMATHCrossRefGoogle Scholar
  11. Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821MathSciNetzbMATHCrossRefGoogle Scholar
  12. Barabási AL (2016) Network Science. Cambridge University Press, CambridgezbMATHGoogle Scholar
  13. Barber RF, Candès EJ (2015) Controlling the false discovery rate via knockoffs. Ann Stat 43(5):2055–2085MathSciNetzbMATHCrossRefGoogle Scholar
  14. Basu S, Michailidis G (2015) Regularized estimation in sparse high-dimensional time series models. Ann Stat 43:1535–1567MathSciNetzbMATHCrossRefGoogle Scholar
  15. Benito M, García-Portugués E, Marron JS, Peña D (2017) Distance-weighted discrimination of face images for gender classification. Stat 6(1):231–240MathSciNetCrossRefGoogle Scholar
  16. Benjamini Y (2010) Discovering the false discovery rate. J R Stat Soc B 72(4):405–416MathSciNetCrossRefGoogle Scholar
  17. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57(1):289–300MathSciNetzbMATHGoogle Scholar
  18. Bergmeir C, Benítez JM (2012) On the use of cross-validation for time series predictor evaluation. Inf Sci 191:192–213CrossRefGoogle Scholar
  19. Bertini E, Tatu A, Keim D (2011) Quality metrics in high-dimensional data visualization: an overview and systematization. IEEE Trans Vis Comput Graph 17:2203–2212CrossRefGoogle Scholar
  20. Besag J (1986) On the statistical analysis of dirty pictures. J R Stat Soc B 48(3):259–302MathSciNetzbMATHGoogle Scholar
  21. Bickel PJ, Levina E (2008) Regularized estimation of large covariance matrices. Ann Stat 36(1):199–227MathSciNetzbMATHCrossRefGoogle Scholar
  22. Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp. CrossRefGoogle Scholar
  23. Bouveyron C, Brunet-Saumard C (2014) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:52–78MathSciNetzbMATHCrossRefGoogle Scholar
  24. Box GEP, Tiao GC (1968) A bayesian approach to some outlier problems. Biometrika 55(1):119–129MathSciNetzbMATHCrossRefGoogle Scholar
  25. Breiman L (2001) Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci 16:199–231MathSciNetzbMATHCrossRefGoogle Scholar
  26. Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Chapman and Hall/CRC, New YorkzbMATHGoogle Scholar
  27. Brockwell SE, Gordon IR (2001) A comparison of statistical methods for meta-analysis. Stat Med 20:825–840CrossRefGoogle Scholar
  28. Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, Berlin, HeidelbergzbMATHCrossRefGoogle Scholar
  29. Bühlmann P, van de Geer S (2018) Statistics for big data: a perspective. Stat Prob Lett 136:37–41MathSciNetzbMATHCrossRefGoogle Scholar
  30. Bühlmann P, Drineas P, Kane M, van der Laan M (2016) Handbook of big data. Chapman and Hall/CRC, Boca RatonCrossRefGoogle Scholar
  31. Cai TT (2017) Global testing and large-scale multiple testing for high-dimensional covariance structures. Annu Rev Stat Appl 4:423–446CrossRefGoogle Scholar
  32. Cai TT, Liu W (2011) Adaptive thresholding for sparse covariance matrix estimation. J Am Stat Assoc 106:672–684MathSciNetzbMATHCrossRefGoogle Scholar
  33. Cai TT, Liu W (2016) Large-scale multiple testing of correlations. J Am Stat Assoc 111:229–240MathSciNetCrossRefGoogle Scholar
  34. Cai TT, Zhuo HH (2012) Optimal rates of convergence for sparse covariance matrix estimation. Ann Stat 40(5):2389–2420MathSciNetzbMATHCrossRefGoogle Scholar
  35. Cai TT, Liu W, Luo X (2011) A constrained \(\ell _{1}\) minimization approach to sparse precision matrix estimation. J Am Stat Assoc 106:594–607MathSciNetzbMATHCrossRefGoogle Scholar
  36. Caiado J, Maharaj EA, D’urso P (2015) Time series clustering. In: Handbook of cluster analysis, CRC Press, pp 241–264Google Scholar
  37. Cairo A (2016) The truthful art: data, charts, and maps for communication. New RidersGoogle Scholar
  38. Candès E, Tao T (2006) Near-optimal signal recovery from random projections: universal encoding strategies. IEEE Trans Inf Theory 52:5406–5425MathSciNetzbMATHCrossRefGoogle Scholar
  39. Candès E, Romberg JK, Tao T (2006) Stable signal recovery from incomplete and inaccurate measurements. Commun Pure Appl Math 52:1207–1223MathSciNetzbMATHCrossRefGoogle Scholar
  40. Candès E, Li X, Ma Y, Wright J (2011) Robust principal component analysis? J ACM 58(3):11MathSciNetzbMATHCrossRefGoogle Scholar
  41. Candès EJ, Fan Y, Janson L, Lv J (2016) Panning for gold: model-free knockoffs for high-dimensional controlled variable selection. Technical report, May 2016, Department of Statistics, Stanford UniversityGoogle Scholar
  42. Cao R (2017) Ingenuas reflexiones de un estadístico en la era del big data. Bol de Estad e Investig Oper 33(3):295–321Google Scholar
  43. Carmichael I, Marron JS (2018) Data science vs. statistics: two cultures? Jpn J Stat Data Sci 1(1):117–138CrossRefGoogle Scholar
  44. Cerioli A, Farcomeni A, Riani M (2013) Robust distances for outlier-free goodness-of-fit testing. Comput Stat Data Anal 65:29–45MathSciNetzbMATHCrossRefGoogle Scholar
  45. Chen CP, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inform Sci 275:314–347CrossRefGoogle Scholar
  46. Chen H, De P, Hu YJ, Hwang BH (2014) Wisdom of crowds: the value of stock opinions transmitted through social media. Rev Financ Stud 27(5):1367–1403CrossRefGoogle Scholar
  47. Chen J, Chen Z (2008) Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95(3):759–771MathSciNetzbMATHCrossRefGoogle Scholar
  48. Chernozhukov V, Galichon A, Hallin M, Henry M (2017) Monge–Kantorovich depth, quantiles, ranks and signs. Ann Stat 45(1):223–256MathSciNetzbMATHCrossRefGoogle Scholar
  49. Cook RD (2018) An introduction to envelopes: dimension reduction for efficient estimation in multivariate statistics. Wiley, New YorkzbMATHCrossRefGoogle Scholar
  50. Cook D, Buja A, Cabrera J, Hurley C (1995) Grand tour and projection pursuit. J Comput Graph Stat 4:155–172Google Scholar
  51. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297zbMATHGoogle Scholar
  52. Cover TM, Hart PE (1967) Nearest neighbour pattern classification. IEEE Trans Inform Theory 13:21–27zbMATHCrossRefGoogle Scholar
  53. Cuesta-Albertos JA, Gordaliza A, Matrán C (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25(2):553–576MathSciNetzbMATHCrossRefGoogle Scholar
  54. Cuevas A (2014) A partial overview of the theory of statistics with functional data. J Stat Plan Inference 147:1–23MathSciNetzbMATHCrossRefGoogle Scholar
  55. Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29:103–130zbMATHCrossRefGoogle Scholar
  56. Donoho D (2006a) Compressed sensing. IEEE Trans Inf Theory 52:1289–1306MathSciNetzbMATHCrossRefGoogle Scholar
  57. Donoho D (2006b) For most large underdetermined systems of linear equations the minimal 1-norm solution is also the sparsest solution. Commun Pure Appl Math 59:797–829MathSciNetzbMATHCrossRefGoogle Scholar
  58. Donoho D (2017) 50 years of data science. J Comput Graph Stat 26(4):745–766MathSciNetCrossRefGoogle Scholar
  59. Dryden IL, Hodge DJ (2018) Journeys in big data statistics. Stat Prob Lett 136:121–125MathSciNetzbMATHCrossRefGoogle Scholar
  60. Efron B, Hastie T (2016) Computer age statistical inference. Cambridge University Press, CambridgezbMATHCrossRefGoogle Scholar
  61. Evergreen SDH (2016) Effective data visualization: the right chart for the right data. SAGE PublicationsGoogle Scholar
  62. Faith J, Mintram R, Angelova M (2006) Targeted projection pursuit for visualizing gene expression data classifications. Bioinformatics 22:2667–2673CrossRefGoogle Scholar
  63. Fan J, Han F, Liu H (2014) Challenges of big data analysis. Natl Sci Rev 1(2):293–314CrossRefGoogle Scholar
  64. Forni M, Hallin M, Lippi M, Reichlin L (2005) The generalized dynamic factor model: one-sided estimation and forecasting. J Am Stat Assoc 100:830–840MathSciNetzbMATHCrossRefGoogle Scholar
  65. Fraiman R, Justel A, Svarc M (2008) Selection of variables for cluster analysis and classification rules. J Am Stat Assoc 103:1294–1303MathSciNetzbMATHCrossRefGoogle Scholar
  66. Friedman J, Hastie T, Tibshirani R (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3):432–441zbMATHCrossRefGoogle Scholar
  67. Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, New YorkzbMATHGoogle Scholar
  68. Galeano P, Peña D (2019) Outlier detection in high-dimensional time series (Unpublished manuscript)Google Scholar
  69. Galeano P, Peña D, Tsay RS (2006) Outlier detection in multivariate time series by projection pursuit. J Am Stat Assoc 101:654–669MathSciNetzbMATHCrossRefGoogle Scholar
  70. Galimberti G, Manisi A, Soffritti G (2017) Modelling the role of variables in model-based cluster analysis. Stat Comput 28(1):1–25MathSciNetzbMATHGoogle Scholar
  71. Gandomi A, Haider M (2015) Beyond the hype: big data concepts, methods, and analytics. Int J of Inf Manage 35(2):137–144CrossRefGoogle Scholar
  72. García-Ferrer A, Highfield RA, Palm F, Zellner A (1987) Macroeconomic forecasting using pooled international data. J Bus Econ Stat 5:53–67Google Scholar
  73. Geisser S (1975) The predictive sample reuse method with applications. J Am Stat Assoc 70:320–328zbMATHCrossRefGoogle Scholar
  74. Genton MG (2001) Classes of kernels for machine learning: a statistics perspective. J Mach Learn Res 2:299–312MathSciNetzbMATHGoogle Scholar
  75. Genton MG, Johnson C, Potter K, Stenchikov G, Sun Y (2014) Surface boxplots. Stat 3(1):1–11CrossRefGoogle Scholar
  76. Genton MG, Castruccio S, Crippa P, Dutta S, Huser R, Sun Y, Vettori S (2015) Visuanimation in statistics. Stat 4(1):81–96MathSciNetCrossRefGoogle Scholar
  77. Giannone D, Reichlin L, Small D (2008) Nowcasting: the real-time informational content of macroeconomic data. J Monet Econ 55:665–676CrossRefGoogle Scholar
  78. Gómez V, Maravall A (1996) Programas tramo and seats. Documento de Trabajo, Banco de España SGAPE-97001Google Scholar
  79. Guhaniyogi R, Dunson DB (2015) Bayesian compressed regression. J Am Stat Assoc 110:1500–1514MathSciNetzbMATHCrossRefGoogle Scholar
  80. Hall P, Marron JS, Neeman A (2005) Geometric representation of high dimension, low sample size data. J R Stat Soc B 67(3):427–444MathSciNetzbMATHCrossRefGoogle Scholar
  81. Härdle WK, Lu HHS, Shen X (2018) Handbook of big data analytics. SpringerGoogle Scholar
  82. Hastie T, Pregibon D (1992) Generalized linear models. In: Chambers JM, Hastie TJ (eds) Statistical models in S, Chap 6. Wadsworth & Brooks/ColeGoogle Scholar
  83. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New YorkzbMATHCrossRefGoogle Scholar
  84. Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, Boca RatonzbMATHCrossRefGoogle Scholar
  85. Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67zbMATHCrossRefGoogle Scholar
  86. Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4:251–257CrossRefGoogle Scholar
  87. Huber PJ (1964) Robust estimation of a location parameter. Ann Math Stat 35(1):73–101MathSciNetzbMATHCrossRefGoogle Scholar
  88. Hyvärinen A, Oja E (2000) Independent component analysis: algorithms and applications. Neural Netw 13:411–430CrossRefGoogle Scholar
  89. Irizarry RA (2001) Local harmonic estimation in musical sound signals. J Am Stat Assoc 96:357–367MathSciNetzbMATHCrossRefGoogle Scholar
  90. Jain AK (1989) Fundamentals of digital image processing. Prentice Hall, Englewood Cliffs, NJzbMATHGoogle Scholar
  91. James W, Stein C (1961) Estimation with quadratic loss. In: Proceedings of 4th Berkeley symposium on mathematical statistics and probability, vol I, University of California Press, pp 361–379Google Scholar
  92. Johnstone IM, Titterington DM (2009) Statistical challenges of high-dimensional data. Philos Trans R Soc A 367:4237–4253MathSciNetzbMATHCrossRefGoogle Scholar
  93. Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53:457–481MathSciNetzbMATHCrossRefGoogle Scholar
  94. Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New YorkGoogle Scholar
  95. Kokoszka P, Reimherr M (2017) Introduction to functional data analysis. Chapman and Hall/CRC, Boca RatonGoogle Scholar
  96. Kolaczyk ED (2009) Statistical analysis of network data. Springer, New YorkzbMATHCrossRefGoogle Scholar
  97. Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1):1CrossRefGoogle Scholar
  98. Lam XY, Marron JS, Sun D, Toh KC (2018) Fast algorithms for large-scale generalized distance weighted discrimination. J Comput Graph Stat 27(2):368–379MathSciNetCrossRefGoogle Scholar
  99. Lauritzen SL (1996) Graphical Models. Oxford University Press Inc., New YorkzbMATHGoogle Scholar
  100. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444CrossRefGoogle Scholar
  101. Liu W (2013) Gaussian graphical model estimation with false discovery rate control. Ann Stat 41(6):2948–2978MathSciNetzbMATHCrossRefGoogle Scholar
  102. López-Pintado S, Romo J (2009) On the concept of depth for functional data. J Am Stat Assoc 104:718–734MathSciNetzbMATHCrossRefGoogle Scholar
  103. Lu X, Marron JS, Haaland P (2014) Object-oriented data analysis of cell images. J Am Stat Assoc 109:548–559MathSciNetCrossRefGoogle Scholar
  104. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proceedings of the 5th Berkeley symposium on mathematical statistics and probability vol 1, pp 281–297Google Scholar
  105. Majumdar A (2009) Image compression by sparse PCA coding in curvelet domain. Signal Image Video Process 3:27–34zbMATHCrossRefGoogle Scholar
  106. Maronna RA, Martin RD, Yohai V, Salibián-Barrera M (2019) Robust statistics: theory and methods (with R), 2nd edn. Wiley, Hoboken, NJzbMATHGoogle Scholar
  107. Meinshausen N, Bühlmann P (2006) High dimensional graphs and variable selection with the lasso. Ann Stat 34(3):1436–1462MathSciNetzbMATHCrossRefGoogle Scholar
  108. Mosteller F, Wallace DL (1963) Inference in an authorship problem: a comparative study of discrimination methods applied to the authorship of the disputed federalist papers. J Am Stat Assoc 58:275–309zbMATHGoogle Scholar
  109. Munzner T (2014) Visualization analysis and design. Chapman and Hall/CRC, Boca RatonCrossRefGoogle Scholar
  110. Norets A (2010) Approximation of conditional densities by smooth mixtures of regressions. Ann Stat 38(3):1733–1766MathSciNetzbMATHCrossRefGoogle Scholar
  111. de Oliveira MF, Levkowitz H (2003) From visual data exploration to visual data mining: a survey. IEEE Trans Vis Comput Graph 9:378–394CrossRefGoogle Scholar
  112. Pan W, Shen X (2007) Penalized model-based clustering with application to variable selection. J Mach Learn Res 8:1145–1164zbMATHGoogle Scholar
  113. Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2:1–135CrossRefGoogle Scholar
  114. Paradis L, Han Q (2007) A survey of fault management in wireless sensor networks. J Netw Syst Manag 15:171–190CrossRefGoogle Scholar
  115. Peña D (2014) Big data and statistics: trend or change. Bol de Estad e Investig Oper 30:313–324MathSciNetGoogle Scholar
  116. Peña D, Box GEP (1987) Identifying a simplifying structure in time series. J Am Stat Assoc 82:836–843MathSciNetzbMATHGoogle Scholar
  117. Peña D, Poncela P (2004) Forecasting with nonstationary dynamic factor models. J Econom 119(2):291–321MathSciNetzbMATHCrossRefGoogle Scholar
  118. Peña D, Prieto FJ (2001a) Cluster identification using projections. J Am Stat Assoc 96:1433–1445MathSciNetzbMATHCrossRefGoogle Scholar
  119. Peña D, Prieto FJ (2001b) Robust covariance matrix estimation and multivariate outlier detection. Technometrics 43:286–310MathSciNetCrossRefGoogle Scholar
  120. Peña D, Sánchez I (2005) Multifold predictive validation in armax time series models. J Am Stat Assoc 100:135–146MathSciNetzbMATHCrossRefGoogle Scholar
  121. Peña D, Tiao GC, Tsay RS (2001) A course in time series analysis. Wiley, Hoboken, NJzbMATHGoogle Scholar
  122. Peña D, Viladomat J, Zamar R (2012) Nearest-neighbors medians clustering. Stat Anal Data Min 5(4):349–362MathSciNetCrossRefGoogle Scholar
  123. Peña D, Smucler E, Yohai VJ (2019a) Forecasting multiple time series with one-sided dynamic principal components. J Am Stat Assoc. CrossRefGoogle Scholar
  124. Peña D, Tsay RS, Zamar R (2019b) Empirical dynamic quantiles for visualization of high-dimensional time series. Technometrics. CrossRefGoogle Scholar
  125. Pigoli D, Hadjipantelis PZ, Coleman JS, Aston JAD (2018) The statistical analysis of acoustic phonetic data: exploring differences between spoken romance languages (with discussion). J R Stat Soc C 67:1–27CrossRefGoogle Scholar
  126. Quijano-Sánchez L, Liberatore F (2017) The big chase: a decision support system for client acquisition applied to financial networks. Decis Support Syst 98:49–58CrossRefGoogle Scholar
  127. Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286CrossRefGoogle Scholar
  128. Radke RJ, Andra S, Al-Kofahi O, Roysam B (2005) Image change detection algorithms: a systematic survey. IEEE Trans Image Process 14:294–307MathSciNetCrossRefGoogle Scholar
  129. Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101:168–178MathSciNetzbMATHCrossRefGoogle Scholar
  130. Ramsay JO, Silverman BW (2005) Functional data analysis, 2nd edn. Springer, New YorkzbMATHGoogle Scholar
  131. Ren Z, Sun T, Zhang CH, Zhou HH (2015) Asymptotic normality and optimalities in estimation of large gaussian graphical model. Ann Stat 43(3):991–1026MathSciNetzbMATHCrossRefGoogle Scholar
  132. Riani M, Atkinson AC, Cerioli A (2009) Finding an unknown number of multivariate outliers. J R Stat Soc B 71(2):447–466MathSciNetzbMATHCrossRefGoogle Scholar
  133. Riani M, Atkinson AC, Cerioli A (2012) Problems and challenges in the analysis of complex data: static and dynamic approaches. In: di Ciaccio A, Coli M, Angulo JM (eds) Advanced statistical methods for the analysis of large data-sets. Springer, Berlin, Heidelberg, pp 145–157CrossRefGoogle Scholar
  134. Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386–408CrossRefGoogle Scholar
  135. Rousseeuw P, van den Bossche W (2018) Detecting deviating data cells. Technometrics 60(2):135–145MathSciNetCrossRefGoogle Scholar
  136. Ryan TP, Woodall WH (2005) The most-cited statistical papers. J Appl Stat 32(5):461–474MathSciNetzbMATHCrossRefGoogle Scholar
  137. Samuel AL (1959) Some studies in machine learning using the game of checkers. IBM J Res Dev 3:210–229MathSciNetCrossRefGoogle Scholar
  138. Schölkopf B, Smola A, Müller KR (1997) Kernel principal component analysis. In: Gerstner W, Germond A, Hasler M, Nicoud JD (eds) Artificial Neural Networks ICANN’97, vol 1327. Lecture Notes in Computer Science, pp 583–588Google Scholar
  139. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464MathSciNetzbMATHCrossRefGoogle Scholar
  140. Sesia M, Sabatti C, Candès EJ (2018) Gene hunting with knockoffs for hidden Markov models. Biometrika. CrossRefGoogle Scholar
  141. Shao J (1993) Linear model selection by cross-validation. J Am Stat Assoc 88:486–494MathSciNetzbMATHCrossRefGoogle Scholar
  142. Shen H, Huang JZ (2008) Sparse principal component analysis via regularized low rank matrix approximation. J Multivariate Anal 99(6):1015–1034MathSciNetzbMATHCrossRefGoogle Scholar
  143. Shi JQ, Choi R (2011) Gaussian process regression analysis for functional data. CRC Press, Boca RatonzbMATHGoogle Scholar
  144. Small C (1990) A survey of multidimensional medians. Int Stat Rev 58:263–277CrossRefGoogle Scholar
  145. Stock JH, Watson MW (2002) Forecasting using principal components from a large number of predictors. J Am Stat Assoc 97:1167–1179MathSciNetzbMATHCrossRefGoogle Scholar
  146. Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc B 36(2):111–147MathSciNetzbMATHGoogle Scholar
  147. Stone M (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J R Stat Soc B 39(1):44–47MathSciNetzbMATHGoogle Scholar
  148. Sun Y, Genton MG (2011) Functional boxplots. J Comput Graph Stat 20(2):316–334MathSciNetCrossRefGoogle Scholar
  149. Tausczik YR, Pennebaker JW (2010) The psychological meaning of words: Liwc and computerized text analysis methods. J Lang Soc Psychol 29:24–54CrossRefGoogle Scholar
  150. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 12:267–288MathSciNetzbMATHGoogle Scholar
  151. Tong H (2012) Threshold models in non-linear time series analysis. Springer, New YorkGoogle Scholar
  152. Tong H, Lim KS (1980) Threshold autoregression, limit cycles and cyclical data (with discussion). J R Stat Soc B 42(3):245–292zbMATHGoogle Scholar
  153. Torrecilla JL, Romo J (2018) Data learning from big data. Stat Prob Lett 136:15–19MathSciNetzbMATHCrossRefGoogle Scholar
  154. Tsay RS, Chen R (2018) Nonlinear time series analysis. Wiley, Hoboken, NJzbMATHGoogle Scholar
  155. Tukey JW (1970) Exploratory data analysis. Addison-Wesley Pub, Co, Reading, MAzbMATHGoogle Scholar
  156. Tzeng JY, Byerley W, Devlin B, Roeder K, Wasserman L (2003) Outlier detection and false discovery rates for whole-genome DNA matching. J Am Stat Assoc 98:236–246MathSciNetzbMATHCrossRefGoogle Scholar
  157. Vidal R (2011) Subspace clustering. IEEE Signal Proc Mag 28:52–68CrossRefGoogle Scholar
  158. Wang S, Zhu J (2008) Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64:440–448MathSciNetzbMATHCrossRefGoogle Scholar
  159. Wei F, Tian W (2018) Heterogeneous connection effects. Stat Prob Lett 133:9–14MathSciNetzbMATHCrossRefGoogle Scholar
  160. Witten DM, Tibshirani R (2010) A framework for feature selection in clustering. J Am Stat Assoc 105:713–726MathSciNetzbMATHCrossRefGoogle Scholar
  161. Witten DM, Tibshirani R, Hastie T (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3):515–534CrossRefGoogle Scholar
  162. Xia Y, Cai T, Cai TT (2016) Testing differential networks with applications to detecting gene-by-gene interactions. Biometrika 102:247–266zbMATHCrossRefGoogle Scholar
  163. Yang Y (2005) Can the strengths of aic and bic be shared? A conflict between model identification and regression estimation. Biometrika 92:937–950MathSciNetzbMATHCrossRefGoogle Scholar
  164. Zhang P (1993) Model selection via multifold cross validation. Ann Stat 21(1):299–313MathSciNetzbMATHCrossRefGoogle Scholar
  165. Zhao SD, Cai TT, Li H (2014) Direct estimation of differential networks. Biometrika 101:253–268MathSciNetzbMATHCrossRefGoogle Scholar
  166. Zhou Z, Wu WB (2009) Local linear quantile estimation for nonstationary time series. Ann Stat 37:2696–2729MathSciNetzbMATHCrossRefGoogle Scholar
  167. Zhu X, Pan R, Li G, Liu Y, Wang H (2017) Network vector autoregression. Ann Stat 45(3):1096–1123MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Sociedad de Estadística e Investigación Operativa 2019

Authors and Affiliations

  1. 1.Departamento de Estadística and Institute of Financial Big DataUniversidad Carlos III de MadridGetafeSpain

Personalised recommendations