Emergence of Statistical Methodologies with the Rise of BIG Data

  • Nedret BillorEmail author
  • Asuman S. Turkmen
Part of the Women in Engineering and Science book series (WES)


Due to the acceleration of electronic computation and the generation of the “BIG”datasets at an unprecedented pace in many fields, there have been great advancements in the development of statistical/machine learning methodologies as we enter the twenty-first century. It is very important to be able to analyze such complex and high dimensional datasets yielding valuable information that deepens understanding, improves decision making, and enhances the performance of predictive models. For instance, the current problems encountered in manufacturing industry, such as quality improvement initiatives, determination of user expectations for a new product, and manufacturing cost estimation, become more difficult to solve as the high dimensional and complex structured data have become available. In order to overcome some of the today’s challenges of a complex manufacturing system, statistical/machine learning techniques have been utilized and found that these have been remarkably helpful to handle the problems arising in this field. We have three objectives in this chapter. The first is to highlight the developments in new statistical/machine learning algorithms, including—but not limited to—deep learning, random forests, support vector machine, dimension reduction, and sparse modeling. The second is to present how these new enormously ambitious data-driven algorithms play important role in analyzing datasets, generated continuously every second from various scientific disciplines such as engineering, biology, neuroscience, chemistry. As the complexity and size of generated data increased for the past two decades, the invention of data-driven algorithms has flourished and became more important than their inferential justifications which are an important aspect of a statistical analysis. Therefore, the third objective of the chapter is to explain how inferential analyses, the theories by which statisticians choose among competing methods, evolve in the twenty-first century as the invention of these new algorithms continues at a fast pace.


  1. Aguirre-Urreta MI, Rönkkö M (2017) Statistical inference with PLSc using bootstrap confidence intervals. MIS Quarterly.Google Scholar
  2. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control, 19(6):716–723MathSciNetzbMATHCrossRefGoogle Scholar
  3. Allaire JJ, Chollet F (2018) Keras: R interface to “Keras”. R package version 2.1.3Google Scholar
  4. Anderson TW (1963) Asymptotic theory for principal component analysis. Ann Math Stat 34(1):122–148MathSciNetzbMATHCrossRefGoogle Scholar
  5. Auret L, Aldrich C (2010) Unsupervised process fault detection with random forests. Ind Eng Chem Res 49(19):9184–9194CrossRefGoogle Scholar
  6. Bai Y, Sun Z, Zeng B, Long J, Li L, Oliveira JVD, et al. (2018) A comparison of dimension reduction techniques for support vector machine modeling of multi-parameter manufacturing quality prediction. J Intell Manuf (in press).
  7. Bartlett PL, Jordan MI, McAuliffe JD (2006) Convexity, classification, and risk bounds. J Am Stat Assoc 101:138–156MathSciNetzbMATHCrossRefGoogle Scholar
  8. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc Ser B 57(1):289–300MathSciNetzbMATHGoogle Scholar
  9. Benkedjouh T, Medjaher K, Zerhouni N, Rechak S (2015) Health assessment and life prediction of cutting tools based on support vector regression. J Intell Manuf 26(2):213–223CrossRefGoogle Scholar
  10. Berk R, Brown L, Buja A, Zhang K, Zhao L (2013) Valid post-selection inference. Ann Stat 41(2):802–837MathSciNetzbMATHCrossRefGoogle Scholar
  11. Bertino E, Catania B, Caglio E (1999) Applying data mining techniques to wafer manufacturing. In: Zytkow JM, Rauch J (eds) PKDD’99, LNAI, vol 1704. Springer, Berlin, pp 41–50Google Scholar
  12. Biau G, Devroye L, Lugosi G (2008) Consistency of random forests and other averaging classifiers. J Mach Learn Res 9:2015–2033MathSciNetzbMATHGoogle Scholar
  13. Blanchard G, Bousquet O, Massart P (2004) Statistical performance of support vector machines. Technical ReportGoogle Scholar
  14. Boente G (1987) Asymptotic theory for robust principal components. J Multivar Anal 21:67–78MathSciNetzbMATHCrossRefGoogle Scholar
  15. Breiman L (2001a) Random forests. Mach Learn 45:5–32zbMATHCrossRefGoogle Scholar
  16. Breiman L (2001b) Statistical modeling: The two cultures. Stat Sci 16(3):199–231MathSciNetzbMATHCrossRefGoogle Scholar
  17. Caydas U, Ekici S (2010) Support vector machines models for surface roughness prediction in CNC turning of AISI 304 austenitic stainless steel. J Intell Manuf 23:639–650CrossRefGoogle Scholar
  18. Chang YC, Mastrangelo C (2011) Addressing multicollinearity in semiconductor manufacturing. Qual Reliab Eng Int 27:843–854CrossRefGoogle Scholar
  19. Chen A, Bickel PJ (2006) Efficient independent component analysis. Ann Stat 34(6):2825–2855MathSciNetzbMATHCrossRefGoogle Scholar
  20. Chiang LH, Pell RJ, Seasholtz MB (2003) Exploring process data with the use of robust outlier detection algorithms. J Process Control 13(5):437–449CrossRefGoogle Scholar
  21. Cho S, Asfour S, Onar A, Kaundinya N (2005) Tool breakage detection using support vector machine learning in a milling process. Int J Mach Tools Manuf 45(3):241–249CrossRefGoogle Scholar
  22. Critchley F (1985) Influence in principal components analysis. Biometrika 72:627–636MathSciNetzbMATHCrossRefGoogle Scholar
  23. Dauxois J, Pousse A, Romain Y (1982) Asymptotic theory for the principal component analysis of a vector random function: some applications to statistical inference. J Multivar Anal 12(1):136–154MathSciNetzbMATHCrossRefGoogle Scholar
  24. de Jong S (1993) SIMPLS: An alternative approach to partial least squares regression. Chemome Intell Lab Syst 18:251–263CrossRefGoogle Scholar
  25. de Ketelaere K, Hubert M, Schmitt E (2015) Overview of PCA based statistical process monitoring methods for time-dependent, high dimensional data. J Qual Technol 47:318–335CrossRefGoogle Scholar
  26. Deng L, Seltzer M, Yu D, Acero A, Mohamed A, Hinton GE (2010) Binary coding of speech spectrograms using a deep auto-encoder. In” Proceedings of 11th annual conference of the international speech communication association, vol 3, pp 1692–1695Google Scholar
  27. Dijkstra TK, Henseler J (2015) Consistent partial least squares path modeling. MIS Q 39(2):297–316CrossRefGoogle Scholar
  28. Dunia R, Edgar TF, Nixon M (2013) Process monitoring using principal components in parallel coordinates. AIChE J 59(2):445–456CrossRefGoogle Scholar
  29. Efron B (2010) Large-scale inference: Empirical Bayes methods for estimation, testing, and prediction. Institute of mathematical statistics onographs, Vol 1. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  30. Efron B (2014) Estimation and accuracy after model selection (with discussion). J Am Stat Assoc 109(507):991–1007zbMATHCrossRefGoogle Scholar
  31. Efron B, Hastie T (2016) Computer age statistical inference: algorithms, evidence, and data science. Institute of mathematical statistics monographs, 1st edn. Cambridge University Press, CambridgezbMATHCrossRefGoogle Scholar
  32. Efron B, Turnbull B, Narasimhan B (2015) locfdr: Computes local false discovery rates. R package version 1.1-8Google Scholar
  33. Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360MathSciNetzbMATHCrossRefGoogle Scholar
  34. Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22CrossRefGoogle Scholar
  35. Ge Z, Song Z (2010) A comparative study of just-in-time-learning based methods for online soft sensor modeling. Chemom Intell Lab Syst 104(2):306–317CrossRefGoogle Scholar
  36. Genuer R, Poggi JM, Tuleau C (2008) Random forests: some methodological insights. Technical report, INRIAGoogle Scholar
  37. Hable R (2012) Asymptotic normality of support vector machine variants and other regularized kernel methods. J Multivar Anal 106:92–117MathSciNetzbMATHCrossRefGoogle Scholar
  38. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: prediction, inference and data mining, 2nd edn. SpringerzbMATHCrossRefGoogle Scholar
  39. Hihi SE, Bengio Y (1996) Hierarchical recurrent neural networks for long-term dependencies. Adv Neural Inf Process Syst 8:493–499Google Scholar
  40. Hinton GE, Osindero S, Teh YW (2014) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554MathSciNetzbMATHCrossRefGoogle Scholar
  41. Hoerl A, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67zbMATHCrossRefGoogle Scholar
  42. Hyvorinen A, Karhunen J, Oja E (2001) Independent component analysis, 1st edn. Wiley, New YorkCrossRefGoogle Scholar
  43. Irani KB, Cheng J, Fayyad UM, Qian Z (1993) Applying machine learning to semiconductor manufacturing. IEEE Exp 8:41–47CrossRefGoogle Scholar
  44. Jain P, Rahman I, Kulkarni BD (2007) Development of a soft sensor for a batch distillation column using support vector regression techniques. Chem Eng Res Des 85(2):283–287CrossRefGoogle Scholar
  45. Janssens O, Slavkovikj V, Vervisch B, Stockman K, Loccufier M, Verstockt S, et al. (2016) Convolution neural network based fault detection for rotating machinery. J Sound Vib 377:331–345CrossRefGoogle Scholar
  46. Javanmard A, Montanari A (2014) Confidence intervals and hypothesis testing for high-dimensional regression. J Mach Learn Res 15:2869–2909MathSciNetzbMATHGoogle Scholar
  47. Jia F, Lei Y, Lin J, Zhou X, Lu N (2016) Deep neural networks: a promising tool for fault characteristic mining and intelligent diagnosis of rotating machinery with massive data. Mech Syst Signal Process 72–73:303–315CrossRefGoogle Scholar
  48. Jolliffe IT (2002) Principal component analysis. Springer series in statistics, 2nd edn. Springer, New YorkGoogle Scholar
  49. Kao LJ, Lee TS, Lu CJ (2016) A multi-stage control chart pattern recognition scheme based on independent component analysis and support vector machine. J Intell Manuf 27(3):653–664CrossRefGoogle Scholar
  50. Karoui N, Purdom E (2016) The bootstrap, covariance matrices and PCA in moderate and high-dimensions. arXiv:1608.00948Google Scholar
  51. Kräemer N, Sugiyama M (2011) The degrees of freedom of partial least squares regression. J Am Stat Assoc 106(494):697–705MathSciNetzbMATHCrossRefGoogle Scholar
  52. Le S, Josse J, Husson F (2008) FactoMineR: An R package for multivariate analysis. J Stat Softw 25(1):1–18CrossRefGoogle Scholar
  53. Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–324CrossRefGoogle Scholar
  54. Lee MH (1993) Knowledge based factory. Artif Intell Eng 8:109–125CrossRefGoogle Scholar
  55. Lee J, Sun D, Sun Y, Taylor J (2016) Exact post-selection inference, with application to the Lasso. Ann Stat 44(3):907–927MathSciNetzbMATHCrossRefGoogle Scholar
  56. Lee JM, Yoo C, Choi SW, Vanrolleghem PA, Lee IB (2004) Nonlinear process monitoring using kernel principal component analysis. Chem Eng Sci 59(1):223–234CrossRefGoogle Scholar
  57. Li Y, Tsung F (2012) Multiple attribute control charts with false discovery rate control, quality and reliability engineering international. Wiley Online Library, vol 28, pp 857–871. Google Scholar
  58. Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22Google Scholar
  59. Lim HK, Kim Y, Kim MK (2017) Failure prediction using sequential pattern mining in the wire bonding process. IEEE Trans Semicond Manuf 30(3):285–292CrossRefGoogle Scholar
  60. Lin Y (2000) Some asymptotic properties of the support vector machine. Technical report 1029. Department of Statistics, University of Wisconsin-MadisonGoogle Scholar
  61. Lin Y (2002) A note on margin-based loss functions in classification. Statist Probab Lett 68:73–82MathSciNetzbMATHCrossRefGoogle Scholar
  62. Lockhart R, Taylor J, Tibshirani R, Tibshirani R (2014) A significance test for the Lasso. Ann Stat 42(2):413–468MathSciNetzbMATHCrossRefGoogle Scholar
  63. Malhi A, Yan R, Gao RX (2011) Prognosis of defect propagation based on recurrent neural networks. IEEE Trans Instrum Meas 60(3):703–711CrossRefGoogle Scholar
  64. Mallows CL (1973) Some comments on C P. Technometrics 15(4):661–675zbMATHGoogle Scholar
  65. Marchini JL, Heaton C, Ripley BD (2017) fastICA: FastICA algorithms to perform ICA and projection pursuit. R package version 1.2–1Google Scholar
  66. Melhem M, Ananou B, Ouladsine M, Pinaton J (2016) Regression methods for predicting the product’s quality in the semiconductor manufacturing process. IFAC-papers online, vol 49, pp 83–88CrossRefGoogle Scholar
  67. Mentch L, Hooker G (2016) Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J Mach Learn Res 17:1–41MathSciNetzbMATHGoogle Scholar
  68. Mentch L, Hooker G (2014) Ensemble trees and CLTs: statistical inference for supervised learning. arXiv preprint arXiv:1404.6473Google Scholar
  69. Mevik BH, Wehrens R, Liland KH (2016) pls: Partial least squares and principal component regression. R package version 2.6-0Google Scholar
  70. Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2017) e1071: Misc functions of the Department of Statistics, Probability Theory Group, (Formerly: E1071), TU Wien. R package version 1.6-8Google Scholar
  71. Miettinen J, Taskinen S, Nordhausen K, Oja H (2015) Fourth moments and independent component analysis. Stat Sci 30:372–390MathSciNetzbMATHCrossRefGoogle Scholar
  72. Miller Jr RG (1981) Simultaneous statistical inference. Springer series in statistics, 2nd edn. Springer, New YorkzbMATHCrossRefGoogle Scholar
  73. Oksanen J, Blanchet GF, Friendly M, Kindt R, Legendre P, McGlinn D, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Szoecs E, Wagner H (2017) vegan: Community ecology package. R package version 2.4-5Google Scholar
  74. Pardo M, Sberveglieri G (2008) Random forests and nearest Shrunken centroids for the classification of sensor array data. Sens Actuators B Chem 131:93–99CrossRefGoogle Scholar
  75. Puggini L, Doyle J, McLoone S (2016) Fault detection using random forest similarity distance. IFAC-Safe Process 49(5):132–137Google Scholar
  76. Qin SJ (2003) Statistical process monitoring: basics and beyond. J Chemom 17:480–502CrossRefGoogle Scholar
  77. R Core Team (2016) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
  78. Ribeiro B (2005) Support vector machines for quality monitoring in a plastic injection molding process. IEEE Trans Syst Man Cybern C (Appl Rev) 35:401–410CrossRefGoogle Scholar
  79. Saidi L, Ail JB, Friaiech F (2015) Application of higher order spectral features and support vector machines for bearing faults classification. ISA Trans 54:193–206CrossRefGoogle Scholar
  80. Saybani MR, Wah TY, Amini A, Yazdi S, Lahsasna A (2011) Applications of support vector machines in oil refineries: A survey. Int J Phys Sci 6(27):6295–6302Google Scholar
  81. Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117CrossRefGoogle Scholar
  82. Schölkopf B, Burges C, Smola A (1999) Advances in kernel methods: support vector learning. MIT Press, CambridgezbMATHGoogle Scholar
  83. Scovel JC, Steinwart I (2004) Fast rates for support vector machines using gaussian kernels. Technical report LA-UR04-8796, Los Alamos National LaboratoryGoogle Scholar
  84. Smolensky PI (1986) Information processing in dynamical systems: foundations of harmony theory, parallel distributed processing: explorations in the micro structure of cognition. MIT Press, CambridgeGoogle Scholar
  85. Sokol A, Maathuis MH, Falkeborg B (2014) Quantifying identifiability in independent component analysis. Electron J Stat 8:1438–1459MathSciNetzbMATHCrossRefGoogle Scholar
  86. Steinwart I (2005) Consistency of support vector machines and other regularized kernel machines. IEEE Trans Inform Theory 51:128–142MathSciNetzbMATHCrossRefGoogle Scholar
  87. Susto GA, Beghi A (2013) A virtual metrology system based on least angle regression and statistical clustering. Appl Stoch Models Bus Ind 29:362–376MathSciNetCrossRefGoogle Scholar
  88. Tenenbaum JB, Silva VD, Langford JC (2010) A global geometric framework for nonlinear dimensionality reduction. Science 290:2319–2323CrossRefGoogle Scholar
  89. Tian Y, Fu M, Wu F (2015) Steel plates fault diagnosis on the basis of support vector machines. Neurocomputing 151:296–303CrossRefGoogle Scholar
  90. Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58(1):267–288MathSciNetzbMATHGoogle Scholar
  91. Tibshirani R, Taylor J, Loftus J, Reid S (2016) selectiveInference: tools for post-selection inference, R package version 1.1.3Google Scholar
  92. Thornhill NF, Shah SL, Huang B, Vishnubhotla A (2002) Spectral principal component analysis of dynamic process data. Control Eng Pract 10(8):833–846CrossRefGoogle Scholar
  93. van de Geer S, Bühlmann P, Ritov Y, Dezeure R (2014) On asymptotically optimal confidence regions and tests for high-dimensional models. Ann Stat 42(3):1166–1202MathSciNetzbMATHCrossRefGoogle Scholar
  94. Vapnik V (1995) The nature of statistical learning theory. SpringerzbMATHCrossRefGoogle Scholar
  95. Wager S, Athey A (2018) Estimation and inference of heterogeneous treatment effects using random forests, J Am Stat Assoc 113:1228–1242. MathSciNetzbMATHCrossRefGoogle Scholar
  96. Wager S, Hastie T, Efron B (2014) Confidence intervals for random forests: The Jackknife and the infinitesimal Jackknife. J Mach Learn Res 15:1625–1651MathSciNetzbMATHGoogle Scholar
  97. Wang XZ, McGreavy C (1998) Automatic classification for mining process operational data. Ind Eng Chem Res 37(6):2215–2222CrossRefGoogle Scholar
  98. Wang P, Gao RX, Yan R (2017) A deep learning-based approach to material removal rate prediction in polishing. CIRP Ann Manuf Technol 66:429–432CrossRefGoogle Scholar
  99. Wang J, Ma Y, Zhang L, Gao RX, Wu D (2018) Deep learning for smart manufacturing: methods and applications. J Manuf Syst 48(Part C):144–156CrossRefGoogle Scholar
  100. Wei T (2015) The convergence and asymptotic analysis of the generalized symmetric fast ICA algorithm. IEEE Trans Signal Process 63(24):6445–6458MathSciNetzbMATHCrossRefGoogle Scholar
  101. Weimer D, Scholz-Reiter B, Shpitalni M (2016) Design of deep convolution neural network architectures for automated feature extraction in industrial inspection. CIRP Ann Manuf Technol 65(1):417–420CrossRefGoogle Scholar
  102. Westfall P, Young S (1993) Resampling-based multiple testing: examples and methods for p-value adjustment. Wiley series in probability and statistics. Wiley-IntersciencezbMATHGoogle Scholar
  103. Widodo A, Yang BS (2007) Support vector machine in machine condition monitoring and fault diagnosis. Mech Syst Signal Process 21:2560–2574CrossRefGoogle Scholar
  104. Wold H (1975) Path models with latent variables: the NIPALS approach. In: Quantitative sociology international perspectives on mathematical and statistical model building, pp 307–357. Academic PressGoogle Scholar
  105. Wu D, Jennings C, Terpenny J, Gao RX, Kumara S (2017) a comparative study on machine learning algorithms for smart manufacturing: tool wear prediction using random forests. J Manuf Sci Eng 139:071018–071027CrossRefGoogle Scholar
  106. Xanthopoulos P, Razzaghi T (2013) A weighted support vector machine method for control chart pattern recognition. Comput Ind Eng 66:683–695CrossRefGoogle Scholar
  107. Xiao Y, Wang H, Zhang L (2014) Two methods of selecting gaussian kernel parameters for one-class SVM and their application to fault detection. Knowl-Based Syst 59:75–84CrossRefGoogle Scholar
  108. Yang B, Di X, Han T (2008) Random forests classifier for machine fault diagnosis. J Mech Sci Technol 22:1716–1725CrossRefGoogle Scholar
  109. Yao M, Wang H (2015) On-line monitoring of batch processes using generalized additive kernel principal component analysis. J Process Control 103:338–351MathSciNetGoogle Scholar
  110. Yarin G (2016) Uncertainty in deep learning. Ph.D. thesis, Cambridge UniversityGoogle Scholar
  111. You D, Gao X, Katayama S (2015) WPD-PCA-based laser welding process monitoring and defects diagnosis by using FNN and SVM. IEEE Trans Ind Electron 62(1):628–636CrossRefGoogle Scholar
  112. Yu J (2012) A Bayesian inference based two-stage support vector regression framework for soft sensor development in batch bioprocesses. Comput Chem Eng 41:134–144CrossRefGoogle Scholar
  113. Yu H, Khan F, Garaniya V (2015) Nonlinear Gaussian belief network based fault diagnosis for industrial processes. J Process Control 35:178–200CrossRefGoogle Scholar
  114. Zhang T (2004) Statistical behavior and consistency of classification methods based on convex risk minimization. Ann Stat 32:56–84MathSciNetzbMATHCrossRefGoogle Scholar
  115. Zhang Y, Teng Y, Zhang Y (2010) Complex process quality prediction using modified kernel partial least squares. Chem Eng Sci 65(6):2153–2158CrossRefGoogle Scholar
  116. Zhang Y (2008) Fault detection and diagnosis of nonlinear processes using improved kernel independent component analysis (KICA) and support vector machine (SVM). Ind Eng Chem Res 47(18):6961–6971CrossRefGoogle Scholar
  117. Zhang W, He D, Jia R (2013) Online quality prediction for cobalt oxalate synthesis process using least squares support vector regression approach with dual updating. Control Eng Pract 21(10):1267–1276CrossRefGoogle Scholar
  118. Zhang Y, Li S, Teng Y (2012) Dynamic processes monitoring using recursive kernel principal component analysis. Chem Eng Sci 72:78–86CrossRefGoogle Scholar
  119. Zhang C-H, Zhang S (2014) Confidence intervals for low-dimensional parameters with high-dimensional data. J R Stat Soc Ser B 76(1):217–242MathSciNetzbMATHCrossRefGoogle Scholar
  120. Zou C, Tseng ST, Wang Z (2014) Outlier detection in general profiles using penalized regression method. IIE Trans J Inst Ind Syst Eng 46(2):106–117Google Scholar
  121. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67:301–320MathSciNetzbMATHCrossRefGoogle Scholar
  122. Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101:1418–1429MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Department of Mathematics and StatisticsAuburn UniversityAuburnUSA
  2. 2.Department of StatisticsThe Ohio State UniversityColumbusUSA

Personalised recommendations