Abstract
As our information infrastructure evolves, our ability to store, extract, and analyze data is rapidly changing. Big data is a popular term that is used to describe the large, diverse, complex and/or longitudinal datasets generated from a variety of instruments, sensors and/or computer-based transactions. The term big data refers not only to the size or volume of data, but also to the variety of data and the velocity or speed of data accrual. As the volume, variety, and velocity of data increase, our existing analytical methodologies are stretched to new limits. These changes pose new opportunities for researchers in statistical methodology, including those interested in surveillance and statistical process control methods. Although it is well documented that harnessing big data to make better decisions can serve as a basis for innovative solutions in industry, healthcare, and science, these solutions can be found more easily with sound statistical methodologies. In this paper, we discuss several big data applications to highlight the opportunities and challenges for applied statisticians interested in surveillance and statistical process control. Our goal is to bring the research issues into better focus and encourage methodological developments for big data analysis in these areas.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Azevedo, A. I. R. L., & Santos, M. F. (2008). KDD, SEMMA and CRISP-DM: A parallel overview. In Paper Presented at the IADIS European Conference on Data Mining. Amsterdam, The Netherlands.
Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys, 41(3), 16–16.52. doi:10.1145/1541880.1541883.
Boyd, D. F. (1950). Applying the group chart for X and R. Industrial Quality Control, 7 (3), 22–25.
Bradley, P. S., Fayyad, U., & Reina, C. (1998). Scaling clustering algorithms to large databases. In Proceedings of the 4th International Conference on Knowledge Discovery & Data Mining Knowledge Discovery and Data Mining (pp. 9–15). http://www.aaai.org/Papers/KDD/1998/KDD98-002.pdf.
Brownstein, J. S., Freifeld, C. C., & Madoff, L. C. (2009). Digital disease detection — harnessing the web for public health surveillance. New England Journal of Medicine, 360 (21), 2153–2157. doi:10.1056/NEJMp0900702.
Capizzi, G., & Masarotto, G. (2011). A least angle regression control chart for multidimensional data. Technometrics, 53(3), 285–296. doi:10.1198/Tech.2011.10027.
Carter, P. (2011). Big data analytics: Future architectures, skills and roadmaps for the CIO. International Data Corporation (IDC).http://www.sas.com/resources/asset/BigDataAnalytics-FutureArchitectures-Skills-RoadmapsfortheCIO.pdf.
Chakrabarti, D., & Faloutsos, C. (2012). Graph mining: laws, tools, and case studies. Synthesis Lectures on Data Mining and Knowledge Discovery, 3(3), 1–207. doi:10.2200/S00449ED1V01Y201209DMK006.
Chinnam, R. B. (2002). Support vector machines for recognizing shifts in correlated and other manufacturing processes. International Journal of Production Research, 40 (17), 4449–4466.
Chunara, R., Andrews, J. R., & Brownstein, J. S. (2012). Social and news media enable estimation of epidemiological patterns early in the 2010 Haitian cholera outbreak. The American Journal of Tropical Medicine and Hygiene, 86 (1), 39–45. doi:10.4269/ajtmh.2012.11-0597.
Cook, D. F., & Chiu, C. C. (1998) Using radial basis function neural networks to recognize shifts in correlated manufacturing process parameters. IIE Transactions, 30(3), 227–234.
Cook, D. J., & Holder, L. B. (2006). Mining graph data. Hoboken, NJ: Wiley.
Cruz, J. A., & Wishart, D. S. (2006). Applications of machine learning in cancer prediction and prognosis. Cancer Inform, 2, 59–77. http://www.ncbi.nlm.nih.gov/pubmed/19458758.
Deming, W. E. (2000). The new economics: for industry, government, education (2nd ed.). Cambridge: The MIT Press.
Deng, H., Runger, G. C., & Tuv, E. (2012). Systems monitoring with real time contrasts. Journal of Quality Technology, 44(1), 9–27.
Duchesne, C., Liu, J. J., & MacGregor, J. F. (2012). Multivariate image analysis in the process industries: A review. Chemometrics and Intelligent Laboratory Systems, 117, 116–128. http://dx.doi.org/10.1016/j.chemolab.2012.04.003.
Ferraty, F., & Romain, Y. (2011). The Oxford handbook of functional data analysis. Oxford Handbooks. Oxford: Oxford University Press.
Fraker, S. E., Woodall, W. H., & Mousavi, S. (2008). Performance metrics for surveillance schemes. Quality Engineering, 20(4), 451–464. doi:10.1080/08982110701810444.
Guha, S., Rastogi, R., & Shim, K. (1998). CURE: an efficient clustering algorithm for large databases. In Paper presented at the Proceedings of the 1998 ACM SIGMOD international conference on Management of data, Seattle, Washington, USA.
Hale, C., & Rowe, M. (2012). Do not get out of control: Achieving real-time quality and performance. CrossTalk: The Journal of Defense Software Engineering, 25(1), 4–8. http://www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc.pdf&AD=ADA554677.
Hay, S. I., George, D. B., Moyes, C. L., & Brownstein, J. S. (2013). Big data opportunities for global infectious disease surveillance. PLoS Medicine, 10 (4), e1001413.
Huang, K, -T., Lee, Y. W., & Wang, R. Y. (1999). Quality information and knowledge. Upper Saddle River: Prentice Hall.
Hwang, W., Runger, G., & Tuv, E. (2007). Multivariate statistical process control with artificial contrasts. IIE Transactions, 39(6), 659–669.
IEEE BigData (2013). http://www.ischool.drexel.edu/bigdata/bigdata2013/ (Accessed 05/19/2013).
Jirasettapong, P., & Rojanarowan, N. (2011). A guideline to select control charts for multiple stream processes control. Engineering Journal, 15(3), 1–14. doi:10.4186/ej.2011.15.3.1.
Jobe, J. M., & Pokojovy, M. (2009). A multistep, cluster-based multivariate chart for retrospective monitoring of individuals. Journal of Quality Technology, 41(4), 323–339.
Jones-Farmer, L. A., Ezell, J. D., & Hazen, B. T. (2014). Applying control chart methods to enhance data quality. Technometrics, 56(1), 29–41.
Lanning, J. W., Montgomery, D. C., & Runger, G. C. (2002). Monitoring a multiple stream filling operation using fractional samples. Quality Engineering, 15(2), 183–195. doi:10.1081/QEN-120015851.
Library of Congress (2013). Update on the twitter archive at the Library of Congress. http://www.loc.gov/today/pr/2013/files/twitter_report_2013jan.pdf (Accessed 05/19/2013).
Liu, X., MacKay, R. J., & Steiner, S. H. (2008). Monitoring multiple stream processes. Quality Engineering, 20(3), 296–308. doi:10.1080/08982110802035404.
MacGregor, J. F. (2013). Some perspectives on the impact of big data on process systems engineering. Big data: The next frontier for innovation, competition, and productivity. In 2013 AIChE Annual Meeting. http://www3.aiche.org/proceedings/Abstract.aspx?PaperID=342936.
MacGregor, J., & Cinar, A. (2012). Monitoring, fault diagnosis, fault-tolerant control and optimization: Data driven methods. Computers & Chemical Engineering, 47, 111–120. http://dx.doi.org/10.1016/j.compchemeng.2012.06.017.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute. http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation (Accessed 5/8/2013).
McCulloh, I. A, & Carley, K. M. (2008). Social network change cetection. Center for the Computational Analysis of Social and Organizational Systems Technical Report. Carnegie Mellon University, Pittsburg, PA. http://www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc.pdf&AD=ADA488427.
Megahed, F. M., Fraker, S. E., & Woodall, W. H. (2012). A note on two performance metrics for public-health surveillance schemes. Journal of Applied Probability and Statistics, 7(1), 35–41.
Megahed, F. M., Woodall, W. H., & Camelio, J. A. (2011). A review and perspective on control charting with image data. Journal of Quality Technology, 43(2), 83–98.
Mell P, Grance T. (2011). The NIST definition of cloud computing. National Institute of Standards and Technology. http://docs.lib.noaa.gov/noaa_documents/NOAA_related_docs/NIST/special_publication/sp_800-145.pdf.
Meneces, N. S., Olivera, S. A., Saccone, C. D., & Tessore, J. (2008). Statistical control of multiple-stream processes: a Shewhart control chart for each stream. Quality Engineering, 20(2), 185–194. doi:10.1080/08982110701241608.
Montgomery, D. C. (2013). Introduction to statistical quality control 7th Ed. Wiley, Hoboken, NJ.
Mortell, R. R., & Runger, G. C. (1995). Statistical process control of multiple stream processes. Journal of Quality Technology, 27(1), 1–12.
Ning, X., & Tsung, F. (2010). Monitoring a process with mixed-type and high-dimensional data. In 2010 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM) (pp. 1430–1432), 7–10 Dec 2010. doi:10.1109/IEEM.2010.5674333.
Nomikos, P., & MacGregor, J. F. (1995). Multivariate SPC charts for monitoring batch processes. Technometrics, 37(1), 41–59. doi:10.1080/00401706.1995.10485888.http://www.tandfonline.com/doi/abs/10.1080/00401706.1995.10485888.
Noorossana, R., Saghaei, A., & Amiri, A. (2011). Statistical analysis of profile monitoring. Wiley series in probability and statistics. Hoboken: Wiley.
Parssian, A. (2006). Managerial decision support with knowledge of accuracy and completeness of the relational aggregate functions. Decision Support Systems, 42(3), 1494–1502. doi:10.1016/j.dss.2005.12.005.
Pipino, L. L, Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211–218. doi:10.1145/505248.506010.
Prodan, R., & Ostermann, S. (2009). A survey and taxonomy of infrastructure as a service and web hosting cloud providers. In 10th IEEE/ACM International Conference on Grid Computing (pp. 17–25), 13–15 Oct 2009. doi:10.1109/GRID.2009.5353074.
Rajaraman, A., Leskovec, J., & Ullman, J. D. (2012). Mining of massive datasets. ISBN:978-1-107-01535-7.
Ramsay, J. O., & Silverman BW. (2002). Applied functional data analysis: methods and case studies. Springer Series in Statistics. New York: Springer.
Ramsay, J. O., & Silverman, B. W. (2005). Functional data analysis 2nd ed. Springer Series in Statistics. New York: Springer.
Rolka, H., Burkom, H., Cooper, G. F., Kulldorff, M., Madigan, D., & Wong, W. K. (2007). Issues in applied statistics for public health bioterrorism surveillance using multiple data streams: Research needs. Statistics in Medicine, 26(8), 1834–1856. doi:10.1002/Sim.2793
SAS (2013). Big data - What is it? http://www.sas.com/big-data/ (Accessed 5/8/2013).
Scannapieco, M., & Catarci, T. (2002). Data quality under a computer science perspective. Archivi & Computer, 2, 1–15.
Scarfone, K., & Mell, P. (2012). Guide to intrusion detection and prevention systems (IDPS) (Draft): Recommendations of the National Institute of Standards and Technology. http://csrc.nist.gov/publications/drafts/800-94-rev1/draft_sp800-94-rev1.pdf.
Shmueli, G., & Burkom, H. (2010). Statistical challenges facing early outbreak detection in biosurveillance. Technometrics, 52(1), 39–51. doi:10.1198/Tech.2010.06134.
Strauss, G., Shell, A., Yu, R., & Acohido, B. (2013). Hoax and ensuing crash on Wall Street show the new dangers of our light-speed media world. http://www.usatoday.com/story/news/nation/2013/04/23/hack-attack-on-associated-press-shows-vulnerable-media/2106985/ (Accessed 05/16/2013).
Sullivan, J. H. (2002). Detection of multiple change points from clustering individual observations. Journal of Quality Technology, 34(4), 371–383.
Sun, R., & Tsung, F. (2003). A kernel-distance-based multivariate control chart using support vector methods. International Journal of Production Research, 41(13), 2975–2989.
The Economist (2010a). All too much. http://www.economist.com/node/15557421 (Accessed 5/8/2013).
The Economist (2010b). Data, data everywhere. http://www.economist.com/node/15557443 (Accessed 5/8/2013).
The Economist (2011). Schumpeter: Too much buzz. http://www.economist.com/node/21542154 (Accessed 5/8/2013).
Thissen, U., Swierenga, H., de Weijer, A. P., Wehrens, R., Melssen, W. J., & Buydens, L. M. C. (2005). Multivariate statistical process control using mixture modelling. Journal of Chemometrics, 19(1), 23–31. doi:10.1002/Cem.903.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B, 58(1), 267–288.
Tsui, K. -L., Wenchi, C., Gierlich, P., Goldsman, D., Xuyuan, L., & Maschek, T. (2008). A review of healthcare, public health, and syndromic surveillance. Quality Engineering, 20(4), 435–450. doi:10.1080/08982110802334138.
Underbrink, A., Potter, A., & Jaenisch, H., & Reifer, D. J. (2012). Application stress testing Achieving cyber security by testing cyber attacks. In: 2012 IEEE Conference on Technologies for Homeland Security (HST) (pp. 556–561), 13–15 Nov. 2012. doi:10.1109/THS.2012.6459909.
U.S. General Services Administration (2013). Infrastructure as a Service (IaaS). http://www.gsa.gov/portal/content/112063 (Accessed 5/8/2013).
Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5–33.
Wenke, L., Stolfo, S. J., & Mok, K. W. (1999). A data mining framework for building intrusion detection models. In Proceedings of the IEEE Symposium on Security and Privacy 1999 (pp. 120–132). doi:10.1109/SECPRI.1999.766909.
Woodall, W. H. (2000). Controversies and contradictions in statistical process control. Journal of Quality Technology, 32(4), 341–350.
Woodall, W. H., Spitzner, D. J., Montgomery, D. C., & Gupta, S. (2004). Using control charts to monitor process and product quality profiles. Journal of Quality Technology, 36(3), 309–320.
Wu, Q., Zhang, H., & Pu, J. (2007). Mitigating distributed denial-of-service attacks using network connection control charts. In Proceedings of the 2nd International Conference on Scalable Information Systems, Suzhou, China.
Zhang, H., Albin, S. L., Wagner, S. R., Nolet, D. A., & Gupta, S. (2010). Determining statistical process control baseline periods in long historical data streams. Journal of Quality Technology, 42(1), 21–35.
Zikopoulos, P., deRoos, D., Parasuraman, K., Deutsch, T., Corrigan, D., & Giles, J. (2013). Harness the Power of Big Data: The IBM Big Data Platform. ISBN:978-0-07180818-7.
Zikopoulos, P., Eaton, C., deRoos, D., Deutsch, T., & Lapis, G. (2012). Understanding big data: Analytics for enterprise class hadoop and streaming data. New York: McGraw-Hill.
Zou, C., Ning, X., & Tsung, F. (2012). LASSO-based multivariate linear profile monitoring. Annals of Operations Research, 192(1), 3–19. doi:10.1007/s10479-010-0797-8.
Zou, C. L., Jiang, W., & Tsung, F. (2011). A LASSO-based diagnostic framework for multivariate statistical process control. Technometrics, 53(3), 297–309. doi:10.1198/Tech.2011.10034.
Zou, C. L., & Qiu, P. H. (2009). Multivariate statistical process control using LASSO. Journal of the American Statistical Association, 104(488), 1586–1596. doi:10.1198/jasa.2009.tm08128.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Megahed, F.M., Jones-Farmer, L.A. (2015). Statistical Perspectives on “Big Data”. In: Knoth, S., Schmid, W. (eds) Frontiers in Statistical Quality Control 11. Frontiers in Statistical Quality Control. Springer, Cham. https://doi.org/10.1007/978-3-319-12355-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-12355-4_3
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12354-7
Online ISBN: 978-3-319-12355-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)