Skip to main content

Statistical Perspectives on “Big Data”

  • Chapter
Frontiers in Statistical Quality Control 11

Part of the book series: Frontiers in Statistical Quality Control ((FSQC))

Abstract

As our information infrastructure evolves, our ability to store, extract, and analyze data is rapidly changing. Big data is a popular term that is used to describe the large, diverse, complex and/or longitudinal datasets generated from a variety of instruments, sensors and/or computer-based transactions. The term big data refers not only to the size or volume of data, but also to the variety of data and the velocity or speed of data accrual. As the volume, variety, and velocity of data increase, our existing analytical methodologies are stretched to new limits. These changes pose new opportunities for researchers in statistical methodology, including those interested in surveillance and statistical process control methods. Although it is well documented that harnessing big data to make better decisions can serve as a basis for innovative solutions in industry, healthcare, and science, these solutions can be found more easily with sound statistical methodologies. In this paper, we discuss several big data applications to highlight the opportunities and challenges for applied statisticians interested in surveillance and statistical process control. Our goal is to bring the research issues into better focus and encourage methodological developments for big data analysis in these areas.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Azevedo, A. I. R. L., & Santos, M. F. (2008). KDD, SEMMA and CRISP-DM: A parallel overview. In Paper Presented at the IADIS European Conference on Data Mining. Amsterdam, The Netherlands.

    Google Scholar 

  • Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM Computing Surveys, 41(3), 16–16.52. doi:10.1145/1541880.1541883.

    Article  Google Scholar 

  • Boyd, D. F. (1950). Applying the group chart for X and R. Industrial Quality Control, 7 (3), 22–25.

    Google Scholar 

  • Bradley, P. S., Fayyad, U., & Reina, C. (1998). Scaling clustering algorithms to large databases. In Proceedings of the 4th International Conference on Knowledge Discovery & Data Mining Knowledge Discovery and Data Mining (pp. 9–15). http://www.aaai.org/Papers/KDD/1998/KDD98-002.pdf.

  • Brownstein, J. S., Freifeld, C. C., & Madoff, L. C. (2009). Digital disease detection — harnessing the web for public health surveillance. New England Journal of Medicine, 360 (21), 2153–2157. doi:10.1056/NEJMp0900702.

    Article  Google Scholar 

  • Capizzi, G., & Masarotto, G. (2011). A least angle regression control chart for multidimensional data. Technometrics, 53(3), 285–296. doi:10.1198/Tech.2011.10027.

    Article  MathSciNet  Google Scholar 

  • Carter, P. (2011). Big data analytics: Future architectures, skills and roadmaps for the CIO. International Data Corporation (IDC).http://www.sas.com/resources/asset/BigDataAnalytics-FutureArchitectures-Skills-RoadmapsfortheCIO.pdf.

  • Chakrabarti, D., & Faloutsos, C. (2012). Graph mining: laws, tools, and case studies. Synthesis Lectures on Data Mining and Knowledge Discovery, 3(3), 1–207. doi:10.2200/S00449ED1V01Y201209DMK006.

    Article  Google Scholar 

  • Chinnam, R. B. (2002). Support vector machines for recognizing shifts in correlated and other manufacturing processes. International Journal of Production Research, 40 (17), 4449–4466.

    Article  MATH  Google Scholar 

  • Chunara, R., Andrews, J. R., & Brownstein, J. S. (2012). Social and news media enable estimation of epidemiological patterns early in the 2010 Haitian cholera outbreak. The American Journal of Tropical Medicine and Hygiene, 86 (1), 39–45. doi:10.4269/ajtmh.2012.11-0597.

    Article  Google Scholar 

  • Cook, D. F., & Chiu, C. C. (1998) Using radial basis function neural networks to recognize shifts in correlated manufacturing process parameters. IIE Transactions, 30(3), 227–234.

    Google Scholar 

  • Cook, D. J., & Holder, L. B. (2006). Mining graph data. Hoboken, NJ: Wiley.

    Book  Google Scholar 

  • Cruz, J. A., & Wishart, D. S. (2006). Applications of machine learning in cancer prediction and prognosis. Cancer Inform, 2, 59–77. http://www.ncbi.nlm.nih.gov/pubmed/19458758.

    Google Scholar 

  • Deming, W. E. (2000). The new economics: for industry, government, education (2nd ed.). Cambridge: The MIT Press.

    Google Scholar 

  • Deng, H., Runger, G. C., & Tuv, E. (2012). Systems monitoring with real time contrasts. Journal of Quality Technology, 44(1), 9–27.

    Google Scholar 

  • Duchesne, C., Liu, J. J., & MacGregor, J. F. (2012). Multivariate image analysis in the process industries: A review. Chemometrics and Intelligent Laboratory Systems, 117, 116–128. http://dx.doi.org/10.1016/j.chemolab.2012.04.003.

  • Ferraty, F., & Romain, Y. (2011). The Oxford handbook of functional data analysis. Oxford Handbooks. Oxford: Oxford University Press.

    Google Scholar 

  • Fraker, S. E., Woodall, W. H., & Mousavi, S. (2008). Performance metrics for surveillance schemes. Quality Engineering, 20(4), 451–464. doi:10.1080/08982110701810444.

    Article  Google Scholar 

  • Guha, S., Rastogi, R., & Shim, K. (1998). CURE: an efficient clustering algorithm for large databases. In Paper presented at the Proceedings of the 1998 ACM SIGMOD international conference on Management of data, Seattle, Washington, USA.

    Google Scholar 

  • Hale, C., & Rowe, M. (2012). Do not get out of control: Achieving real-time quality and performance. CrossTalk: The Journal of Defense Software Engineering, 25(1), 4–8. http://www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc.pdf&AD=ADA554677.

  • Hay, S. I., George, D. B., Moyes, C. L., & Brownstein, J. S. (2013). Big data opportunities for global infectious disease surveillance. PLoS Medicine, 10 (4), e1001413.

    Article  Google Scholar 

  • Huang, K, -T., Lee, Y. W., & Wang, R. Y. (1999). Quality information and knowledge. Upper Saddle River: Prentice Hall.

    Google Scholar 

  • Hwang, W., Runger, G., & Tuv, E. (2007). Multivariate statistical process control with artificial contrasts. IIE Transactions, 39(6), 659–669.

    Article  Google Scholar 

  • IEEE BigData (2013). http://www.ischool.drexel.edu/bigdata/bigdata2013/ (Accessed 05/19/2013).

  • Jirasettapong, P., & Rojanarowan, N. (2011). A guideline to select control charts for multiple stream processes control. Engineering Journal, 15(3), 1–14. doi:10.4186/ej.2011.15.3.1.

    Article  Google Scholar 

  • Jobe, J. M., & Pokojovy, M. (2009). A multistep, cluster-based multivariate chart for retrospective monitoring of individuals. Journal of Quality Technology, 41(4), 323–339.

    Google Scholar 

  • Jones-Farmer, L. A., Ezell, J. D., & Hazen, B. T. (2014). Applying control chart methods to enhance data quality. Technometrics, 56(1), 29–41.

    Article  MathSciNet  Google Scholar 

  • Lanning, J. W., Montgomery, D. C., & Runger, G. C. (2002). Monitoring a multiple stream filling operation using fractional samples. Quality Engineering, 15(2), 183–195. doi:10.1081/QEN-120015851.

    Article  Google Scholar 

  • Library of Congress (2013). Update on the twitter archive at the Library of Congress. http://www.loc.gov/today/pr/2013/files/twitter_report_2013jan.pdf (Accessed 05/19/2013).

  • Liu, X., MacKay, R. J., & Steiner, S. H. (2008). Monitoring multiple stream processes. Quality Engineering, 20(3), 296–308. doi:10.1080/08982110802035404.

    Article  Google Scholar 

  • MacGregor, J. F. (2013). Some perspectives on the impact of big data on process systems engineering. Big data: The next frontier for innovation, competition, and productivity. In 2013 AIChE Annual Meeting. http://www3.aiche.org/proceedings/Abstract.aspx?PaperID=342936.

  • MacGregor, J., & Cinar, A. (2012). Monitoring, fault diagnosis, fault-tolerant control and optimization: Data driven methods. Computers & Chemical Engineering, 47, 111–120. http://dx.doi.org/10.1016/j.compchemeng.2012.06.017.

  • Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute. http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation (Accessed 5/8/2013).

  • McCulloh, I. A, & Carley, K. M. (2008). Social network change cetection. Center for the Computational Analysis of Social and Organizational Systems Technical Report. Carnegie Mellon University, Pittsburg, PA. http://www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc.pdf&AD=ADA488427.

  • Megahed, F. M., Fraker, S. E., & Woodall, W. H. (2012). A note on two performance metrics for public-health surveillance schemes. Journal of Applied Probability and Statistics, 7(1), 35–41.

    Google Scholar 

  • Megahed, F. M., Woodall, W. H., & Camelio, J. A. (2011). A review and perspective on control charting with image data. Journal of Quality Technology, 43(2), 83–98.

    Google Scholar 

  • Mell P, Grance T. (2011). The NIST definition of cloud computing. National Institute of Standards and Technology. http://docs.lib.noaa.gov/noaa_documents/NOAA_related_docs/NIST/special_publication/sp_800-145.pdf.

  • Meneces, N. S., Olivera, S. A., Saccone, C. D., & Tessore, J. (2008). Statistical control of multiple-stream processes: a Shewhart control chart for each stream. Quality Engineering, 20(2), 185–194. doi:10.1080/08982110701241608.

    Article  Google Scholar 

  • Montgomery, D. C. (2013). Introduction to statistical quality control 7th Ed. Wiley, Hoboken, NJ.

    MATH  Google Scholar 

  • Mortell, R. R., & Runger, G. C. (1995). Statistical process control of multiple stream processes. Journal of Quality Technology, 27(1), 1–12.

    Google Scholar 

  • Ning, X., & Tsung, F. (2010). Monitoring a process with mixed-type and high-dimensional data. In 2010 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM) (pp. 1430–1432), 7–10 Dec 2010. doi:10.1109/IEEM.2010.5674333.

    Google Scholar 

  • Nomikos, P., & MacGregor, J. F. (1995). Multivariate SPC charts for monitoring batch processes. Technometrics, 37(1), 41–59. doi:10.1080/00401706.1995.10485888.http://www.tandfonline.com/doi/abs/10.1080/00401706.1995.10485888.

  • Noorossana, R., Saghaei, A., & Amiri, A. (2011). Statistical analysis of profile monitoring. Wiley series in probability and statistics. Hoboken: Wiley.

    Book  Google Scholar 

  • Parssian, A. (2006). Managerial decision support with knowledge of accuracy and completeness of the relational aggregate functions. Decision Support Systems, 42(3), 1494–1502. doi:10.1016/j.dss.2005.12.005.

    Article  Google Scholar 

  • Pipino, L. L, Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211–218. doi:10.1145/505248.506010.

    Article  Google Scholar 

  • Prodan, R., & Ostermann, S. (2009). A survey and taxonomy of infrastructure as a service and web hosting cloud providers. In 10th IEEE/ACM International Conference on Grid Computing (pp. 17–25), 13–15 Oct 2009. doi:10.1109/GRID.2009.5353074.

    Google Scholar 

  • Rajaraman, A., Leskovec, J., & Ullman, J. D. (2012). Mining of massive datasets. ISBN:978-1-107-01535-7.

    Google Scholar 

  • Ramsay, J. O., & Silverman BW. (2002). Applied functional data analysis: methods and case studies. Springer Series in Statistics. New York: Springer.

    Google Scholar 

  • Ramsay, J. O., & Silverman, B. W. (2005). Functional data analysis 2nd ed. Springer Series in Statistics. New York: Springer.

    Google Scholar 

  • Rolka, H., Burkom, H., Cooper, G. F., Kulldorff, M., Madigan, D., & Wong, W. K. (2007). Issues in applied statistics for public health bioterrorism surveillance using multiple data streams: Research needs. Statistics in Medicine, 26(8), 1834–1856. doi:10.1002/Sim.2793

    Article  MathSciNet  Google Scholar 

  • SAS (2013). Big data - What is it? http://www.sas.com/big-data/ (Accessed 5/8/2013).

  • Scannapieco, M., & Catarci, T. (2002). Data quality under a computer science perspective. Archivi & Computer, 2, 1–15.

    Google Scholar 

  • Scarfone, K., & Mell, P. (2012). Guide to intrusion detection and prevention systems (IDPS) (Draft): Recommendations of the National Institute of Standards and Technology. http://csrc.nist.gov/publications/drafts/800-94-rev1/draft_sp800-94-rev1.pdf.

  • Shmueli, G., & Burkom, H. (2010). Statistical challenges facing early outbreak detection in biosurveillance. Technometrics, 52(1), 39–51. doi:10.1198/Tech.2010.06134.

    Article  MathSciNet  Google Scholar 

  • Strauss, G., Shell, A., Yu, R., & Acohido, B. (2013). Hoax and ensuing crash on Wall Street show the new dangers of our light-speed media world. http://www.usatoday.com/story/news/nation/2013/04/23/hack-attack-on-associated-press-shows-vulnerable-media/2106985/ (Accessed 05/16/2013).

  • Sullivan, J. H. (2002). Detection of multiple change points from clustering individual observations. Journal of Quality Technology, 34(4), 371–383.

    Google Scholar 

  • Sun, R., & Tsung, F. (2003). A kernel-distance-based multivariate control chart using support vector methods. International Journal of Production Research, 41(13), 2975–2989.

    Article  MATH  Google Scholar 

  • The Economist (2010a). All too much. http://www.economist.com/node/15557421 (Accessed 5/8/2013).

  • The Economist (2010b). Data, data everywhere. http://www.economist.com/node/15557443 (Accessed 5/8/2013).

  • The Economist (2011). Schumpeter: Too much buzz. http://www.economist.com/node/21542154 (Accessed 5/8/2013).

  • Thissen, U., Swierenga, H., de Weijer, A. P., Wehrens, R., Melssen, W. J., & Buydens, L. M. C. (2005). Multivariate statistical process control using mixture modelling. Journal of Chemometrics, 19(1), 23–31. doi:10.1002/Cem.903.

    Article  Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B, 58(1), 267–288.

    MATH  MathSciNet  Google Scholar 

  • Tsui, K. -L., Wenchi, C., Gierlich, P., Goldsman, D., Xuyuan, L., & Maschek, T. (2008). A review of healthcare, public health, and syndromic surveillance. Quality Engineering, 20(4), 435–450. doi:10.1080/08982110802334138.

    Article  Google Scholar 

  • Underbrink, A., Potter, A., & Jaenisch, H., & Reifer, D. J. (2012). Application stress testing Achieving cyber security by testing cyber attacks. In: 2012 IEEE Conference on Technologies for Homeland Security (HST) (pp. 556–561), 13–15 Nov. 2012. doi:10.1109/THS.2012.6459909.

    Google Scholar 

  • U.S. General Services Administration (2013). Infrastructure as a Service (IaaS). http://www.gsa.gov/portal/content/112063 (Accessed 5/8/2013).

  • Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5–33.

    MATH  Google Scholar 

  • Wenke, L., Stolfo, S. J., & Mok, K. W. (1999). A data mining framework for building intrusion detection models. In Proceedings of the IEEE Symposium on Security and Privacy 1999 (pp. 120–132). doi:10.1109/SECPRI.1999.766909.

    Google Scholar 

  • Woodall, W. H. (2000). Controversies and contradictions in statistical process control. Journal of Quality Technology, 32(4), 341–350.

    Google Scholar 

  • Woodall, W. H., Spitzner, D. J., Montgomery, D. C., & Gupta, S. (2004). Using control charts to monitor process and product quality profiles. Journal of Quality Technology, 36(3), 309–320.

    Google Scholar 

  • Wu, Q., Zhang, H., & Pu, J. (2007). Mitigating distributed denial-of-service attacks using network connection control charts. In Proceedings of the 2nd International Conference on Scalable Information Systems, Suzhou, China.

    Google Scholar 

  • Zhang, H., Albin, S. L., Wagner, S. R., Nolet, D. A., & Gupta, S. (2010). Determining statistical process control baseline periods in long historical data streams. Journal of Quality Technology, 42(1), 21–35.

    Google Scholar 

  • Zikopoulos, P., deRoos, D., Parasuraman, K., Deutsch, T., Corrigan, D., & Giles, J. (2013). Harness the Power of Big Data: The IBM Big Data Platform. ISBN:978-0-07180818-7.

    Google Scholar 

  • Zikopoulos, P., Eaton, C., deRoos, D., Deutsch, T., & Lapis, G. (2012). Understanding big data: Analytics for enterprise class hadoop and streaming data. New York: McGraw-Hill.

    Google Scholar 

  • Zou, C., Ning, X., & Tsung, F. (2012). LASSO-based multivariate linear profile monitoring. Annals of Operations Research, 192(1), 3–19. doi:10.1007/s10479-010-0797-8.

    Article  MATH  MathSciNet  Google Scholar 

  • Zou, C. L., Jiang, W., & Tsung, F. (2011). A LASSO-based diagnostic framework for multivariate statistical process control. Technometrics, 53(3), 297–309. doi:10.1198/Tech.2011.10034.

    Article  MathSciNet  Google Scholar 

  • Zou, C. L., & Qiu, P. H. (2009). Multivariate statistical process control using LASSO. Journal of the American Statistical Association, 104(488), 1586–1596. doi:10.1198/jasa.2009.tm08128.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fadel M. Megahed .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Megahed, F.M., Jones-Farmer, L.A. (2015). Statistical Perspectives on “Big Data”. In: Knoth, S., Schmid, W. (eds) Frontiers in Statistical Quality Control 11. Frontiers in Statistical Quality Control. Springer, Cham. https://doi.org/10.1007/978-3-319-12355-4_3

Download citation

Publish with us

Policies and ethics