Skip to main content

State of the Science in Big Data Analytics

  • Chapter
  • First Online:
Big Data-Enabled Nursing

Part of the book series: Health Informatics ((HI))

  • 1678 Accesses

Abstract

Big data analysis is made feasible by the recent emergence and operational maturity and convergence of data capture, representation and discovery methods and technologies. This chapter describes key methods that allow tackling hard discovery (analysis and modeling) questions with large datasets. Particular emphasis is placed on answering predictive and causal questions, coping with very large dimensionalities, and producing models that generalize well outside the samples used for discovery. Within these areas the chapter emphasizes exemplary methods such as regularized and kernel-based methods, causal graphs and Markov Boundary induction that have strong theoretical as well as strong empirical performance. Other notable developments are also addressed, such as robust protocols for model selection and error estimation, analysis of unstructured data, analysis of multimodal data, network science approaches, deep learning, active learning, and other methods. The chapter concludes with a discussion of several open and challenging areas.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Note that while “predictive modeling” from a generic linguistic perspective implies forecasting the future, recent use of the term in Data Science literature includes both prospective and retrospective classification and regression.

  2. 2.

    The sample size is the second major element that determines model error according to statistical machine learning theory. For an introduction in the topic see (Aliferis et al. 2006) under “bias-variance” tradeoff.

  3. 3.

    We omit for simplicity and brevity a variant of the above (the “soft margin” SVM formulation) which further allows for noisy data and mild non linearity in the data.

  4. 4.

    The Markov Blanket of T is the set of variables in the data such that all non Markov Blanket variables are independent of the response T, once we know the Markov Blanket. Since the Markov Boundary is of interest in practice because it is the minimal Markov Blanket, it is common to see in the literature use of the term “Markov Blanket” when the more precise “Markov Boundary” is implied.

References

  • Albert R, Jeong H, Barabasi AL. Error and attack tolerance of complex networks. Nature. 2000;406:378–482.

    Article  CAS  PubMed  Google Scholar 

  • Aliferis CF, Tsamardinos I, Statnikov A. HITON: a novel Markov blanket algorithm for optimal variable selection. In: AMIA 2003 annual symposium proceedings; 2003. p. 21–25.

    Google Scholar 

  • Aliferis CF, Statnikov A, Tsamardinos I. Challenges in the analysis of mass-throughput data: a technical commentary from the statistical machine learning perspective. Cancer Inform. 2006;2.

    Google Scholar 

  • Aliferis CF, Statnikov A, Tsamardinos I, Schildcrout JS, Shepherd BE, Harrell Jr FE. Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data. PLoS One. 2009;4(3):e4922. doi:10.1371/journal.pone.0004922.

    Article  PubMed  PubMed Central  Google Scholar 

  • Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD. Local causal and Markov blanket induction for causal discovery and feature selection for classification. Part II: analysis and extensions. J Mach Learn Res. 2010a;11:235–84.

    Google Scholar 

  • Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD. Local causal and Markov blanket induction for causal discovery and feature selection for classification. Part I: algorithms and empirical evaluation. J Mach Learn Res. 2010b;11:171–234.

    Google Scholar 

  • Aphinyanaphongs Y, Tsamardinos I, Statnikov A, Hardin D, Aliferis CF. Text categorization models for retrieval of high quality articles in internal medicine. J Am Med Inform Assoc. 2005;12(2):207–16.

    Article  PubMed  PubMed Central  Google Scholar 

  • Barabasi AL. Scale-free networks: a decade and beyond. Science. 2009;325:412–3. doi:10.1126/science.1173299.

    Article  CAS  PubMed  Google Scholar 

  • Barabasi AL, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nat Rev Genet. 2004;5:101–13.

    Article  CAS  PubMed  Google Scholar 

  • Barrenas F, Chavali S, Holme P, Mobini R, Benson M. Network properties of complex human disease genes identified through genome-wide association studies. PLoS One. 2009;4(11):e8090. doi:10.1371/journal.pone.0008090.

    Article  PubMed  PubMed Central  Google Scholar 

  • Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. doi:10.1023/A:1010933404324.

    Article  Google Scholar 

  • Cheng J, Greiner R. Comparing Bayesian network classifiers. In: Proceedings of the 15th conference on uncertainty in artificial intelligence (UAI); 1999. p. 101–7.

    Google Scholar 

  • Cheng J, Greiner R. Learning Bayesian belief network classifiers: algorithms and system. In: Proceedings of 14th biennial conference of the Canadian society for computational studies of intelligence; 2001.

    Google Scholar 

  • Chickering DM. Optimal structure identification with greedy search. J Mach Learn Res. 2003;3(3):507–54.

    Google Scholar 

  • Cooper GF, Herskovits E. A Bayesian method for the induction of probabilistic networks from data. Mach Learn. 1992;9(4):309–47.

    Google Scholar 

  • Cooper GF, Aliferis CF, Ambrosino R, Aronis J, Buchanan BG, Caruana R, Fine MJ, Glymour C, Gordon G, Hanusa BH. An evaluation of machine-learning methods for predicting pneumonia mortality. Artif Intell Med. 1997;9(2):107–38.

    Article  CAS  PubMed  Google Scholar 

  • Daemen A, et al. A kernel-based integration of genome-wide data for clinical decision support. Genome Med. 2009;1(4):39. doi:10.1186/gm39.

    Article  PubMed  PubMed Central  Google Scholar 

  • Dobbin K, Simon R. Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics. 2005;6(1):27–38. doi:10.1093/biostatistics/kxh015.

    Article  PubMed  Google Scholar 

  • Duda RO, Hart PE, Stork DG. Pattern classification. New York: John Wiley & Sons; 2012.

    Google Scholar 

  • Dupuy A, Simon RM. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J Natl Cancer Inst. 2007;99(2):147–57. doi:10.1093/jnci/djk018.

    Article  PubMed  Google Scholar 

  • Friedman C, Hripcsak G. Natural language processing and its future in medicine. Acad Med. 1999;74(8):890–5.

    Article  CAS  PubMed  Google Scholar 

  • Friedman C, Alderson PO, Austin JH, Cimino JJ, Johnson SB. A general natural-language text processor for clinical radiology. J Am Med Inform Assoc. 1994;1(2):161.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Mach Learn. 1997;29(2):131–63.

    Article  Google Scholar 

  • Friedman J, Trevor H, Tibshirani R. The elements of statistical learning, vol. 1. Berlin: Springer; 2001.

    Google Scholar 

  • Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc. 2004;11(5):392–402. doi:10.1197/jamia.M1552.

    Article  PubMed  PubMed Central  Google Scholar 

  • Fu LD, Aliferis CF. Using content-based and bibliometric features for machine learning models to predict citation counts in the biomedical literature. Scientometrics. 2010;85(1):257–70. doi:10.1007/s11192-010-0160-5.

    Article  Google Scholar 

  • Genkin A, Lewis DD, Madigan D. Large-scale Bayesian logistic regression for text categorization. Technometrics. 2007;49(3):291–304.

    Article  Google Scholar 

  • Gevaert O, De Smet F, Timmerman D, Moreau Y, De Moor B. Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics. 2006;22:e184–90. doi:10.1093/bioinformatics/btl230.

    Article  CAS  PubMed  Google Scholar 

  • Granger CW. Investigating causal relations by econometric models and cross-spectral methods. Econometrica. 1969;37(3):424–38.

    Article  Google Scholar 

  • Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157–82.

    Google Scholar 

  • Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46(1–3):389–422.

    Article  Google Scholar 

  • Harrell F. Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. New York: Springer; 2015.

    Book  Google Scholar 

  • Heckerman D, Geiger D, Chickering DM. Learning Bayesian networks: the combination of knowledge and statistical data. Mach Learn. 1995;20(3):197–243.

    Google Scholar 

  • Holme P, Kim BJ, Yoon CN, Han SK. Attack vulnerability of complex networks. Phys Rev E. 2002;65:056109.

    Article  Google Scholar 

  • Kohavi R, John GH. Wrappers for feature subset selection. Artif Intell. 1997;97(1):273–324.

    Article  Google Scholar 

  • Koller D, Sahami M. Toward optimal feature selection. In: Proceedings of the international conference on machine learning; 1996.

    Google Scholar 

  • LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44. doi:10.1038/nature14539.

    Article  CAS  PubMed  Google Scholar 

  • Lee S, Kim E, Monsen KA. Public health nurse perceptions of Omaha System data visualization. Int J Med Inform. 2015;84(10):826–34. doi:10.1016/j.ijmedinf.2015.06.010.

    Article  PubMed  Google Scholar 

  • Margaritis D, Thrun S. Bayesian network induction via local neighborhoods. Adv Neural Inf Process Syst. 1999;12:505–11.

    Google Scholar 

  • Markou M, Singh S. Novelty detection: a review—part 1: statistical approaches. Sig Process. 2003;83(12):2481–97.

    Article  Google Scholar 

  • Meganck S, Leray P, Manderick B. Learning causal bayesian networks from observations and experiments: A decision theoretic approach. MDAI, 2006;3885:58–69.

    Google Scholar 

  • Mitchell TM. Machine learning, vol. 45. Burr Ridge, IL: McGraw Hill; 1997. p. 995.

    Google Scholar 

  • Monsen KA, Peterson JJ, Mathiason MA, Kim E, Lee S, Chi CL, Pieczkiewicz DS. Data visualization techniques to showcase nursing care quality. Comput Inform Nurs. 2015;33(10):417–26. doi:10.1097/CIN.0000000000000190.

    Article  PubMed  Google Scholar 

  • Narendra V, Lytkin N, Aliferis C, Statnikov A. A comprehensive assessment of methods for de-novo reverse-engineering of genome-scale regulatory networks. Genomics. 2011;97(1):7–18. doi:10.1016/j.ygeno.2010.10.003.

    Article  CAS  PubMed  Google Scholar 

  • Neapolitan RE. Probabilistic reasoning in expert systems: theory and algorithms. New York: Wiley; 1990.

    Google Scholar 

  • Newman MEJ, Barabasi AL, Watts DJ. The structure and dynamics of networks. Princeton, NJ: Princeton University Press; 2003.

    Google Scholar 

  • Pearl J. Probabilistic reasoning in intelligent systems: networks of plausible inference. San Mateo, CA: Morgan Kaufmann Publishers; 1988.

    Google Scholar 

  • Pearl J. Causality: models, reasoning, and inference. Cambridge, UK: Cambridge University Press; 2000.

    Google Scholar 

  • Pieczkiewicz DS, Finkelstein SM. Evaluating the decision accuracy and speed of clinical data visualizations. J Am Med Inform Assoc. 2010;17(2):178–81.

    Article  PubMed  PubMed Central  Google Scholar 

  • Pieczkiewicz DS, Finkelstein SM, Hertz MI. Design and evaluation of a web-based interactive visualization system for lung transplant home monitoring data. In: AMIA annual symposium proceedings; 2007. p. 598–602.

    Google Scholar 

  • Ray B, Henaff M, Ma S, Efstathiadis E, Peskin ER, Picone M, Poli T, Aliferis CF, Statnikov A. Information content and analysis methods for multi-modal high-throughput biomedical data. Sci Rep. 2014;4. doi:10.1038/srep04411.

  • Schapire RE. The boosting approach to machine learning: an overview. In: Nonlinear estimation and classification. New York: Springer; 2003. p. 149–71.

    Google Scholar 

  • Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw. 2015;61:85–117.

    Article  PubMed  Google Scholar 

  • Spirtes P, Glymour CN, Scheines R. Causation, prediction, and search, vol. 2. Cambridge, MA: MIT Press; 2000.

    Google Scholar 

  • Statnikov A, Aliferis CF, Hardin DP, Guyon I. A gentle introduction to support vector machines. In: Biomedicine: theory and methods, vol. 1. Singapore: World Scientific; 2011.

    Google Scholar 

  • Statnikov A, Aliferis CF, Hardin DP, Guyon I. A gentle introduction to support vector machines. In: Biomedicine: case studies and benchmarks, vol. 2. World Scientific; 2012.

    Google Scholar 

  • Tong S, Koller D. Support vector machine active learning with applications to text classification. J Mach Learn Res. 2002;2:45–66.

    Google Scholar 

  • Tsamardinos I, Aliferis CF. Towards principled feature selection: relevancy, filters and wrappers. In: Proceedings of the ninth international workshop on artificial intelligence and statistics (AI & Stats); 2003.

    Google Scholar 

  • Tsamardinos I, Aliferis CF, Statnikov A. Time and sample efficient discovery of Markov blankets and direct causal relations. In: Proceedings of the ninth international conference on knowledge discovery and data mining (KDD); 2003. p. 673–8.

    Google Scholar 

  • Tsamardinos I, Brown LE, Aliferis CF. The max-min hill-climbing Bayesian network structure learning algorithm. Mach Learn. 2006;65(1):31–78.

    Article  Google Scholar 

  • Vapnik V. The nature of statistical learning theory. New York: Springer Science & Business Media; 2013.

    Google Scholar 

  • Wang L, Zhu J, Zou H. The doubly regularized support vector machine. Stat Sin. 2006;16:589–615.

    Google Scholar 

  • West VL, Borland D, Hammond WE. Innovative information visualization of electronic health record data: a systematic review. J Am Med Inform Assoc. 2015;22(2):330–9. doi:10.1136/amiajnl-2014-002955.

    PubMed  Google Scholar 

  • Weston J, Elisseeff A, Scholkopf B, Tipping M. Use of the zero-norm with linear models and kernel methods. J Mach Learn Res. 2003;3(7):1439–61.

    Google Scholar 

  • Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Series B (Stat Methodol). 2005;67(2):301–20.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to C. F. Aliferis M.D., Ph.D., F.A.C.M.I. .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Aliferis, C.F. (2017). State of the Science in Big Data Analytics. In: Delaney, C., Weaver, C., Warren, J., Clancy, T., Simpson, R. (eds) Big Data-Enabled Nursing. Health Informatics. Springer, Cham. https://doi.org/10.1007/978-3-319-53300-1_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-53300-1_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-53299-8

  • Online ISBN: 978-3-319-53300-1

  • eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics