Skip to main content

Big Data, Real-World Data, and Machine Learning

  • Chapter
  • First Online:
Statistical Methods in Biomarker and Early Clinical Development
  • 900 Accesses

Abstract

Complex human diseases result from the cumulative effect of multiple genomic components and environmental factors. The impact of any individual marker is limited when diagnosing complex polygenic human disease or guiding efficient treatment. Large-scale studies of gene expression have much more chance to capture the signal from human disease. Meanwhile, sequencing technology is rapidly advancing enabling us to evaluate millions of genomic features simultaneously. Combined with clinical, demographic, proteomic, and imaging data, each patient provides an unprecedented amount of information on a meta-omics level. Machine learning becomes key to efficiently mining this big data and providing each patient the most effective personalized care.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Alexander, E.K., Kennedy, G.C., Baloch, Z.W., Cibas, E.S., Chudova, D., Diggans, J., Friedman, L., Kloos, R.T., LiVolsi, V.A., Mandel, S.J., et al. (2012). Preoperative Diagnosis of Benign Thyroid Nodules with Indeterminate Cytology. N. Engl. J. Med. 367, 705–715.

    Article  Google Scholar 

  • Ali, A., Shamsuddin, S.M., and Ralescu, A.L. (2015). Classification with class imbalance problem: A review. Int. J. Adv. Soft Comput. Its Appl. 7, 176–204.

    Google Scholar 

  • Aliper, A., Plis, S., Artemov, A., Ulloa, A., Mamoshina, P., and Zhavoronkov, A. (2016). Deep Learning Applications for Predicting Pharmacological Properties of Drugs and Drug Repurposing Using Transcriptomic Data. Mol. Pharm. 13, 2524–2530.

    Article  Google Scholar 

  • Ambroise, C., and McLachlan, G.J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. 99, 6562–6566.

    Article  MATH  Google Scholar 

  • Bach, F.R., Heckerman, D., and Horvitz, E. (2006). Considering Cost Asymmetry in Learning Classifiers. J Mach Learn Res 7, 1713–1741.

    MathSciNet  MATH  Google Scholar 

  • Bair, E., and Tibshirani, R. (2004). Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data. PLoS Biol. 2.

    Google Scholar 

  • Bair, E., Hastie, T., Paul, D., and Tibshirani, R. (2006). Prediction by Supervised Principal Components. J. Am. Stat. Assoc. 101, 119–137.

    Article  MathSciNet  MATH  Google Scholar 

  • Balasubramanian, M., and Schwartz, E.L. (2002). The Isomap Algorithm and Topological Stability. Science 295, 7–7.

    Article  Google Scholar 

  • Bamber, D. (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J. Math. Psychol. 12, 387–415.

    Article  MathSciNet  MATH  Google Scholar 

  • Bellman, R. (1957). Dynamic Programming (Princeton, NJ, USA: Princeton University Press).

    MATH  Google Scholar 

  • Bengio, Y. (2009). Learning Deep Architectures for AI. Found. Trends® Mach. Learn. 2, 1–127.

    Article  MATH  Google Scholar 

  • Blum, A.L., and Langley, P. (1997). Selection of relevant features and examples in machine learning. Artif. Intell. 97, 245–271.

    Article  MathSciNet  MATH  Google Scholar 

  • Borg, I., and Groenen, P.J.F. (2010). Modern multidimensional scaling: theory and applications (New York, NY: Springer New York).

    MATH  Google Scholar 

  • Breiman, L. (1996). Bagging Predictors. Mach. Learn. 24, 123–140.

    MATH  Google Scholar 

  • Budczies, J., Klauschen, F., Sinn, B.V., Győrffy, B., Schmitt, W.D., Darb-Esfahani, S., and Denkert, C. (2012). Cutoff Finder: a comprehensive and straightforward Web application enabling rapid biomarker cutoff optimization. PloS One 7, e51862.

    Article  Google Scholar 

  • Cannon, J. (2011). The Significance of Hurthle Cells in Thyroid Disease. The Oncologist 16, 1380–1387.

    Article  Google Scholar 

  • Chen, Z., Li, J., and Wei, L. (2007). A multiple kernel support vector machine scheme for feature selection and rule extraction from gene expression data of cancer tissue. Artif. Intell. Med. 41, 161–175.

    Article  Google Scholar 

  • Choi, Y., Liu, T.T., Pankratz, D.G., Colby, T.V., Barth, N.M., Lynch, D.A., Walsh, P.S., Raghu, G., Kennedy, G.C., and Huang, J. (2018). Identification of usual interstitial pneumonia pattern using RNA-Seq and machine learning: challenges and solutions. BMC Genomics 19.

    Google Scholar 

  • Coffin, M., and Sukhatme, S. (1997). Receiver Operating Characteristic Studies and Measurement Errors. Biometrics 53, 823–837.

    Article  MATH  Google Scholar 

  • Cun, Y., and Fröhlich, H. (2012). Prognostic gene signatures for patient stratification in breast cancer - accuracy, stability and interpretability of gene selection approaches using prior knowledge on protein-protein interactions. BMC Bioinformatics 13, 69.

    Article  Google Scholar 

  • Danaee, P., Ghaeini, R., and Hendrix, D.A. (2017). A Deep Learning Approach For Cancer Detection And Relevant Gene Identification. Pac. Symp. Biocomput. Pac. Symp. Biocomput. 22, 219–229.

    Google Scholar 

  • Das, S. (2001). Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection. In Proceedings of the Eighteenth International Conference on Machine Learning, (San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.), pp. 74–81.

    Google Scholar 

  • Dawson, K., Rodriguez, R.L., and Malyj, W. (2005). Sample phenotype clusters in high-density oligonucleotide microarray data sets are revealed using Isomap, a nonlinear algorithm. BMC Bioinformatics 6, 195.

    Article  Google Scholar 

  • Díaz-Uriarte, R., and Alvarez de Andrés, S. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7, 3.

    Article  Google Scholar 

  • Diplaris, S., Tsoumakas, G., Mitkas, P.A., and Vlahavas, I. (2005). Protein Classification with Multiple Algorithms. In Advances in Informatics, (Springer, Berlin, Heidelberg), pp. 448–456.

    Chapter  Google Scholar 

  • Dobbin, K.K., and Simon, R.M. (2007). Sample size planning for developing classifiers using high-dimensional DNA microarray data. Biostat. Oxf. Engl. 8, 101–117.

    Article  MATH  Google Scholar 

  • Dobbin, K.K., and Simon, R.M. (2011). Optimally splitting cases for training and testing high dimensional classifiers. BMC Med. Genomics 4, 31.

    Article  Google Scholar 

  • Dobbin, K.K., Zhao, Y., and Simon, R.M. (2008). How large a training set is needed to develop a classifier for microarray data? Clin. Cancer Res. Off. J. Am. Assoc. Cancer Res. 14, 108–114.

    Article  Google Scholar 

  • Džeroski, S., and Ženko, B. (2004). Is Combining Classifiers with Stacking Better than Selecting the Best One? Mach. Learn. 54, 255–273.

    Article  MATH  Google Scholar 

  • England, W.L. (1988). An Exponential Model Used for optimal Threshold selection on ROC Curues. Med. Decis. Making 8, 120–131.

    Article  Google Scholar 

  • Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., and Thrun, S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, nature21056.

    Article  Google Scholar 

  • Fakoor, R., Ladhak, F., Nazi, A., and Huber, M. (2013). Using deep learning to enhance cancer diagnosis and classification. In Proceedings of the ICML Workshop on the Role of Machine Learning in Transforming Healthcare, p.

    Google Scholar 

  • Ferranti, D., Krane, D., and Craft, D. (2017). The value of prior knowledge in machine learning of complex network systems. Bioinforma. Oxf. Engl. 33, 3610–3618.

    Article  Google Scholar 

  • Freund, Y., and Schapire, R.E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 55, 119–139.

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman, J.H. (2001). Greedy function approximation: A gradient boosting machine. Ann. Stat. 29, 1189–1232.

    Article  MathSciNet  MATH  Google Scholar 

  • Glaab, E. (2016). Using prior knowledge from cellular pathways and molecular networks for diagnostic specimen classification. Brief. Bioinform. 17, 440–452.

    Article  Google Scholar 

  • Goetzinger, K.R., and Odibo, A.O. (2011). Statistical analysis and interpretation of prenatal diagnostic imaging studies, Part 1: evaluating the efficiency of screening and diagnostic tests. J. Ultrasound Med. Off. J. Am. Inst. Ultrasound Med. 30, 1121–1127.

    Google Scholar 

  • Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning (MIT Press).

    Google Scholar 

  • Greiner, M. (1995). Two-graph receiver operating characteristic (TG-ROC): a Microsoft-EXCEL template for the selection of cut-off values in diagnostic tests. J. Immunol. Methods 185, 145–146.

    Article  Google Scholar 

  • Gulshan, V., Peng, L., Coram, M., Stumpe, M.C., Wu, D., Narayanaswamy, A., Venugopalan, S., Widner, K., Madams, T., Cuadros, J., et al. (2016). Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA 316, 2402.

    Article  Google Scholar 

  • Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002). Gene Selection for Cancer Classification using Support Vector Machines. Mach. Learn. 46, 389–422.

    Article  MATH  Google Scholar 

  • Halpern, E.J., Albert, M., Krieger, A.M., Metz, C.E., and Maidment, A.D. (1996). Comparison of receiver operating characteristic curves on the basis of optimal operating points. Acad. Radiol. 3, 245–253.

    Article  Google Scholar 

  • Han, M., Chen, D., and Sun, Z. (2008). Analysis to Neyman-Pearson classification with convex loss function. Anal. Theory Appl. 24, 18–28.

    Article  MathSciNet  MATH  Google Scholar 

  • Hao, Y., Choi, Y., Babiarz, J.-E., Kloos, R.-T., Kennedy, G.-C., Huang, J., Walsh, P.-S. (2019a) Analytical verification performance of afirma genomic sequencing classifier in the diagnosis of cytologically indeterminate thyroid nodules. Front. Endocrinol. 10:438

    Article  Google Scholar 

  • Hao, Y., Duh, Q.-Y., Kloos, R.-T., Babiarz, J.-E., Harrell, R.-M., Traweek, S.-T., Kim, S.-Y., Fedorowicz, G., Walsh, P.-S., Sadow, P.-M., Huang, J., Kennedy, G.-C. (2019b) Identification of Hurthle cell cancers: solving a clinical challenge with genomic sequencing and a trio of machine learning algorithms. BMC Syst. Biol. 13(Suppl 2):Article number 27

    Article  Google Scholar 

  • Hinton, G.E., and Salakhutdinov, R.R. (2006). Reducing the dimensionality of data with neural networks. Science 313, 504–507.

    Article  MathSciNet  MATH  Google Scholar 

  • Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441.

    Article  MATH  Google Scholar 

  • Hubert, M., Rousseeuw, P.J., and Branden, K.V. (2005). ROBPCA: A New Approach to Robust Principal Component Analysis. Technometrics 47, 64–79.

    Article  MathSciNet  Google Scholar 

  • Japkowicz, N., and Stephen, S. (2002). The Class Imbalance Problem: A Systematic Study. Intell Data Anal 6, 429–449.

    Article  MATH  Google Scholar 

  • Jirapech-Umpai, T., and Aitken, S. (2005). Feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes. BMC Bioinformatics 6, 148.

    Article  Google Scholar 

  • Jurcic, J.G., and Scheinberg, D.A. (2002). Monoclonal Antibodies: Leukemia and Lymphoma. In Encyclopedia of Cancer, (Elsevier), pp. 235–245.

    Chapter  Google Scholar 

  • Kim, S.J., Cho, K.J., and Oh, S. (2017). Development of machine learning models for diagnosis of glaucoma. PLOS ONE 12, e0177726.

    Article  Google Scholar 

  • Kohl, M. (2016). MKmisc: Miscellaneous functions from M. Kohl.

    Google Scholar 

  • Kohonen, T. (1988). Neurocomputing: Foundations of Research. J.A. Anderson, and E. Rosenfeld, eds. (Cambridge, MA, USA: MIT Press), pp. 509–521.

    Google Scholar 

  • Kotani, M., Sugiyama, A., and Ozawa, S. (2002). Analysis of DNA microarray data using self-organizing map and kernel based clustering. In Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP ‘02, pp. 755–759 2.

    Chapter  Google Scholar 

  • Kotsiantis, S., Kanellopoulos, D., and Pintelas, P. (2005). Handling imbalanced datasets: A review. GESTS Int. Trans. Comput. Sci. Eng. 30, 25–36.

    Google Scholar 

  • Kourou, K., Exarchos, T.P., Exarchos, K.P., Karamouzis, M.V., and Fotiadis, D.I. (2015). Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 13, 8–17.

    Article  Google Scholar 

  • Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324.

    Article  Google Scholar 

  • Liu, Q., Sung, A.H., Chen, Z., Liu, J., Huang, X., and Deng, Y. (2009). Feature Selection and Classification of MAQC-II Breast Cancer and Multiple Myeloma Microarray Gene Expression Data. PLOS ONE 4, e8250.

    Article  Google Scholar 

  • Liu, Q., Sung, A.H., Chen, Z., Liu, J., Chen, L., Qiao, M., Wang, Z., Huang, X., and Deng, Y. (2011). Gene selection and classification for cancer microarray data based on machine learning and similarity measures. BMC Genomics 12, S1.

    Article  Google Scholar 

  • Long, E., Lin, H., Liu, Z., Wu, X., Wang, L., Jiang, J., An, Y., Lin, Z., Li, X., Chen, J., et al. (2017). An artificial intelligence platform for the multihospital collaborative management of congenital cataracts. Nat. Biomed. Eng. 1, s41551-016-0024–016.

    Google Scholar 

  • López-Ratón, M., Rodríguez-Álvarez, M., Cadarso-Suárez, C., and Gude, F. (2014). OptimalCutpoints: An R Package for Selecting Optimal Cutpoints in Diagnostic Tests. J. Stat. Softw. 61.

    Google Scholar 

  • Lusted LB (1968). Introduction to Medical Decision Making (Springfield, IL: Charles C Thomas).

    Google Scholar 

  • Mari, G., Deter, R.L., Carpenter, R.L., Rahman, F., Zimmerman, R., Moise, K.J., Dorman, K.F., Ludomirsky, A., Gonzalez, R., Gomez, R., et al. (2000). Noninvasive diagnosis by Doppler ultrasonography of fetal anemia due to maternal red-cell alloimmunization. Collaborative Group for Doppler Assessment of the Blood Velocity in Anemic Fetuses. N. Engl. J. Med. 342, 9–14.

    Article  Google Scholar 

  • Maxim, L.D., Niebo, R., and Utell, M.J. (2014). Screening tests: a review with examples. Inhal. Toxicol. 26, 811–828.

    Article  Google Scholar 

  • McGaughey, G., Walters, W.P., and Goldman, B. (2016). Understanding covariate shift in model performance. F1000Research 5, 597.

    Article  Google Scholar 

  • McNeil, B.J., Keeler, E., and Adelstein, S.J. (1975). Primer on Certain Elements of Medical Decision Making. N. Engl. J. Med. 293, 211–215.

    Article  Google Scholar 

  • Moraes, D., Wainer, J., and Rocha, A. (2016). Low false positive learning with support vector machines. J. Vis. Commun. Image Represent. 38, 340–350.

    Article  Google Scholar 

  • Moreno-Torres, J.G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N.V., and Herrera, F. (2012). A unifying view on dataset shift in classification. Pattern Recognit. 45, 521–530.

    Article  Google Scholar 

  • Neyman, J., and Pearson, E.S. (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philos. Trans. R. Soc. Lond. Math. Phys. Eng. Sci. 231, 289–337.

    Article  MATH  Google Scholar 

  • Nguyen, Q., Valizadegan, H., Seybert, A., and Hauskrecht, M. (2011). Sample-efficient learning with auxiliary class-label information. AMIA. Annu. Symp. Proc. 2011, 1004–1012.

    Google Scholar 

  • Nguyen, Q., Valizadegan, H., and Hauskrecht, M. (2014). Learning classification models with soft-label information. J. Am. Med. Inform. Assoc. 21, 501–508.

    Article  Google Scholar 

  • Nikiforov, Y.E., Seethala, R.R., Tallini, G., Baloch, Z.W., Basolo, F., Thompson, L.D.R., Barletta, J.A., Wenig, B.M., Al Ghuzlan, A., Kakudo, K., et al. (2016). Nomenclature Revision for Encapsulated Follicular Variant of Papillary Thyroid Carcinoma: A Paradigm Shift to Reduce Overtreatment of Indolent Tumors. JAMA Oncol. 2, 1023–1029.

    Article  Google Scholar 

  • Nikkilä, J., Törönen, P., Kaski, S., Venna, J., Castrén, E., and Wong, G. (2002). Analysis and visualization of gene expression data using self-organizing maps. Neural Netw. Off. J. Int. Neural Netw. Soc. 15, 953–966.

    Article  Google Scholar 

  • Orsenigo, C., and Vercellis, C. (2012). An effective double-bounded tree-connected Isomap algorithm for microarray data classification. Pattern Recognit. Lett. 33, 9–16.

    Article  Google Scholar 

  • Pankratz, D.G., Choi, Y., Imtiaz, U., Fedorowicz, G.M., Anderson, J.D., Colby, T.V., Myers, J.L., Lynch, D.A., Brown, K.K., Flaherty, K.R., et al. (2017). Usual Interstitial Pneumonia Can Be Detected in Transbronchial Biopsies Using Machine Learning. Ann. Am. Thorac. Soc. 14, 1646–1654.

    Article  Google Scholar 

  • Patel, K.N., Angell, T.E., Babiarz, J., Barth, N.M., Blevins, T., Duh, Q.-Y., Ghossein, R.A., Harrell, R.M., Huang, J., Kennedy, G.C., et al. (2018). Performance of a Genomic Sequencing Classifier for the Preoperative Diagnosis of Cytologically Indeterminate Thyroid Nodules. JAMA Surg. 153, 817.

    Article  Google Scholar 

  • Pedro Brasil (2010). DiagnosisMed: Diagnostic Test Accuracy Evaluation for Medical Professionals.

    Google Scholar 

  • Perez, M., and Marwala, T. (2012). Microarray data feature selection using hybrid genetic algorithm simulated annealing. In 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel, pp. 1–5.

    Google Scholar 

  • Perkins, N.J., and Schisterman, E.F. (2005). The Youden Index and the optimal cut-point corrected for measurement error. Biom. J. Biom. Z. 47, 428–441.

    Article  MathSciNet  Google Scholar 

  • Puuronen, S., Terziyan, V., and Tsymbal, A. (1999). A dynamic integration algorithm for an ensemble of classifiers. In Foundations of Intelligent Systems, Z.W. Raś, and A. Skowron, eds. (Berlin, Heidelberg: Springer), pp. 592–600.

    Chapter  MATH  Google Scholar 

  • Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. (2009). Dataset Shift in Machine Learning.

    Google Scholar 

  • Raghu, G., et al. (2019). Use of a molecular classifier to identify usual interstitial pneumonia in conventional transbronchial lung biopsy samples: a prospective validation study. Lancet Respir Med. 7(6), 487–496

    Article  Google Scholar 

  • Ranzato, M. aurelio, Boureau, Y. -la., and Cun, Y.L. (2008). Sparse Feature Learning for Deep Belief Networks. In Advances in Neural Information Processing Systems 20, J.C. Platt, D. Koller, Y. Singer, and S.T. Roweis, eds. (Curran Associates, Inc.), pp. 1185–1192.

    Google Scholar 

  • Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., and Müller, M. (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77.

    Article  Google Scholar 

  • Ronald D. Smith (1995). Evaluation of Diagnostic Tests. In Veterinary Clinical Epidemiology, (Butterworth-Heinemann, Stoneham), pp. 29–43.

    Google Scholar 

  • Sajda, P. (2006). Machine learning for detection and diagnosis of disease. Annu. Rev. Biomed. Eng. 8, 537–565.

    Article  Google Scholar 

  • Sanchez-Garcia, F., Villagrasa, P., Matsui, J., Kotliar, D., Castro, V., Akavia, U.-D., Chen, B.-J., Saucedo-Cuevas, L., Rodriguez Barrueco, R., Llobet-Navas, D., et al. (2014). Integration of genomic data enables selective discovery of breast cancer drivers. Cell 159, 1461–1475.

    Article  Google Scholar 

  • Schölkopf, B., Platt, J., and Hofmann, T. (2007). Greedy Layer-Wise Training of Deep Networks. In Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference, (MIT Press), pp. 153–160.

    Google Scholar 

  • Scott, C. (2007). Performance Measures for Neyman-Pearson Classification. IEEE Trans. Inf. Theory 53, 2852–2863.

    Article  MathSciNet  MATH  Google Scholar 

  • Sheng, L., Pique-Regi, R., Asgharzadeh, S., and Ortega, A. (2009). Microarray classification using block diagonal linear discriminant analysis with embedded feature selection. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1757–1760.

    Chapter  Google Scholar 

  • Sill, J., Takacs, G., Mackey, L., and Lin, D. (2009). Feature-Weighted Linear Stacking. ArXiv09110460 Cs.

    Google Scholar 

  • Silvestri, G.A., Vachani, A., Whitney, D., Elashoff, M., Porta Smith, K., Ferguson, J.S., Parsons, E., Mitra, N., Brody, J., Lenburg, M.E., et al. (2015). A Bronchial Genomic Classifier for the Diagnostic Evaluation of Lung Cancer. N. Engl. J. Med. 373, 243–251.

    Article  Google Scholar 

  • Sing, T., Sander, O., Beerenwinkel, N., and Lengauer, T. (2005). ROCR: visualizing classifier performance in R. Bioinforma. Oxf. Engl. 21, 3940–3941.

    Article  Google Scholar 

  • Squillario, M., Barbieri, M., Verri, A., and Barla, A. (2016). Enhancing Interpretability of Gene Signatures with Prior Biological Knowledge. Microarrays Basel Switz. 5.

    Google Scholar 

  • Stingo, F.C., Chen, Y.A., Tadesse, M.G., and Vannucci, M. (2011). Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes. Ann. Appl. Stat. 5, 1978–2002.

    Article  MathSciNet  MATH  Google Scholar 

  • Strong, D.M., Lee, Y.W., and Wang, R.Y. (1997). Data Quality in Context. Commun ACM 40, 103–110.

    Article  Google Scholar 

  • Tan, J., Ung, M., Cheng, C., and Greene, C.S. (2015). Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders. Pac. Symp. Biocomput. Pac. Symp. Biocomput. 132–143.

    Google Scholar 

  • Tang, E.K., Suganthan, P., and Yao, X. (2006). Gene selection algorithms for microarray data based on least squares support vector machine. BMC Bioinformatics 7, 95.

    Article  Google Scholar 

  • Tarca, A.L., Lauria, M., Unger, M., Bilal, E., Boue, S., Kumar Dey, K., Hoeng, J., Koeppl, H., Martin, F., Meyer, P., et al. (2013). Strengths and limitations of microarray-based phenotype prediction: lessons learned from the IMPROVER Diagnostic Signature Challenge. Bioinforma. Oxf. Engl. 29, 2892–2899.

    Article  Google Scholar 

  • Tenenbaum, J.B., de Silva, V., and Langford, J.C. (2000). A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290, 2319–2323.

    Article  Google Scholar 

  • Tibshirani, R. (1994). Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc. Ser. B 58, 267–288.

    MathSciNet  MATH  Google Scholar 

  • Tong, X. (2013). A Plug-in Approach to Neyman-Pearson Classification. J. Mach. Learn. Res. 14, 3011–3040.

    MathSciNet  MATH  Google Scholar 

  • Tong, X., Feng, Y., and Zhao, A. (2016a). A survey on Neyman-Pearson classification and suggestions for future research. Wiley Interdiscip. Rev. Comput. Stat. 8, 64–81.

    Article  MathSciNet  Google Scholar 

  • Tong, X., Feng, Y., and Li, J.J. (2016b). Neyman-Pearson (NP) classification algorithms and NP receiver operating characteristic (NP-ROC) curves. ArXiv160803109 Stat.

    Google Scholar 

  • Tong, X., Feng, Y., and Li, J.J. (2018). Neyman-Pearson classification algorithms and NP receiver operating characteristics. Sci. Adv. 4, eaao1659.

    Article  Google Scholar 

  • Valdes, G., Luna, J.M., Eaton, E., Ii, C.B.S., Ungar, L.H., and Solberg, T.D. (2016). MediBoost: a Patient Stratification Tool for Interpretable Decision Making in the Era of Precision Medicine. Sci. Rep. 6, srep37854.

    Article  Google Scholar 

  • Valizadegan, H., Nguyen, Q., and Hauskrecht, M. (2012). Learning Medical Diagnosis Models from Multiple Experts. AMIA. Annu. Symp. Proc. 2012, 921–930.

    Google Scholar 

  • Vannucci, M., and Stingo, F.C. (2011). Bayesian Models for Variable Selection that Incorporate Biological Information∗. In Bayesian Statistics 9, J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M. Smith, and M. West, eds. (Oxford University Press), pp. 659–678.

    Google Scholar 

  • Vaske, C.J., Benz, S.C., Sanborn, J.Z., Earl, D., Szeto, C., Zhu, J., Haussler, D., and Stuart, J.M. (2010). Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinforma. Oxf. Engl. 26, i237-245.

    Article  Google Scholar 

  • Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning (ICML-08), W.W. Cohen, A. Mccallum, and S.T. Roweis, eds. pp. 1096–1103.

    Chapter  Google Scholar 

  • Wang, S.-Q., Yang, J., and Chou, K.-C. (2006). Using stacked generalization to predict membrane protein types based on pseudo-amino acid composition. J. Theor. Biol. 242, 941–946.

    Article  MathSciNet  Google Scholar 

  • Wolpert, D.H. (1992). Stacked Generalization. Neural Netw. 5, 241–259.

    Article  Google Scholar 

  • Wu, G., Feng, X., and Stein, L. (2010). A human functional protein interaction network and its application to cancer data analysis. Genome Biol. 11, R53.

    Article  Google Scholar 

  • Wu, S.-H., Lin, K.-P., Chen, C.-M., and Chen, M.-S. (2008). Asymmetric Support Vector Machines: Low False-positive Learning Under the User Tolerance. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (New York, NY, USA: ACM), pp. 749–757.

    Chapter  Google Scholar 

  • Xia, X.-L., Xing, H., and Liu, X. (2013). Analyzing Kernel Matrices for the Identification of Differentially Expressed Genes. PLoS One 8, e81683.

    Article  Google Scholar 

  • Xie, J., Xu, L., and Chen, E. (2012). Image Denoising and Inpainting with Deep Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, eds. (Curran Associates, Inc), pp. 341–349.

    Google Scholar 

  • Xie, Y.-L., Wang, J.-H., Liang, Y.-Z., Sun, L.-X., Song, X.-H., and Yu, R.-Q. (1993). Robust principal component analysis by projection pursuit. J. Chemom. 7, 527–541.

    Article  Google Scholar 

  • Xu, L., Jiang, J.-H., Zhou, Y.-P., Wu, H.-L., Shen, G.-L., and Yu, R.-Q. (2007). MCCV stacked regression for model combination and fast spectral interval selection in multivariate calibration. Chemom. Intell. Lab. Syst. 87, 226–230.

    Article  Google Scholar 

  • Xu, Y., Dai, Z., Chen, F., Gao, S., Pei, J., and Lai, L. (2015). Deep Learning for Drug-Induced Liver Injury. J. Chem. Inf. Model. 55, 2085–2093.

    Article  Google Scholar 

  • Youden, W.J. (1950). Index for rating diagnostic tests. Cancer 3, 32–35.

    Article  Google Scholar 

  • Zhao, A., Feng, Y., Wang, L., and Tong, X. (2016). Neyman-Pearson Classification under High-Dimensional Settings. J. Mach. Learn. Res. 17, 1–39.

    MathSciNet  MATH  Google Scholar 

  • Zweig, M.H., and Campbell, G. (1993). Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin. Chem. 39, 561–577.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Lu, J., Hao, Y., Huang, J., Kim, S.Y. (2019). Big Data, Real-World Data, and Machine Learning. In: Fang, L., Su, C. (eds) Statistical Methods in Biomarker and Early Clinical Development. Springer, Cham. https://doi.org/10.1007/978-3-030-31503-0_9

Download citation

Publish with us

Policies and ethics