A Machine Learning Perspective on Big Data Analysis

  • Nathalie JapkowiczEmail author
  • Jerzy Stefanowski
Part of the Studies in Big Data book series (SBD, volume 16)


This chapter surveys the field of Big Data analysis from a machine learning perspective. In particular, it contrasts Big Data analysis with data mining, which is based on machine learning, reviews its achievements and discusses its impact on science and society. The chapter concludes with a summary of the book’s contributing chapters divided into problem-centric and domain-centric essays.


Link Prediction Concept Drift Hadoop Distribute File System Graph Mining Data Stream Mining 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Abiteboul, S.: Querying semi-structured data. In: ICDT ’97 Proceedings of the 6th International Conference on Database Theory, pp. 1–18 (1997)Google Scholar
  2. 2.
    An interview with Michal Jordan—Why Big Data Could Be a Big Fail. IEEE Spectrum. (Posted by Lee Gomes, 20 Oct 2014)Google Scholar
  3. 3.
    Anderson, C.: The end of Theory. The data deluge makes the scientific method obsolete, Wired Magazine, 16/07 (2008, June 23)Google Scholar
  4. 4.
    Auerbach, D.: The Mystery of the Exploding Tongue. How reliable is Google Flu Trends? Slate Web page. (2014)
  5. 5.
    Azzara, M.: Big Data Ethics: Transparency, Privacy, and Identity. Blog (Retrieved 2015)Google Scholar
  6. 6.
    Barbaro, M., Zeller, Jr, T.: A Face Is Exposed for AOL Searcher No. 4417749. The New York Times Magazine. (August 9, 2006)Google Scholar
  7. 7.
    Barbier, G., Liu, H.: Data Mining in Social Media. In: Aggarwal, C. (eds.) Social Network Data Analytics, pp. 327–352. Kluwer Academic Publishers, Springer (2011)Google Scholar
  8. 8.
    Bekkerman, R., Bilenko, M., Langford, J.: Scaling Up Machine Learning. Parallel and Distributed Approaches. Cambridge University Press, Cambridge (2011)CrossRefGoogle Scholar
  9. 9.
    Berkeley Data Analysis Stack.
  10. 10.
    Beyer, M.A., Laney, D.: The importance of "Big Data": a definition. Gartner Publications, pp. 1–9 (2012). See also: http://www.gartner-com/it-glosary/big-data
  11. 11.
    Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)Google Scholar
  12. 12.
    Billion Price Project.
  13. 13.
    Boyd, D., Crawford, K.: Six provocations for Big Data. Presented at "A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society" Oxford Internet Institute, Sept 21 (2011)Google Scholar
  14. 14.
    Boyd, D., Crawford, K.: Critical questions for big data. Inf. Commun. Soc. 15(5), 662–679 (2012)CrossRefGoogle Scholar
  15. 15.
    Che, D., Safran, M., Peng, Z.: From big data to big data mining: challenges, issues and opportunities. In: Hong, B, et al. (eds.) DASFAA Workshops, Springer LNCS 7827, pp. 1–15 (2013)Google Scholar
  16. 16.
    Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mobile New Appl. 19, 171–209 (2014)CrossRefGoogle Scholar
  17. 17.
    Dai, C., Lin, D., Bertino, E., Kantarcioglu, M.: An approach to evaluate data trustworthiness based on data provenance. In: Proceedings of the 5th VLDB Workshop on Secure Data Management, pp. 82– 98 (2008)Google Scholar
  18. 18.
    Davidson, S., Freire, J.: Provenance and scientific workflows: challenges and opportunities. In: Proceedings of the SIGMOD’08 (2008)Google Scholar
  19. 19.
    Davis, K.: Ethics of Big Data. Balancing Risk and Innovation. O’Reily (2012)Google Scholar
  20. 20.
    De Mauro, A., Greco, M., Grimaldi, M.: What is big data? a consensual definition and a review of key research topics. In: Proceedings of 4th Conference on Integrated Information (2014)Google Scholar
  21. 21.
    Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 71–80 (2000)Google Scholar
  22. 22.
    Einav, L., Levin, J.D.: The data revolution and economic analysis. National Bureau of Economic Research Working Paper, no. 19035 (2013)Google Scholar
  23. 23.
    Fan, W., Bifet, A.: Mining big data: current status, and forecast to the future. SIGKDD Explor. Newsl. 12(2), 1–5 (2013)Google Scholar
  24. 24.
    Frontiers in Massive Data Analysis. The National Research Council, the National Academy of Sciences, USA (2013)Google Scholar
  25. 25.
    Future Attribute Screening Technology. Wikipedia article.
  26. 26.
    Gaber, M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: a review. ACM Sigmod Record 34(2), 18–26 (2005)CrossRefzbMATHGoogle Scholar
  27. 27.
    Gama, J.: Knowledge Discovery from Data Streams, 1st ed. Hall/CRC, (2010)Google Scholar
  28. 28.
    Ghoting, A., Kambadur, P., Pednault, E., Kannan, R.: NIMBLE: A toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD 2011, pp. 334–342 (2011)Google Scholar
  29. 29.
    Ginsberg, J., Mohebbi, M. H., Patel, Rajan S., Brammer, L., Smolinski, M.S., Brilliant, L.: Detecting influenza epidemics using search engine query data. Nature 457(7232), 1012–1014 (19 Feb 2009)Google Scholar
  30. 30.
    Glavic, B.: Big Data provenance: challenges and implications for benchmarking. In: Specifying Big Data Benchmarks, pp. 72–80. Springer (2014)Google Scholar
  31. 31.
    Gonzalez, M.C., Hidalgo, C.A., Barabasi, A.L.: Understanding individual human mobility patterns. Nature 453, 779–782 (2008)CrossRefGoogle Scholar
  32. 32.
  33. 33.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. San Francisco, Morgan Kaufmann (2005)zbMATHGoogle Scholar
  34. 34.
    Harford, T.: Big Data: are we making a big mistakes? Financial Times, March 28 (2014)Google Scholar
  35. 35.
    Hashem, I., Yaqoob, I., Anuor, N., Mokhter, S., Gani, A., Khan, S.: The rise of bog data on cloud computing. Review and open research issues. Inf. Syst. 47, 98–115 (2015)CrossRefGoogle Scholar
  36. 36.
    How big data analysis helped increase Walmart’s sales turnover. DeZyre Web page (23 May 2015)Google Scholar
  37. 37.
    Kang, U., Faloutsos, C.: Big graph mining: algorithms and discoveries. ACM SIGKDD Explor. Newsl. 14(2), 29–36 (2012)CrossRefGoogle Scholar
  38. 38.
    Kraska, T., Talwalkar, A., Duchi, J.C., Griffith, R., Franklin, M.J., Jordan, M.I. MLbase: A distributed machine-learning system. In: Proceedings of Sixth Biennial Conference on Innovative Data Systems Research (2013)Google Scholar
  39. 39.
    Krempl, G., Zliobaite, I., Brzezinski, D., Hullermeier, E., Last, M., Lemaire, V., Noack, T., Shaker, A., Sievi, S., Spiliopoulou, M., Stefanowski, J.: Open challenges for data stream mining research. ACM SIGKDD Explor. 16(1), 1–10 (2014). JuneCrossRefGoogle Scholar
  40. 40.
    Mahout software.
  41. 41.
    Maimon, O., Rokach, L. (eds.): The Data Mining and Knowledge Discovery Handbook. Springer (2005)Google Scholar
  42. 42.
    Mannila, H.: Data mining: machine learning, statistics, and databases, In: Proceedings of the Eight International Conference on Scientific and Statistical Database Management. Stockholm June 18–20, pp. 1–8 (1996)Google Scholar
  43. 43.
    Manning C., Schutze H. Foundations of Statistical Natural Language Processing. MIT Press (1999)Google Scholar
  44. 44.
    Marcus, G., Davis, E.: Eight (No, Nine!) Problems With Big Data. New York Times (Apr 6, 2014)Google Scholar
  45. 45.
    Matwin, S.: Privacy-preserving data mining techniques: survey and challenges. In: Custers, B., Calders, T., Schermer, B., Zarsky T. (eds.) Discrimination and Privacy in the Information Society. Springer Series on Studies in Applied Philosophy, Epistemology and Rational Ethics, vol. 3, pp. 209–221 (2013)Google Scholar
  46. 46.
    Matwin, S.: Machine learning: four lessons and what is next? Bull. Pol. AI Soc. 2, 2–7 (2013)Google Scholar
  47. 47.
    Mayer-Schonberger, V., Cukier, K.: Big Data: A Revolution That Will Transform How We Live, Work and Think. Eamon, Dolan/Houghton Mifflin Harcourt (2013)Google Scholar
  48. 48.
    Morales, G., Bifet, A.: SAMOA: scalable advanced massive online analysis. J. Mach. Learn. Res. 16, 149–153 (2015)Google Scholar
  49. 49.
    Narayanan, A., Shmatikov, V.: Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset). In: Proceedings of the 2008 IEEE Symposium on Security and Privacy SP’08, pp. 111–125 (2008)Google Scholar
  50. 50.
    Piatetsky-Shapiro, G., Matheus, C. (eds): Knowledge discovery in databases. AAAI/MIT Press (1991)Google Scholar
  51. 51.
    Pietsch, W.: Big Data? The New Science of Complexity. In: 6th Munich-Sydney-Tilburg Conference on Models and Decisions (Munich; 10–12 April 2013)Google Scholar
  52. 52.
    Reinventing Society in the Wake of Big Data—Edge’s interview with Alex "Sandy" Pentland (Posted August 30, 2012)Google Scholar
  53. 53.
    Ritter, D.: When to act on a correlation and when no to. Harward Business Review, March 19 (2014)Google Scholar
  54. 54.
    Roddick, J., Hornsby, K., Spiliopoulou, M.: An updated bibliography of temporal, spatial, and spatio-temporal data mining research. Lect. Notes Comput. Sci. 2007, 147–163 (2001)CrossRefzbMATHGoogle Scholar
  55. 55.
    Rudin, C., Passonneau, R., Radeva, A., Jerome, S., Issac, D.: 21st century data miners meet 19-th century electrical cables. IEEE Comput. 103–105 (June 2011)Google Scholar
  56. 56.
    Rudin, C., et al.: Machine learning for the New York city power grid. IEEE Trans. Pattern Anal. Mach. Intell. 34(2), 328–345 (2012)CrossRefGoogle Scholar
  57. 57.
    Shekhar, S.: What is special about mining spatial and spatio-temporal datasets? Tutorial (2014)b.
  58. 58.
    Simmhan, Y., Plale, B., Gannon, D.: A survey on data provenance techniques. Technical Report Indiana University, IUB-CS-TR618 (2005)Google Scholar
  59. 59.
    Singh, D., Reddy, C.: A survey on platforms for Big Data analytics. J. Big Data 1(8), 2–20 (2014)Google Scholar
  60. 60.
    Sloan Digital Sky Survey. Wikipedia article.
  61. 61.
    Sun, Y., Han, J.: Mining Heterogeneous Information Networks: Principles and Methodologies. Morgan & Claypool Publishers (2012)Google Scholar
  62. 62.
    The h2o software.
  63. 63.
    Thomson, C.: What Is IBMs Watson? The New York Times Magazine, June 16 (2010)Google Scholar
  64. 64.
    Tufekci, Z.: Big Data: Pitfalls, methods and concepts for an emergent field. SSRN (March 2013).
  65. 65.
    Venkateswara Rao, K., Govardhan, A., Chalapati, Rao K.V.: Spatiotemporal data mining: issues, tasks and applications. Int. J. Comput. Sci. & Eng. Surv. (IJCSES) 3(1) (Feb 2012)Google Scholar
  66. 66.
    Vucetic S., Obradovis, Z.: Discovering homogeneous regions in spatial data through competition. In: Proceedings of the 17th International Conference of Machine Learning ICML, pp. 1095–1102 (2000)Google Scholar
  67. 67.
    Zhou, Z.H., Chavla, N., Jin, Y., Williams, G.: Big Data opportunities and challenges: discussions from data analytics perspectives. IEEE Comput. Intell. Mag. 9(4), 62–74 (2014)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.School of Electrical Engineering & Computer ScienceUniversity of OttawaOttawaCanada
  2. 2.Institute of Computing SciencesPoznań University of TechnologyPoznańPoland

Personalised recommendations