Advertisement

What Is Data Science?

  • Michael L. BrodieEmail author
Chapter

Abstract

Data science, a new discovery paradigm, is potentially one of the most significant advances of the early twenty-first century. Originating in scientific discovery, it is being applied to every human endeavor for which there is adequate data. While remarkable successes have been achieved, even greater claims have been made. Benefits, challenge, and risks abound. The science underlying data science has yet to emerge. Maturity is more than a decade away. This claim is based firstly on observing the centuries-long developments of its predecessor paradigms—empirical, theoretical, and Jim Gray’s Fourth Paradigm of Scientific Discovery (Hey et al., The fourth paradigm: data-intensive scientific discovery Edited by Microsoft Research, 2009) (aka eScience, data-intensive, computational, procedural)—and secondly on my studies of over 150 data science use cases, several data science-based startups, and, on my scientific advisory role for Insight (https://www.insight-centre.org/), a Data Science Research Institute (DSRI) that requires that I understand the opportunities, state of the art, and research challenges for the emerging discipline of data science. This chapter addresses essential questions for a DSRI: What is data science? What is world-class data science research? A companion chapter (Brodie, On Developing Data Science, in Braschler et al. (Eds.), Applied data science – Lessons learned for the data-driven business, Springer 2019) addresses the development of data science applications and of the data science discipline itself.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brodie, M. L. (2014a, June). The first law of data science: Do umbrellas cause rain? KDnuggets.Google Scholar
  2. Brodie, M. L. (2014b, October). Piketty revisited: Improving economics through data science – How data curation can enable more faithful data science (in much less time). KDnuggets.Google Scholar
  3. Brodie, M. L. (2015a, June). Understanding data science: An emerging discipline for data-intensive discovery. In S. Cutt (Ed.), Getting data right: Tackling the challenges of big data volume and variety. Sebastopol, CA: O’Reilly Media.Google Scholar
  4. Brodie, M. L. (2015b, July). Doubt and verify: Data science power tools. KDnuggets. Republished on ODBMS.org.Google Scholar
  5. Brodie, M. L. (2015c, November). On political economy and data science: When a discipline is not enough. KDnuggets. Republished ODBMS.org November 20, 2015.Google Scholar
  6. Brodie, M. L. (2018, January 1). Why understanding truth is important in data science? KDnuggets. Republished Experfy.com, February 16, 2018.Google Scholar
  7. Brodie, M. L. (2019). On developing data science, to appear. In M. Braschler, T. Stadelmann, & K. Stockinger (Eds.), Applied data science – Lessons learned for the data-driven business. Berlin: Springer.Google Scholar
  8. Cambridge Mobile Telematics. (2018, April 2). Distraction 2018: Data from over 65 million trips shows that distracted driving is increasing.Google Scholar
  9. Castanedo, F. (2015, August). Data preparation in the big data era: Best practices for data integration. Boston: O’Reilly.Google Scholar
  10. Dasu, T., & Johnson, T. (2003). Exploratory data mining and cleaning. Hoboken, NJ: Wiley-IEEE.CrossRefGoogle Scholar
  11. Data Science. (2018). Opportunities to transform chemical sciences and engineering. A Chemical Sciences Roundtable Workshop, National Academies of Science, February 27–28, 2018.Google Scholar
  12. Demirkan, H., & Dal, B. (2014, July/August). The data economy: Why do so many analytics projects fail? Analytics Magazine.Google Scholar
  13. Dietterich, T. G. (2000, June). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1–15). Berlin: Springer.Google Scholar
  14. Dingus, T. A., et al. (2016). Driver crash risk factors and prevalence evaluation using naturalistic driving data. Proceedings of the National Academy of Sciences, 113(10), 2636–2641.  https://doi.org/10.1073/pnas.1513271113.CrossRefGoogle Scholar
  15. Economist. (2017a, April 12). How Germany’s Otto uses artificial intelligence. The Economist.Google Scholar
  16. Economist. (2017b, May 4). The World’s most valuable resource. The Economist.Google Scholar
  17. Economist. (2018a, January 6). Many happy returns: New data reveal long-term investment trends. The Economist.Google Scholar
  18. Economist. (2018b, February 24). Economists cannot avoid making value judgments: Lessons from the “repugnant” market for organs. The Economist.Google Scholar
  19. Economist. (2018c, March 28). In algorithms we trust: How AI is spreading throughout the supply chain. The Economist.Google Scholar
  20. Eriksson, J., Girod, L., Hull, B., Newton, R., Madden, S., & Balakrishnan, H. (2008) The pothole patrol: Using a mobile sensor network for road surface monitoring. In Proceedings of the 6th International Conference on Mobile Systems, Applications, and Services (MobiSys ’08). ACM, New York, NY.Google Scholar
  21. Forrester. (2015, November 9). Predictions 2016: The path from data to action for marketers: How marketers will elevate systems of insight. Forrester Research.Google Scholar
  22. Forrester. (2017, March 7). The Forrester wave: Predictive analytics and machine learning solutions, Q1 2017.Google Scholar
  23. Gartner G00301536. (2017, February 14). 2017 magic quadrant for data science platforms.Google Scholar
  24. Gartner G00310700. (2016, September 19). Survey analysis: Big data investments begin tapering in 2016. Gartner.Google Scholar
  25. Gartner G00315888. (2017, December 14). Market guide for data preparation. Gartner.Google Scholar
  26. Gartner G00326671. (2017, June 7). Critical capabilities for data science platforms. Gartner.Google Scholar
  27. Hey, T., Tansley, S., & Tolle, K. (Eds.). (2009). The fourth paradigm: Data-intensive scientific discovery Edited by Microsoft Research.Google Scholar
  28. Jenkins, J. M., Caldwell, D. A., Chandrasekaran, H., Twicken, J. D., Bryson, S. T., Quintana, E. V., et al. (2010). Overview of the Kepler science processing pipeline. The Astrophysical Journal Letters, 713(2), L87.CrossRefGoogle Scholar
  29. Liu, J. T. (2012). Shadow theory, data model design for data integration. CoRR, 1209, 2012. arXiv:1209.2647.Google Scholar
  30. Lohr, S. (2014, August 17). For big-data scientists, ‘Janitor Work’ is key hurdle to insights. New York Times.Google Scholar
  31. Lohr, S., & Singer, N. (2016). How data failed us in calling an election. The New York Times, 10, 2016.Google Scholar
  32. Mayo, M. (2017, May 31) Data preparation tips, tricks, and tools: An interview with the insiders. KDnuggets.Google Scholar
  33. Nagarajan, M. et al. (2015). Predicting future scientific discoveries based on a networked analysis of the past literature. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’15). ACM, New York, NY, pp. 2019–2028.Google Scholar
  34. NSF. (2016, December). Realizing the potential of data science. Final Report from the National Science Foundation Computer and Information Science and Engineering Advisory Committee Data Science Working Group.Google Scholar
  35. Pearl, J. (2009a). Causality: Models, reasoning, and inference. New York: Cambridge University Press.CrossRefGoogle Scholar
  36. Pearl, J. (2009b). Epilogue: The art and science of cause and effect. In J. Pearl (Ed.), Causality: Models, reasoning, and inference (pp. 401–428). New York: Cambridge University Press.CrossRefGoogle Scholar
  37. Pearl, J. (2009c). Causal inference in statistics: An overview. Statistics Surveys, 3, 96–146.MathSciNetCrossRefGoogle Scholar
  38. Piketty, T. (2014). Capital in the 21st century. Cambridge: The Belknap Press.Google Scholar
  39. Press, G. (2016, May 23). Cleaning big data: Most time-consuming, least enjoyable data science task, survey says. Forbes.Google Scholar
  40. Reimsbach-Kounatze, C. (2015). The proliferation of “big data” and implications for official statistics and statistical agencies: A preliminary analysis. OECD Digital Economy Papers, No. 245, OECD Publishing, Paris.  https://doi.org/10.1787/5js7t9wqzvg8-en
  41. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., et al. (2017). Mastering chess and Shogi by self-play with a general reinforcement learning algorithm. ArXiv E-Prints, cs.AI.Google Scholar
  42. Singh, G., et al. (2007). Optimizing workflow data footprint special issue of the scientific programming journal dedicated to dynamic computational workflows: Discovery, optimisation and scheduling.Google Scholar
  43. Spangler, S., et al. (2014). Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’14). ACM, New York, NY, pp. 1877–1886.Google Scholar
  44. Stoica, I., et al. (2017, October 16). A Berkeley view of systems challenges for AI. Technical Report No. UCB/EECS-2017-159.Google Scholar
  45. Thakur, A. (2016, July 21). Approaching (almost) any machine learning problem. The Official Blog of Kaggle.com.Google Scholar
  46. Veeramachaneni, K. (2016, December 7). Why you’re not getting value from your data science. Harvard Business Review.Google Scholar
  47. Waller, M. A., & Fawcett, S. E. (2013). Data science, predictive analytics, and big data: A revolution that will transform supply chain design and management. Journal of Business Logistics, 34, 77–84.  https://doi.org/10.1111/jbl.12010.CrossRefGoogle Scholar
  48. Winship, C., & Morgan, S. L. (1999). The estimation of causal effects from observational data. Annual Review of Sociology, 25(1), 659–706.  https://doi.org/10.1146/annurev.soc.25.1.659.CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Computer Science and Artificial Intelligence LaboratoryMassacheusetts Institute of TechnologyCambridgeUSA

Personalised recommendations