Data science vs. statistics: two cultures?

Abstract

Data science is the business of learning from data, which is traditionally the business of statistics. Data science, however, is often understood as a broader, task-driven and computationally-oriented version of statistics. Both the term data science and the broader idea it conveys have origins in statistics and are a reaction to a narrower view of data analysis. Expanding upon the views of a number of statisticians, this paper encourages a big-tent view of data analysis. We examine how evolving approaches to modern data analysis relate to the existing discipline of statistics (e.g. exploratory analysis, machine learning, reproducibility, computation, communication and the role of theory). Finally, we discuss what these trends mean for the future of statistics by highlighting promising directions for communication, education and research.

This is a preview of subscription content, access via your institution.

Notes

  1. 1.

    https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century.

  2. 2.

    https://www.wired.com/2008/06/pb-theory/.

  3. 3.

    https://simplystatistics.org/2015/10/29/the-statistics-identity-crisis-am-i-really-a-data-scientist/.

  4. 4.

    https://twitter.com/cdixon/status/428914681911070720.

  5. 5.

    https://simplystatistics.org/.

  6. 6.

    http://andrewgelman.com/.

  7. 7.

    https://normaldeviate.wordpress.com/.

  8. 8.

    https://idc9.github.io/stor390/.

  9. 9.

    This idea is usually communicated through a venn diagram, e.g. http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram.

  10. 10.

    https://simplystatistics.org/2014/07/25/academic-statisticians-there-is-no-shame-in-developing-statistical-solutions-that-solve-just-one-problem/.

  11. 11.

    http://www.tandfonline.com/doi/pdf/10.1080/10618600.2017.1384734.

  12. 12.

    This includes both databases and mathematical representations of data.

  13. 13.

    For example, reproducible research would fall under this category and point 5.

  14. 14.

    We take the position that data science is the practice of broader statistics.

  15. 15.

    “A data scientist is a statistician who lives in San Francisco” (Bhardwaj 2017).

  16. 16.

    http://andrewgelman.com/2013/11/14/statistics-least-important-part-data-science/.

  17. 17.

    We do not claim this list is exhaustive.

  18. 18.

    Peter Naur uses the term “data science” but in a narrower sense, focusing more on computation.

  19. 19.

    http://bulletin.imstat.org/2014/10/ims-presidential-address-let-us-own-data-science/ and https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/.

  20. 20.

    https://priceonomics.com/whats-the-difference-between-data-science-and/.

  21. 21.

    e.g. see https://priceonomics.com/hadley-wickham-the-man-who-revolutionized-r/ and the quote about “The fact that data science exists as a field is a colossal failure of statistics”.

  22. 22.

    http://stat-computing.org/computing/.

  23. 23.

    http://stat-graphics.org/graphics/.

  24. 24.

    https://www.nsf.gov/funding/pgmsumm.jsp/pimsid=505347.

  25. 25.

    The literature is not consistent about the definitions of reproducibility and replicability. In this paper we use the definitions given here.

  26. 26.

    Writing code that continues to work overtime is non-trivial; it involves maintaining the same computing environment and managing dependencies correctly, e.g. the software packages the code uses change over time, version 1.1.1 might work the same as version 2.1.1.

  27. 27.

    Understanding the nitty-gritty details of how statistical software works is not trivial: how does the optimization routine determine convergence? Are the data mean centered by default? There is a lack in uniformity in how statistical software is written; we believe this is exacerbated by the lack of of statisticians writing statistical software.

  28. 28.

    Even if the code for a study is available, someone may still want to rewrite the code say in another language. In this case have the original code available to base the new code on is helpful.

  29. 29.

    Publishing messy code is still beneficial and certainly better than not publishing code (Barnes 2010).

  30. 30.

    The use of shallow means we can view a generalized linear model as a neural network with 0 layers. The more layers a network has, the more complex of a pattern it can find (Goodfellow et al. 2016).

  31. 31.

    i.e. the output of a predictive model may be interesting insofar as it helps us do something.

  32. 32.

    In other words, in many cases understanding is primarily a means to and ends for predictive problems and visa versa.

  33. 33.

    From Talking Machines season 3, episode 5. https://www.thetalkingmachines.com/.

  34. 34.

    This statement probably applies to non-quantitative fields. For example, some academics in comparative literature are more “empirical” in the sense they examine a particular body of work, draw conclusions and possibly generalize/relate their conclusions to other bodies of work. Other people in comparative literature apply “theoretical methods”.

  35. 35.

    https://simplystatistics.org/2014/07/25/academic-statisticians-there-is-no-shame-in-developing-statistical-solutions-that-solve-just-one-problem/.

  36. 36.

    https://simplystatistics.org/2014/03/20/the-8020-rule-of-statistical-methods-development/.

  37. 37.

    e.g. L2 regularized (ridge) linear regression has a closed form while L1 regularization (LASSO) does not.

  38. 38.

    For example, if google’s algorithms mistakenly think someone is dead, then likely the rest of the world will too https://www.nytimes.com/2017/12/16/business/google-thinks-im-dead.html.

  39. 39.

    Our argument is that computation can help communication. Others have taken this idea further and use computation, specifically information theory, as a metaphor for communication, e.g. Doumont (2009).

  40. 40.

    For example, it is often suggested that code comments should describe why the code was written the way it was, not what the code is doing. For data analysis, where the target audience is probably less experienced with programming, describing the what may also be useful.

  41. 41.

    For more information and examples see http://rmarkdown.rstudio.com/.

  42. 42.

    e.g. see the list of people discussed in https://simplystatistics.org/2015/12/11/instead-of-research-on-reproducibility-just-do-reproducible-research/.

  43. 43.

    The complexity and time costs to making research reproducible is, in part, technical issue.

  44. 44.

    https://simplystatistics.org/2017/06/13/the-future-of-education-is-plain-text/.

  45. 45.

    e.g. see each of the notes from https://idc9.github.io/stor390/.

  46. 46.

    e.g. see https://github.com/scipy/scipy/blob/master/HACKING.rst.txt.

  47. 47.

    http://mooc.org/.

  48. 48.

    https://simplystatistics.org/2017/02/01/reproducible-research-limits/.

  49. 49.

    https://www.coursera.org/specializations/jhu-data-science.

  50. 50.

    http://data8.org/.

  51. 51.

    e.g. including a lecture on communication in an undergraduate data science course: https://idc9.github.io/stor390/notes/communication/communication.html.

  52. 52.

    https://simplystatistics.org/2017/12/20/thoughts-on-david-donoho-s-fifty-years-of-data-science/.

  53. 53.

    e.g. in an upper level undergraduate course such as UNC’s STOR 455: Statistical Methods I.

  54. 54.

    See for example https://www.inferentialthinking.com/chapters/13/prediction.html.

  55. 55.

    e.g. see the order of the chapters in the textbook: https://www.inferentialthinking.com/.

  56. 56.

    http://community.amstat.org/cmis/home.

  57. 57.

    http://wimlworkshop.org/.

  58. 58.

    http://ww2.amstat.org/meetings/wsds/2018/.

References

  1. Alivisatos, P. (2017). Stem and computer science education: Preparing the 21st century workforce. Research and Technology Subcommittee House Committee on Science, Space, and Technology.

  2. Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired Magazine, 16(7), 16-07.

    Google Scholar 

  3. Aravkin, A., & Davis, D. (2016). A smart stochastic algorithm for nonconvex optimization with applications to robust machine learning. arXiv preprint arXiv:161001101.

  4. Association, A. S., et al. (2014). Curriculum guidelines for undergraduate programs in statistical science. Retrieved March 3, 2009, from http://www.amstat.org/education/curriculumguidelines.cfm.

  5. Barnes, N. (2010). Publish your computer code: It is good enough. Nature News, 467(7317), 753–753.

    Article  Google Scholar 

  6. Barocas, S., Boyd, D., Friedler, S., & Wallach, H. (2017). Social and technical trade-offs in data science.

  7. Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.

    Article  Google Scholar 

  8. Bhardwaj, A. (2017). What is the difference between data science and statistics? https://priceonomics.com/whats-the-difference-between-data-science-and/.

  9. Blei, D. M., & Smyth, P. (2017). Science and data science. Proceedings of the National Academy of Sciences, 114(33), 8689–8692.

    Article  Google Scholar 

  10. Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In: Advances in Neural Information Processing Systems (pp. 4349–4357).

  11. Bottou, L., Curtis, F. E., & Nocedal, J. (2016). Optimization methods for large-scale machine learning. arXiv preprint arXiv:160604838.

  12. Breiman, L., et al. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231.

    MathSciNet  Article  Google Scholar 

  13. Buckheit, J. B., & Donoho, D. L. (1995). Wavelab and reproducible research. In: Wavelets and statistics (pp. 55–81), Springer.

  14. Bühlmann, P., & van de Geer, S. (2018). Statistics for big data: A perspective. Statistics and Probability Letters.

  15. Bühlmann, P., & Meinshausen, N. (2016). Magging: maximin aggregation for inhomogeneous large-scale data. Proceedings of the IEEE, 104(1), 126–135.

    Article  Google Scholar 

  16. Bühlmann, P., & Stuart, A. M. (2016). Mathematics, statistics and data science. EMS Newsletter, 100, 28–30.

    Google Scholar 

  17. Chambers, J. M. (1993). Greater or lesser statistics: A choice for future research. Statistics and Computing, 3(4), 182–184.

    Article  Google Scholar 

  18. Cleveland, W. S. (2001). Data science: an action plan for expanding the technical areas of the field of statistics. International Statistical Review, 69(1), 21–26.

    Article  Google Scholar 

  19. Conway, D. (2010). The data science Venn diagram.

  20. Crawford, K. (2017). The trouble with bias. Conference on Neural Information Processing Systems, invited speaker.

  21. De Veaux, R. D., Agarwal, M., Averett, M., Baumer, B. S., Bray, A., Bressoud, T. C., et al. (2017). Curriculum guidelines for undergraduate programs in data science. Annual Review of Statistics and Its Application, 4, 15–30.

    Article  Google Scholar 

  22. Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical Statistics, 26(4), 745–766.

    MathSciNet  Article  Google Scholar 

  23. Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:170208608.

  24. Efron, B., & Hastie, T. (2016). Computer age statistical inference (vol 5). Cambridge: Cambridge University Press.

  25. Eick, S. G., Graves, T. L., Karr, A. F., Marron, J., & Mockus, A. (2001). Does code decay? Assessing the evidence from change management data. IEEE Transactions on Software Engineering, 27(1), 1–12.

    Article  Google Scholar 

  26. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery and data mining (Vol. 21). Menlo Park: AAAI press.

    Google Scholar 

  27. Felder, R. M., & Brent, R. (2016). Teaching and learning STEM: A practical guide. Hoboken: Wiley.

    Google Scholar 

  28. Freitas, A. A. (2014). Comprehensible classification models: A position paper. ACM SIGKDD Explorations Newsletter, 15(1), 1–10.

    Article  Google Scholar 

  29. Gentleman, R., Carey, V., Huber, W., Irizarry, R., & Dudoit, S. (2006). Bioinformatics and computational biology solutions using R and Bioconductor. Berlin: Springer.

    MATH  Google Scholar 

  30. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Book in preparation for mit press. http://www.deeplearningbook.org.

  31. Graves, T. L., Karr, A. F., Marron, J., & Siy, H. (2000). Predicting fault incidence using software change history. IEEE Transactions on Software Engineering, 26(7), 653–661.

    Article  Google Scholar 

  32. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (2011). Robust statistics: the approach based on influence functions (Vol. 114). Hoboken: Wiley.

    MATH  Google Scholar 

  33. Hand, D. J., et al. (2006). Classifier technology and the illusion of progress. Statistical Science, 21(1), 1–14.

    MathSciNet  Article  Google Scholar 

  34. Hardin, J., Hoerl, R., Horton, N. J., Nolan, D., Baumer, B., Hall-Holt, O., et al. (2015). Data science in statistics curricula: Preparing students to “think with data”. The American Statistician, 69(4), 343–353.

    MathSciNet  Article  Google Scholar 

  35. Hicks, S. C., & Irizarry, R. A. (2017). A guide to teaching data science. The American Statistician (just-accepted).

  36. Hooker, G., & Hooker, C. (2017). Machine learning and the future of realism. arXiv preprint arXiv:170404688.

  37. Huber, P. J. (2011). Robust statistics. In: International Encyclopedia of Statistical Science (pp. 1248–1251). Springer.

  38. Jl, Doumont. (2009). Trees, maps, and theorems. Brussels: Principiae.

    Google Scholar 

  39. Kiar, G., Bridgeford, E., Chandrashekhar, V., Mhembere, D., Burns, & R., Roncal, W. G., et al. (2017). A comprehensive cloud framework for accurate and reliable human connectome estimation and meganalysis. bioRxiv p 188706.

  40. Knuth, D. E. (1984). Literate programming. The Computer Journal, 27(2), 97–111.

    Article  Google Scholar 

  41. Kross, S., Peng, R. D., Caffo, B. S., Gooding, I., & Leek, J. T. (2017). The democratization of data science education. Peer J (PrePrints).

  42. Leek, J. T., & Peng, R. D. (2015). Opinion: Reproducible research can still be wrong: Adopting a prevention approach. Proceedings of the National Academy of Sciences, 112(6), 1645–1646.

    Article  Google Scholar 

  43. Lipton, Z. C. (2016). The mythos of model interpretability. arXiv preprint arXiv:160603490.

  44. Lu, X., Marron, J., & Haaland, P. (2014). Object-oriented data analysis of cell images. Journal of the American Statistical Association, 109(506), 548–559.

    MathSciNet  Article  Google Scholar 

  45. Maronna, R., Martin, R. D., & Yohai, V. (2006). Robust statistics (Vol. 1). Chichester: Wiley.

    Book  Google Scholar 

  46. Marron, J. (1999). Effective writing in mathematical statistics. Statistica Neerlandica, 53(1), 68–75.

    Article  Google Scholar 

  47. Marron, J. (2017). Big data in context and robustness against heterogeneity. Econometrics and Statistics, 2, 73–80.

    MathSciNet  Article  Google Scholar 

  48. Marron, J., & Alonso, A. M. (2014). Overview of object oriented data analysis. Biometrical Journal, 56(5), 732–753.

    MathSciNet  Article  Google Scholar 

  49. Members, R. P. (2017). The r project for statistical computing. https://www.r-project.org/.

  50. Naur, P. (1974). Concise survey of computer methods.

  51. Network, C. G. A., et al. (2012). Comprehensive molecular characterization of human colon and rectal cancer. Nature, 487(7407), 330–337.

    Article  Google Scholar 

  52. Nolan, D., & Temple Lang, D. (2010). Computing in the statistics curricula. The American Statistician, 64(2), 97–107.

    MathSciNet  Article  Google Scholar 

  53. O’Neil, C. (2017). Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway Books.

  54. Patil, D. (2011). Building data science teams. “O’Reilly Media, Inc.”.

  55. Patil, P., Peng, R. D., & Leek, J. (2016). A statistical definition for reproducibility and replicability. bioRxiv p 066803.

  56. Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226–1227.

    Article  Google Scholar 

  57. Perez, F., & Granger, B. E. (2015). Project jupyter: Computational narratives as the engine of collaborative data science. Tech. rep., Technical Report. Technical report, Project Jupyter.

  58. Pizer, S. M., & Marron, J. (2017). Object statistics on curved manifolds. In Statistical Shape and Deformation Analysis: Methods, Implementation and Applications (p. 137).

    Chapter  Google Scholar 

  59. Reid, N. (2018). Statistical science in the world of big data. Statistics and Probability Letters.

    MathSciNet  Article  Google Scholar 

  60. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why should i trust you? Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135–1144). ACM.

  61. Russell, S., & Norvig, P. (2009). Artificial intelligence: A modern approach. Egnlewood Cliffs: Artificial Intelligence Prentice-Hall.

    MATH  Google Scholar 

  62. Sandve, G. K., Nekrutenko, A., Taylor, J., & Hovig, E. (2013). Ten simple rules for reproducible computational research. PLoS Computational Biology, 9(10), 285. (e1003) .

    Article  Google Scholar 

  63. Smith, M. T., Zwiessele, M., & Lawrence, N. D. (2016) Differentially private Gaussian processes. arXiv preprint arXiv:160600720.

  64. Sonnenburg, S., Braun, M. L., Ong, C. S., Bengio, S., Bottou, L., Holmes, G., et al. (2007). The need for open source software in machine learning. Journal of Machine Learning Research, 8(oct), 2443–2466.

    Google Scholar 

  65. Staudte, R. G., & Sheather, S. J. (2011). Robust estimation and testing (Vol. 918). Hoboken: Wiley.

    MATH  Google Scholar 

  66. Stodden, V. (2012). Reproducible research for scientific computing: Tools and strategies for changing the culture. Computing in Science and Engineering, 14(4), 13–17.

    Article  Google Scholar 

  67. Tao, T. (2007). What is good mathematics? Bulletin of the American Mathematical Society, 44(4), 623–634.

    MathSciNet  Article  Google Scholar 

  68. Tukey, J. W. (1962). The future of data analysis. The Annals of Mathematical Statistics, 33(1), 1–67.

    MathSciNet  Article  Google Scholar 

  69. Wang, H., & Marron, J. (2007). Object oriented data analysis: Sets of trees. The Annals of Statistics, 1849–1873.

    MathSciNet  Article  Google Scholar 

  70. Wasserman, L. (2014). Rise of the machines. In Past, present, and future of statistical science (pp. 1–12).

    Chapter  Google Scholar 

  71. Wickham, H. (2015). R packages: Organize, test, document, and share your code. O’Reilly Media, Inc.

  72. Wilson, G., Aruliah, D. A., Brown, C. T., Hong, N. P. C., Davis, M., Guy, R. T., et al. (2014). Best practices for scientific computing. PLoS Biology, 12(1), 745. (e1001) .

    Article  Google Scholar 

  73. Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLoS Computational Biology, 13(6), 510. (e1005) .

    Article  Google Scholar 

  74. Wu, C. (1998). Statistics = data science? http://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf.

  75. Xie, Y. (2015). Dynamic Documents with R and knitr (Vol. 29). Boca Raton: CRC Press.

    Google Scholar 

  76. Yu, B. (2014). Ims presidential address: Let us own data science. http://bulletin.imstat.org/2014/10/ims-presidential-address-let-us-own-data-science/.

  77. Zarsky, T. (2016). The trouble with algorithmic decisions: An analytic road map to examine efficiency and fairness in automated and opaque decision making. Science, Technology, and Human Values, 41(1), 118–132.

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported in part by the National Science Foundation under Grant No. 1633074. We would like to thank Deborah Carmichael for editorial comments.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Iain Carmichael.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Carmichael, I., Marron, J.S. Data science vs. statistics: two cultures?. Jpn J Stat Data Sci 1, 117–138 (2018). https://doi.org/10.1007/s42081-018-0009-3

Download citation

Keywords

  • Computation
  • Literate programming
  • Machine learning
  • Reproducibility
  • Robustness