Skip to main content

Principles of Data Science: Primer

  • Chapter
  • First Online:
Book cover Data Driven

Part of the book series: Management for Professionals ((MANAGPROF))

  • 4133 Accesses

Abstract

Let us face it. Statistics and mathematics deter almost everyone except the ones who choose to specialize in it. If you kept reading and reached this far in the book you are probably now considering skipping the chapters on Data Science and moving on to the next on Strategy because, well, it sounds more exciting. Thus, let us start this chapter on statistics by a simple example that illustrates why it is worth reading and why consultants may increasingly use mathematics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    One may recommend “Naked Statistics” from Charles Wheelan [89], which introduces the overall field of statistics in a simple and humoristic way …technical expertise not required.

  2. 2.

    The software-hardware interface defines the field of Robotics as an application of Cybernetics, a field invented by the late Norbert Wiener and from where Machine Learning emerged as a subfield.

  3. 3.

    Pearson correlation is the most common in loose usage.

  4. 4.

    By general purpose, I mean the assumption of linear relationship between variables, which is often what is meant by a “simple” model in mathematics.

  5. 5.

    Eq. 6.5 is formally the divergence of p2 from p1. An unbiased degree of association according to Kullback and Leibler [157] is obtained by taking the sum of each one-sided divergence: D(p 1 ,p 2 ) + D(p 2 ,p 1 ).

  6. 6.

    All 1-dimentional values in mathematics are referred to as scalars; multi-dimensional objects may bear different names, most common of which are vectors, matrices and tensors.

  7. 7.

    Hyperspace is the name given to a space made of more than three dimensions (i.e. three variables). A plane that lies in a hyperspace is defined by more than two vectors, and called a hyperplane. It does not have a physical representation in our 3D world. The way scientists present “hyper-“objects such as hyperplanes is by presenting consecutive 2D planes along different values of the 4th variable, the 5th variable, etc. This is why the use of functions, matrices and tensors is strictly needed to handle computations in multivariable spaces.

  8. 8.

    As mentioned in Sect. 6.1, the standard error is the standard deviation of the means of different sub-samples drawn from the original sample or population

  9. 9.

    The 80/20 rule, or Pareto principle, is a principle commonly used in business and economics that states that 80% of a problem stem from only 20% of its causes. It was first suggested by the late Joseph Juran, one of the most prominent management consultants of the twentieth century.

References

  1. Sarkar et al (2011) Translational bioinformatics: linking knowledge across biological and clinical realms. J Am Med Inform Assoc 18:354–357

    Article  Google Scholar 

  2. Marx V (2013) The big challenges of big data. Nature 498:255–260

    Article  Google Scholar 

  3. Siegel E (2013) Predictive analytics: the power to predict who will click, buy, lie, or die. Wiley, Hoboken

    Google Scholar 

  4. Wheelan C (2013) Naked statistics. Norton, New York

    Google Scholar 

  5. Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI 14(2):1137–1145

    Google Scholar 

  6. Lee Rodgers J, Nicewander WA (1988) Thirteen ways to look at the correlation coefficient. Am Stat 42(1):59–66

    Article  Google Scholar 

  7. Cover TM, Thomas JA (2012) Elements of information theory. Wiley, New York

    Google Scholar 

  8. Kullback S (1959) Information theory and statistics. Wiley, New York

    Google Scholar 

  9. Gower JC (1985) Properties of Euclidean and non-Euclidean distance matrices. Linear Algebra Appl 67:81–97

    Article  Google Scholar 

  10. Legendre A (1805) Nouvelles méthodes pour la détermination des orbites des comètes. Didot, Paris

    Google Scholar 

  11. Ozer DJ (1985) Correlation and the coefficient of determination. Psychol Bull 97(2):307

    Article  Google Scholar 

  12. Nagelkerke NJ (1991) A note on a general definition of the coefficient of determination. Biometrika 78(3):691–692

    Article  Google Scholar 

  13. Aiken LS, West SG, Reno RR (1991) Multiple regression: testing and interpreting interactions. Sage, London

    Google Scholar 

  14. Gibbons MR (1982) Multivariate tests of financial models: a new approach. J Financ Econ 10(1):3–27

    Article  Google Scholar 

  15. Berger JO (2013) Statistical decision theory and Bayesian analysis. Springer, New York

    Google Scholar 

  16. Ng A (2008) Artificial intelligence and machine learning, online video lecture series. Stanford University, Stanford. www.see.stanford.edu

    Google Scholar 

  17. Ott RL, Longnecker M (2001) An introduction to statistical methods and data analysis. Cengage Learning, Belmont

    Google Scholar 

  18. Tsitsiklis (2010) Probabilistic systems analysis and applied probability, online video lecture series. MIT, Cambridge. www.ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-041-probabilistic-systems-analysis-and-applied-probability-fall-2010/video-lectures/

    Google Scholar 

  19. Nuzzo R (2014) Statistical errors. Nature 506(7487):150–152

    Article  Google Scholar 

  20. Goodman SN (1999) Toward evidence-based medical statistics: the p-value fallacy. Ann Intern Med 130(12):995–1004

    Article  Google Scholar 

  21. Lyapunov A (1901) Nouvelle forme du théorème sur la limite de probabilité. Mémoires de l'Académie de St-Petersbourg 12

    Google Scholar 

  22. Baesens B (2014) Analytics in a big data world: the essential guide to data science and its applications. Wiley, New York

    Google Scholar 

  23. Curuksu J (2012) Adaptive conformational sampling based on replicas. J Math Biol 64:917–931

    Article  Google Scholar 

  24. Pidd M (1998) Computer simulation in management science. Wiley, Chichester

    Google Scholar 

  25. Löytynoja A (2014) Machine learning with Matlab, Nordic Matlab expo 2014. MathWorks, Stockholm. www.mathworks.com/videos/machine-learning-with-matlab-92623.html

    Google Scholar 

  26. Becla J, Lim KT, Wang DL (2010) Report from the 3rd workshop on extremely large databases. Data Sci J 8:MR1–MR16

    Article  Google Scholar 

  27. Treinen W (2014) Big data value strategic research and innovation agenda. European Commission Press, New York

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Curuksu, J.D. (2018). Principles of Data Science: Primer. In: Data Driven. Management for Professionals. Springer, Cham. https://doi.org/10.1007/978-3-319-70229-2_6

Download citation

Publish with us

Policies and ethics