Principles of Data Science: Primer

Curuksu, Jeremy David

doi:10.1007/978-3-319-70229-2_6

Jeremy David Curuksu²

Part of the book series: Management for Professionals ((MANAGPROF))

4133 Accesses

Abstract

Let us face it. Statistics and mathematics deter almost everyone except the ones who choose to specialize in it. If you kept reading and reached this far in the book you are probably now considering skipping the chapters on Data Science and moving on to the next on Strategy because, well, it sounds more exciting. Thus, let us start this chapter on statistics by a simple example that illustrates why it is worth reading and why consultants may increasingly use mathematics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
One may recommend “Naked Statistics” from Charles Wheelan [89], which introduces the overall field of statistics in a simple and humoristic way …technical expertise not required.
2.
The software-hardware interface defines the field of Robotics as an application of Cybernetics, a field invented by the late Norbert Wiener and from where Machine Learning emerged as a subfield.
3.
Pearson correlation is the most common in loose usage.
4.
By general purpose, I mean the assumption of linear relationship between variables, which is often what is meant by a “simple” model in mathematics.
5.
Eq. 6.5 is formally the divergence of p₂ from p₁. An unbiased degree of association according to Kullback and Leibler [157] is obtained by taking the sum of each one-sided divergence: D(p ₁ ,p ₂ ) + D(p ₂ ,p ₁ ).
6.
All 1-dimentional values in mathematics are referred to as scalars; multi-dimensional objects may bear different names, most common of which are vectors, matrices and tensors.
7.
Hyperspace is the name given to a space made of more than three dimensions (i.e. three variables). A plane that lies in a hyperspace is defined by more than two vectors, and called a hyperplane. It does not have a physical representation in our 3D world. The way scientists present “hyper-“objects such as hyperplanes is by presenting consecutive 2D planes along different values of the 4th variable, the 5th variable, etc. This is why the use of functions, matrices and tensors is strictly needed to handle computations in multivariable spaces.
8.
As mentioned in Sect. 6.1, the standard error is the standard deviation of the means of different sub-samples drawn from the original sample or population
9.
The 80/20 rule, or Pareto principle, is a principle commonly used in business and economics that states that 80% of a problem stem from only 20% of its causes. It was first suggested by the late Joseph Juran, one of the most prominent management consultants of the twentieth century.

References

Sarkar et al (2011) Translational bioinformatics: linking knowledge across biological and clinical realms. J Am Med Inform Assoc 18:354–357
Article Google Scholar
Marx V (2013) The big challenges of big data. Nature 498:255–260
Article Google Scholar
Siegel E (2013) Predictive analytics: the power to predict who will click, buy, lie, or die. Wiley, Hoboken
Google Scholar
Wheelan C (2013) Naked statistics. Norton, New York
Google Scholar
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI 14(2):1137–1145
Google Scholar
Lee Rodgers J, Nicewander WA (1988) Thirteen ways to look at the correlation coefficient. Am Stat 42(1):59–66
Article Google Scholar
Cover TM, Thomas JA (2012) Elements of information theory. Wiley, New York
Google Scholar
Kullback S (1959) Information theory and statistics. Wiley, New York
Google Scholar
Gower JC (1985) Properties of Euclidean and non-Euclidean distance matrices. Linear Algebra Appl 67:81–97
Article Google Scholar
Legendre A (1805) Nouvelles méthodes pour la détermination des orbites des comètes. Didot, Paris
Google Scholar
Ozer DJ (1985) Correlation and the coefficient of determination. Psychol Bull 97(2):307
Article Google Scholar
Nagelkerke NJ (1991) A note on a general definition of the coefficient of determination. Biometrika 78(3):691–692
Article Google Scholar
Aiken LS, West SG, Reno RR (1991) Multiple regression: testing and interpreting interactions. Sage, London
Google Scholar
Gibbons MR (1982) Multivariate tests of financial models: a new approach. J Financ Econ 10(1):3–27
Article Google Scholar
Berger JO (2013) Statistical decision theory and Bayesian analysis. Springer, New York
Google Scholar
Ng A (2008) Artificial intelligence and machine learning, online video lecture series. Stanford University, Stanford. www.see.stanford.edu
Google Scholar
Ott RL, Longnecker M (2001) An introduction to statistical methods and data analysis. Cengage Learning, Belmont
Google Scholar
Tsitsiklis (2010) Probabilistic systems analysis and applied probability, online video lecture series. MIT, Cambridge. www.ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-041-probabilistic-systems-analysis-and-applied-probability-fall-2010/video-lectures/
Google Scholar
Nuzzo R (2014) Statistical errors. Nature 506(7487):150–152
Article Google Scholar
Goodman SN (1999) Toward evidence-based medical statistics: the p-value fallacy. Ann Intern Med 130(12):995–1004
Article Google Scholar
Lyapunov A (1901) Nouvelle forme du théorème sur la limite de probabilité. Mémoires de l'Académie de St-Petersbourg 12
Google Scholar
Baesens B (2014) Analytics in a big data world: the essential guide to data science and its applications. Wiley, New York
Google Scholar
Curuksu J (2012) Adaptive conformational sampling based on replicas. J Math Biol 64:917–931
Article Google Scholar
Pidd M (1998) Computer simulation in management science. Wiley, Chichester
Google Scholar
Löytynoja A (2014) Machine learning with Matlab, Nordic Matlab expo 2014. MathWorks, Stockholm. www.mathworks.com/videos/machine-learning-with-matlab-92623.html
Google Scholar
Becla J, Lim KT, Wang DL (2010) Report from the 3rd workshop on extremely large databases. Data Sci J 8:MR1–MR16
Article Google Scholar
Treinen W (2014) Big data value strategic research and innovation agenda. European Commission Press, New York
Google Scholar

Download references

Author information

Authors and Affiliations

Amazon Web Services, Inc, New York, NY, USA
Jeremy David Curuksu

Authors

Jeremy David Curuksu
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Curuksu, J.D. (2018). Principles of Data Science: Primer. In: Data Driven. Management for Professionals. Springer, Cham. https://doi.org/10.1007/978-3-319-70229-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-70229-2_6
Published: 07 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70228-5
Online ISBN: 978-3-319-70229-2
eBook Packages: Business and ManagementBusiness and Management (R0)

Publish with us

Policies and ethics