Abstract
It’s not uncommon that a single computer is inadequate to handle a massively large data set. The common problems are that it takes too long to process the data and the data volume exceeds the storage capacity of the host. Cleverly designed algorithms sometimes can reduce the processing time to an acceptable point, but the single host solution will eventually fail if data volume is sufficiently great. A far-reaching solution to the data volume problem replaces the single host with a network of computers across which the data are distributed and processed. However, the hardware solution is incomplete until the data processing algorithms are adapted to the distributed computing environment. A complete solution requires algorithms that are scalable. Scalability depends on the statistics that are being computed by the algorithm, and the statistics that allow for scalability are associative statistics. Scalability and associative statistics are the subject of this chapter.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Right-skewness is common when a variable is bounded below as is the case with body mass index since no one may have a body mass index less than or equal to zero.
- 3.
Of course, once it’s been determined that the observation belongs to an interval, there’s no need to test any other intervals.
- 4.
Chapter 7 uses these BRFSS data files in all of the tutorials.
- 5.
The first character in the string in Python is s[0].
- 6.
It will not begin with /home/... if your operating system is Windows.
- 7.
In Spyder, close the console, thereby killing the kernel, and start a new console to restart the interpreter.
- 8.
Execute functions.py if your function in functions.py is not compiling despite calling the reload function.
- 9.
The codebook contains a wealth of information about the data and data file structure.
- 10.
It’s informative to submit the instruction a = b = 1 at the console. Then, submit a = 2 and print the value of b. The moral of this lesson is be careful when you set two variables equal.
- 11.
We think so.
- 12.
NASDAQ is the abbreviation for the National Association of Securities Dealers Automated Quotations system.
- 13.
- 14.
Pearson’s correlation coefficient is a measure of linear association. Linear association is meaningful when the variables are quantitative or ordinal.
- 15.
The LU-factorization method is faster and more accurate than computing the inverse and then multiplying.
- 16.
Numpy matrices and arrays cannot be rounded even if they are of length 1 or 1 Ă— 1 in dimension.
- 17.
Interpretation of regression coefficients is discussed at length in Chap. 6
References
K. Bache, M. Lichman, University of California Irvine Machine Learning Repository (University of California, Irvine, 2013). http://archive.ics.uci.edu/ml
R. Ecob, G.D. Smith, Income and health: what is the nature of the relationship? Soc. Sci. Med. 48, 693–705 (1999)
S.L. Ettner, New evidence on the relationship between income and health. J. Health Econ. 15 (1), 67–85 (1996)
Harvard T.H. Chan School for Public Health, Obesity prevention source (2015). http://www.hsph.harvard.edu/obesity-prevention-source/us-obesity-trends-map/
J.A. Levine, Poverty and obesity in the U.S. Diabetes 60 (11), 2667–2668 (2011)
A.H. Mokdad, M.K. Serdula, W.H. Dietz, B.A. Bowman, J.S. Marks, J.P. Koplan, The spread of the obesity epidemic in the United States, 1991–1998. J. Am. Med. Assoc. 282 (16), 1519–1522 (1999)
A.M. Prentice, The emerging epidemic of obesity in developing countries. Int. J. Epidemiol. 35 (1), 93–99 (2006)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Steele, B., Chandler, J., Reddy, S. (2016). Scalable Algorithms and Associative Statistics. In: Algorithms for Data Science. Springer, Cham. https://doi.org/10.1007/978-3-319-45797-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-45797-0_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45795-6
Online ISBN: 978-3-319-45797-0
eBook Packages: Computer ScienceComputer Science (R0)