Scalable Algorithms and Associative Statistics

Steele, Brian; Chandler, John; Reddy, Swarna

doi:10.1007/978-3-319-45797-0_3

Brian Steele⁴,
John Chandler⁵ &
Swarna Reddy⁶

7170 Accesses

Abstract

It’s not uncommon that a single computer is inadequate to handle a massively large data set. The common problems are that it takes too long to process the data and the data volume exceeds the storage capacity of the host. Cleverly designed algorithms sometimes can reduce the processing time to an acceptable point, but the single host solution will eventually fail if data volume is sufficiently great. A far-reaching solution to the data volume problem replaces the single host with a network of computers across which the data are distributed and processed. However, the hardware solution is incomplete until the data processing algorithms are adapted to the distributed computing environment. A complete solution requires algorithms that are scalable. Scalability depends on the statistics that are being computed by the algorithm, and the statistics that allow for scalability are associative statistics. Scalability and associative statistics are the subject of this chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We discussed the BRFSS data briefly in Chap. 1, Sect. 1.2
2.
Right-skewness is common when a variable is bounded below as is the case with body mass index since no one may have a body mass index less than or equal to zero.
3.
Of course, once it’s been determined that the observation belongs to an interval, there’s no need to test any other intervals.
4.
Chapter 7 uses these BRFSS data files in all of the tutorials.
5.
The first character in the string in Python is s[0].
6.
It will not begin with /home/... if your operating system is Windows.
7.
In Spyder, close the console, thereby killing the kernel, and start a new console to restart the interpreter.
8.
Execute functions.py if your function in functions.py is not compiling despite calling the reload function.
9.
The codebook contains a wealth of information about the data and data file structure.
10.
It’s informative to submit the instruction a = b = 1 at the console. Then, submit a = 2 and print the value of b. The moral of this lesson is be careful when you set two variables equal.
11.
We think so.
12.
NASDAQ is the abbreviation for the National Association of Securities Dealers Automated Quotations system.
13.
The inner product of a vector w with itself is the scalar w ^T w (Chap. 1, Sect. 1.10.1).
14.
Pearson’s correlation coefficient is a measure of linear association. Linear association is meaningful when the variables are quantitative or ordinal.
15.
The LU-factorization method is faster and more accurate than computing the inverse and then multiplying.
16.
Numpy matrices and arrays cannot be rounded even if they are of length 1 or 1 × 1 in dimension.
17.
Interpretation of regression coefficients is discussed at length in Chap. 6

References

K. Bache, M. Lichman, University of California Irvine Machine Learning Repository (University of California, Irvine, 2013). http://archive.ics.uci.edu/ml
Google Scholar
R. Ecob, G.D. Smith, Income and health: what is the nature of the relationship? Soc. Sci. Med. 48, 693–705 (1999)
Article Google Scholar
S.L. Ettner, New evidence on the relationship between income and health. J. Health Econ. 15 (1), 67–85 (1996)
Article Google Scholar
Harvard T.H. Chan School for Public Health, Obesity prevention source (2015). http://www.hsph.harvard.edu/obesity-prevention-source/us-obesity-trends-map/
J.A. Levine, Poverty and obesity in the U.S. Diabetes 60 (11), 2667–2668 (2011)
Google Scholar
A.H. Mokdad, M.K. Serdula, W.H. Dietz, B.A. Bowman, J.S. Marks, J.P. Koplan, The spread of the obesity epidemic in the United States, 1991–1998. J. Am. Med. Assoc. 282 (16), 1519–1522 (1999)
Article Google Scholar
A.M. Prentice, The emerging epidemic of obesity in developing countries. Int. J. Epidemiol. 35 (1), 93–99 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Montana, Missoula, MT, USA
Brian Steele
School of Business Administration, University of Montana, Missoula, MT, USA
John Chandler
SoftMath Consultants, LLC, Missoula, MT, USA
Swarna Reddy

Authors

Brian Steele
View author publications
You can also search for this author in PubMed Google Scholar
John Chandler
View author publications
You can also search for this author in PubMed Google Scholar
Swarna Reddy
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Steele, B., Chandler, J., Reddy, S. (2016). Scalable Algorithms and Associative Statistics. In: Algorithms for Data Science. Springer, Cham. https://doi.org/10.1007/978-3-319-45797-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-45797-0_3
Published: 27 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45795-6
Online ISBN: 978-3-319-45797-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics