Skip to main content

Scalable Algorithms and Associative Statistics

  • Chapter
  • First Online:
Algorithms for Data Science

Abstract

It’s not uncommon that a single computer is inadequate to handle a massively large data set. The common problems are that it takes too long to process the data and the data volume exceeds the storage capacity of the host. Cleverly designed algorithms sometimes can reduce the processing time to an acceptable point, but the single host solution will eventually fail if data volume is sufficiently great. A far-reaching solution to the data volume problem replaces the single host with a network of computers across which the data are distributed and processed. However, the hardware solution is incomplete until the data processing algorithms are adapted to the distributed computing environment. A complete solution requires algorithms that are scalable. Scalability depends on the statistics that are being computed by the algorithm, and the statistics that allow for scalability are associative statistics. Scalability and associative statistics are the subject of this chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We discussed the BRFSS data briefly in Chap. 1, Sect. 1.2

  2. 2.

    Right-skewness is common when a variable is bounded below as is the case with body mass index since no one may have a body mass index less than or equal to zero.

  3. 3.

    Of course, once it’s been determined that the observation belongs to an interval, there’s no need to test any other intervals.

  4. 4.

    Chapter 7 uses these BRFSS data files in all of the tutorials.

  5. 5.

    The first character in the string in Python is s[0].

  6. 6.

    It will not begin with /home/... if your operating system is Windows.

  7. 7.

    In Spyder, close the console, thereby killing the kernel, and start a new console to restart the interpreter.

  8. 8.

    Execute functions.py if your function in functions.py is not compiling despite calling the reload function.

  9. 9.

    The codebook contains a wealth of information about the data and data file structure.

  10. 10.

    It’s informative to submit the instruction a = b = 1 at the console. Then, submit a = 2 and print the value of b. The moral of this lesson is be careful when you set two variables equal.

  11. 11.

    We think so.

  12. 12.

    NASDAQ is the abbreviation for the National Association of Securities Dealers Automated Quotations system.

  13. 13.

    The inner product of a vector w with itself is the scalar w T w (Chap. 1, Sect. 1.10.1).

  14. 14.

    Pearson’s correlation coefficient is a measure of linear association. Linear association is meaningful when the variables are quantitative or ordinal.

  15. 15.

    The LU-factorization method is faster and more accurate than computing the inverse and then multiplying.

  16. 16.

    Numpy matrices and arrays cannot be rounded even if they are of length 1 or 1 Ă— 1 in dimension.

  17. 17.

    Interpretation of regression coefficients is discussed at length in Chap. 6

References

  1. K. Bache, M. Lichman, University of California Irvine Machine Learning Repository (University of California, Irvine, 2013). http://archive.ics.uci.edu/ml

    Google Scholar 

  2. R. Ecob, G.D. Smith, Income and health: what is the nature of the relationship? Soc. Sci. Med. 48, 693–705 (1999)

    Article  Google Scholar 

  3. S.L. Ettner, New evidence on the relationship between income and health. J. Health Econ. 15 (1), 67–85 (1996)

    Article  Google Scholar 

  4. Harvard T.H. Chan School for Public Health, Obesity prevention source (2015). http://www.hsph.harvard.edu/obesity-prevention-source/us-obesity-trends-map/

  5. J.A. Levine, Poverty and obesity in the U.S. Diabetes 60 (11), 2667–2668 (2011)

    Google Scholar 

  6. A.H. Mokdad, M.K. Serdula, W.H. Dietz, B.A. Bowman, J.S. Marks, J.P. Koplan, The spread of the obesity epidemic in the United States, 1991–1998. J. Am. Med. Assoc. 282 (16), 1519–1522 (1999)

    Article  Google Scholar 

  7. A.M. Prentice, The emerging epidemic of obesity in developing countries. Int. J. Epidemiol. 35 (1), 93–99 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Steele, B., Chandler, J., Reddy, S. (2016). Scalable Algorithms and Associative Statistics. In: Algorithms for Data Science. Springer, Cham. https://doi.org/10.1007/978-3-319-45797-0_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45797-0_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45795-6

  • Online ISBN: 978-3-319-45797-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics