Skip to main content

Cluster Analysis

  • Chapter
  • First Online:
Algorithms for Data Science

Abstract

Sometimes it’s possible to divide a collection of observations into distinct subgroups based on nothing more than the observation attributes. If this can be done, then understanding the population or process generating the observations becomes easier. The intent of cluster analysis is to carry out a division of a data set into clusters of observations that are more alike within cluster than between clusters. Clusters are formed either by aggregating observations or dividing a single glob of observations into a collection of smaller sets. The process of cluster formation involves two varieties of algorithms. The first shuffles observations between a fixed number of clusters to maximize within-cluster similarity. The second process begins with singleton clusters and recursively merges the clusters. Alternatively, we may begin with one cluster and recursively split off new clusters. In this chapter, we discuss two popular cluster analysis algorithms (and representatives of the two varieties of algorithms): the k-means algorithm and hierarchical agglomerative clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Section 10.6, Chap. 10 works with data originating from grocery store receipts.

  2. 2.

    We worked with the mathematical form of the histogram in Chap. 3, Sect. 3.4.2

  3. 3.

    The data shown in this figure may be plotted as a set of histograms. However, we use a simple line plot instead as it’s easier to see the similarities among empirical distributions.

  4. 4.

    Section 3.4.2 of Chap. 3 discusses histograms in details.

  5. 5.

    The statement x j  ∈ b i is true if l i  < x j  ≤ u i .

  6. 6.

    It can be proved that the distance between any two clusters will be less than 2.

  7. 7.

    Recall from the tutorial of Sect. 8.4 that there are actually 54 geographic entities that we are loosely referring to as state.

  8. 8.

    The previous notation for the estimated proportion of individuals in interval l and observation j, was p j, l .

  9. 9.

    The pickle file was created in instruction 12 of the tutorial of Sect. 8.4.

  10. 10.

    Other criteria are usually considered and may outweigh theoretical considerations.

References

  1. C.C. Aggarwal, Data Mining - The Textbook (Springer, New York, 2015)

    Google Scholar 

  2. G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning with Applications in R (Springer, New York, 2013)

    Google Scholar 

  3. G. McLachlan, T. Krishnan, The EM Algorithm and Extensions, 2nd edn. (Wiley, Hoboken, 2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Steele, B., Chandler, J., Reddy, S. (2016). Cluster Analysis. In: Algorithms for Data Science. Springer, Cham. https://doi.org/10.1007/978-3-319-45797-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45797-0_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45795-6

  • Online ISBN: 978-3-319-45797-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics