Abstract
Sometimes it’s possible to divide a collection of observations into distinct subgroups based on nothing more than the observation attributes. If this can be done, then understanding the population or process generating the observations becomes easier. The intent of cluster analysis is to carry out a division of a data set into clusters of observations that are more alike within cluster than between clusters. Clusters are formed either by aggregating observations or dividing a single glob of observations into a collection of smaller sets. The process of cluster formation involves two varieties of algorithms. The first shuffles observations between a fixed number of clusters to maximize within-cluster similarity. The second process begins with singleton clusters and recursively merges the clusters. Alternatively, we may begin with one cluster and recursively split off new clusters. In this chapter, we discuss two popular cluster analysis algorithms (and representatives of the two varieties of algorithms): the k-means algorithm and hierarchical agglomerative clustering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
The data shown in this figure may be plotted as a set of histograms. However, we use a simple line plot instead as it’s easier to see the similarities among empirical distributions.
- 4.
- 5.
The statement x j  ∈ b i is true if l i  < x j  ≤ u i .
- 6.
It can be proved that the distance between any two clusters will be less than 2.
- 7.
Recall from the tutorial of Sect. 8.4 that there are actually 54 geographic entities that we are loosely referring to as state.
- 8.
The previous notation for the estimated proportion of individuals in interval l and observation j, was p j, l .
- 9.
The pickle file was created in instruction 12 of the tutorial of Sect. 8.4.
- 10.
Other criteria are usually considered and may outweigh theoretical considerations.
References
C.C. Aggarwal, Data Mining - The Textbook (Springer, New York, 2015)
G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning with Applications in R (Springer, New York, 2013)
G. McLachlan, T. Krishnan, The EM Algorithm and Extensions, 2nd edn. (Wiley, Hoboken, 2008)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Steele, B., Chandler, J., Reddy, S. (2016). Cluster Analysis. In: Algorithms for Data Science. Springer, Cham. https://doi.org/10.1007/978-3-319-45797-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-45797-0_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45795-6
Online ISBN: 978-3-319-45797-0
eBook Packages: Computer ScienceComputer Science (R0)