Cluster Analysis

Steele, Brian; Chandler, John; Reddy, Swarna

doi:10.1007/978-3-319-45797-0_8

Brian Steele⁴,
John Chandler⁵ &
Swarna Reddy⁶

7176 Accesses

Abstract

Sometimes it’s possible to divide a collection of observations into distinct subgroups based on nothing more than the observation attributes. If this can be done, then understanding the population or process generating the observations becomes easier. The intent of cluster analysis is to carry out a division of a data set into clusters of observations that are more alike within cluster than between clusters. Clusters are formed either by aggregating observations or dividing a single glob of observations into a collection of smaller sets. The process of cluster formation involves two varieties of algorithms. The first shuffles observations between a fixed number of clusters to maximize within-cluster similarity. The second process begins with singleton clusters and recursively merges the clusters. Alternatively, we may begin with one cluster and recursively split off new clusters. In this chapter, we discuss two popular cluster analysis algorithms (and representatives of the two varieties of algorithms): the k-means algorithm and hierarchical agglomerative clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Section 10.6, Chap. 10 works with data originating from grocery store receipts.
2.
We worked with the mathematical form of the histogram in Chap. 3, Sect. 3.4.2
3.
The data shown in this figure may be plotted as a set of histograms. However, we use a simple line plot instead as it’s easier to see the similarities among empirical distributions.
4.
Section 3.4.2 of Chap. 3 discusses histograms in details.
5.
The statement x _j ∈ b _i is true if l _i < x _j ≤ u _i.
6.
It can be proved that the distance between any two clusters will be less than 2.
7.
Recall from the tutorial of Sect. 8.4 that there are actually 54 geographic entities that we are loosely referring to as state.
8.
The previous notation for the estimated proportion of individuals in interval l and observation j, was p _j, l.
9.
The pickle file was created in instruction 12 of the tutorial of Sect. 8.4.
10.
Other criteria are usually considered and may outweigh theoretical considerations.

References

C.C. Aggarwal, Data Mining - The Textbook (Springer, New York, 2015)
Google Scholar
G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning with Applications in R (Springer, New York, 2013)
Google Scholar
G. McLachlan, T. Krishnan, The EM Algorithm and Extensions, 2nd edn. (Wiley, Hoboken, 2008)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Montana, Missoula, MT, USA
Brian Steele
School of Business Administration, University of Montana, Missoula, MT, USA
John Chandler
SoftMath Consultants, LLC, Missoula, MT, USA
Swarna Reddy

Authors

Brian Steele
View author publications
You can also search for this author in PubMed Google Scholar
John Chandler
View author publications
You can also search for this author in PubMed Google Scholar
Swarna Reddy
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Steele, B., Chandler, J., Reddy, S. (2016). Cluster Analysis. In: Algorithms for Data Science. Springer, Cham. https://doi.org/10.1007/978-3-319-45797-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-45797-0_8
Published: 27 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45795-6
Online ISBN: 978-3-319-45797-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics