Statistical Learning and Data Mining

Gentle, James E.

doi:10.1007/978-0-387-98144-4_16

James E. Gentle²

Part of the book series: Statistics and Computing ((SCO))

10k Accesses

Abstract

A major objective in data analysis is to identify interesting features or structure in the data. In this chapter, we consider the use of some of the tools and measures discussed in Chapters 9 and 10 to identify interesting structure. The graphical methods discussed in Chapter 8 are also very useful in discovering structure, but we do not consider those methods further in the present chapter. There are basically two ways of thinking about “structure”. One has to do with counts of observations. In this approach, patterns in the density are the features of interest. We may be interested in whether the density is multimodal, whether it is skewed, whether there are holes in the density, and so on. The other approach seeks to identify relationships among the variables. The two approaches are related in the sense that if there are relationships among the variables, the density of the observations is higher in regions in which the relationships hold. Relationships among variables are generally not exact, and the relationships are identified by the higher density of observations that exhibit the approximate relationships. An important kind of pattern in data is a relationship to time. Often, even though data are collected at different times, the time itself is not represented by a variable on the dataset. A simple example is one in which the data are collected sequentially at roughly equal intervals. In this case, the index of the observations may serve as a surrogate variable. Consider the small univariate dataset in Table 16.1, for example. A static view of a histogram of these univariate data, as in Figure 16.1, shows a univariate bimodal dataset. Figure 16.2, however, in which the data are plotted against the index (by rows in Table 16.1), shows a completely different structure. The data appear to be sinusoidal with an increasing frequency. The sinusoidal responses at roughly equal sampling intervals result in a bimodal static distribution, which is the structure seen in the histogram. Interesting structure may also be groups or clusters of data based on some measure of similarity, as discussed in Section 9.2 beginning on page 383. When there are separate groups in the data, but the observations do not contain an element or an index variable representing group membership, identifying nearby elements or clusters in the data requires some measure of similarity (or, equivalently, of dissimilarity).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Author information

Authors and Affiliations

Department of Computational & Data Sciences, George Mason University, 4400, University Drive, Fairfax, VA, 220304444, USA
James E. Gentle

Authors

James E. Gentle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to James E. Gentle .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gentle, J.E. (2009). Statistical Learning and Data Mining. In: Computational Statistics. Statistics and Computing. Springer, New York, NY. https://doi.org/10.1007/978-0-387-98144-4_16

Download citation

DOI: https://doi.org/10.1007/978-0-387-98144-4_16
Published: 25 June 2009
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-98143-7
Online ISBN: 978-0-387-98144-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics