Statistical Learning and Data Mining
A major objective in data analysis is to identify interesting features or structure in the data. In this chapter, we consider the use of some of the tools and measures discussed in Chapters 9 and 10 to identify interesting structure. The graphical methods discussed in Chapter 8 are also very useful in discovering structure, but we do not consider those methods further in the present chapter. There are basically two ways of thinking about “structure”. One has to do with counts of observations. In this approach, patterns in the density are the features of interest. We may be interested in whether the density is multimodal, whether it is skewed, whether there are holes in the density, and so on. The other approach seeks to identify relationships among the variables. The two approaches are related in the sense that if there are relationships among the variables, the density of the observations is higher in regions in which the relationships hold. Relationships among variables are generally not exact, and the relationships are identified by the higher density of observations that exhibit the approximate relationships. An important kind of pattern in data is a relationship to time. Often, even though data are collected at different times, the time itself is not represented by a variable on the dataset. A simple example is one in which the data are collected sequentially at roughly equal intervals. In this case, the index of the observations may serve as a surrogate variable. Consider the small univariate dataset in Table 16.1, for example. A static view of a histogram of these univariate data, as in Figure 16.1, shows a univariate bimodal dataset. Figure 16.2, however, in which the data are plotted against the index (by rows in Table 16.1), shows a completely different structure. The data appear to be sinusoidal with an increasing frequency. The sinusoidal responses at roughly equal sampling intervals result in a bimodal static distribution, which is the structure seen in the histogram. Interesting structure may also be groups or clusters of data based on some measure of similarity, as discussed in Section 9.2 beginning on page 383. When there are separate groups in the data, but the observations do not contain an element or an index variable representing group membership, identifying nearby elements or clusters in the data requires some measure of similarity (or, equivalently, of dissimilarity).
KeywordsPrincipal Component Analysis Data Mining Independent Component Analysis Statistical Learning Minimal Span Tree
Unable to display preview. Download preview PDF.