Statistical findings on subgroups belong to the most popular and simple forms of knowledge we encounter in all domains of science, business, or even daily life. We read or hear such messages as: Lung cancer mortality rate has considerably increased for women during the last 10 years, unemployment rate is overproportionally high for young men with low educational level, potential of violance is the highest for males between 14 and 18. In this paper, we first compare knowledge expressed by subgroup patterns with other popular knowledge types of Knowledge Discovery in Databases (KDD), introduce types of description languages for subgroups, summarize general pattern classes for subgroup deviations and associations. A deviation pattern describes a deviating behavior of a target variable in a subgroup. Deviation patterns rely on statistical tests and thus capture knowledge about a subgroup in form of a verified (alternative) hypothesis on the distribution of a target variable. Search for deviating subgroups is organized in two phases. In a brute force search, alternative search heuristics can be applied to find a set of deviating subgroups. In a second refinement phase, redundancy elimination operators identify a system of subgroups. We discuss the role of tests for subgroup mining, introduce specializations of the general deviation pattern, summarize search approaches, and deal with navigation and visualization operations that support an analyst in interactively constructing a best system of deviating subgroups.
KeywordsQuality Function Target Variable Description Language Deviation Pattern Subgroup Size
Unable to display preview. Download preview PDF.
- 2.Wrobel, S.: An Algorithm for Multi-relational Discovery of Subgroups, in: Proceedings of the First European Symposium on Principles of KDD (eds. Komorowski, J. and Zytkow, J. ), Springer-Verlag, Berlin 1997, 78–87.Google Scholar
- 3.Klösgen, W.: Explora: A Multipattern and Multistrategy Discovery Assistant, in: Advances in Knowledge Discovery and Data Mining (eds. Fayyad, U.; PiatetskyShapiro, G.; Smyth, P. and Uthurusamy, R. ), MIT Press, Cambridge 1996. 249–271.Google Scholar
- 4.Friedman, J. and Fisher, N.: Bump Hunting in High-Dimensional Data, in: Statistics and Computing 1998.Google Scholar
- 5.Smyth, P. and Goodman, R.: An information theoretic approach to rule induction, in: IEEE Trans. Knowledge and Data Engineering 4, 1992.Google Scholar
- 6.Gebhardt, F.: Choosing among Competing Generalizations, in: Knowledge Acquisition 3, 1991.Google Scholar
- 7.Friendly, M.: Conceptual and Visual Models for Categorical Data, in: The American Statistician 1993.Google Scholar