Finding Patterns

Berthold, Michael R.; Borgelt, Christian; Höppner, Frank; Klawonn, Frank

doi:10.1007/978-1-84882-260-3_7

Michael R. Berthold⁶,
Christian Borgelt⁷,
Frank Höppner⁸ &
…
Frank Klawonn⁹

Part of the book series: Texts in Computer Science ((TCS))

8680 Accesses

Abstract

This chapter introduces a variety of methods that are useful to get an overview of the data, which includes a summary of the whole database as well as the identification of areas that exceptionally deviate from the remainder. They provide answers to questions such as: Does it naturally subdivide into groups? How do attributes depend on each other? Are there certain conditions leading to exceptions from the average behaviour? The chapter provides an overview of clustering methods (hierarchical clustering, k-Means, density-based clustering), association analysis, self-organizing maps and deviation analysis. The definition and choice of distance or similarity measures, which is required by almost every technique to compare different cases in the database, is also tackled.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Isotropic means that the density is symmetric around the mean, that is, the cluster shapes are roughly hyperspheres. Using the Mahalanobis distance instead would allow for ellipsoidal shape, but we restrict ourselves to the isotropic case here.
2.
The node are usually called neurons, since self-organizing maps are a special form of neural network.
3.
As n is relatively large, we approximate the binomial distribution by a normal distribution. We use the z-test for reasons of simplicity—typically p ₀ is not known but has to be estimated from the sample, and then Student’s t-test is used.
4.
We will discuss the use of other types of distance metrics in KNIME later.

References

Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. In: Proc. 1999 ACM SIGMOD Int. Conf. on Management of Data, pp. 61–72. ACM Press, New York (1999)
Chapter Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. on very Large Databases (VLDB 1994, Santiago de Chile), pp. 487–499. Morgan Kaufmann, San Mateo (1994)
Google Scholar
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.: Fast discovery of association rules. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.): Advances in Knowledge Discovery and Data Mining, pp. 307–328. AAAI Press/MIT Press, Cambridge (1996)
Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data. Data Min. Knowl. Discov. 11, 5–33 (2005)
Article MathSciNet Google Scholar
Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: OPTICS: ordering points to identify the clustering structure. In: ICMD, pp. 49–60, Philadelphia (1999)
Google Scholar
Atzmueller, M., Puppe, F.: Sd-map: a fast algorithm for exhaustive subgroup discovery. In: Proc. Int. Conf. Knowledge Discovery in Databases (PKDD). Lecture Notes in Computer Science, vol. 4213. Springer, Berlin (2006)
Google Scholar
Baumgartner, C., Plant, C., Kailing, K., Kriegel, H.-P., Kröger, P.: Subspace selection for clustering high-dimensional data. In: Proc. IEEE Int. Conf. on Data Mining, pp. 11–18. IEEE Press, Piscataway (2003)
Google Scholar
Bayardo, R., Goethals, B., Zaki, M.J. (eds.): Proc. Workshop Frequent Item Set Mining Implementations (FIMI 2004, Brighton, UK), CEUR Workshop Proceedings 126, Aachen, Germany (2004). http://www.ceur-ws.org/Vol-126/
Bellman, R.: Adaptive Control Processes. Princeton University Press, Princeton (1961)
MATH Google Scholar
Böttcher, M., Spott, M., Nauck, D.: Detecting temporally redundant association rules. In: Proc. 4th Int. Conf. on Machine Learning and Applications (ICMLA 2005, Los Angeles, CA), pp. 397–403. IEEE Press, Piscataway (2005)
Google Scholar
Böttcher, M., Spott, M., Nauck, D.: A framework for discovering and analyzing changing customer segments. In: Advances in Data Mining—Theoretical Aspects and Applications. Lecture Notes in Computer Science, vol. 4597, pp. 255–268. Springer, Berlin (2007)
Chapter Google Scholar
Borgelt, C., Berthold, M.R.: Mining molecular fragments: finding relevant substructures of molecules. In: Proc. IEEE Int. Conf. on Data Mining (ICDM 2002, Maebashi, Japan), pp. 51–58. IEEE Press, Piscataway (2002)
Google Scholar
Borgelt, C.: On canonical forms for frequent graph mining. In: Proc. 3rd Int. Workshop on Mining Graphs, Trees and Sequences (MGTS’05, Porto, Portugal), pp. 1–12. ECML/PKDD 2005 Organization Committee, Porto (2005)
Google Scholar
Borgelt, C., Wang, X.: SaM: a split and merge algorithm for fuzzy frequent item set mining (to appear)
Google Scholar
Branko, K., Lavrac, N.: Apriori-sd: adapting association rule learning to subgroup discovery. Appl. Artif. Intell. 20(7), 543–583 (2006)
Article Google Scholar
Cheng, Y., Fayyad, U., Bradley, P.S.: Efficient discovery of error-tolerant frequent itemsets in high dimensions. In: Proc. 7th Int. Conf. on Knowledge Discovery and Data Mining (KDD’01, San Francisco, CA), pp. 194–203. ACM Press, New York (2001)
Google Scholar
Cook, D.J., Holder, L.B.: Graph-based data mining. IEEE Trans. Intell. Syst. 15(2), 32–41 (2000)
Article Google Scholar
Davé, R.N.: Characterization and detection of noise in clustering. Pattern Recognit. Lett. 12, 657–664 (1991)
Article Google Scholar
Ding, C., He, X.: Cluster merging and splitting in hierarchical clustering algorithms. In: Proc. IEEE Int. Conference on Data Mining, p. 139. IEEE Press, Piscataway (2002)
Google Scholar
Dunn, J.: Well separated clusters and optimal fuzzy partitions. J. Cybern. 4, 95–104 (1974)
Article MathSciNet Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xiaowei, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD 96, Portland, Oregon), pp. 226–231. AAAI Press, Menlo Park (1996)
Google Scholar
Everitt, B.S., Landau, S., Leese, M.: Cluster Analysis. Wiley, Chichester (2001)
MATH Google Scholar
Finn, P.W., Muggleton, S., Page, D., Srinivasan, A.: Pharmacore discovery using the inductive logic programming system PROGOL. Mach. Learn. 30(2–3), 241–270 (1998)
Article Google Scholar
Gamberger, D., Lavrac, N.: Expert-guided subgroup discovery: methodology and application. J. Artif. Intell. Res. 17, 501–527 (2007)
Google Scholar
Goethals, B., Zaki, M.J. (eds.): Proc. Workshop Frequent Item Set Mining Implementations (FIMI 2003, Melbourne, FL, USA), CEUR Workshop Proceedings 90, Aachen, Germany (2003). http://www.ceur-ws.org/Vol-90/
Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)
Article Google Scholar
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17(2–3), 107–145 (2001)
Article MATH Google Scholar
Han, J., Pei, H., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc. Conf. on the Management of Data (SIGMOD’00, Dallas, TX), pp. 1–12. ACM Press, New York (2000)
Google Scholar
Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia satabases with noise. In: Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 224–228. AAAI Press, Menlo Park (1998)
Google Scholar
Hinneburg, A., Keim, D.A.: Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proc. 25th Int. Conf. on Very Large Databases, pp. 506–517. Morgan Kaufmann, San Mateo (1999)
Google Scholar
Höppner, F.: Speeding up Fuzzy C-means: using a hierarchical data organisation to control the precision of membership calculation. Fuzzy Sets Syst. 128(3), 365–378 (2002)
Article MATH Google Scholar
Höppner, F., Klawonn, F.: A contribution to convergence theory of fuzzy C-means and derivatives. IEEE Trans. Fuzzy Syst. 11(5), 682–694 (2003)
Article Google Scholar
Höppner, F., Klawonn, F., Kruse, R., Runkler, T.A.: Fuzzy Cluster Analysis. Wiley, Chichester (1999)
MATH Google Scholar
Huan, J., Wang, W., Prins, J.: Efficient mining of frequent subgraphs in the presence of isomorphism. In: Proc. 3rd IEEE Int. Conf. on Data Mining (ICDM 2003, Melbourne, FL), pp. 549–552. IEEE Press, Piscataway (2003)
Google Scholar
Kaski, S., Oja, E., Oja, E.: Kohonen Maps. Elsevier, Amsterdam (1999)
MATH Google Scholar
Klösgen, W.: Efficient discovery of interesting statements in databases. J. Intell. Inf. Syst. 4, 53–69 (1995)
Article Google Scholar
Klösgen, W.: Explora: a multipattern and multistrategy discovery assistant. In: Advances in Knowledge Discovery and Data Mining. MIT Press, Cambridge (1996). Chap. 10
Google Scholar
Kohonen, T.: The self-organizing map. Proc. IEEE 78, 1464–1480 (1990)
Article Google Scholar
Kramer, S., de Raedt, L., Helma, C.: Molecular feature mining in HIV data. In: Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2001, San Francisco, CA), pp. 136–143. ACM Press, New York (2001)
Google Scholar
Kuramochi, M., Karypis, G.: Frequent subgraph discovery. In: Proc. 1st IEEE Int. Conf. on Data Mining (ICDM 2001, San Jose, CA), pp. 313–320. IEEE Press, Piscataway (2001)
Chapter Google Scholar
Leman, D., Feelders, A., Knobbe, A.: Exceptional model mining. In: Proc. Europ. Conf. Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science, vol. 5212, pp. 1–16. Springer, Berlin (2008)
Chapter Google Scholar
Nijssen, S., Kok, J.N.: A quickstart in frequent structure mining can make a difference. In: Proc. 10th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD2004, Seattle, WA), pp. 647–652. ACM Press, New York (2004)
Google Scholar
Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. SIGKDD Explor. Newsl. 6(1), 90–105 (2004)
Article Google Scholar
Pei, J., Tung, A.K.H., Han, J.: Fault-tolerant frequent pattern mining: problems and challenges. In: Proc. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMK’01, Santa Babara, CA). ACM Press, New York (2001)
Google Scholar
Ritter, H., Martinez, T., Schulten, K.: Neural Computation and Self-Organizing Maps: An Introduction. Addison-Wesley, Reading (1992)
MATH Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article MATH Google Scholar
Scheffer, T., Wrobel, S.: Finding the most interesting patterns in a database quickly by using sequential sampling. J. Mach. Learn. Res. 3, 833–862 (2003)
MATH MathSciNet Google Scholar
Scholz, M.: Sampling-based sequential subgroup mining. In: Proc. 11th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 265–274. AAAI Press, Menlo Park (2005)
Google Scholar
Smyth, P., Goodman, R.M.: An information theoretic approach to rule induction from databases. IEEE Trans. Knowl. Discov. Eng. 4(4), 301–316 (1992)
Article Google Scholar
Steinbeck, C., Han, Y., Kuhn, S., Horlacher, O., Luttmann, E., Willighagen, E.: The chemistry development kit (CDK): an open-source Java library for chemo- and bioinformatics. J. Chem. Inf. Comput. Sci. 43(2), 493–500 (2003)
Article Google Scholar
Vesanto, J.: SOM-based data visualization methods. Intell. Data Anal. 3(2), 111–126 (1999)
Article MATH Google Scholar
Wang, X., Borgelt, C., Kruse, R.: Mining fuzzy frequent item sets. In: Proc. 11th Int. Fuzzy Systems Association World Congress (IFSA’05, Beijing, China), pp. 528–533. Tsinghua University Press/Springer, Beijing/Heidelberg (2005)
Google Scholar
Webb, G.I., Zhang, S.: k-Optimal-rule-discovery. Data Min. Knowl. Discov. 10(1), 39–79 (2005)
Article MathSciNet Google Scholar
Webb, G.I.: Discovering significant patterns. Mach. Learn. 68(1), 1–33 (2007)
Article Google Scholar
Wrobel, S.: An algorithm for multi-relational discovery of subgroups. In: Proc. 1st Europ. Symp. on Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science, vol. 1263, pp. 78–87. Springer, London (1997)
Chapter Google Scholar
Yan, X., Han, J.: gSpan: graph-based substructure pattern mining. In: Proc. 2nd IEEE Int. Conf. on Data Mining (ICDM 2003, Maebashi, Japan), pp. 721–724. IEEE Press, Piscataway (2002)
Google Scholar
Yan, X., Han, J.: Close-graph: mining closed frequent graph patterns. In: Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2003, Washington, DC), pp. 286–295. ACM Press, New York (2003)
Google Scholar
Xie, X.L., Beni, G.A.: Validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 3(8), 841–846 (1991)
Article Google Scholar
Zaki, M., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of association rules. In: Proc. 3rd Int. Conf. on Knowledge Discovery and Data Mining (KDD’97, Newport Beach, CA), pp. 283–296. AAAI Press, Menlo Park (1997)
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: a new data clustering algorithm and its applications. Data Min. Knowl. Discov. 1(2), 141–182 (1997)
Article Google Scholar
Zhao, Y., Karypis, G., Fayyad, U.: Hierarchical clustering algorithms for document datasets. Data Min. Knowl. Discov. 10, 141–168 (2005)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

FB Informatik und Informationswissenschaft, Universität Konstanz, 78457, Konstanz, Germany
Prof. Dr. Michael R. Berthold
Intelligent Data Analysis & Graphical Models Research Unit, European Centre for Soft Computing, C/ Gonzalo Gutiérrez Quirós s/n Edificio Científico-Technológico Campus Mieres, 3ª Planta, 33600, Mieres, Asturias, Spain
Dr. Christian Borgelt
FB Wirtschaft, Ostfalia University of Applied Sciences, Robert-Koch-Platz 10-14, 38440, Wolfsburg, Germany
Prof. Dr. Frank Höppner
FB Informatik, Ostfalia University of Applied Sciences, Salzdahlumer Str. 46/48, 38302, Wolfenbüttel, Germany
Prof. Dr. Frank Klawonn

Authors

Prof. Dr. Michael R. Berthold
View author publications
You can also search for this author in PubMed Google Scholar
Dr. Christian Borgelt
View author publications
You can also search for this author in PubMed Google Scholar
Prof. Dr. Frank Höppner
View author publications
You can also search for this author in PubMed Google Scholar
Prof. Dr. Frank Klawonn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael R. Berthold .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. (2010). Finding Patterns. In: Guide to Intelligent Data Analysis. Texts in Computer Science. Springer, London. https://doi.org/10.1007/978-1-84882-260-3_7

Download citation

DOI: https://doi.org/10.1007/978-1-84882-260-3_7
Publisher Name: Springer, London
Print ISBN: 978-1-84882-259-7
Online ISBN: 978-1-84882-260-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics