Abstract
Data mining, also referred to as database mining or knowledge discovery in databases (KDD), is a new research area that aims at the discovery of useful information from large datasets. Data mining uses statistical analysis and inference to extract interesting trends and events, create useful reports, support decision making,etc. It exploits the massive amounts of data to achieve business, operational or scientific goals.
In this chapter we give an overview of the data mining process and we describe the fundamental data mining problems: mining association rules and sequential patterns, classification and prediction, and clustering. Basic algorithms developed to efficiently process data mining tasks are discussed and illustrated with examples of their operation on real data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ankerst, M., Breunig, M., Kriegel, H-P., Sander, J., Optics: ordering points to identify the clustering structure, Proc. ACM SIGMOD Conference on Management of Data, 1999, 49–60.
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P., Automatic subspace clustering of high dimensional data for data mining applications, Proc. ACM SIGMOD Conference on Management of Data, 1998, 94–105.
Aha, D., Tolerating noisy, irrelevant, and novel attributes in instancebased learning algorithms, International Journal of Man-Machine Studies 36 (2), 1992, 267–287.
Agrawal, R., Imielinski, T., Swami, A., Mining association rules between sets of items in large databases, Proc. ACM SIGMOD Conference on Management of Data, 1993, 207–216.
Anderberg, M.R., Cluster analysis for applications, Academic Press, New York, 1973.
Aamodt, A., Plazas, E., Case-based reasoning: foundational issues, methodological variations, and system approaches, AI Communications 7, 1994, 39–52.
Alsabati, K., Ranka, S., Singh, V., Clouds: a decision tree classifier for large datasets, Proc. 4th International Conference on Knowledge Discovery and Data Mining (KDD’1998), 1998, 2–8.
Agrawal, R., Srikant, R., Fast algorithms for mining association rules, Proc. 20th International Conference on Very Large Data Bases (VLDB’94), 1994, 478–499.
Agrawal, R., Srikant, R., Mining sequential patterns, Proc. 11th International Conference on Data Engineering, 1995, 3–14.
Agrawal, R., Shafer, J.C., Parallel mining of association rules, IEEE Transactions on Knowledge and Data Engineering, vol. 8, No. 6, 1996, 962–969.
Aggarwal, C.C., Yu, P.S., Outlier detection in high dimensional data, Proc. ACM SIGMOD Conference on Management of Data, 2001, 3746.
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., Classification and regression trees, Wadsworth, Belmont, 1984.
Bishop, C., Neural networks for pattern recognition, Oxford University Press, New York, NY, 1995.
Breunig, M.M, Kriegel, H-P., Ng, R.T., Sander, J., LOF: identifying density-based local outliers, Proc. ACM SIGMOD Conference on Management of Data, 2000, 93–104.
Beckmann, N, Kriegel, H-P., Schneider, R., Seeger, B., The R*-tree: an efficient and robust access method for points and rectangles, Proc. ACM SIGMOD Conference on Management of Data, 1990, 322–331.
Barnett, V., Lewis, T., Outliers in statistical data, John Wiley, 1994.
Brin, S., Motwani, R., Ullman, J.D., Tsur, S., Dynamic itemset counting and implication rules for market basket data, Proc. ACM SIG-MOD Conference on Management of Data, 1997, 255–264.
Bettini, C., Wang, X.S., Jajodia, S., Lin, J., Discovering frequent event patterns with multiple granularities in time sequences, IEEE Transactions on Knowledge and Data Engineering, vol. 10, No. 2, 1998, 222–237.
Cheung, D.W., Han, J., Ng, V., Wong, C.Y., Maintenance of discovered association rules in large databases: an incremental updating technique, Proc. 12th International Conference on Data Engineering, 1996, 106–114.
Chen, M.S., Han, J., Yu, P.S., Data mining: an overview from a database perspective, IEEE Trans. Knowledge and Data Engineering 8, 1996, 866–883.
Cois, K., Pedrycz, W., Swiniarski, R., Data mining methods for knowledge discovery, Kluwer Acadamic Publishers, 1998.
Cheeseman, P., Stutz, J., Bayesian classification (autoclass): theory and results, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, MIT Press, 1996, 153–180.
Chakrabarti, S., Sasrawagi, S., Dom, B., Mining surprising patterns using temporal description length, Proc. 2.4nd Conference on Very Large Data Bases (VLDB’98), 1998, 606–617.
Duda, R.O., Hurt, P.E., Pattern classification and scene analysis, John Wiley, New York, 1973.
Ester, M., Kriegel, H-P., Sander, J., Xu, X., A density-based algorithm for discovering clusters in large spatial database with noise, Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD’96), 1996, 226–231.
FMM+96] Fukuda, T., Marimoto, Y., Morishita, S., Tokuyama, T., Constructing efficient decision trees by using optimized association rules, Proc. 22nd Conference on Very Large Data Bases (VLDB’96) 1996, 146–155.
Frawley, W.J., Piatetsky-Shapiro, G., Matheus, C.J., Knowledge discovery in databases: an overview, G. Piatetsky-Shapiro, W. Frawley (eds.), Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA, 1991, 1–27.
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Advances in knowledge discovery and data mining, MIT Press, 1996. [Fuk90] Fukunaga, K., Introduction to statistical pattern recognition, Aca- demic Press, San Diego, CA, 1990.
Ganti, V., Gehrke, J., Ramakrishnan, R., CACTUS–clustering categorical data using summaries, Proc. 5th International Conference on Knowledge Discovery and Data Mining (KDD’99), 1999, 73–83.
Gerke, J., Ganti, V., Ramakrishnan, R., Loh, W.Y., BOAT–optimistic decision tree construction, Proc. ACM SIGMOD Conference on Management of Data, 1999, 169–180.
Gibson, D., Kleinberg, J., Raghavan, P., CLustering categorical data: an approach based on dynamical systems, Proc. 24th International Conference on Very Large Data Bases (VLDB’98), 1998, 311–323.
Goldberg, D.E., Genetic algorithms in search optimization and machine learning, Morgan Kaufmann Pub., 1989.
Gupta, S.K., Rao, K.S., Bhatnagar, V., K-means clustering algorithm for categorical attributes, M. Mohania, A. Min Tjoa (eds.), Lecture Notes in Computer Science 1676, Data Warehousing and Knowledge Discovery, Springer-Verlag, Berlin, 1999, 203–208.
Gerke, J., Ramakrishnan, R., Ganti, V, Rainforest–a framework for fast decision tree classification of large datasets, Data Mining and Knowledge Discovery, vol. 4, issue 2 /3, 2000, 127–162.
Guha, S., Rastogi, R., Shim, K., Cure: an efficient clustering algorithm for large databases, Proc. ACM SIGMOD Conference on Management of Data, Seattle, USA, 1998, 73–84.
Garofalakis, M., Rastogi, R., Shim, K., Mining sequential patterns with regular expression constraints, Proc. 25th International Conference on Very Large Data Bases (VLDB’99), 1999, 223–234.
Guha, S., Rastogi, R., Shim, K., ROCK: a robust clustering algorithm for categorical attributes, Proc. International Conference on Data Engineering (ICDE’99), 1999, 512–521.
Guralnik, V., Wijesekera, D., Srivastava, J., Pattern directed mining of sequence data, Proc. 4th International Conference on Knowledge Discovery and Data Mining (KDD’98), 1998, 51–57.
Heckerman, D., Bayesian networks for knowledge discovery, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, MIT Press, 1996, 273–305.
Han, J., Fu, Y., Discovery of multiple-level association rules from large databases, Proc. 21th International Conference on Very Large Data Bases (VLDB’95), 1995, 420–431.
Han, J., Kamber, M., Data mining: concepts and techniques, Morgan Kaufmann Pub., 2000.
Hinneburg, A., Keim, D. A., An efficient approach to clustering in large multimedia databases with noise, Proc. 4th International Conference on Knowledge Discovery and Data Mining (KDD’98), 1998, 58–65.
Han, E., Karypis, G., Kumar, V., Mobasher, B., Hypergraph based clustering in high-dimensional data sets: a summary of results, Bulletin of the Technical Committee on Data Engineering, 21(1), 1998, 15–22.
Hawkins, D., Identification of outliers, Chapman and Hall, 1980.
Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U., Hsu, MC., FreeSpan: frequent pattern-projected sequential pattern mining, Proc. 6th International Conference on Knowledge Discovery and Data Mining (KDD ‘00), 2000, 355–359.
Han, J., Pei, J., Yin, Y., Mining frequent patterns without candidate generation, Proc. ACM SIGMOD Conference on Management of Data, 2000, 1–12.
Houtsma, M., Swami, A., Set-oriented mining of association rules, Research Report RJ 9567, IBM Almaden Research Center, San Jose, California, USA, 1993.
Huang, Z., Extensions to the K-means algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2, 1998, 283–304.
Imielinski, T., Mannila, H, A database perspective on knowledge discovery, Communications of ACM 39, 1996, 58–64.
James, M., Classification algorithms, John Wiley, New York, 1985.
Jain, A.K., Dubes, R.C., Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, NJ, 1988.
Joshi, M., Karypis, G., Kumar, V., A universal formulation of sequential patterns, Technical Report 99–21, Department of Computer Science, University of Minnesota, Minneapolis, 1999.
Jagadish, H.V., Koudas, N., Muthukrishnan, S., Mining deviants in a time series database, Proc. 25th International Conference on Very Large Data Bases (VLDB’99), 1999, 102–113.
Jain, A.K., Murty, M.N., Flynn, P.J., Data clustering: a survey, ACM Computing Surveys 31, 1999, 264–323.
Karypis, G., Aggarwal, R., Kumar, V., Shekhar, S., Multilevel hyper-graph partitioning: application in VLSI domain, Proc. ACM/IEEE Design Automation Conference, 1997, 526–529.
Knorr, E.M., Ng, R.T., Algorithms for mining distance-based outliers in large datasets, Proc. 24th International Conference on Very Large Data Bases (VLDB’98), 1998, 392–403.
Knorr, E.M., Ng, R.T., Tucakov, V., Distance-based outliers: algorithms and applications, VLDB Journal 8 (3–4), 2000, 237–253.
Knorr, E.M., Ng, R.T., Zamar, R.H., Robust space transformation for distance-based operations, Proc. 8th International Conference on Knowledge Discovery and Data Mining (KDD’2001), 2001, 126–135.
Kohavi, R., The power of decision tables, N. Lavrac, S. Wrobel (eds.), Lecture Notes in Computer Science 912, Machine Learning: ECML95, 8th European Conference on Machine Learning, Springer Verlag, Berlin, 1995, 174–189.
Kolodner, J.L., Case-based reasoning, Morgan Kaufmann, 1993.
Kaufman, L., Rousseeuw, P.J., Finding groups in data: an introduction to cluster analysis, John Wiley 0000 Sons, 1990.
Lauritzen, S.L., The EM algorithm for graphical association models with missing data, Computational Statistics and Data Analysis 19, 1995, 191–201.
Lu, H., Setiono, R., Liu, H., Neurorule: a connectionist approach to data mining, Proc. International Conference on Very Large Data Bases (VLDB’95), 1995, 478–489.
Magidson, J., The CHAID approach to segmentation modeling: Chisquared automatic interaction detection, R.P. Bagozzi (ed.), Advanced Methods of Marketing Research, Blackwell Business, Cambridge, MA, 1994, 118–159.
Mehta, M., Agrawal, R., Rissanen, J., SLIQ: a fast scalable classifier for data mining, Proc. International Conference on Extending Database Technology (EDBT’96), 1996, 18–32.
McQueen, J., Some methods for classification and analysis of multivariate observations, Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, 281–297.
Michalewicz, Z., Genetic algorithms + data structures = evolution programs, Springer Verlag, 1992.
Mitchell, T.M., An introduction to genetic algorithms, MIT Press, Cambridge, 1996.
Mitchell, T.M., Machine learning, McGraw-Hill, New York, 1997.
Mehta, M., Rissanen, J., Agrawal, R., MDL-based decision tree pruning, Proc. 1st International Conference on Knowledge Discovery and Data Mining (KDD’1995), 1995, 216–221.
Michie, D., Spiegelhalter, D.J., Taylor, C.C., Machine learning, neural and statistical classification, Ellis Horwood, 1994.
Mannila, H., Toivonen, H., Discovering generalized episodes using minimal occurrences, Proc. 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96), 1996, 146–151.
Manilla, H., Toivonen, H., Verkamo, A.I., Efficient algorithms for discovering association rules, Proc. AAAI Workshop Knowledge Discovery in Databases, 1994, 181–192.
Mannila, H., Toivonen, H., Verkamo, A.I., Discovering frequent episodes in sequences, Proc. 1st International Conference on Knowledge Discovery and Data Mining (KDD’95), 1995, 210–215.
Murthy, S.K., Automatic construction of decision trees from data: a multi-disciplinary survey, Data Mining and Knowledge Discovery vol. 2, No. 4, 1997, 345–389.
Ng, R., Han, J., Efficient and effective clustering method for spatial data mining, Proc. 20th International Conference on Very Large Data Bases (VLDB’94), 1994, 144–155.
Pawlak, Z., Rough sets: theoretical aspects of reasoning about data, Kluwer Academic Publishers, 1991.
Piatetsky-Shapiro, G., Frawley, W.J., Knowledge discovery in databases, AAAI/MIT Press, 1991.
Piatetsky-Shapiro, G., Fayyad, U.M., Smyth, P, From data mining to knowledge discovery: an overview, U.M. Fayyad, G. PiatetskyShapiro, P. Smyth, R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996, 1–35.
Pei, J., Han J., Mortazavi-Asl, B., Zhu, H., Mining access patterns efficiently from Web logs, Proc. 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’00), 2000, 396–407.
Pei, J., Han J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu M-C., PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth, Proc. 17th International Conference on Data Engineering (ICDE’01), 2001, 215–224.
Parthasarathy, S., Zaki, M.J., Ogihara, M., Dwarkadas, S., Incremental and interactive sequence mining, Proc. 8th International Conference on Information and Knowledge Management, 1999, 251–258.
Quinlan, J.R., Rivest, R.L., Inferring decision trees using the minimum description length principle, Information and Computation 80, 1989, 227–248.
Quinlan, J.R., Induction of decision trees, Machine Learning, vol. 1, No. 1, 1986, 81–106.
Quinlan, J. R., C4.5: programs for machine learning, Morgan Kaufmann, 1993.
Rumelhart, D.E., Hinton, G.E., Williams, R.J., Learning internal representation by error propagation, D.E. Rumelhart, J.L. McClelland (eds.), Parallel Distributed Processing, MIT Press, 1996, 318–362.
Ripley, B., Pattern, recognition and neural networks, Cambridge Uni- versity Press, Cambridge, 1996.
Ramaswamy, S., Rastogi, R., Shim, K., Efficient algorithms for mining ouliers from large data sets, Proc. ACM SIGMOD Conference on Management of Data, 2000, 427–438.
Rastogi, R., Shim, K., PUBLIC: a decision tree classifier that integrates building and pruning, Proc. 24th International Conference on Very Large Data Bases (VLDB’98), 1998, 404–415.
Srikant, R., Agrawal, R., Mining generalized association rules, Proc. 21th International Conference on Very Large Data Bases (VLDB’95), 1995, 407–419.
Srikant, R., Agrawal, R., Mining quantitative association rules in large relational tables, Proc. ACM SIGMOD Conference on Management of Data, 1996, 1–12.
SA96b] Srikant, R., Agrawal, R., Mining sequential patterns: generalizations and performance improvements, P.M.G. Apers, M. Bouzeghoub, G. Gardarin (eds.) Lecture Notes in Computer Science 1057, Advances in Database Technology - EDBT’96, 5th International Conference on Extending Database Technology 1996, 3–17.
Shafer, J., Agrawal, R., Mehta, M., SPRINT: a scalable parallel classifier for data mining, Proc. International Conference on Very Large Data Bases (VLDB’96), 1996, 544–555.
Sarawagi, S., Agrawal, R., Megiddo, N., Discovery-driven exploration of OLAP data cubes, Proc. International Conference on Extending Database Technology (EDBT’98), 1998, 168–182.
Schikuta, E., Grid clustering: an efficient hierarchical clustering method for very large data sets, Proc. International Conference on Pattern Recognition, 1996, 101–105.
Sheikholeslami, G., Chatterjee, S., Zhang, A., WaveCluster: a multiresolution clustering approach for very large spatial databases, Proc. 24th International Conference on Very Large Data Bases (VLDB’98), 1998, 428–439.
Shih, Y.-S., Family of splitting criteria for classification trees, Statistics and Computing 9, 1999, 309–315.
Savasere, A., Omiecinski, E., Navathe, S., An efficient algorithm for mining association rules in large databases, Proc. 21th International Conference on Very Large Data Bases (VLDB’95), 1995, 432–444.
Slowinski, R., Stefanowski, J., Rough-set reasoning about uncertain data, Fundamenta Informaticae, vol. 27, No. 2–3, 1996, 229–244.
Toivonen, H., Sampling large databases for association rules, Proc. and International Conference on Very Large Data Bases (VLDB’96), 1996, 134–145.
Witten, I.H., Frank, E., Data mining: practical machine learning tools and techniques with Java implementations, Morgan Kaufmann Pub., 2000.
Weiss, S.M., Kulikowski, C.A., Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems, Morgan Kaufmann Pub., 1991.
Wang, K., Tan, J., Incremental discovery of sequential patterns, The ACM-SIGMOD’s 96 Data Mining Workshop: on Research Issues on Data Mining and Knowledge Discovery, 1996, 95–102.
Wang, W., Yang, J., Muntz, R., Sting: a statistical information grid approach to spatial data mining, Proc. 23nd International Conference on Very Large Data Bases (VLDB’97), 1997, 186–195.
Xu, X., Ester, M., Kriegel, H-P., Sander, J., A distribution-based clustering algorithm for mining in large spatial databases, Proc. 14th International Conference on Data Engineering, 1998, 324–331.
Zadeh, L.A., Fuzzy sets, Information and Control 8, 1965, 338–353.
Zaki, M.J., Efficient enumeration of frequent sequences, Proc. 1998 ACM CIKM Int. Conf. on Information and Knowledge Management, USA, 1998.
Ziarko, W., Rough sets, fuzzy sets and knowledge discovery, Springer Verlag, 1994.
Zhang, T., Ramakrishnan, R., Livmy, M., BIRCH: An efficient data clustering method for very large databases, Proc. ACM SIGMOD Conference on Management of Data, 1996, 167–187.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Morzy, T., Zakrzewicz, M. (2003). Data Mining. In: Błażewicz, J., Kubiak, W., Morzy, T., Rusinkiewicz, M. (eds) Handbook on Data Management in Information Systems. International Handbooks on Information Systems. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24742-5_11
Download citation
DOI: https://doi.org/10.1007/978-3-540-24742-5_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53441-6
Online ISBN: 978-3-540-24742-5
eBook Packages: Springer Book Archive