Data Mining

Morzy, Tadeusz; Zakrzewicz, Maciej

doi:10.1007/978-3-540-24742-5_11

Tadeusz Morzy⁶ &
Maciej Zakrzewicz⁶

Part of the book series: International Handbooks on Information Systems ((INFOSYS))

841 Accesses
5 Citations

Abstract

Data mining, also referred to as database mining or knowledge discovery in databases (KDD), is a new research area that aims at the discovery of useful information from large datasets. Data mining uses statistical analysis and inference to extract interesting trends and events, create useful reports, support decision making,etc. It exploits the massive amounts of data to achieve business, operational or scientific goals.

In this chapter we give an overview of the data mining process and we describe the fundamental data mining problems: mining association rules and sequential patterns, classification and prediction, and clustering. Basic algorithms developed to efficiently process data mining tasks are discussed and illustrated with examples of their operation on real data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ankerst, M., Breunig, M., Kriegel, H-P., Sander, J., Optics: ordering points to identify the clustering structure, Proc. ACM SIGMOD Conference on Management of Data, 1999, 49–60.
Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P., Automatic subspace clustering of high dimensional data for data mining applications, Proc. ACM SIGMOD Conference on Management of Data, 1998, 94–105.
Google Scholar
Aha, D., Tolerating noisy, irrelevant, and novel attributes in instancebased learning algorithms, International Journal of Man-Machine Studies 36 (2), 1992, 267–287.
Article Google Scholar
Agrawal, R., Imielinski, T., Swami, A., Mining association rules between sets of items in large databases, Proc. ACM SIGMOD Conference on Management of Data, 1993, 207–216.
Google Scholar
Anderberg, M.R., Cluster analysis for applications, Academic Press, New York, 1973.
MATH Google Scholar
Aamodt, A., Plazas, E., Case-based reasoning: foundational issues, methodological variations, and system approaches, AI Communications 7, 1994, 39–52.
Google Scholar
Alsabati, K., Ranka, S., Singh, V., Clouds: a decision tree classifier for large datasets, Proc. 4th International Conference on Knowledge Discovery and Data Mining (KDD’1998), 1998, 2–8.
Google Scholar
Agrawal, R., Srikant, R., Fast algorithms for mining association rules, Proc. 20th International Conference on Very Large Data Bases (VLDB’94), 1994, 478–499.
Google Scholar
Agrawal, R., Srikant, R., Mining sequential patterns, Proc. 11th International Conference on Data Engineering, 1995, 3–14.
Google Scholar
Agrawal, R., Shafer, J.C., Parallel mining of association rules, IEEE Transactions on Knowledge and Data Engineering, vol. 8, No. 6, 1996, 962–969.
Article Google Scholar
Aggarwal, C.C., Yu, P.S., Outlier detection in high dimensional data, Proc. ACM SIGMOD Conference on Management of Data, 2001, 3746.
Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., Classification and regression trees, Wadsworth, Belmont, 1984.
MATH Google Scholar
Bishop, C., Neural networks for pattern recognition, Oxford University Press, New York, NY, 1995.
Google Scholar
Breunig, M.M, Kriegel, H-P., Ng, R.T., Sander, J., LOF: identifying density-based local outliers, Proc. ACM SIGMOD Conference on Management of Data, 2000, 93–104.
Google Scholar
Beckmann, N, Kriegel, H-P., Schneider, R., Seeger, B., The R*-tree: an efficient and robust access method for points and rectangles, Proc. ACM SIGMOD Conference on Management of Data, 1990, 322–331.
Google Scholar
Barnett, V., Lewis, T., Outliers in statistical data, John Wiley, 1994.
Google Scholar
Brin, S., Motwani, R., Ullman, J.D., Tsur, S., Dynamic itemset counting and implication rules for market basket data, Proc. ACM SIG-MOD Conference on Management of Data, 1997, 255–264.
Google Scholar
Bettini, C., Wang, X.S., Jajodia, S., Lin, J., Discovering frequent event patterns with multiple granularities in time sequences, IEEE Transactions on Knowledge and Data Engineering, vol. 10, No. 2, 1998, 222–237.
Article Google Scholar
Cheung, D.W., Han, J., Ng, V., Wong, C.Y., Maintenance of discovered association rules in large databases: an incremental updating technique, Proc. 12th International Conference on Data Engineering, 1996, 106–114.
Google Scholar
Chen, M.S., Han, J., Yu, P.S., Data mining: an overview from a database perspective, IEEE Trans. Knowledge and Data Engineering 8, 1996, 866–883.
Article Google Scholar
Cois, K., Pedrycz, W., Swiniarski, R., Data mining methods for knowledge discovery, Kluwer Acadamic Publishers, 1998.
Google Scholar
Cheeseman, P., Stutz, J., Bayesian classification (autoclass): theory and results, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, MIT Press, 1996, 153–180.
Google Scholar
Chakrabarti, S., Sasrawagi, S., Dom, B., Mining surprising patterns using temporal description length, Proc. 2.4nd Conference on Very Large Data Bases (VLDB’98), 1998, 606–617.
Google Scholar
Duda, R.O., Hurt, P.E., Pattern classification and scene analysis, John Wiley, New York, 1973.
MATH Google Scholar
Ester, M., Kriegel, H-P., Sander, J., Xu, X., A density-based algorithm for discovering clusters in large spatial database with noise, Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD’96), 1996, 226–231.
Google Scholar
FMM+96] Fukuda, T., Marimoto, Y., Morishita, S., Tokuyama, T., Constructing efficient decision trees by using optimized association rules, Proc. 22nd Conference on Very Large Data Bases (VLDB’96) 1996, 146–155.
Google Scholar
Frawley, W.J., Piatetsky-Shapiro, G., Matheus, C.J., Knowledge discovery in databases: an overview, G. Piatetsky-Shapiro, W. Frawley (eds.), Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA, 1991, 1–27.
Google Scholar
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Advances in knowledge discovery and data mining, MIT Press, 1996. [Fuk90] Fukunaga, K., Introduction to statistical pattern recognition, Aca- demic Press, San Diego, CA, 1990.
Google Scholar
Ganti, V., Gehrke, J., Ramakrishnan, R., CACTUS–clustering categorical data using summaries, Proc. 5th International Conference on Knowledge Discovery and Data Mining (KDD’99), 1999, 73–83.
Google Scholar
Gerke, J., Ganti, V., Ramakrishnan, R., Loh, W.Y., BOAT–optimistic decision tree construction, Proc. ACM SIGMOD Conference on Management of Data, 1999, 169–180.
Google Scholar
Gibson, D., Kleinberg, J., Raghavan, P., CLustering categorical data: an approach based on dynamical systems, Proc. 24th International Conference on Very Large Data Bases (VLDB’98), 1998, 311–323.
Google Scholar
Goldberg, D.E., Genetic algorithms in search optimization and machine learning, Morgan Kaufmann Pub., 1989.
Google Scholar
Gupta, S.K., Rao, K.S., Bhatnagar, V., K-means clustering algorithm for categorical attributes, M. Mohania, A. Min Tjoa (eds.), Lecture Notes in Computer Science 1676, Data Warehousing and Knowledge Discovery, Springer-Verlag, Berlin, 1999, 203–208.
Google Scholar
Gerke, J., Ramakrishnan, R., Ganti, V, Rainforest–a framework for fast decision tree classification of large datasets, Data Mining and Knowledge Discovery, vol. 4, issue 2 /3, 2000, 127–162.
Google Scholar
Guha, S., Rastogi, R., Shim, K., Cure: an efficient clustering algorithm for large databases, Proc. ACM SIGMOD Conference on Management of Data, Seattle, USA, 1998, 73–84.
Google Scholar
Garofalakis, M., Rastogi, R., Shim, K., Mining sequential patterns with regular expression constraints, Proc. 25th International Conference on Very Large Data Bases (VLDB’99), 1999, 223–234.
Google Scholar
Guha, S., Rastogi, R., Shim, K., ROCK: a robust clustering algorithm for categorical attributes, Proc. International Conference on Data Engineering (ICDE’99), 1999, 512–521.
Google Scholar
Guralnik, V., Wijesekera, D., Srivastava, J., Pattern directed mining of sequence data, Proc. 4th International Conference on Knowledge Discovery and Data Mining (KDD’98), 1998, 51–57.
Google Scholar
Heckerman, D., Bayesian networks for knowledge discovery, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, MIT Press, 1996, 273–305.
Google Scholar
Han, J., Fu, Y., Discovery of multiple-level association rules from large databases, Proc. 21th International Conference on Very Large Data Bases (VLDB’95), 1995, 420–431.
Google Scholar
Han, J., Kamber, M., Data mining: concepts and techniques, Morgan Kaufmann Pub., 2000.
Google Scholar
Hinneburg, A., Keim, D. A., An efficient approach to clustering in large multimedia databases with noise, Proc. 4th International Conference on Knowledge Discovery and Data Mining (KDD’98), 1998, 58–65.
Google Scholar
Han, E., Karypis, G., Kumar, V., Mobasher, B., Hypergraph based clustering in high-dimensional data sets: a summary of results, Bulletin of the Technical Committee on Data Engineering, 21(1), 1998, 15–22.
Google Scholar
Hawkins, D., Identification of outliers, Chapman and Hall, 1980.
Google Scholar
Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U., Hsu, MC., FreeSpan: frequent pattern-projected sequential pattern mining, Proc. 6th International Conference on Knowledge Discovery and Data Mining (KDD ‘00), 2000, 355–359.
Google Scholar
Han, J., Pei, J., Yin, Y., Mining frequent patterns without candidate generation, Proc. ACM SIGMOD Conference on Management of Data, 2000, 1–12.
Google Scholar
Houtsma, M., Swami, A., Set-oriented mining of association rules, Research Report RJ 9567, IBM Almaden Research Center, San Jose, California, USA, 1993.
Google Scholar
Huang, Z., Extensions to the K-means algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2, 1998, 283–304.
Article Google Scholar
Imielinski, T., Mannila, H, A database perspective on knowledge discovery, Communications of ACM 39, 1996, 58–64.
Article Google Scholar
James, M., Classification algorithms, John Wiley, New York, 1985.
MATH Google Scholar
Jain, A.K., Dubes, R.C., Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, NJ, 1988.
MATH Google Scholar
Joshi, M., Karypis, G., Kumar, V., A universal formulation of sequential patterns, Technical Report 99–21, Department of Computer Science, University of Minnesota, Minneapolis, 1999.
Google Scholar
Jagadish, H.V., Koudas, N., Muthukrishnan, S., Mining deviants in a time series database, Proc. 25th International Conference on Very Large Data Bases (VLDB’99), 1999, 102–113.
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J., Data clustering: a survey, ACM Computing Surveys 31, 1999, 264–323.
Article Google Scholar
Karypis, G., Aggarwal, R., Kumar, V., Shekhar, S., Multilevel hyper-graph partitioning: application in VLSI domain, Proc. ACM/IEEE Design Automation Conference, 1997, 526–529.
Google Scholar
Knorr, E.M., Ng, R.T., Algorithms for mining distance-based outliers in large datasets, Proc. 24th International Conference on Very Large Data Bases (VLDB’98), 1998, 392–403.
Google Scholar
Knorr, E.M., Ng, R.T., Tucakov, V., Distance-based outliers: algorithms and applications, VLDB Journal 8 (3–4), 2000, 237–253.
Google Scholar
Knorr, E.M., Ng, R.T., Zamar, R.H., Robust space transformation for distance-based operations, Proc. 8th International Conference on Knowledge Discovery and Data Mining (KDD’2001), 2001, 126–135.
Google Scholar
Kohavi, R., The power of decision tables, N. Lavrac, S. Wrobel (eds.), Lecture Notes in Computer Science 912, Machine Learning: ECML95, 8th European Conference on Machine Learning, Springer Verlag, Berlin, 1995, 174–189.
Google Scholar
Kolodner, J.L., Case-based reasoning, Morgan Kaufmann, 1993.
Google Scholar
Kaufman, L., Rousseeuw, P.J., Finding groups in data: an introduction to cluster analysis, John Wiley 0000 Sons, 1990.
Google Scholar
Lauritzen, S.L., The EM algorithm for graphical association models with missing data, Computational Statistics and Data Analysis 19, 1995, 191–201.
Article MATH Google Scholar
Lu, H., Setiono, R., Liu, H., Neurorule: a connectionist approach to data mining, Proc. International Conference on Very Large Data Bases (VLDB’95), 1995, 478–489.
Google Scholar
Magidson, J., The CHAID approach to segmentation modeling: Chisquared automatic interaction detection, R.P. Bagozzi (ed.), Advanced Methods of Marketing Research, Blackwell Business, Cambridge, MA, 1994, 118–159.
Google Scholar
Mehta, M., Agrawal, R., Rissanen, J., SLIQ: a fast scalable classifier for data mining, Proc. International Conference on Extending Database Technology (EDBT’96), 1996, 18–32.
Google Scholar
McQueen, J., Some methods for classification and analysis of multivariate observations, Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, 281–297.
Google Scholar
Michalewicz, Z., Genetic algorithms + data structures = evolution programs, Springer Verlag, 1992.
Google Scholar
Mitchell, T.M., An introduction to genetic algorithms, MIT Press, Cambridge, 1996.
Google Scholar
Mitchell, T.M., Machine learning, McGraw-Hill, New York, 1997.
MATH Google Scholar
Mehta, M., Rissanen, J., Agrawal, R., MDL-based decision tree pruning, Proc. 1st International Conference on Knowledge Discovery and Data Mining (KDD’1995), 1995, 216–221.
Google Scholar
Michie, D., Spiegelhalter, D.J., Taylor, C.C., Machine learning, neural and statistical classification, Ellis Horwood, 1994.
Google Scholar
Mannila, H., Toivonen, H., Discovering generalized episodes using minimal occurrences, Proc. 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96), 1996, 146–151.
Google Scholar
Manilla, H., Toivonen, H., Verkamo, A.I., Efficient algorithms for discovering association rules, Proc. AAAI Workshop Knowledge Discovery in Databases, 1994, 181–192.
Google Scholar
Mannila, H., Toivonen, H., Verkamo, A.I., Discovering frequent episodes in sequences, Proc. 1st International Conference on Knowledge Discovery and Data Mining (KDD’95), 1995, 210–215.
Google Scholar
Murthy, S.K., Automatic construction of decision trees from data: a multi-disciplinary survey, Data Mining and Knowledge Discovery vol. 2, No. 4, 1997, 345–389.
Article Google Scholar
Ng, R., Han, J., Efficient and effective clustering method for spatial data mining, Proc. 20th International Conference on Very Large Data Bases (VLDB’94), 1994, 144–155.
Google Scholar
Pawlak, Z., Rough sets: theoretical aspects of reasoning about data, Kluwer Academic Publishers, 1991.
Google Scholar
Piatetsky-Shapiro, G., Frawley, W.J., Knowledge discovery in databases, AAAI/MIT Press, 1991.
Google Scholar
Piatetsky-Shapiro, G., Fayyad, U.M., Smyth, P, From data mining to knowledge discovery: an overview, U.M. Fayyad, G. PiatetskyShapiro, P. Smyth, R. Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996, 1–35.
Google Scholar
Pei, J., Han J., Mortazavi-Asl, B., Zhu, H., Mining access patterns efficiently from Web logs, Proc. 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’00), 2000, 396–407.
Google Scholar
Pei, J., Han J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu M-C., PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth, Proc. 17th International Conference on Data Engineering (ICDE’01), 2001, 215–224.
Google Scholar
Parthasarathy, S., Zaki, M.J., Ogihara, M., Dwarkadas, S., Incremental and interactive sequence mining, Proc. 8th International Conference on Information and Knowledge Management, 1999, 251–258.
Google Scholar
Quinlan, J.R., Rivest, R.L., Inferring decision trees using the minimum description length principle, Information and Computation 80, 1989, 227–248.
Article MathSciNet MATH Google Scholar
Quinlan, J.R., Induction of decision trees, Machine Learning, vol. 1, No. 1, 1986, 81–106.
Google Scholar
Quinlan, J. R., C4.5: programs for machine learning, Morgan Kaufmann, 1993.
Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J., Learning internal representation by error propagation, D.E. Rumelhart, J.L. McClelland (eds.), Parallel Distributed Processing, MIT Press, 1996, 318–362.
Google Scholar
Ripley, B., Pattern, recognition and neural networks, Cambridge Uni- versity Press, Cambridge, 1996.
Google Scholar
Ramaswamy, S., Rastogi, R., Shim, K., Efficient algorithms for mining ouliers from large data sets, Proc. ACM SIGMOD Conference on Management of Data, 2000, 427–438.
Google Scholar
Rastogi, R., Shim, K., PUBLIC: a decision tree classifier that integrates building and pruning, Proc. 24th International Conference on Very Large Data Bases (VLDB’98), 1998, 404–415.
Google Scholar
Srikant, R., Agrawal, R., Mining generalized association rules, Proc. 21th International Conference on Very Large Data Bases (VLDB’95), 1995, 407–419.
Google Scholar
Srikant, R., Agrawal, R., Mining quantitative association rules in large relational tables, Proc. ACM SIGMOD Conference on Management of Data, 1996, 1–12.
Google Scholar
SA96b] Srikant, R., Agrawal, R., Mining sequential patterns: generalizations and performance improvements, P.M.G. Apers, M. Bouzeghoub, G. Gardarin (eds.) Lecture Notes in Computer Science 1057, Advances in Database Technology - EDBT’96, 5th International Conference on Extending Database Technology 1996, 3–17.
Google Scholar
Shafer, J., Agrawal, R., Mehta, M., SPRINT: a scalable parallel classifier for data mining, Proc. International Conference on Very Large Data Bases (VLDB’96), 1996, 544–555.
Google Scholar
Sarawagi, S., Agrawal, R., Megiddo, N., Discovery-driven exploration of OLAP data cubes, Proc. International Conference on Extending Database Technology (EDBT’98), 1998, 168–182.
Google Scholar
Schikuta, E., Grid clustering: an efficient hierarchical clustering method for very large data sets, Proc. International Conference on Pattern Recognition, 1996, 101–105.
Google Scholar
Sheikholeslami, G., Chatterjee, S., Zhang, A., WaveCluster: a multiresolution clustering approach for very large spatial databases, Proc. 24th International Conference on Very Large Data Bases (VLDB’98), 1998, 428–439.
Google Scholar
Shih, Y.-S., Family of splitting criteria for classification trees, Statistics and Computing 9, 1999, 309–315.
Article Google Scholar
Savasere, A., Omiecinski, E., Navathe, S., An efficient algorithm for mining association rules in large databases, Proc. 21th International Conference on Very Large Data Bases (VLDB’95), 1995, 432–444.
Google Scholar
Slowinski, R., Stefanowski, J., Rough-set reasoning about uncertain data, Fundamenta Informaticae, vol. 27, No. 2–3, 1996, 229–244.
MathSciNet MATH Google Scholar
Toivonen, H., Sampling large databases for association rules, Proc. and International Conference on Very Large Data Bases (VLDB’96), 1996, 134–145.
Google Scholar
Witten, I.H., Frank, E., Data mining: practical machine learning tools and techniques with Java implementations, Morgan Kaufmann Pub., 2000.
Google Scholar
Weiss, S.M., Kulikowski, C.A., Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems, Morgan Kaufmann Pub., 1991.
Google Scholar
Wang, K., Tan, J., Incremental discovery of sequential patterns, The ACM-SIGMOD’s 96 Data Mining Workshop: on Research Issues on Data Mining and Knowledge Discovery, 1996, 95–102.
Google Scholar
Wang, W., Yang, J., Muntz, R., Sting: a statistical information grid approach to spatial data mining, Proc. 23nd International Conference on Very Large Data Bases (VLDB’97), 1997, 186–195.
Google Scholar
Xu, X., Ester, M., Kriegel, H-P., Sander, J., A distribution-based clustering algorithm for mining in large spatial databases, Proc. 14th International Conference on Data Engineering, 1998, 324–331.
Google Scholar
Zadeh, L.A., Fuzzy sets, Information and Control 8, 1965, 338–353.
Article MathSciNet MATH Google Scholar
Zaki, M.J., Efficient enumeration of frequent sequences, Proc. 1998 ACM CIKM Int. Conf. on Information and Knowledge Management, USA, 1998.
Google Scholar
Ziarko, W., Rough sets, fuzzy sets and knowledge discovery, Springer Verlag, 1994.
Google Scholar
Zhang, T., Ramakrishnan, R., Livmy, M., BIRCH: An efficient data clustering method for very large databases, Proc. ACM SIGMOD Conference on Management of Data, 1996, 167–187.
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computing Science, Poznań University of Technology, Poznań, Poland
Tadeusz Morzy & Maciej Zakrzewicz

Authors

Tadeusz Morzy
View author publications
You can also search for this author in PubMed Google Scholar
Maciej Zakrzewicz
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Bioorganic Chemistry, Polish Academy of Sciences, ul. Noskowskiego 12, 61-704, Poznań, Poland
Jacek Błażewicz
Faculty of Business Administration, Memorial University of Newfoundland, NF A1B 3X5, St. John’s, Canada
Wieslaw Kubiak
Institute of Computing Science, Poznań University of Technology, ul. Piotrowo 3a, 60-965, Poznań, Poland
Tadeusz Morzy
Information and Computer Science Laboratory, Telcordia Technologies, 445 South Street MCC-1J346B, 07960, Morristown, NJ, USA
Marek Rusinkiewicz

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Morzy, T., Zakrzewicz, M. (2003). Data Mining. In: Błażewicz, J., Kubiak, W., Morzy, T., Rusinkiewicz, M. (eds) Handbook on Data Management in Information Systems. International Handbooks on Information Systems. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24742-5_11

Download citation

DOI: https://doi.org/10.1007/978-3-540-24742-5_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53441-6
Online ISBN: 978-3-540-24742-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics