Abstract
Handling numerical data stored in a relational database is different from handling those numerical data stored in a single table due to the multiple occurrences of an individual record in the non-target table and non-determinate relations between tables. Most traditional data mining methods only deal with a single table and discretize columns that contain continuous numbers into nominal values. In a relational database, multiple records with numerical attributes are stored separately from the target table, and these records are usually associated with a single structured individual stored in the target table. Numbers in multi-relational data mining (MRDM) are often discretized, after considering the schema of the relational database, in order to reduce the continuous domains to more manageable symbolic domains of low cardinality, and the loss of precision is assumed to be acceptable. In this paper, we consider different alternatives for dealing with continuous attributes in MRDM. The discretization procedures considered in this paper include algorithms that do not depend on the multi-relational structure of the data and also that are sensitive to this structure. In this experiment, we study the effects of taking the one-to-many association issue into consideration in the process of discretizing continuous numbers. We implement a new method of discretization, called the entropyinstance-based discretization method, and we evaluate this discretization method with respect to C4.5 on three varieties of a well-known multi-relational database (Mutagenesis), where numeric attributes play an important role. We demonstrate on the empirical results obtained that entropy-based discretization can be improved by taking into consideration the multiple-instance problem.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alfred, R., Kazakov, D.: Weighted Pattern-Based Transformation Approach to Relational Data Mining. In: Proc of ICAIET 2006, Kota Kinabalu, Sabah, Malaysia (November 2006)
Alfred, R., Kazakov, D.: Data Summarization Approach to Relational Domain Learning Based on Frequent Pattern to Support the Development of Decision Making. In: Li, X., Zaïane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 889–898. Springer, Heidelberg (2006)
Alfred, R., Kazakov, D.: Pattern-Based Transformation Approach to Relational Domain Learning Using DARA. In: the Proc DMIN 2006, USA, pp. 296–302 (2006)
Srinivasan, A., Muggleton, S.H., Sternberg, M.J.E., King, R.D.: Theories for mutagenicity: A study in first-order and feature-based induction. Artificial Intelligence 85 (1996)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, Los Alamitos, California
Kramer, S., Lavrač, N., Flach, P.: Propositionalization approaches to relational data mining. In: Dzeroski, S., Lavrač, N. (eds.) Relational Data mining, Springer, Heidelberg (2001)
Salton, G., Michael, J.: Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York (1986)
Bezdek, J.C.: Some new indexes of cluster validiy. IEEE Transaction System, Man, Cybern. B 28, 301–315 (1998)
Boley, D.: Principal direction divisive partitioning. Data Mining and Knowledge Discovery 2(4), 325–344 (1998)
Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufman, San Francisco (1999)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp. 94–105. ACM Press, New York (1998)
Hofmann, T., Buhnmann, J.M.: Active data clustering. In: Advance in Neural Information Processing System (1998)
Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975)
Van Laer, W., De Raedt, L., Deroski, S.: On multi-class problems and discretization in inductive logic programming. In: Raś, Z.W., Skowron, A. (eds.) ISMIS 1997. LNCS, vol. 1325, Springer, Heidelberg (1997)
Kohavi, R., Sahami, M.: Error-based and entropy-based discretisation of continuous features. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press (1996)
Perner, P., Trautzsch, S.: Multi-interval discretization methods for decision tree learning. In: Advances in Pattern Recognition, Joint IAPR International Workshops SSPR ’98 and SPR 1998, pp. 475–482 (1998)
Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous valued attributes for classification learning. In: Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pp. 1022–1027 (1993)
Srinivasan, A., Muggleton, S., King, R.: Comparing the use of background knowledge by inductive logic programming systems. In: Proceedings of the 5th International Workshop on Inductive Logic Programming (1995)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Alfred, R., Kazakov, D. (2007). Discretization Numbers for Multiple-Instances Problem in Relational Database. In: Ioannidis, Y., Novikov, B., Rachev, B. (eds) Advances in Databases and Information Systems. ADBIS 2007. Lecture Notes in Computer Science, vol 4690. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75185-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-75185-4_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75184-7
Online ISBN: 978-3-540-75185-4
eBook Packages: Computer ScienceComputer Science (R0)