Abstract
Attribute noise can affect classification learning. Previous work in handling attribute noise has focused on those predictable attributes that can be predicted by the class and other attributes. However, attributes can often be predictive but unpredictable. Being predictive, they are essential to classification learning and it is important to handle their noise. Being unpredictable, they require strategies different from those of predictable attributes. This paper presents a study on identifying, cleansing and measuring noise for predictive-but-unpredictable attributes. New strategies are accordingly proposed. Both theoretical analysis and empirical evidence suggest that these strategies are more effective and more efficient than previous alternatives.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Blake, C.L., Merz, C.J.U.: repository of machine learning databases, Department of Information and Computer Science, University of California, Irvine (1998)
Brodley, C.E., Friedl, M.A.: Identifying and eliminating mislabeled training instances. In: Proc. of the 13th National Conf. on Artificial Intelligence, pp. 799–805 (1996)
Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. Journal of Artificial Intelligence Research 11, 131–167 (1999)
Gamberger, D., Lavrac, N., Dzeroski, S.: Noise detection and elimination in data preprocessing: experiments in medical domains. Applied Artificial Intelligence 14, 205–223 (2000)
Gamberger, D., Lavrac, N., Groselj, C.: Experiments with noise filtering in a medical domain. In: Proc. of the 16th International Conf. on Machine Learning, pp. 143–151 (1999)
Guyon, I., Matic, N., Vapnik, V.: Discovering Informative Patterns and Data Cleaning, AAAI/MIT Press, pp. 181–203 (1996)
Kubica, J., Moore, A.: Probabilistic noise identification and data cleaning. In: Proc. of the 3rd IEEE International Conf. on Data Mining, pp. 131–138 (2003)
Maletic, J.I., Marcus, A.: Data cleansing: Beyond integrity analysis. In: Proc. of the 5th Conf. on Information Quality, pp. 200–209 (2000)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993)
Rijsbergen, C.J.v.: Information Retrieval, 2nd edn., Butterworths (1979)
Schwarm, S., Wolfman, S.: Cleaning data with Bayesian methods, Final project report for CSE574, University of Washington (2000)
Teng, C.M.: Correcting noisy data. In: Proc. of the 16th International Conf. on Machine Learning, pp. 239–248 (1999)
Teng, C.M.: Applying noise handling techniques to genomic data: A case study. In: Proc. of the 3rd IEEE International Conf. on Data Mining, pp. 743–746 (2003)
Verbaeten, S.: Identifying mislabeled training examples in ILP classification problems. In: Proc. of the 12th Belgian-Dutch Conf. on Machine Learning, pp. 1–8 (2002)
Zhu, X., Wu, X., Chen, Q.: Eliminating class noise in large datasets. In: Proc. of the 20th International Conf. on Machine Learning, pp. 920–927 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yang, Y., Wu, X., Zhu, X. (2004). Dealing with Predictive-but-Unpredictable Attributes in Noisy Data Sources. In: Boulicaut, JF., Esposito, F., Giannotti, F., Pedreschi, D. (eds) Knowledge Discovery in Databases: PKDD 2004. PKDD 2004. Lecture Notes in Computer Science(), vol 3202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30116-5_43
Download citation
DOI: https://doi.org/10.1007/978-3-540-30116-5_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23108-0
Online ISBN: 978-3-540-30116-5
eBook Packages: Springer Book Archive