Improving Rule Induction Precision for Automated Annotation by Balancing Skewed Data Sets

Batista, Gustavo E. A. P. A.; Monard, Maria C.; Bazzan, Ana L. C.

doi:10.1007/978-3-540-30478-4_3

Gustavo E. A. P. A. Batista²¹,
Maria C. Monard²¹ &
Ana L. C. Bazzan²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3303))

Included in the following conference series:

International Symposium on Knowledge Exploration in Life Science Informatics

329 Accesses
5 Citations

Abstract

There is an overwhelming increase in submissions to genomic databases, posing a problem for database maintenance, especially regarding annotation of fields left blank during submission. In order not to include all data as submitted, one possible alternative consists of performing the annotation manually. A less resource demanding alternative is automatic annotation. The latter helps the curator since predicting the properties of each protein sequence manually is turning a bottleneck, at least for protein databases. Machine Learning – ML – techniques have been used to generate automatic annotation and to help curators. A challenging problem for automatic annotation is that traditional ML algorithms assume a balanced training set. However, real-world data sets are predominantly imbalanced (skewed), i.e., there is a large number of examples of one class compared with just few examples of the other class. This is the case for protein databases where a large number of proteins is not annotated for every feature. In this work we discuss some over and under-sampling techniques that deal with class imbalance. A new method to deal with this problem that combines two known over and under-sampling methods is also proposed. Experimental results show that the symbolic classifiers induced by C4.5 on data sets after applying known over and under-sampling methods, as well as the new proposed method are always more accurate than the ones induced from the original imbalanced data sets. Therefore, this is a step towards producing more accurate rules for automating annotation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Batista, G.E.A.P.A., Bazan, A.L., Monard, M.C.: Balancing Training Data for Automated Annotation of Keywords: a Case Study. In: Proceedings of the Second Brazilian Workshop on Bioinformatics, pp. 35–43 (2003)
Google Scholar
Bazzan, A.L.C., Ceroni, S., Engel, P.M., Schroeder, L.F.: Automatic Annotation of Keywords for Proteins Related to Mycoplasmataceae Using Machine Learning Techniques. Bioinformatics 18(S2), S1–S9 (2002)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. Jounal of Artificial Intelligence Research 16, 321–357 (2002)
MATH Google Scholar
Hand, D.J.: Construction and Assessment of Classification Rules. John Wiley and Sons, Chichester (1997)
MATH Google Scholar
Kohavi, R., Sommerfield, D., Dougherty, J.: Data Mining Using MLC++: A Machine Learning Library in C++. International Journal on Artificial Intelligence Tools 6(4), 537–566 (1997)
Article Google Scholar
Kretschmann, E., Fleischmann, W., Apweiler, R.: Automatic Rule Generation for Protein Annotation with the C4.5 Data Mining Algorithm Applied on SWISS–PROT. Bioinformatics 17, 920–926 (2001)
Article Google Scholar
Monard, M.C., Batista, G.E.A.P.A.: Learning with Skewed Class Distribution. In: Abe, J.M., da Silva Filho, J.I. (eds.) Advances in Logic, Artificial Intelligence and Robotics, pp. 173–180. IOS Press, São Paulo (2002)
Google Scholar
Provost, F.J., Fawcett, T.: Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions. Knowledge Discovery and Data Mining, 43–48 (1997)
Google Scholar
Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufmann, CA (1988)
Google Scholar
Weiss, G.M., Provost, F.: The Effect of Class Distribution on Classifier Learning: An Empirical Study. Technical ReportML-TR-44, Rutgers University, Department of Computer Science (2001)
Google Scholar
Wilson, D.L.: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Communications 2(3), 408–421 (1972)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Instituto de Ciências Matemáticas e de Computação, USP, Caixa Postal 668, 13560–970, São Carlos, SP, Brazil
Gustavo E. A. P. A. Batista & Maria C. Monard
Instituto de Informática, UFRGS, Caixa Postal 15064, 91501–970, Porto Alegre, RS, Brazil
Ana L. C. Bazzan

Authors

Gustavo E. A. P. A. Batista
View author publications
You can also search for this author in PubMed Google Scholar
Maria C. Monard
View author publications
You can also search for this author in PubMed Google Scholar
Ana L. C. Bazzan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Sciences, School of Mathematics and Computing, University of Southern Queensland, 4350, Toowoomba, QLD, Australia
Jesús A. López
Istituto di Ricerche Farmacologiche “Mario Negri”, Via Eritrea 62, 20157, Milano, Italy
Emilio Benfenati
School of Biomedial Sciences, Bioinformatics Research Group, University of Ulster, Cromore Road, BT52 1SA, Coleraine, Northern Ireland, UK
Werner Dubitzky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Batista, G.E.A.P.A., Monard, M.C., Bazzan, A.L.C. (2004). Improving Rule Induction Precision for Automated Annotation by Balancing Skewed Data Sets. In: López, J.A., Benfenati, E., Dubitzky, W. (eds) Knowledge Exploration in Life Science Informatics. KELSI 2004. Lecture Notes in Computer Science(), vol 3303. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30478-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-540-30478-4_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23927-7
Online ISBN: 978-3-540-30478-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics