Optimization of Cost Sensitive Models to Improve Prediction of Molecular Functions

García-López, Sebastián; Jaramillo-Garzón, Jorge Alberto; Castellanos-Dominguez, German

doi:10.1007/978-3-662-44485-6_15

Optimization of Cost Sensitive Models to Improve Prediction of Molecular Functions

Sebastián García-López⁸,
Jorge Alberto Jaramillo-Garzón^8,9 &
German Castellanos-Dominguez⁸

Conference paper
First Online: 01 January 2014

786 Accesses
1 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 452))

Abstract

The prediction of unknown protein functions is one of the main concerns at field of computational biology. This fact is reflected specifically in the prediction of molecular functions such as catalytic and binding activities. This, along with the massive amount of information has made that tools based on machine learning techniques have increase their popularity in the last years. However, these tools are confronted to several problems associated to the treated data, one of them is the learning with large imbalance between their categories. There exist several techniques to overcomes the class imbalance, but most of them present many weakness that difficult the obtaining of reliable results. Moreover, models based on cost sensitive learning seems to be a good choice to deal with imbalance data, yet, the obtaining of a optimal cost matrix still remains an open issue. In this paper, a methodology to calculate a optimal cost matrix for models based on cost sensitive learning is proposed. The results show the superiority of this approach compared with several techniques in the state of the art regarding to class imbalance. Tests were applied to prediction of molecular functions in Embryophyta plants.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Aebersold, R., Mann, M., et al.: Mass spectrometry-based proteomics. Nat. 422(6928), 198–207 (2003)
Article Google Scholar
Allison, D.B., Cui, X., Page, G.P., Sabripour, M.: Microarray data analysis: from disarray to consolidation and consensus. Nat. Rev. Genet. 7(1), 55–65 (2006)
Article Google Scholar
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al.: Gene ontology: tool for the unification of biology. Nat. Genet. 25(1), 25 (2000)
Article Google Scholar
Basu, M.: Data Complexity in Pattern Recognition. Springer, New York (2006)
Book MATH Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
MATH Google Scholar
Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach. Learn. 40(2), 139–157 (2000)
Article Google Scholar
Ding, Z.: Diversified ensemble classifiers for highly imbalanced data learning and its application in bioinformatics. Ph.D thesis, Georgia State University (2011)
Google Scholar
Domingos, P.: Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–164. ACM (1999)
Google Scholar
Elkan, C.: The foundations of cost-sensitive learning. In: International Joint Conference on Artificial Intelligence, vol. 17, pp. 973–978. Lawrence Erlbaum Associates Ltd (2001)
Google Scholar
García-López, S., Jaramillo-Garzón, J.A., Higuita-Vásquez, J.C., Castellanos-Domínguez, C.G.: Wrapper and filter metrics for PSO-based class balance applied to protein subcellular localization. In: 2012 Biostec-Bioinformatics (2012)
Google Scholar
Grzymala-Busse, J.W., Stefanowski, J., Wilk, S.: A comparison of two approaches to data mining from imbalanced data. J. Intell. Manuf. 16(6), 565–573 (2005)
Article Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 289–300 (2002)
Article Google Scholar
Jain, E., Bairoch, A., Duvaud, S., Phan, I., Redaschi, N., Suzek, B., Martin, M., McGarvey, P., Gasteiger, E.: Infrastructure for the life sciences: design and implementation of the uniprot website. BMC Bioinform. 10(1), 136 (2009)
Article Google Scholar
Jaramillo-Garzón, J.A., Gallardo-Chacón, J.J., Castellanos-Domínguez, C.G., Perera-Lluna, A.: Predictability of gene ontology slim-terms from primary structure information in embryophyta plant proteins. BMC Bioinform. 14(1), 68 (2013)
Article Google Scholar
Larrañaga, P., Calvo, B., Santana, R., Bielza, C., Galdiano, J., Inza, I., Lozano, J.A., Armañanzas, R., Santafé, G., Pérez, A., et al.: Machine learning in bioinformatics. Briefings Bioinform. 7(1), 86–112 (2006)
Article Google Scholar
Liu, X.Y., Zhou, Z.H.: The influence of class imbalance on cost-sensitive learning: an empirical study. In: 2006 Sixth International Conference on Data Mining, ICDM’06, pp. 970–974. IEEE (2006)
Google Scholar
Liu, X.-Y., Zhou, Z.-H.: Towards cost-sensitive learning for real-world applications. In: Cao, L., Huang, J.Z., Bailey, J., Koh, Y.S., Luo, J. (eds.) PAKDD Workshops 2011. LNCS, vol. 7104, pp. 494–505. Springer, Heidelberg (2012)
Chapter Google Scholar
Polikar, R.: Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 6(3), 21–45 (2006)
Article Google Scholar
Schapire, R.E.: A brief introduction to boosting. In: International Joint Conference on Artificial Intelligence, vol. 16, pp. 1401–1406. Lawrence Erlbaum Associates Ltd (1999)
Google Scholar
Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., Rätsch, G.: Accurate splice site prediction using support vector machines. BMC Bioinform. 8(Suppl 10), S7 (2007)
Article Google Scholar
Su, C.T., Hsiao, Y.H.: An evaluation of the robustness of MTS for imbalanced data. IEEE Trans. Knowl. Data Eng. 19(10), 1321–1332 (2007)
Article Google Scholar
Mohanna, E., Valian, E., Tavakoli, S.: Improved cuckoo search algorithm for global optimization. Int. J. Commun. Inf. Technol. 1(1), 31–44 (2011)
Google Scholar
Yang, P., Xu, L., Zhou, B.B., Zhang, Z., Zomaya, A.Y.: A particle swarm based hybrid system for imbalanced medical data sampling. BMC Genomics 10(Suppl 3), S34 (2009)
Article Google Scholar

Download references

Acknowledgements

This work is partially funded by the Research office (DIMA) at the Universidad Nacional de Colombia at Manizales and the Colombian National Research Centre (COLCIENCIAS) through grant No.111952128388 and the “jovenes investigadores e innovadores - 2010 Virginia Gutierrez de Pineda” fellowship.

Author information

Authors and Affiliations

Signal Processing and Recognition Group, Universidad Nacional de Colombia, Campus la Nubia, Km 7 vía al Magdalena, Manizales, Colombia
Sebastián García-López, Jorge Alberto Jaramillo-Garzón & German Castellanos-Dominguez
Grupo de Automática y Electrónica, Instituto Tecnológico Metropolitano, Cll 54A No 30-01, Medellín, Colombia
Jorge Alberto Jaramillo-Garzón

Authors

Sebastián García-López
View author publications
You can also search for this author in PubMed Google Scholar
Jorge Alberto Jaramillo-Garzón
View author publications
You can also search for this author in PubMed Google Scholar
German Castellanos-Dominguez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jorge Alberto Jaramillo-Garzón .

Editor information

Editors and Affiliations

Universitat Politècnica de Catalunya, Barcelona, Spain
Mireya Fernández-Chimeno
Instituto Gulbenkian de Ciência, Oeiras, Portugal
Pedro L. Fernandes
Boston College, Chestnut Hill, Massachusetts, USA
Sergio Alvarez
University of Guelph, Guelph, Canada
Deborah Stacey
University of Vic, Vic, Spain
Jordi Solé-Casals
Technical University of Lisbon, Lisbon, Portugal
Ana Fred
New University of Lisbon, Lisboa, Portugal
Hugo Gamboa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

García-López, S., Jaramillo-Garzón, J.A., Castellanos-Dominguez, G. (2014). Optimization of Cost Sensitive Models to Improve Prediction of Molecular Functions. In: Fernández-Chimeno, M., et al. Biomedical Engineering Systems and Technologies. BIOSTEC 2013. Communications in Computer and Information Science, vol 452. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44485-6_15

Download citation

DOI: https://doi.org/10.1007/978-3-662-44485-6_15
Published: 02 November 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44484-9
Online ISBN: 978-3-662-44485-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics