Automating the Extraction of Essential Genes from Literature

Rodrigues, Ruben; Costa, Hugo; Rocha, Miguel

doi:10.1007/978-3-319-95786-9_6

Ruben Rodrigues¹⁴,
Hugo Costa¹⁵ &
Miguel Rocha¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10933))

Included in the following conference series:

Industrial Conference on Data Mining

1018 Accesses

Abstract

The construction of repositories with curated information about gene essentiality for organisms of interest in Biotechnology is a very relevant task, mainly in the design of cell factories for the enhanced production of added-value products. However, it requires retrieval and extraction of relevant information from literature, leading to high costs regarding manual curation. Text mining tools implementing methods addressing tasks as information retrieval, named entity recognition and event extraction have been developed to automate and reduce the time required to obtain relevant information from literature in many biomedical fields. However, current tools are not designed or optimized for the purpose of identifying mentions to essential genes in scientific texts.

In this work, we propose a pipeline to automatically extract mentions to genes and to classify them accordingly to their essentiality for a specific organism. This pipeline implements a machine learning approach that is trained using a manually curated set of documents related with gene essentiality in yeast. This corpus is provided as a resource for the community, as a benchmark for the development of new methods. Our pipeline was evaluated performing resampling and cross validation over this curated dataset, presenting an accuracy of over 80%, and an f1-score over 75%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Guo, D., Zhang, L., Pan, H., Li, X.: Metabolic engineering of Escherichia coli for production of 2-Phenylethylacetate from L-phenylalanine. MicrobiologyOpen 6(4), e00486 (2017)
Article Google Scholar
Yu, T., Zhou, Y.J., Wenning, L., Liu, Q., Krivoruchko, A., Siewers, V., Nielsen, J., David, F.: Metabolic engineering of Saccharomyces cerevisiae for production of very long chain fatty acid-derived chemicals. Nat. Commun. 8, 15587 (2017)
Article Google Scholar
Chen, W.H., Lu, G., Chen, X., Zhao, X.M., Bork, P.: OGEE v2: an update of the online gene essentiality database with special focus on differentially essential genes in human cancer cell lines. Nucleic Acids Res. 45(D1), D940–D944 (2017)
Article Google Scholar
Cherry, J.M., Hong, E.L., Amundsen, C., Balakrishnan, R., Binkley, G., Chan, E.T., Christie, K.R., Costanzo, M.C., Dwight, S.S., Engel, S.R., Fisk, D.G., Hirschman, J.E., Hitz, B.C., Karra, K., Krieger, C.J., Miyasato, S.R., Nash, R.S., Park, J., Skrzypek, M.S., Simison, M., Weng, S., Wong, E.D.: Saccharomyces genome database: the genomics resource of budding yeast. Nucleic Acids Res. 40(Database issue), D700-D705 (2012)
Google Scholar
Shatkay, H., Craven, M.: Mining the Biomedical Literature. Computational Molecular Biology. MIT Press, Cambridge (2012)
Google Scholar
Gerner, M., Nenadic, G., Bergman, C.M.: LINNAEUS: a species name identification system for biomedical literature. BMC Bioinform. 11(1), 85 (2010)
Article Google Scholar
Gooch, P.: BADREX: In situ expansion and coreference of biomedical abbreviations using dynamic regular expressions. CoRR abs/1206.4, p. 6 (2012)
Google Scholar
Campos, D., Matos, S., Oliveira, J.: A modular framework for biomedical concept recognition. BMC Bioinform. 14(1), 281 (2013)
Article Google Scholar
Ananiadou, S., Pyysalo, S., Tsujii, J., Kell, D.B.: Event extraction for systems biology by text mining the literature. Trends Biotechnol. 28(7), 381–390 (2010)
Article Google Scholar
Yakushiji, A., Tateisi, Y., Miyao, Y., Tsujii, J.: Event extraction from biomedical papers using a full parser. In: Pacific Symposium on Biocomputing, pp. 408–419 (2001)
Google Scholar
McClosky, D., Surdeanu, M., Manning, C.D.: Event extraction as dependency parsing. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT 2011, Stroudsburg, PA, USA, pp. 1626–1635. Association for Computational Linguistics (2011)
Google Scholar
Chun, H., Hwang, Y., Rim, H.-C.: Unsupervised event extraction from biomedical literature using co-occurrence information and basic patterns. In: Su, K.-Y., Tsujii, J., Lee, J.-H., Kwong, O.Y. (eds.) IJCNLP 2004. LNCS (LNAI), vol. 3248, pp. 777–786. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-30211-7_83
Chapter Google Scholar
McCallum, A.K.: MALLET: a machine learning for language toolkit (2002). http://mallet.cs.umass.edu
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). http://www.csie.ntu.edu.tw/~cjlin/libsvm
Article Google Scholar
Rodrigues, R., Costa, H., Rocha, M.: Development of a machine learning framework for biomedical text mining. In: Saberi Mohamad, M., Rocha, M., Fdez-Riverola, F., Domínguez Mayo, F., De Paz, J. (eds.) PACBB 2016. AISC, vol. 477, pp. 41–49. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-40126-3_5
Chapter Google Scholar
Lourenço, A., Carreira, R., Carneiro, S., Maia, P., Glez-Peña, D., Fdez-Riverola, F., Ferreira, E.C., Rocha, I., Rocha, M.: @Note: A workbench for Biomedical Text Mining. J. Biomed. Inform. 42(4), 710–720 (2009)
Article Google Scholar

Download references

Acknowledgments

This work is co-funded by the North Portugal Regional Operational Programme, under the “Portugal 2020”, through the European Regional Development Fund (ERDF), within project SISBI- Ref^a NORTE-01-0247-FEDER-003381.

The Centre of Biological Engineering (CEB), University of Minho, sponsored all computational hardware and software required for this work.

Author information

Authors and Affiliations

CEB - Centre Biological Engineering, University of Minho, Braga, Portugal
Ruben Rodrigues & Miguel Rocha
Silicolife Lda, Braga, Portugal
Hugo Costa

Authors

Ruben Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Hugo Costa
View author publications
You can also search for this author in PubMed Google Scholar
Miguel Rocha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miguel Rocha .

Editor information

Editors and Affiliations

Institute of Computer Vision and Applied Computer Sciences, Leipzig, Germany
Petra Perner

Ethics declarations

The authors declare they have no conflict of interests regarding this article.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rodrigues, R., Costa, H., Rocha, M. (2018). Automating the Extraction of Essential Genes from Literature. In: Perner, P. (eds) Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2018. Lecture Notes in Computer Science(), vol 10933. Springer, Cham. https://doi.org/10.1007/978-3-319-95786-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-95786-9_6
Published: 04 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-95785-2
Online ISBN: 978-3-319-95786-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics