Abstract
Automated analysis of full-text life science research articles and technical documents is becoming increasingly important. In contrast to abstracts, accessing and processing full-text is considerably more complex. GetItFull is a tool for downloading and pre-processing full-text journal articles. GetItFull automatically connects to a journal’s Web site, downloads the journal content and performs various commonly used pre-processing steps. The output comprises a structured XML document for each article with tags identifying the various sections and journal information. The output may then be used as the basis for text mining applications or exported to a database for further processing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Natarajan, J., Berrar, D., Hack, C.J., Dubitzky, W.: Knowledge Discovery in Biology and Biotechnology Texts: A Review of Techniques, Evaluation Strategies, and Applications. Critical Reviews in Biotechnology 25, 31–52 (2005)
Hearst Marti, A.: Untangling text data mining. In: Proc. of ACL, 37 (1999)
Baeza-Yates, R., Ribeiro-Nato, B.: Modern information retrieval. Addison-Wesley, Harlow, UK (1999)
Ng, S.-K., Wong, M.: Towards routine automatic pathway discovery from on-line scientific text abstracts. In: Proceedings of the workshop on Genome Informatics, vol. 10, pp. 104–112 (1999)
Wong, L.: A protein interaction extraction system. In: Pacific Symposium on Biocomputing, vol. 6, pp. 520–531 (2001)
Park, J.C., Kim, H.S., Kim, J.J.: Bi-directional incremental parsing for automatic pathway identification with combinatory categorical grammar. In: Pacific Symposium on Biocomputing, vol. 6, pp. 396–407 (2001)
Yakushiji, A., Tateisi, Y., Miyao, Y., Tsujii, J.: Event extraction from biomedical papers using a full parser. In: Pacific Symposium on Biocomputing, vol. 6, pp. 408–419 (2001)
Friedman, C., Kra, P., Yu, H., Krauthammer, M., Rzhetsky, A.: GENIES: A natural language processing system for extraction of molecular pathways from journal article. Bioinformatics Suppl. 1, 74–82 (2001)
Sekimizu, T., Park, H.S., Tsujii, J.: Identifying the interaction between genes and gene products based on frequently seen verbs in MEDLINE abstracts. In: Proceedings of the workshop on Genome Informatics, pp. 62–71 (1998)
Rindflesch, T.C., Tanabe, L., Weinstein, J.N., Hunter, L.: EDGAR: Extraction of drugs, genes, and relations from the biomedical literature. In: Pacific Symposium On Biocomputing, vol. 5, pp. 517–528 (2000)
Craven, M., Kumlien, J.: Constructing biological knowledge base by extracting information from text sources. In: Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology, pp. 76–77 (1999)
Stapley, B.J., Kelley, L.A., Strenberg, M.J.E.: Predicting the sub-cellular location of proteins from text using support vector machines. In: Pacific Symposium on Biocomputing, vol. 7, pp. 374–385 (2002)
Rindflesch, T.C., Rayan, J.V., Hunter, L.: Extracting molecular binding relationships from biomedical text. In: Proc. App. Nat. Lan. Proc. and Ass. Comp. Ling., pp. 188–195 (2000)
Shah, P.K., Perz_Iratxeta, C., Bork, P., Andrade, M.A.: Information Extraction from Full-text Scientific Articles, Where are the key words? BMC Bioinformatics 4(20) (2003)
Yu, H., Hatzvisaailoulou, V., Friedman, C., Rzhetsky, A., Wilbur, W.J.: Automatic Extraction of Gene and Protein Synonyms from Medline and Journal Articles. In: Proceedings of AMIA Symposium, pp. 919–923 (2003)
Frideman, C., Kra, P., Yu, H., Krauthammer, M., Rzhetsky, A.: Geneis: A Natural language processing system. Bioinformatics 17, 74–82 (2001)
Schuemie, M.J., Weeber, M., Schijvenaars, B.J.A., van Mulligen, E.M., van der Eijk, C.C., Jelier, R., Mons, B., Kors, J.A.: Distribution of Information in Biomedical Abstracts and Full-text Publications. Bioinformatics 20, 2597–2604 (2004)
Bremer, E.G., Natarajan, J., Zhang, Y., DeSesa, C., Hack, C.J., Dubitzky, W.: Text mining of full text articles and creation of a knowledge base for analysis of microarray data. In: López, J.A., Benfenati, E., Dubitzky, W. (eds.) KELSI 2004. LNCS, vol. 3303, pp. 84–95. Springer, Heidelberg (2004)
Microsoft Internet Transfer control help at http://support.microsoft.com/
Entrez-gene, a database of genes at http://www.ncbi.nih.gov/entrez/
Natarajan, J., Mulay, N., DeSesa, C., Hack, C.J., Dubitzky, W., Bremer, E.G.: A grid infrastructure for text mining of full text articles and creation of a knowledge base of gene relations. In: Oliveira, J.L., Maojo, V., Martín-Sánchez, F., Pereira, A.S. (eds.) ISBMDA 2005. LNCS (LNBI), vol. 3745, pp. 101–108. Springer, Heidelberg (2005)
Text REtrieval Conference (TREC) home page at http://trec.nist.gov/
Cohen, K.B., Tanabe, L., Kinoshita, S., Hunter, L.: A resource for constructing customized test suites for molecular biology entity identification system. In: Linking Biological Literature, Ontologies and Databases (Biolink 2004), pp. 1–8 (2004)
Müller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature. PLOS Biology 2(11) (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Natarajan, J. et al. (2006). GetItFull – A Tool for Downloading and Pre-processing Full-Text Journal Articles. In: Bremer, E.G., Hakenberg, J., Han, EH.(., Berrar, D., Dubitzky, W. (eds) Knowledge Discovery in Life Science Literature. KDLL 2006. Lecture Notes in Computer Science(), vol 3886. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11683568_12
Download citation
DOI: https://doi.org/10.1007/11683568_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32809-4
Online ISBN: 978-3-540-32810-0
eBook Packages: Computer ScienceComputer Science (R0)