GetItFull – A Tool for Downloading and Pre-processing Full-Text Journal Articles

Natarajan, Jeyakumar; Haines, Cliff; Berglund, Brian; DeSesa, Catherine; Hack, Catherine J.; Dubitzky, Werner; Bremer, Eric G.

doi:10.1007/11683568_12

Jeyakumar Natarajan²⁴,
Cliff Haines²⁵,
Brian Berglund²⁵,
Catherine DeSesa²⁶,
Catherine J. Hack²⁴,
Werner Dubitzky²⁴ &
…
Eric G. Bremer²⁶

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3886))

Included in the following conference series:

International Workshop on Knowledge Discovery in Life Science LIterature

487 Accesses
2 Citations

Abstract

Automated analysis of full-text life science research articles and technical documents is becoming increasingly important. In contrast to abstracts, accessing and processing full-text is considerably more complex. GetItFull is a tool for downloading and pre-processing full-text journal articles. GetItFull automatically connects to a journal’s Web site, downloads the journal content and performs various commonly used pre-processing steps. The output comprises a structured XML document for each article with tags identifying the various sections and journal information. The output may then be used as the basis for text mining applications or exported to a database for further processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Natarajan, J., Berrar, D., Hack, C.J., Dubitzky, W.: Knowledge Discovery in Biology and Biotechnology Texts: A Review of Techniques, Evaluation Strategies, and Applications. Critical Reviews in Biotechnology 25, 31–52 (2005)
Article Google Scholar
Hearst Marti, A.: Untangling text data mining. In: Proc. of ACL, 37 (1999)
Google Scholar
Baeza-Yates, R., Ribeiro-Nato, B.: Modern information retrieval. Addison-Wesley, Harlow, UK (1999)
Google Scholar
Ng, S.-K., Wong, M.: Towards routine automatic pathway discovery from on-line scientific text abstracts. In: Proceedings of the workshop on Genome Informatics, vol. 10, pp. 104–112 (1999)
Google Scholar
Wong, L.: A protein interaction extraction system. In: Pacific Symposium on Biocomputing, vol. 6, pp. 520–531 (2001)
Google Scholar
Park, J.C., Kim, H.S., Kim, J.J.: Bi-directional incremental parsing for automatic pathway identification with combinatory categorical grammar. In: Pacific Symposium on Biocomputing, vol. 6, pp. 396–407 (2001)
Google Scholar
Yakushiji, A., Tateisi, Y., Miyao, Y., Tsujii, J.: Event extraction from biomedical papers using a full parser. In: Pacific Symposium on Biocomputing, vol. 6, pp. 408–419 (2001)
Google Scholar
Friedman, C., Kra, P., Yu, H., Krauthammer, M., Rzhetsky, A.: GENIES: A natural language processing system for extraction of molecular pathways from journal article. Bioinformatics Suppl. 1, 74–82 (2001)
Article Google Scholar
Sekimizu, T., Park, H.S., Tsujii, J.: Identifying the interaction between genes and gene products based on frequently seen verbs in MEDLINE abstracts. In: Proceedings of the workshop on Genome Informatics, pp. 62–71 (1998)
Google Scholar
Rindflesch, T.C., Tanabe, L., Weinstein, J.N., Hunter, L.: EDGAR: Extraction of drugs, genes, and relations from the biomedical literature. In: Pacific Symposium On Biocomputing, vol. 5, pp. 517–528 (2000)
Google Scholar
Craven, M., Kumlien, J.: Constructing biological knowledge base by extracting information from text sources. In: Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology, pp. 76–77 (1999)
Google Scholar
Stapley, B.J., Kelley, L.A., Strenberg, M.J.E.: Predicting the sub-cellular location of proteins from text using support vector machines. In: Pacific Symposium on Biocomputing, vol. 7, pp. 374–385 (2002)
Google Scholar
Rindflesch, T.C., Rayan, J.V., Hunter, L.: Extracting molecular binding relationships from biomedical text. In: Proc. App. Nat. Lan. Proc. and Ass. Comp. Ling., pp. 188–195 (2000)
Google Scholar
Shah, P.K., Perz_Iratxeta, C., Bork, P., Andrade, M.A.: Information Extraction from Full-text Scientific Articles, Where are the key words? BMC Bioinformatics 4(20) (2003)
Google Scholar
Yu, H., Hatzvisaailoulou, V., Friedman, C., Rzhetsky, A., Wilbur, W.J.: Automatic Extraction of Gene and Protein Synonyms from Medline and Journal Articles. In: Proceedings of AMIA Symposium, pp. 919–923 (2003)
Google Scholar
Frideman, C., Kra, P., Yu, H., Krauthammer, M., Rzhetsky, A.: Geneis: A Natural language processing system. Bioinformatics 17, 74–82 (2001)
Article Google Scholar
Schuemie, M.J., Weeber, M., Schijvenaars, B.J.A., van Mulligen, E.M., van der Eijk, C.C., Jelier, R., Mons, B., Kors, J.A.: Distribution of Information in Biomedical Abstracts and Full-text Publications. Bioinformatics 20, 2597–2604 (2004)
Article Google Scholar
Bremer, E.G., Natarajan, J., Zhang, Y., DeSesa, C., Hack, C.J., Dubitzky, W.: Text mining of full text articles and creation of a knowledge base for analysis of microarray data. In: López, J.A., Benfenati, E., Dubitzky, W. (eds.) KELSI 2004. LNCS, vol. 3303, pp. 84–95. Springer, Heidelberg (2004)
Chapter Google Scholar
Microsoft Internet Transfer control help at http://support.microsoft.com/
Entrez-gene, a database of genes at http://www.ncbi.nih.gov/entrez/
Natarajan, J., Mulay, N., DeSesa, C., Hack, C.J., Dubitzky, W., Bremer, E.G.: A grid infrastructure for text mining of full text articles and creation of a knowledge base of gene relations. In: Oliveira, J.L., Maojo, V., Martín-Sánchez, F., Pereira, A.S. (eds.) ISBMDA 2005. LNCS (LNBI), vol. 3745, pp. 101–108. Springer, Heidelberg (2005)
Chapter Google Scholar
Text REtrieval Conference (TREC) home page at http://trec.nist.gov/
Cohen, K.B., Tanabe, L., Kinoshita, S., Hunter, L.: A resource for constructing customized test suites for molecular biology entity identification system. In: Linking Biological Literature, Ontologies and Databases (Biolink 2004), pp. 1–8 (2004)
Google Scholar
Müller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature. PLOS Biology 2(11) (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Bioinformatics Research Group, University of Ulster, UK
Jeyakumar Natarajan, Catherine J. Hack & Werner Dubitzky
CTH Technologies, Inc., Oak Brook Terrace, IL USAPSS, Inc, Chicago, IL, USA
Cliff Haines & Brian Berglund
Brain Tumor Research Program, Children’s Memorial Hospital, and Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
Catherine DeSesa & Eric G. Bremer

Authors

Jeyakumar Natarajan
View author publications
You can also search for this author in PubMed Google Scholar
Cliff Haines
View author publications
You can also search for this author in PubMed Google Scholar
Brian Berglund
View author publications
You can also search for this author in PubMed Google Scholar
Catherine DeSesa
View author publications
You can also search for this author in PubMed Google Scholar
Catherine J. Hack
View author publications
You can also search for this author in PubMed Google Scholar
Werner Dubitzky
View author publications
You can also search for this author in PubMed Google Scholar
Eric G. Bremer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Brain Tumor Research Program, Children’s Memorial Hospital, and Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
Eric G. Bremer
Computer Science Department, Knowledge Management in Bioinformatics, Humbold-Universität zu Berlin, Unter den Linden 6, 10099, Berlin, Germany
Jörg Hakenberg
iXmatch Inc., 5555 West 78th Street Suite E, 55439-2702, Minneapolis, MN, USA
Eui-Hong (Sam) Han
School of Biomedical Sciences, University of Ulster, Cromore Road,, BT52 1SA, Coleraine, Northern Ireland, UK
Daniel Berrar
School of Biomedial Sciences, Bioinformatics Research Group, University of Ulster, Cromore Road, BT52 1SA, Coleraine, Northern Ireland, UK
Werner Dubitzky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Natarajan, J. et al. (2006). GetItFull – A Tool for Downloading and Pre-processing Full-Text Journal Articles. In: Bremer, E.G., Hakenberg, J., Han, EH.(., Berrar, D., Dubitzky, W. (eds) Knowledge Discovery in Life Science Literature. KDLL 2006. Lecture Notes in Computer Science(), vol 3886. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11683568_12

Download citation

DOI: https://doi.org/10.1007/11683568_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32809-4
Online ISBN: 978-3-540-32810-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics