A Scalable and Distributed NLP Architecture for Web Document Annotation

Deriviere, Julien; Hamon, Thierry; Nazarenko, Adeline

doi:10.1007/11816508_8

A Scalable and Distributed NLP Architecture for Web Document Annotation

Julien Deriviere²¹,
Thierry Hamon²¹ &
Adeline Nazarenko²¹

Conference paper

1606 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4139))

Abstract

In the context of the ALVIS project, which aims at integrating linguistic information in topic-specific search engines, we develop a NLP architecture to linguistically annotate large collections of web documents. This context leads us to face the scalability aspect of Natural Language Processing. The platform can be viewed as a framework using existing NLP tools. We focus on the efficiency of the platform by distributing linguistic processing on several machines. We carry out an an experiment on 55,329 web documents focusing on biology. These 79 million-word collections of web documents have been processed in 3 days on 16 computers.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ardö, A.: Focused crawling in the alvis semantic search engine. In: Poster in ESWC 2005 – 2nd Annual European Semantic Web Conference, Heraklion, Crete (2005)
Google Scholar
Alphonse, E., Aubin, S., Bessieres, P., Bisson, G., Hamon, T., Laguarrigue, S., Manine, A.P., Nazarenko, A., Nedellec, C., Vetah, M.O.A., Poibeau, T., Weissenbacher, D.: Event-based information extraction for the biomedical domain: the caderige project. In: Workshop BioNLP (Biology and Natural language Processing), Conférence Computational Linguisitics (Coling 2004), Geneva (2004)
Google Scholar
Bontcheva, K., Tablan, V., Maynard, D., Cunningham, H.: Evolving gate to meet new challenges in language engineering. Natural Language Engineering 10(3-4), 349–374 (2004)
Article Google Scholar
Nazarenko, A., Alphonse, E., Derivière, J., Hamon, T., Vauvert, G., Weissenbacher, D.: The alvis format for linguistically annotated documents. In: LREC 2006 (submitted, 2006)
Google Scholar
Berroyer, J.F.: Tagen, un analyseur d”entités nommées: conception, développement et évaluation. Mémoire de d.e.a. d’intelligence artificielle, Université Paris-Nord (2004)
Google Scholar
Grefenstette, G.: Exploration in Automatic Thesaurus Discovery. Kluwer Academic Publishers, Boston (1994)
Google Scholar
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Jones, D., Somers, H. (eds.) New Methods in Language Processing Studies in Computational Linguistics (1997)
Google Scholar
Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a robust part-of-speech tagger for biomedical text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005)
Chapter Google Scholar
Consortium, T.G.O.: Gene ontology: tool for the unification of biology. Nature genetics 25, 25–29 (2000)
Article Google Scholar
MeSH: Medical subject headings. Library of Medicine, Bethesda, Maryland (1998), WWW page: http://www.nlm.nih.gov/mesh/meshhome.html
National Library of Medicine (ed.): UMLS Knowledge Source, 13th edn. (2003)
Google Scholar
Cunningham, H., Bontcheva, K., Tablan, V., Wilks, Y.: Software infrastructure for language resources: a taxonomy of previous work and a requirements analysis. In: Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC-2), Athens (2000)
Google Scholar
Grishman, R.: Tipster architecture design document version 2.3. Technical report, DARPA (1997)
Google Scholar
Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., Kirilov, A.: Kim – a semantic platform for information extraction and retrieval. Natural Language Engineering 10(3-4), 375–392 (2004)
Article Google Scholar
Ferrucci, D., Lally, A.: Uima: an architecture approach to unstructured information processing in a corporate research environment. Natural Language Engineering 10(3-4), 327–348 (2004)
Article Google Scholar
Neff, M.S., Byrd, R.J., Boguraev, B.K.: The talent system: Textract architecture and data model. Natural Language Engineering 10(3-4), 307–326 (2004)
Article Google Scholar
Müller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biology 2(11), 1984–1998 (2004)
Article Google Scholar
Consortium, T.G.O.: Creating the Gene Ontology Resource: Design and Implementation. Genome Res. 11(8), 1425–1433 (2001)
Article Google Scholar
Zajac, R., Casper, M., Sharples, N.: An open distributed architecture for reuse and integration of heterogeneous nlp components. In: Proceedings of the fifth Conference on Applied Natural Language Processing (ANLP 1997) (1997)
Google Scholar
Aubin, S., Nazarenko, A., Nédellec, C.: Adapting a general parser to a sublanguage. In: The international conference RANLP 2005, Borovets, Bulgaria (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

LIPN – UMR CNRS 7030, 99 av. J.B. Clément, F-93430, Villetaneuse, France
Julien Deriviere, Thierry Hamon & Adeline Nazarenko

Authors

Julien Deriviere
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Hamon
View author publications
You can also search for this author in PubMed Google Scholar
Adeline Nazarenko
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Turku Centre for Computer Science (TUCS), Department of Information Technology, University of Turku, Joukahaisenkatu 3-5 B, FIN-20520, Turku, Finland
Tapio Salakoski
Turku Centre for Computer Science (TUCS) and Department of IT, University of Turku, Lemminkäisenkatu 14 A, 20520, Turku, Finland
Filip Ginter & Sampo Pyysalo &
Department of Information Technology, University of Turku, Lemminkäisenkatu 14–18 A, FIN-20520, Turku, Finland
Tapio Pahikkala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Deriviere, J., Hamon, T., Nazarenko, A. (2006). A Scalable and Distributed NLP Architecture for Web Document Annotation. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_8

Download citation

DOI: https://doi.org/10.1007/11816508_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37334-6
Online ISBN: 978-3-540-37336-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics