Skip to main content

A Scalable and Distributed NLP Architecture for Web Document Annotation

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4139))

Abstract

In the context of the ALVIS project, which aims at integrating linguistic information in topic-specific search engines, we develop a NLP architecture to linguistically annotate large collections of web documents. This context leads us to face the scalability aspect of Natural Language Processing. The platform can be viewed as a framework using existing NLP tools. We focus on the efficiency of the platform by distributing linguistic processing on several machines. We carry out an an experiment on 55,329 web documents focusing on biology. These 79 million-word collections of web documents have been processed in 3 days on 16 computers.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ardö, A.: Focused crawling in the alvis semantic search engine. In: Poster in ESWC 2005 – 2nd Annual European Semantic Web Conference, Heraklion, Crete (2005)

    Google Scholar 

  2. Alphonse, E., Aubin, S., Bessieres, P., Bisson, G., Hamon, T., Laguarrigue, S., Manine, A.P., Nazarenko, A., Nedellec, C., Vetah, M.O.A., Poibeau, T., Weissenbacher, D.: Event-based information extraction for the biomedical domain: the caderige project. In: Workshop BioNLP (Biology and Natural language Processing), Conférence Computational Linguisitics (Coling 2004), Geneva (2004)

    Google Scholar 

  3. Bontcheva, K., Tablan, V., Maynard, D., Cunningham, H.: Evolving gate to meet new challenges in language engineering. Natural Language Engineering 10(3-4), 349–374 (2004)

    Article  Google Scholar 

  4. Nazarenko, A., Alphonse, E., Derivière, J., Hamon, T., Vauvert, G., Weissenbacher, D.: The alvis format for linguistically annotated documents. In: LREC 2006 (submitted, 2006)

    Google Scholar 

  5. Berroyer, J.F.: Tagen, un analyseur d”entités nommées: conception, développement et évaluation. Mémoire de d.e.a. d’intelligence artificielle, Université Paris-Nord (2004)

    Google Scholar 

  6. Grefenstette, G.: Exploration in Automatic Thesaurus Discovery. Kluwer Academic Publishers, Boston (1994)

    Google Scholar 

  7. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Jones, D., Somers, H. (eds.) New Methods in Language Processing Studies in Computational Linguistics (1997)

    Google Scholar 

  8. Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a robust part-of-speech tagger for biomedical text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  9. Consortium, T.G.O.: Gene ontology: tool for the unification of biology. Nature genetics 25, 25–29 (2000)

    Article  Google Scholar 

  10. MeSH: Medical subject headings. Library of Medicine, Bethesda, Maryland (1998), WWW page: http://www.nlm.nih.gov/mesh/meshhome.html

  11. National Library of Medicine (ed.): UMLS Knowledge Source, 13th edn. (2003)

    Google Scholar 

  12. Cunningham, H., Bontcheva, K., Tablan, V., Wilks, Y.: Software infrastructure for language resources: a taxonomy of previous work and a requirements analysis. In: Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC-2), Athens (2000)

    Google Scholar 

  13. Grishman, R.: Tipster architecture design document version 2.3. Technical report, DARPA (1997)

    Google Scholar 

  14. Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., Kirilov, A.: Kim – a semantic platform for information extraction and retrieval. Natural Language Engineering 10(3-4), 375–392 (2004)

    Article  Google Scholar 

  15. Ferrucci, D., Lally, A.: Uima: an architecture approach to unstructured information processing in a corporate research environment. Natural Language Engineering 10(3-4), 327–348 (2004)

    Article  Google Scholar 

  16. Neff, M.S., Byrd, R.J., Boguraev, B.K.: The talent system: Textract architecture and data model. Natural Language Engineering 10(3-4), 307–326 (2004)

    Article  Google Scholar 

  17. Müller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biology 2(11), 1984–1998 (2004)

    Article  Google Scholar 

  18. Consortium, T.G.O.: Creating the Gene Ontology Resource: Design and Implementation. Genome Res. 11(8), 1425–1433 (2001)

    Article  Google Scholar 

  19. Zajac, R., Casper, M., Sharples, N.: An open distributed architecture for reuse and integration of heterogeneous nlp components. In: Proceedings of the fifth Conference on Applied Natural Language Processing (ANLP 1997) (1997)

    Google Scholar 

  20. Aubin, S., Nazarenko, A., Nédellec, C.: Adapting a general parser to a sublanguage. In: The international conference RANLP 2005, Borovets, Bulgaria (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Deriviere, J., Hamon, T., Nazarenko, A. (2006). A Scalable and Distributed NLP Architecture for Web Document Annotation. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_8

Download citation

  • DOI: https://doi.org/10.1007/11816508_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-37334-6

  • Online ISBN: 978-3-540-37336-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics