Abstract
In the context of the ALVIS project, which aims at integrating linguistic information in topic-specific search engines, we develop a NLP architecture to linguistically annotate large collections of web documents. This context leads us to face the scalability aspect of Natural Language Processing. The platform can be viewed as a framework using existing NLP tools. We focus on the efficiency of the platform by distributing linguistic processing on several machines. We carry out an an experiment on 55,329 web documents focusing on biology. These 79 million-word collections of web documents have been processed in 3 days on 16 computers.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Ardö, A.: Focused crawling in the alvis semantic search engine. In: Poster in ESWC 2005 – 2nd Annual European Semantic Web Conference, Heraklion, Crete (2005)
Alphonse, E., Aubin, S., Bessieres, P., Bisson, G., Hamon, T., Laguarrigue, S., Manine, A.P., Nazarenko, A., Nedellec, C., Vetah, M.O.A., Poibeau, T., Weissenbacher, D.: Event-based information extraction for the biomedical domain: the caderige project. In: Workshop BioNLP (Biology and Natural language Processing), Conférence Computational Linguisitics (Coling 2004), Geneva (2004)
Bontcheva, K., Tablan, V., Maynard, D., Cunningham, H.: Evolving gate to meet new challenges in language engineering. Natural Language Engineering 10(3-4), 349–374 (2004)
Nazarenko, A., Alphonse, E., Derivière, J., Hamon, T., Vauvert, G., Weissenbacher, D.: The alvis format for linguistically annotated documents. In: LREC 2006 (submitted, 2006)
Berroyer, J.F.: Tagen, un analyseur d”entités nommées: conception, développement et évaluation. Mémoire de d.e.a. d’intelligence artificielle, Université Paris-Nord (2004)
Grefenstette, G.: Exploration in Automatic Thesaurus Discovery. Kluwer Academic Publishers, Boston (1994)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Jones, D., Somers, H. (eds.) New Methods in Language Processing Studies in Computational Linguistics (1997)
Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a robust part-of-speech tagger for biomedical text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005)
Consortium, T.G.O.: Gene ontology: tool for the unification of biology. Nature genetics 25, 25–29 (2000)
MeSH: Medical subject headings. Library of Medicine, Bethesda, Maryland (1998), WWW page: http://www.nlm.nih.gov/mesh/meshhome.html
National Library of Medicine (ed.): UMLS Knowledge Source, 13th edn. (2003)
Cunningham, H., Bontcheva, K., Tablan, V., Wilks, Y.: Software infrastructure for language resources: a taxonomy of previous work and a requirements analysis. In: Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC-2), Athens (2000)
Grishman, R.: Tipster architecture design document version 2.3. Technical report, DARPA (1997)
Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., Kirilov, A.: Kim – a semantic platform for information extraction and retrieval. Natural Language Engineering 10(3-4), 375–392 (2004)
Ferrucci, D., Lally, A.: Uima: an architecture approach to unstructured information processing in a corporate research environment. Natural Language Engineering 10(3-4), 327–348 (2004)
Neff, M.S., Byrd, R.J., Boguraev, B.K.: The talent system: Textract architecture and data model. Natural Language Engineering 10(3-4), 307–326 (2004)
Müller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biology 2(11), 1984–1998 (2004)
Consortium, T.G.O.: Creating the Gene Ontology Resource: Design and Implementation. Genome Res. 11(8), 1425–1433 (2001)
Zajac, R., Casper, M., Sharples, N.: An open distributed architecture for reuse and integration of heterogeneous nlp components. In: Proceedings of the fifth Conference on Applied Natural Language Processing (ANLP 1997) (1997)
Aubin, S., Nazarenko, A., Nédellec, C.: Adapting a general parser to a sublanguage. In: The international conference RANLP 2005, Borovets, Bulgaria (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Deriviere, J., Hamon, T., Nazarenko, A. (2006). A Scalable and Distributed NLP Architecture for Web Document Annotation. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_8
Download citation
DOI: https://doi.org/10.1007/11816508_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37334-6
Online ISBN: 978-3-540-37336-0
eBook Packages: Computer ScienceComputer Science (R0)