Conquering Language: Using NLP on a Massive Scale to Build High Dimensional Language Models from the Web

Grefenstette, Gregory

doi:10.1007/978-3-540-70939-8_4

Gregory Grefenstette¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4394))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1496 Accesses
3 Citations

Abstract

Dictionaries only contain some of the information we need to know about a language. The growth of the Web, the maturation of linguistic processing tools, and the decline in price of memory storage allow us to envision descriptions of languages that are much larger than before. We can conceive of building a complete language model for a language using all the text that is found on the Web for this language. This article describes our current project to do just that.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Grefenstette, G., Nioche, J.: Estimation of English and non-English language use on the WWW. In: Proceedings of RIAO (2000)
Google Scholar
Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L.: To search or to crawl?: towards a query optimizer for text-centric tasks. In: Proceedings of the 2006 ACM SIGMOD international Conference on Management of Data, SIGMOD ’06, Chicago, IL, USA, June 27 - 29, pp. 265–276. ACM Press, New York (2006)
Chapter Google Scholar
Nemeth, L., Tron, V., Halacsy, P., Kornai, A., Rung, A., Szakadat, I.: Leveraging the open source ispell codebase for minority language analysis. In: First Steps in Language Documentation for Minority Languages: Computational Linguistic Tools for Morphology, Lexicon and Corpus Compilation, Proceedings of the SALTMIL Workshop at LREC, pp. 56–59 (2004)
Google Scholar
Besançon, R., de Chalendar, G., Ferret, O., Fluhr, C., Mesnard, O., Naets, H.: Concept-Based Searching and Merging for Multilingual Information Retrieval: First Experiments at CLEF 2003. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 174–184. Springer, Heidelberg (2004)
Chapter Google Scholar
Cavnar, W.B., Trenkle, J.M.: N-gram based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, pp. 161–175 (1994)
Google Scholar
Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of the Third International Conference on the Statistical Analysis of Textual Data (JADT’95), Rome, December 11-13, pp. 263–268 (1995)
Google Scholar
New, B., Pallier, C., Brysbaert, M., Fer, L.: Lexique 2: A New French Lexical Database. Behavior Research Methods, Instruments, & Computers 36(3), 516–524 (2004)
Article Google Scholar
Cunningham, H.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proc. 40th Anniversary Meeting Assoc. for Computational Linguistics (ACL 2002). Assoc. for Computational Linguistics, East Stroudsburg, Pa. (2002)
Google Scholar
Kikui, G-I.: Identifying the coding system and language of on-line documents on the internet. In: Proceedings of the 16th International Conference on Computational Linguistics (COLING) (1996)
Google Scholar
Berland, S., Grabar, N.: Assistance automatique pour l’homogénéisation d’un corpus Web de spécialité. In: Actes des 6èmes Journées internationales d’analyse statistique des données textuelles, JADT 2002, Saint-Malo (2002)
Google Scholar
Heydon, A., Najork, M.: Mercator: A scalable, extensible Web crawler. World Wide Web 2(4), 219–229 (1999)
Article Google Scholar
Sundheim, B.: Overview of results of the MUC-6 evaluation. In: Proceedings of Sixth Message Understanding Conference (MUC-6), Columbia, Maryland, November 6-8, pp. 13–32 (1995)
Google Scholar
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)
Google Scholar
Hiemstra, D.: A probabilistic justification for using tf.idf term weighting in information retrieval. International Journal on Digital Libraries 3(2), 131–139 (2000)
Article Google Scholar
Hindle, D., Rooth, M.: Structural ambiguity and lexical relations. Computational Linguistics 19(1), 103–120 (1993)
Google Scholar
Merlo, P., Crocker, M.W., Berthouzoz, C.: Attaching multiple prepositional phrases: Generalized Backed-off Estimation. In: Cardie, C., Weischedel, R. (eds.) Proceedings of the second conference on Empirical Methods in Natural Language Processing, EMNLP-97, pp. 149–155 (1997)
Google Scholar
Nakov, P., Hearst, M.: Using the Web as an implicit training set: Application to structural ambiguity resolution. In: Proceedings of HLT-EMNLP, Vancouver, British Columbia, Canada, pp. 835–842 (2005)
Google Scholar
Grefenstette, G.: The World Wide Web as a resource for example-based machine translation tasks. In: Proceedings of the ASLIB Conference on Translating and the Computer, London (1998)
Google Scholar
Li, Y., Grefenstette, G.: Translating Chinese idiographic characters via corpus and web validation. In: CORIA’2005, Grenoble, France, March 9-11 (2005)
Google Scholar
Qu, Y., Grefenstette, G.: Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation. In: Proc. of ACL, pp. 184–191 (2004)
Google Scholar
Turney, P.D., Littman, M.L.: Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems (TOIS) 21(4), 315–346 (2003)
Article Google Scholar
Grefenstette, G.: The Color of Things: Towards the automatic acquisition of information for a descriptive dictionary. Revue Française de Linguisitque Appliquée, vol. X-2 1386-1204, 83-94 (2005)
Google Scholar
Cimiano, P., Staab, S.: Learning by googling. ACM SIGKDD Explorations Newsletter 6(2), 24–33 (2004)
Article Google Scholar
Kilgarriff, A.: Linguistic search engine. In: Simov, K. (ed.) Shallow Processing of Large Corpora: Workshop Held in Association with Corpus Linguistics 2003, Lancaster, England (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Commissariat à l’Energie Atomique, CEA LIST, SRCI, BP 6, 92265 Fontenay aux Roses Cedex, France
Gregory Grefenstette

Authors

Gregory Grefenstette
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Grefenstette, G. (2007). Conquering Language: Using NLP on a Massive Scale to Build High Dimensional Language Models from the Web. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70939-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-540-70939-8_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70938-1
Online ISBN: 978-3-540-70939-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics