Abstract
Dictionaries only contain some of the information we need to know about a language. The growth of the Web, the maturation of linguistic processing tools, and the decline in price of memory storage allow us to envision descriptions of languages that are much larger than before. We can conceive of building a complete language model for a language using all the text that is found on the Web for this language. This article describes our current project to do just that.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Grefenstette, G., Nioche, J.: Estimation of English and non-English language use on the WWW. In: Proceedings of RIAO (2000)
Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L.: To search or to crawl?: towards a query optimizer for text-centric tasks. In: Proceedings of the 2006 ACM SIGMOD international Conference on Management of Data, SIGMOD ’06, Chicago, IL, USA, June 27 - 29, pp. 265–276. ACM Press, New York (2006)
Nemeth, L., Tron, V., Halacsy, P., Kornai, A., Rung, A., Szakadat, I.: Leveraging the open source ispell codebase for minority language analysis. In: First Steps in Language Documentation for Minority Languages: Computational Linguistic Tools for Morphology, Lexicon and Corpus Compilation, Proceedings of the SALTMIL Workshop at LREC, pp. 56–59 (2004)
Besançon, R., de Chalendar, G., Ferret, O., Fluhr, C., Mesnard, O., Naets, H.: Concept-Based Searching and Merging for Multilingual Information Retrieval: First Experiments at CLEF 2003. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 174–184. Springer, Heidelberg (2004)
Cavnar, W.B., Trenkle, J.M.: N-gram based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, pp. 161–175 (1994)
Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of the Third International Conference on the Statistical Analysis of Textual Data (JADT’95), Rome, December 11-13, pp. 263–268 (1995)
New, B., Pallier, C., Brysbaert, M., Fer, L.: Lexique 2: A New French Lexical Database. Behavior Research Methods, Instruments, & Computers 36(3), 516–524 (2004)
Cunningham, H.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proc. 40th Anniversary Meeting Assoc. for Computational Linguistics (ACL 2002). Assoc. for Computational Linguistics, East Stroudsburg, Pa. (2002)
Kikui, G-I.: Identifying the coding system and language of on-line documents on the internet. In: Proceedings of the 16th International Conference on Computational Linguistics (COLING) (1996)
Berland, S., Grabar, N.: Assistance automatique pour l’homogénéisation d’un corpus Web de spécialité. In: Actes des 6èmes Journées internationales d’analyse statistique des données textuelles, JADT 2002, Saint-Malo (2002)
Heydon, A., Najork, M.: Mercator: A scalable, extensible Web crawler. World Wide Web 2(4), 219–229 (1999)
Sundheim, B.: Overview of results of the MUC-6 evaluation. In: Proceedings of Sixth Message Understanding Conference (MUC-6), Columbia, Maryland, November 6-8, pp. 13–32 (1995)
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)
Hiemstra, D.: A probabilistic justification for using tf.idf term weighting in information retrieval. International Journal on Digital Libraries 3(2), 131–139 (2000)
Hindle, D., Rooth, M.: Structural ambiguity and lexical relations. Computational Linguistics 19(1), 103–120 (1993)
Merlo, P., Crocker, M.W., Berthouzoz, C.: Attaching multiple prepositional phrases: Generalized Backed-off Estimation. In: Cardie, C., Weischedel, R. (eds.) Proceedings of the second conference on Empirical Methods in Natural Language Processing, EMNLP-97, pp. 149–155 (1997)
Nakov, P., Hearst, M.: Using the Web as an implicit training set: Application to structural ambiguity resolution. In: Proceedings of HLT-EMNLP, Vancouver, British Columbia, Canada, pp. 835–842 (2005)
Grefenstette, G.: The World Wide Web as a resource for example-based machine translation tasks. In: Proceedings of the ASLIB Conference on Translating and the Computer, London (1998)
Li, Y., Grefenstette, G.: Translating Chinese idiographic characters via corpus and web validation. In: CORIA’2005, Grenoble, France, March 9-11 (2005)
Qu, Y., Grefenstette, G.: Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation. In: Proc. of ACL, pp. 184–191 (2004)
Turney, P.D., Littman, M.L.: Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems (TOIS) 21(4), 315–346 (2003)
Grefenstette, G.: The Color of Things: Towards the automatic acquisition of information for a descriptive dictionary. Revue Française de Linguisitque Appliquée, vol. X-2 1386-1204, 83-94 (2005)
Cimiano, P., Staab, S.: Learning by googling. ACM SIGKDD Explorations Newsletter 6(2), 24–33 (2004)
Kilgarriff, A.: Linguistic search engine. In: Simov, K. (ed.) Shallow Processing of Large Corpora: Workshop Held in Association with Corpus Linguistics 2003, Lancaster, England (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Grefenstette, G. (2007). Conquering Language: Using NLP on a Massive Scale to Build High Dimensional Language Models from the Web. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70939-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-540-70939-8_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70938-1
Online ISBN: 978-3-540-70939-8
eBook Packages: Computer ScienceComputer Science (R0)