Skip to main content

Conquering Language: Using NLP on a Massive Scale to Build High Dimensional Language Models from the Web

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2007)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4394))

Abstract

Dictionaries only contain some of the information we need to know about a language. The growth of the Web, the maturation of linguistic processing tools, and the decline in price of memory storage allow us to envision descriptions of languages that are much larger than before. We can conceive of building a complete language model for a language using all the text that is found on the Web for this language. This article describes our current project to do just that.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Grefenstette, G., Nioche, J.: Estimation of English and non-English language use on the WWW. In: Proceedings of RIAO (2000)

    Google Scholar 

  2. Ipeirotis, P.G., Agichtein, E., Jain, P., Gravano, L.: To search or to crawl?: towards a query optimizer for text-centric tasks. In: Proceedings of the 2006 ACM SIGMOD international Conference on Management of Data, SIGMOD ’06, Chicago, IL, USA, June 27 - 29, pp. 265–276. ACM Press, New York (2006)

    Chapter  Google Scholar 

  3. Nemeth, L., Tron, V., Halacsy, P., Kornai, A., Rung, A., Szakadat, I.: Leveraging the open source ispell codebase for minority language analysis. In: First Steps in Language Documentation for Minority Languages: Computational Linguistic Tools for Morphology, Lexicon and Corpus Compilation, Proceedings of the SALTMIL Workshop at LREC, pp. 56–59 (2004)

    Google Scholar 

  4. Besançon, R., de Chalendar, G., Ferret, O., Fluhr, C., Mesnard, O., Naets, H.: Concept-Based Searching and Merging for Multilingual Information Retrieval: First Experiments at CLEF 2003. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 174–184. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  5. Cavnar, W.B., Trenkle, J.M.: N-gram based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, pp. 161–175 (1994)

    Google Scholar 

  6. Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of the Third International Conference on the Statistical Analysis of Textual Data (JADT’95), Rome, December 11-13, pp. 263–268 (1995)

    Google Scholar 

  7. New, B., Pallier, C., Brysbaert, M., Fer, L.: Lexique 2: A New French Lexical Database. Behavior Research Methods, Instruments, & Computers 36(3), 516–524 (2004)

    Article  Google Scholar 

  8. Cunningham, H.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proc. 40th Anniversary Meeting Assoc. for Computational Linguistics (ACL 2002). Assoc. for Computational Linguistics, East Stroudsburg, Pa. (2002)

    Google Scholar 

  9. Kikui, G-I.: Identifying the coding system and language of on-line documents on the internet. In: Proceedings of the 16th International Conference on Computational Linguistics (COLING) (1996)

    Google Scholar 

  10. Berland, S., Grabar, N.: Assistance automatique pour l’homogénéisation d’un corpus Web de spécialité. In: Actes des 6èmes Journées internationales d’analyse statistique des données textuelles, JADT 2002, Saint-Malo (2002)

    Google Scholar 

  11. Heydon, A., Najork, M.: Mercator: A scalable, extensible Web crawler. World Wide Web 2(4), 219–229 (1999)

    Article  Google Scholar 

  12. Sundheim, B.: Overview of results of the MUC-6 evaluation. In: Proceedings of Sixth Message Understanding Conference (MUC-6), Columbia, Maryland, November 6-8, pp. 13–32 (1995)

    Google Scholar 

  13. Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)

    Google Scholar 

  14. Hiemstra, D.: A probabilistic justification for using tf.idf term weighting in information retrieval. International Journal on Digital Libraries 3(2), 131–139 (2000)

    Article  Google Scholar 

  15. Hindle, D., Rooth, M.: Structural ambiguity and lexical relations. Computational Linguistics 19(1), 103–120 (1993)

    Google Scholar 

  16. Merlo, P., Crocker, M.W., Berthouzoz, C.: Attaching multiple prepositional phrases: Generalized Backed-off Estimation. In: Cardie, C., Weischedel, R. (eds.) Proceedings of the second conference on Empirical Methods in Natural Language Processing, EMNLP-97, pp. 149–155 (1997)

    Google Scholar 

  17. Nakov, P., Hearst, M.: Using the Web as an implicit training set: Application to structural ambiguity resolution. In: Proceedings of HLT-EMNLP, Vancouver, British Columbia, Canada, pp. 835–842 (2005)

    Google Scholar 

  18. Grefenstette, G.: The World Wide Web as a resource for example-based machine translation tasks. In: Proceedings of the ASLIB Conference on Translating and the Computer, London (1998)

    Google Scholar 

  19. Li, Y., Grefenstette, G.: Translating Chinese idiographic characters via corpus and web validation. In: CORIA’2005, Grenoble, France, March 9-11 (2005)

    Google Scholar 

  20. Qu, Y., Grefenstette, G.: Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation. In: Proc. of ACL, pp. 184–191 (2004)

    Google Scholar 

  21. Turney, P.D., Littman, M.L.: Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems (TOIS) 21(4), 315–346 (2003)

    Article  Google Scholar 

  22. Grefenstette, G.: The Color of Things: Towards the automatic acquisition of information for a descriptive dictionary. Revue Française de Linguisitque Appliquée, vol. X-2 1386-1204, 83-94 (2005)

    Google Scholar 

  23. Cimiano, P., Staab, S.: Learning by googling. ACM SIGKDD Explorations Newsletter 6(2), 24–33 (2004)

    Article  Google Scholar 

  24. Kilgarriff, A.: Linguistic search engine. In: Simov, K. (ed.) Shallow Processing of Large Corpora: Workshop Held in Association with Corpus Linguistics 2003, Lancaster, England (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Grefenstette, G. (2007). Conquering Language: Using NLP on a Massive Scale to Build High Dimensional Language Models from the Web. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2007. Lecture Notes in Computer Science, vol 4394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70939-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-70939-8_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-70938-1

  • Online ISBN: 978-3-540-70939-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics