Advertisement

Web as a Corpus: Going Beyond the n-gram

  • Preslav NakovEmail author
Chapter
Part of the Communications in Computer and Information Science book series (CCIS, volume 505)

Abstract

The 60-year-old dream of computational linguistics is to make computers capable of communicating with humans in natural language. This has proven hard, and thus research has focused on sub-problems. Even so, the field was stuck with manual rules until the early 90s, when computers became powerful enough to enable the rise of statistical approaches. Eventually, this shifted the main research attention to machine learning from text corpora, thus triggering a revolution in the field.

Today, the Web is the biggest available corpus, providing access to quadrillions of words; and, in corpus-based natural language processing, size does matter. Unfortunately, while there has been substantial research on the Web as a corpus, it has typically been restricted to using page hit counts as an estimate for n-gram word frequencies; this has led some researchers to conclude that the Web should be only used as a baseline. We show that much better results are possible for structural ambiguity problems, when going beyond the n-gram.

Keywords

Web as a corpus Surface features Paraphrases Noun compound bracketing Prepositional phrase attachment Noun phrase coordination Syntactic parsing 

Notes

Acknowledgements

This research was supported by NSF DBI-0317510, and a gift from Genentech.

References

  1. Rajeev, A., Boggess, L.: A simple but useful approach to conjunct identification. In: Proceedings of ACL, pp. 15–21 (1992)Google Scholar
  2. Michele, B., Brill, E.: Scaling to very very large corpora for natural language disambiguation. In: Proceedings of ACL (2001)Google Scholar
  3. Bansal, M., Klein, D.: Web-scale features for full-scale parsing. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - vol.1, HLT 2011, pp. 693–702. PA, USA, Stroudsburg (2011)Google Scholar
  4. Barker, K., Szpakowicz, S.: Semi-automatic recognition of noun modifier relationships. In: Proceedings of the 17th international conference on Computational linguistics, 96–102. Association for Computational Linguistics, Morristown, NJ, USA (1998)Google Scholar
  5. Bergsma, S., Goebel, R.: Using visual information to predict lexical preference. In: Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pp. 399–405. RANLP 2011 Organising Committee, Hissar, Bulgaria (2011)Google Scholar
  6. Pitler, E., Lin, D.: Creating robust supervised classifiers via web-scale n-gram data. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 865–874. Uppsala, Sweden (2010)Google Scholar
  7. Van Durme, B.: Learning bilingual lexicons using the visual similarity of labeled web images. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence -Volume Volume Three, IJCAI 2011, pp. 1764–1769. AAAI Press (2011)Google Scholar
  8. Iris Wang, Q.: Learning noun phrase query segmentation. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 819–826 (2007)Google Scholar
  9. Brants, T., Popat, A.C., Peng, X., Och, F.J., Dean, J.: Large language models in machine translation. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858–867. Czech Republic, Prague (2007)Google Scholar
  10. Brill, E., Resnik, P.: A rule-based approach to prepositional phrase attachment disambiguation. In: Proceedings of COLING (1994)Google Scholar
  11. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30, 107–117 (1998)Google Scholar
  12. Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32, 13–47 (2006)CrossRefGoogle Scholar
  13. Butnariu, C., Kim, SN., Nakov, P., Séaghdha, D., Szpakowicz, S., Veale, T.: Noun compounds using paraphrasing verbs and prepositions. In: Proceedings of the 5th International Workshop on Semantic Evaluations (SemEval-2), Uppsala, Sweden, 11–16 July 2010, pp. 39–44 (2010)Google Scholar
  14. Veale, T.: A concept-centered approach to noun-compound interpretation. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 81–88. Manchester, UK (2008)Google Scholar
  15. Cafarella, M., Banko, M., Etzioni, O.: Technical Report 02 April 2006, University of Washington, Department of Computer Science and Engineering (2006)Google Scholar
  16. Calvo, H., Gelbukh, A.: Improving prepositional phrase attachment disambiguation using the web as corpus. In: Sanfeliu, A., Ruiz-Shulcloper, J. (eds.) CIARP 2003. LNCS, vol. 2905, pp. 604–610. Springer, Heidelberg (2003) CrossRefGoogle Scholar
  17. Cao, Y., Li, H.: Base noun phrase translation using web data and the EM algorithm. In: COLING, pp. 127–133 (2002)Google Scholar
  18. Chantree, F., Kilgarriff, A., De Roeck, A., Willis, A.: Using a distributional thesaurus to resolve coordination ambiguities. In: Technical Report 2005/02. The Open University, UK (2005)Google Scholar
  19. Chklovski, T., Pantel, P.: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 33–40 (2004)Google Scholar
  20. Church, K., Patil, R.: Coping with syntactic ambiguity or how to put the block in the box on the table. Am. J. Comput. Linguist. 8, 139–149 (1982)Google Scholar
  21. Collins, M., Brooks, J.: Prepositional phrase attachment through a backed-off model. In: Proceedings of EMNLP, pp. 27–38 (1995)Google Scholar
  22. Downing, P.: On the creation and use of english compound nouns. Language 53(4), 810–842 (1977)CrossRefGoogle Scholar
  23. Dumais, S., Banko, M., Brill, E., Lin, J., Andrew Ng.: Web question answering: Is more always better?. In: Proceedings of SIGIR, pp. 291–298 (2002)Google Scholar
  24. Fellbaum, C.: Wordnet: An Electronic Lexical Database. MIT Press, Cambridge (1998) zbMATHGoogle Scholar
  25. Fleiss, J.L.: Statistical Methods for Rates and Proportions, 2nd edn. John Wiley & Sons Inc, New York (1981)zbMATHGoogle Scholar
  26. Girju, R., Moldovan, D., Tatu, M., Antohe, D.: On the semantics of noun compounds. Special Issue on Multiword Expressions 19(4), 479–496 (2005)Google Scholar
  27. Girju, R., Nakov, P., Nastase, Szpakowicz, S., Turney, P., Yuret. D.: Semeval-2007 task 04: classification of semantic relations between nominals. In: Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pp. 13–18, Prague, Czech Republic (2007)Google Scholar
  28. Nakov, P., Nastase, V., Szpakowicz, S., Turney, P., Yuret, D.: Language Resources and Evaluation 43, 105–121 (2009)CrossRefGoogle Scholar
  29. Goldberg, M.: An unsupervised model for statistically determining coordinate phrase attachment. In: Proceedings of ACL, pp. 610–614 (1999)Google Scholar
  30. Grefenstette, G.: The world wide web as a resourcefor example-based machine translation tasks. In: Proceedings of the ASLIB Conference on Translating and the Computer (1998)Google Scholar
  31. Hendrickx, I., Kim, S.N., Kozareva, Z., Nakov, P., Séaghdha, D., Padó, S., Romano, M., Szpakowicz, S.: SemEval-2010 Task 8: Multi-way classification of semantic relations between pairs of nominals. In: Proceedings of the 5th International Workshop on Semantic Evaluations (SemEval-2), Uppsala, Sweden, 11– 16 July 2010, 33–38 (2010)Google Scholar
  32. Weber, I.M.: Semantic Methods for Execution-level Business Process Modeling. LNBIP, vol. 40. Springer, Heidelberg (2009)Google Scholar
  33. Hindle, D., Rooth, M.: Structural ambiguity and lexical relations. Comput. Linguist. 19, 103–120 (1993)Google Scholar
  34. Szpektor, I., Tanev, H., Dagan, I., Coppola, B.: Scaling web-based acquisition of entailment relations. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 401–48 (2004)Google Scholar
  35. Weber, I.M.: Evaluation. Semantic Methods for Execution-level Business Process Modeling. LNBIP, vol. 40, pp. 203–225. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  36. Keller, F., Lapata, M.: Using the Web to obtain frequencies for unseen bigrams. Comput. Linguist. 29, 459–484 (2003)CrossRefGoogle Scholar
  37. Kilgariff, A., Grefenstette, G.: Introduction to the special issue on the web as corpus. Comput. Linguist. 29, 333–347 (2003)MathSciNetCrossRefGoogle Scholar
  38. Kilgarriff, A.: Googleology is bad science. Comput. Linguist. 33, 147–151 (2007)CrossRefGoogle Scholar
  39. Nam, K.S., Nakov, P.: Large-scale noun compound interpretation using bootstrapping and the web as a corpus. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 648–658. Edinburgh, Scotland, UK (2011)Google Scholar
  40. Kurohashi, S., Nagao, M.: Dynamic programming method for analyzing conjunctive structures in Japanese. In: Proceedings of COLING, vol. 1 (1992)Google Scholar
  41. Lapata, M., Keller, F.: The Web as a baseline: evaluating the performance of unsupervised Web-based models for a range of NLP tasks. In: Proceedings of HLT-NAACL, pp. 121–128, Boston (2004)Google Scholar
  42. Keller, F.: Web-based models for natural language processing. ACM Trans. Speech Lang. Process. 2(1), 1–31 (2005)MathSciNetGoogle Scholar
  43. Lauer, M.: Designing statistical language learners: experiments on noun compounds. Department of Computing Macquarie University NSW 2109 Australia dissertation (1995)Google Scholar
  44. Levi, J.: The syntax and semantics of complex nominals. Academic Press, New York (1978)Google Scholar
  45. Levy, O., Goldberg, Y.: Linguistic regularities in sparse and explicit word representations. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, 171–180 (2014)Google Scholar
  46. Lin, D.: An information-theoretic definition of similarity. In: ICML 1998: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann Publishers Inc San Francisco, CA, USA (1998)Google Scholar
  47. Church, K., Ji, H., Sekine, S., Yarowsky, D., Bergsma, S., Patil, K., Pitler, E., Lathbury, R., Rao, V., Dalwani, K., Narsale, S.: New tools for web-scale n-grams. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010) Calzolari, N., (Conference Chair), Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M.,Tapias, D., Valletta, M.: European Language Resources Association (ELRA) (2010)Google Scholar
  48. Lin, Y., Michel, J.-B., Lieberman, E.A., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the google books ngram corpus. In: Proceedings of the ACL 2012 System Demonstrations, pp. 169–174. Jeju Island, Korea (2012)Google Scholar
  49. Marcus, M.: A Theory of Syntactic Recognition for Natural Language. MIT Press, Cambridge (1980)zbMATHGoogle Scholar
  50. Santorini, B., Marcinkiewicz, M.: Building a large annotated corpus of english: The PennTreebank. Comput. Linguist. 19, 313–330 (1994)Google Scholar
  51. Mihalcea, R., Moldovan, D.: A method for word sense disambiguation of unrestricted text. In: ACL, pp. 152–158 (1999)Google Scholar
  52. Mikolov, Tomas, Yih, Wen-tau, Zweig, Geoffrey: Linguistic regularities in continuous space word representations.Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751. Atlanta, Georgia (2013)Google Scholar
  53. Modjeska, N., Markert, K. Nissim, M.: Using the web in machine learning for other-anaphora resolution. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, 176–183 ( 2003)Google Scholar
  54. Nakov, P.: Using the web as an implicit training set: Application to noun compound syntax and semantics. EECS Department, University of California, Berkeley, UCB/EECS-2007-173 dissertation (2007)Google Scholar
  55. Improved statistical machine translation using monolingual paraphrases. In: Proceedings of the European Conference on Artificial Intelligence, ECAI 2008, pp. 338–342. Patras, Greece (2008a)Google Scholar
  56. Nakov, P.: Noun compound interpretation using paraphrasing verbs: feasibility study. In: Dochev, D., Pistore, M., Traverso, P. (eds.) AIMSA 2008. LNCS (LNAI), vol. 5253, pp. 103–117. Springer, Heidelberg (2008) CrossRefGoogle Scholar
  57. Paraphrasing verbs for noun compound interpretation. In: Proceedings of the LREC’08 Workshop: Towards a Shared Task for Multiword Expressions, MWE 2008, pp. 46–49. Marrakech, Morocco (2008c)Google Scholar
  58. On the interpretation of noun compounds: Syntax, semantics, and entailment. Natural Lang. Eng. vol. 19, pp. 291–330 (2013)Google Scholar
  59. Hearst, M.: Search engine statistics beyond the n-gram: Application to noun compound bracketing. In: Proceedings of CoNLL-2005, Ninth Conference on Computational Natural Language Learning (2005a)Google Scholar
  60. Hearst, M.: A study of using search engine page hits as a proxy for n-gram frequencies. In: Proceedings of RANLP 2005, pp. 347–353. Borovets, Bulgaria (2005)Google Scholar
  61. Hearst, M.: Using the web as an implicit training set: application to structural ambiguity resolution. In: HLT 2005: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 835–842. Association for Computational Linguistics, Morristown, NJ, USA (2005c)Google Scholar
  62. Hearst, M.: Solving relational similarity problems using the web as a corpus. In: Proceedings of the 46th Annual Meeting on Association for Computational Linguistics, ACL 2008, pp. 452–460. Columbus, OH (2008)Google Scholar
  63. Nakov, P., Hearst, M.: Using verbs to characterize noun-noun relations. In: Euzenat, J., Domingue, J. (eds.) AIMSA 2006. LNCS (LNAI), vol. 4183, pp. 233–244. Springer, Heidelberg (2006) CrossRefGoogle Scholar
  64. Kozareva, Z.: Combining relational and attributional similarity for semantic relation classification. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2011, pp. 323–330. Hissar, Bulgaria (2011)Google Scholar
  65. Schwartz, A., Wolf, B., Hearst, M.: Scaling up BioNLP: application of a text annotation architecture to noun compound bracketing. In: Proceedings of SIG BioLINK (2005a)Google Scholar
  66. Schwartz, A., Wolf, B., Hearst, M.: Proceedings of the ACL 2005 on interactive poster and demonstration sessions, pp. 65–68. Association for Computational Linguistics, Morristown, NJ, USA (2005b)Google Scholar
  67. Nakov, P.I., Hearst, M.A.: Semantic interpretation of noun compounds using verbal and other paraphrases. ACM Trans. Speech Lang. Process. 10, 1–51 (2013)CrossRefGoogle Scholar
  68. Nastase, V., Nakov, P., Séaghdha, D.Ó., Szpakowicz, S.: Semantic Relations Between Nominals: Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, San Rafael (2013) CrossRefGoogle Scholar
  69. Pantel, P., Lin, D.: An unsupervised approach to prepositional phrase attachment using contextually similar words. In: Proceedings of ACL (2000)Google Scholar
  70. Porter, M.: An algorithm for suffix stripping. Program 14, 130–137 (1980)CrossRefGoogle Scholar
  71. Pustejovsky, J., Anick, P., Bergler, S.: Lexical semantic techniques for corpus analysis. Comput. Linguist. 19, 331–358 (1993)Google Scholar
  72. Ratnaparkhi, A.: Statistical models for unsupervised prepositional phrase attachment. In: Proceedings of COLING-ACL vol. 2, pp. 1079–1085 (1998)Google Scholar
  73. Reynar, J., Roukos, S.: A maximum entropy model for prepositional phrase attachment. In: Proceedings of the ARPA Workshop on Human Language Technology, pp. 250–255 (1994)Google Scholar
  74. Resnik, P.: Selection and information: a class-based approach to lexical relationships. University of Pennsylvania, UMI Order No. GAX94-13894 dissertation (1993)Google Scholar
  75. Mining the web for bilingual text. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 527–534. Association for Computational Linguistics, Morristown, NJ, USA (1999a)Google Scholar
  76. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. In: JAIR 11, pp. 95–130 (1999b)Google Scholar
  77. Rigau, G., Magnini, B., Agirre, E., Carroll, J.: Meaning: A roadmap to knowledge technologies. In: Proceedings of COLING Workshop on A Roadmap for Computational Linguistics (2002)Google Scholar
  78. Rus, V., Moldovan, D., Bolohan, O.: Bracketing compound nouns for logic form derivation. In: Haller, S.M., Simmons, G. (eds.) FLAIRS Conference, pp. 198–202. AAAI Press (2002)Google Scholar
  79. Santamaría, C., Gonzalo, J., Verdejo, F.: Automatic association of web directories with word senses. Comput. Linguist. 29, 485–502 (2003)CrossRefGoogle Scholar
  80. Shinzato, K., Torisawa, K.: Acquiring hyponymy relations from web documents. In: Proceedings of HLT-NAACL, pp. 73–80 (2004)Google Scholar
  81. Soricut, R., Brill, E.: Automatic question answering: Beyond the factoid. In: Proceedings of HLT-NAACL, pp. 57–64 (2004)Google Scholar
  82. Stetina, J., Makoto.: Corpus based PP attachment ambiguity resolution with a semantic dictionary. In: Proceedings of WVLC, pp. 66–80 (1997)Google Scholar
  83. Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL 2003, pp. 252–259 (2003)Google Scholar
  84. Toutanova, K., Manning, C.D., Andrew Y.Ng.: Learning random walk models for inducing word dependency distributions. In: Proceedings of ICML (2004)Google Scholar
  85. Turney, P., Littman, M.: Corpus-based learning of analogies and semantic relations. Mach. Learn. J. 60, 251–278 (2005)CrossRefGoogle Scholar
  86. Turney, P.D.: Similarity of semantic relations. Comput. Linguist. 32, 379–416 (2006)CrossRefGoogle Scholar
  87. Véronis, J.: Web: Google adjusts its counts. Jean Veronis’ blog: (2005a). http://aixtal.blogspot.com/2005/03/web-google-adjusts-its-counts.html
  88. Web: MSN cheating too? Jean Veronis’ blog: (2005b). http://aixtal.blogspot.com/2005/02/web-msn-cheating-too.html
  89. Web: Yahoo doubles its counts! Jean Veronis’ blog: (2005c). http://aixtal.blogspot.com/2005/03/web-yahoo-doubles-its-counts.html
  90. Volk, M.: Scaling up. using the www to resolve PP attachment ambiguities. In: Proceedings of Konvens-2000. Sprachkommunikation (2000)Google Scholar
  91. Exploiting the WWW as a corpus to resolve PP attachment ambiguities. In: Proceedings of Corpus Linguistics (2001)Google Scholar
  92. Wang, K., Thrasher, C., Paul Hsu, B.-J.: Web scale NLP: A case study on url word breaking. In: Proceedings of the 20th International Conference on World Wide Web, WWW 2011, pp. 357–366. ACM, New York, NY, USA (2011)Google Scholar
  93. Warren, B.: Semantic patterns of noun-noun compounds. In: Gothenburg Studies in English 41, Goteburg, Acta Universtatis Gothoburgensis (1978)Google Scholar
  94. Way, A., Gough, N.: wEBMT: developing and validating an example-based machine translation system using the world wide web. Comput. Linguist. 29, 421–457 (2003)CrossRefGoogle Scholar
  95. Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: Proceedings of ICML1997, pp. 412–420 (1997)Google Scholar
  96. Zahariev, M.: School of Computing Science, Simon Fraser University, USA dissertation (2004)Google Scholar
  97. Zhu, X., Rosenfeld, R.: Improving trigram language modeling with the world wide web. In: Proceedings of ICASSP I, pp. 533–536 (2001)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Authors and Affiliations

  1. 1.Qatar Computing Research InstituteDohaQatar

Personalised recommendations