Advertisement

Collecting Comparable Corpora

  • Monica Lestari Paramita
  • Ahmet Aker
  • Paul Clough
  • Robert GaizauskasEmail author
  • Nikos Glaros
  • Nikos Mastropavlos
  • Olga Yannoutsou
  • Radu Ion
  • Dan Ștefănescu
  • Alexandru Ceauşu
  • Dan Tufiș
  • Judita Preiss
Chapter
Part of the Theory and Applications of Natural Language Processing book series (NLP)

Abstract

The availability of parallel corpora is limited, especially for under-resourced languages and narrow domains. On the other hand, the number of comparable documents in these areas that are freely available on the Web is continuously increasing. Algorithmic approaches to identify these documents from the Web are needed for the purpose of automatically building comparable corpora for these under-resourced languages and domains. How do we identify these comparable documents? What approaches should be used in collecting these comparable documents from different Web sources? In this chapter, we firstly present a review of previous techniques that have been developed for collecting comparable documents from the Web. Then we describe in detail three new techniques to gather comparable documents from three different types of Web sources: Wikipedia, news articles, and narrow domains.

References

  1. ACCURAT Deliverable: D3.3, D3.4, D3.5.Google Scholar
  2. Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. Proceedings of the EACL Workshop on New Text, Trento, Italy.Google Scholar
  3. Aker, A., Kanoulas, E., & Gaizauskas, R. (2012). A light way to collect comparable corpora from the Web. Proceedings of LREC 2012, 21–27 May, Istanbul, Turkey.Google Scholar
  4. Ardö, A., & Golub, K. (2007). Documentation for the Combine (Focused) Crawling System. http://combine.it.lth.se/documentation/DocMain/
  5. Argaw, A. A., & Asker, L. (2005). Web mining for an amharic-english bilingual corpus. Proceedings of the 1st International Conference on Web Information Systems and Technologies, WEBIST ’05 (pp. 239–246). INSTICC Press.Google Scholar
  6. Baroni, M., & Bernardini, S. (2004). BootCaT: Bootstrapping corpora and terms from the Web. Proceedings of LREC 2004 (pp. 1313–1316).Google Scholar
  7. Barzilay, R., & McKeown, K. R. (2001). Extracting paraphrases from a parallel corpus. ACL ’01: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics (pp. 50–57). Association for Computational Linguistics, Morristown, NJ.Google Scholar
  8. Bharadwaj, R. G., & Varma, V. (2011). Language independent identification of parallel sentences using Wikipedia. Proceedings of the 20th International Conference Companion on World Wide Web, WWW ’11 (pp. 11–12), ACM, New York, NY.Google Scholar
  9. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.zbMATHGoogle Scholar
  10. Braschler, P. S. (1998). Multilingual information retrieval based on document alignment techniques. Research and Advanced Technology for Digital Libraries: Second European Conference, ECDL’98, Heraklion, Crete, Cyprus, September 21–23, 1998: Proceedings, 183. Springer.Google Scholar
  11. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117.CrossRefGoogle Scholar
  12. Callison-Burch, C., Koehn, P., & Osborne, M. (2006). Improved statistical machine translation using paraphrases. Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 17–24). Association for Computational Linguistics, Morristown, NJ.Google Scholar
  13. Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization. Ann Arbor MI, 48113(2), 161–175.Google Scholar
  14. Chakrabarti, S., Punera, K., & Subramanyam, M. (2002, May). Accelerated focused crawling through online relevance feedback. Proceedings of the 11th International Conference on World Wide Web (pp. 148–159). ACM.Google Scholar
  15. Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(1–7), 161–172.CrossRefGoogle Scholar
  16. De Bra, P. M. E., & Post, R. D. J. (1994). Information retrieval in the World-Wide Web: Making client-based searching feasible. Computer Networks and ISDN Systems, 27(2), 183–192.CrossRefGoogle Scholar
  17. Dimalen, D. M. D., & Roxas, R. (2007). AutoCor: A query based automatic acquisition of corpora of closely-related languages. Proceedings of the 21st PACLIC (pp. 146–154).Google Scholar
  18. Esplà-Gomis, M., & Forcada, M. L. (2010). Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with bitextor. The Prague Bulletin of Mathematical Linguistics, 93, 77–86.CrossRefGoogle Scholar
  19. Filatova, E. (2009). Directions for exploiting asymmetries in multilingual Wikipedia. Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (CLIAWS3 ’09).Google Scholar
  20. Fung, P., & Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and em. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, EMNLP ’04 (pp. 57–63), Citeseer.Google Scholar
  21. Gamallo, P., & Garcia, M. (2012). Extraction of bilingual cognates from Wikipedia. Computational Processing of the Portuguese Language (pp. 63–72). Springer.Google Scholar
  22. Ghani, R., Jones, R., & Mladenic, D. (2005). Building minority language corpora by learning to generate web search queries. Knowledge and Information Systems, 7(1), 56–83.CrossRefGoogle Scholar
  23. Hassan, A., Fahmy, H., & Hassan, H. (2007). Improving named entity translation by exploiting comparable and parallel corpora. Proceedings of the 2007 Conference on Recent Advances in Natural Language Processing (RANLP), AMML Workshop.Google Scholar
  24. Hersovici, M., Jacovi, M., Maarek, Y. S., Pelleg, D., Shtalhaim, M., & Ur, S. (1998). The sharksearch algorithm—An application: Tailored Web site mapping. Computer Networks and ISDN Systems, 30(1–7), 317–326.CrossRefGoogle Scholar
  25. Huang, D., Zhao, L., Li, L., & Yu, H. (2010). Mining large-scale comparable corpora from Chinese-English news collections. Proceedings of the 23rd International Conference on Computational Linguistics: Posters (pp. 472–480). Association for Computational Linguistics.Google Scholar
  26. Ion, R., Tufiş, D., Boroş, T., Ceauşu, A., & Ştefănescu, D. (2010). On-line compilation of comparable corpora and their evaluation. Proceedings of the 7th International Conference Formal Approaches to South Slavic and Balkan Languages (FASSBL7) (pp. 29–34). Croatian Language Technologies Society – Faculty of Humanities and Social Sciences, University of Zagreb, Dubrovnik, Croatia, October 2010.Google Scholar
  27. Kauchak, D., & Barzilay, R. (2006). Paraphrasing for automatic evaluation. Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 455–462). Association for Computational Linguistics, Morristown, NJ.Google Scholar
  28. Koehn, P. (2009). Statistical machine translation. Cambridge University Press.Google Scholar
  29. Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate detection using shallow text features. The Third ACM International Conference on Web Search and Data Mining.Google Scholar
  30. Kumano, T., Tanaka, H., & Tokunaga, T. (2007). Extracting phrasal alignments from comparable corpora by using joint probability SMT model. Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-07) (pp. 95–103).Google Scholar
  31. Lü, Y., Huang, J., & Liu, Q. (2007, June). Improving statistical machine translation performance by training data selection and optimization. EMNLP-CoNLL (Vol. 34, pp. 3–350).Google Scholar
  32. Marton, Y., Callison-Burch, C., Resnik, P. (2009). Improved statistical machine translation using monolingually-derived paraphrases. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (pp. 381–390). Association for Computational Linguistics.Google Scholar
  33. Mastropavlos, N., & Papavassiliou, V. (2011). Automatic acquisition of bilingual language resources. Proceedings of the 10th International Conference on Greek Linguistics, Komotini, GreeceGoogle Scholar
  34. Menczer, F., & Belew, R. (2000). Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning, 39(2–3), 203–242.CrossRefGoogle Scholar
  35. Munteanu, D. S., & Marcu, D. (2002). Processing comparable corpora with bilingual suffix trees. EMNLP ’02: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing (pp. 289–295). Association for Computational Linguistics, Morristown, NJ.Google Scholar
  36. Munteanu, D. S., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.CrossRefGoogle Scholar
  37. Munteanu, D. S., & Marcu, D. (2006). Extracting parallel sub-sentential fragments from non-parallel corpora. ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (pp. 81–88). Association for Computational Linguistics, Morristown, NJ.Google Scholar
  38. Nakov, P. (2008). Paraphrasing verbs for noun compound interpretation. Proceedings of the Workshop on Multiword Expressions, LREC-2008.Google Scholar
  39. Paramita, M., Clough, P., Aker, A., & Gaizauskas, R. (2012). Correlation between similarity measures for inter-language linked Wikipedia articles. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012) (pp. 790–797), Istanbul, Turkey.Google Scholar
  40. Passerini, A., Frasconi, P., & Soda, G. (2001). Evaluation methods for focused crawling, Lecture Notes in Computer Science 2175, pp. 33–45.CrossRefGoogle Scholar
  41. Phan, X. H., Nguyen, L. M., & Horiguchi, S. (2008, April). Learning to classify short and sparse text and web with hidden topics from large-scale data collections. Proceedings of the 17th International Conference on World Wide Web (pp. 91–100). ACM.Google Scholar
  42. Pinkerton, B. (1994). Finding what people want: Experiences with the Web Crawler. Proceedings of the 2nd International World Wide Web Conference.Google Scholar
  43. Preiss, J. (2012). Identifying comparable corpora using LDA. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT ‘12) (pp. 558–562). Association for Computational Linguistics, Stroudsburg, PA.Google Scholar
  44. Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 519–526). Association for Computational Linguistics.Google Scholar
  45. Resnik, P. (1998). Parallel strands: A preliminary investigation into mining the web for bilingual text. In D. Farwell, L. Gerber, & E. Hovy (Eds.), Machine Translation and the Information Soup: Third Conference of the Association for Machine Translation in the Americas (AMTA-98), Langhorne, PA, Lecture Notes in Artificial Intelligence 1529, Springer, October, 1998.CrossRefGoogle Scholar
  46. Resnik, P. (1999). Mining the web for bilingual text. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 527–534). Association for Computational Linguistics.Google Scholar
  47. Rose, T. G., Stevenson, M., & Whitehead, M. (2002). The Reuters corpus volume 1 – from yesterday’s news to tomorrow’s language resources. Proceedings of the Third International Conference on Language Resources and Evaluation (pp. 827–832).Google Scholar
  48. Sharoff, S., Babych, B., & Hartley, A. (2006). Using comparable corpora to solve problems difficult for human translators. Proceedings of the COLING/ACL on Main Conference Poster Sessions (pp. 739–746). Association for Computational Linguistics, Morristown, NJ.Google Scholar
  49. Simard, M., Foster, G. F., & Isabelle, P. (1993). Using cognates to align sentences in bilingual corpora. In A. Gawman, E. Kidd, & P-Å. Larson (Eds.), Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: Distributed Computing (CASCON ’93) (Vol. 2, pp. 1071–1082). IBM Press.Google Scholar
  50. Smith, J. R., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. In NAACL-HLT (pp. 403–411).Google Scholar
  51. Steinberger, R., Pouliquen, B., & Ignat, C. (2005). Navigating multilingual news collections using automatically extracted information. Journal of Computing and Information Technology, 13(4), 257–264.CrossRefGoogle Scholar
  52. Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., & Laurikkala, J. (2008). Focused web crawling in the acquisition of comparable corpora. Information Retrieval, 11(5), 427–445.CrossRefGoogle Scholar
  53. Theobald, M., Siddharth, J., & Paepcke, A. (2008). SpotSigs: Robust and efficient near duplicate detection in large web collections. 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008).Google Scholar
  54. Tomás, J., Bataller, J., Casacuberta, F., & Lloret, J., (2001). Mining Wikipedia as a parallel and comparable corpus. Language Forum (Vol. 34, No. 1, pp. 123–137). Bahri Publications.Google Scholar
  55. Uszkoreit, J., Ponte, J. M., Popat, A. C., & Dubiner, M. (2010, August). Large scale parallel document mining for machine translation. Proceedings of the 23rd International Conference on Computational Linguistics (pp. 1101–1109). Association for Computational Linguistics.Google Scholar
  56. Yu, K., & Tsujii, J. (2009). Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers (pp. 121–124). Association for Computational Linguistics, Stroudsburg, PA.Google Scholar
  57. Zhao, S., Niu, C., Zhou, M., Liu, T., & Li, S. (2008, June). Combining multiple resources to improve SMT-based paraphrasing model. Proceedings of ACL-08: HLT (pp. 1021–1029). Association for Computational Linguistics, Columbus, OH.Google Scholar
  58. Zhang, Y., Wu, K., Gao, J., & Vines, P. (2006). Automatic acquisition of Chinese-English parallel corpus from the web. Proceedings of 28th European Conference on Information Retrieval ECIR 2006, April 10–12, 2006, London.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Monica Lestari Paramita
    • 1
  • Ahmet Aker
    • 1
  • Paul Clough
    • 1
  • Robert Gaizauskas
    • 1
    Email author
  • Nikos Glaros
    • 2
  • Nikos Mastropavlos
    • 2
  • Olga Yannoutsou
    • 2
  • Radu Ion
    • 3
  • Dan Ștefănescu
    • 3
  • Alexandru Ceauşu
    • 3
  • Dan Tufiș
    • 3
  • Judita Preiss
    • 1
  1. 1.University of SheffieldSheffieldUK
  2. 2.Institute for Language and Speech Processing (ILSP)AthensGreece
  3. 3.Research Institute for Artificial Intelligence, Romanian Academy Center for Artificial Intelligence (RACAI)BucharestRomania

Personalised recommendations