Advertisement

Mixing Textual Data Selection Methods for Improved In-Domain Data Adaptation

  • Krzysztof Wołk
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 746)

Abstract

The efficient use of machine translation (MT) training data is being revolutionized on account of the application of advanced data selection techniques. These techniques involve sentence extraction from broad domains and adaption for MTs of in-domain data. In this research, we attempt to improve in-domain data adaptation methodologies. We focus on three techniques to select sentences for analysis. The first technique is term frequency–inverse document frequency, which originated from information retrieval (IR). The second method, cited in language modeling literature, is a perplexity-based approach. The third method is a unique concept, the Levenshtein distance, which we discuss herein. We propose an effective combination of the three data selection techniques that are applied at the corpus level. The results of this study revealed that the individual techniques are not particularly successful in practical applications. However, multilingual resources and a combination-based IR methodology were found to be an effective approach.

Keywords

Text domain adaptation In-domain adaptation Data filtration Corpora adaptation Machine learning 

References

  1. 1.
    Brown, P., Pietra, V., Pietra, S., Mercer, R.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19, 263–311 (1993)Google Scholar
  2. 2.
    Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), pp. 355–362. Association for Computational linguistics, Stroudsburg (2011)Google Scholar
  3. 3.
    Daumé III, H., Jagarlamudi, J.: Domain adaptation for machine translation by mining unseen words. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pp. 407–412. Association for Computational Linguistics, Stroudsburg (2011)Google Scholar
  4. 4.
    Koehn, P., Haddow, B.: Towards effective use of training data in statistical machine translation. In: Proceedings of the 7th ACL Workshop on Statistical Machine Translation, pp. 317–321. Association for Computational Linguistics, Stroudsburg (2012)Google Scholar
  5. 5.
    Civera, J., Juan, A.: Domain adaptation in statistical machine translation with mixture modelling. In: Proceedings of the 2nd ACL Workshop on Statistical Machine Translation, pp. 177–180. Association for Computational Linguistics, Stroudsburg (2007)Google Scholar
  6. 6.
    Foster, G., Kuhn, P.: Mixture-model adaptation for SMT. In: Proceedings of the 2nd ACL Workshop on Statistical Machine Translation, pp. 128–136. Association for Computational Linguistics, Stroudsburg (2007)Google Scholar
  7. 7.
    Eidelman, E., Boyd-Graber, J., Resnik, P.: Topic models for dynamic translation model adaptation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers (ACL 2012), vol. 2, pp. 115–119. Association for Computational Linguistics, Stroudsburg (2012)Google Scholar
  8. 8.
    Matsoukas, S., Rosti, A., Zhang, B.: Discriminative corpus weight estimation for machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), vol. 2, pp. 708–717. Association for Computational Linguistics, Stroudsburg (2009)Google Scholar
  9. 9.
    Hildebrand, A.S., Eck, M., Vogel, S., Waibel, A.: Adaptation of the translation model for statistical machine translation based on information retrieval. In: Proceedings of EAMT 10th Annual Conference, Budapest, Hungary, 30–31 May 2005, pp. 133–142. Association for Computational Linguistics, Stroudsburg (2005)Google Scholar
  10. 10.
    Lü, Y., Huang, J., Liu, Q.: Improving statistical machine translation performance by training data selection and optimization. In: Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), pp. 343–350. Association for Computational Linguistics, Stroudsburg (2007)Google Scholar
  11. 11.
    Lin, S., Tsai, C., Chien, L., Chen, K., Lee, L.: Chinese language model adaptation based on document classification and multiple domain-specific language models. In: Kokkinakis, G., Fakotakis, N., Dermatas, E. (eds.) Proceedings of the 5th European Conference on Speech Communication and Technology, pp. 1463–1466. International Speech Communication Association, Grenoble (1997)Google Scholar
  12. 12.
    Gao, J., Goodman, J., Li, M., Lee, K.: Toward a unified approach to statistical language modeling for Chinese. ACM Trans. Asian Lang. Inf. Process. 1, 3–33 (2002).  https://doi.org/10.1145/595576.595578CrossRefGoogle Scholar
  13. 13.
    Moore, R., Lewis, W.: Intelligent selection of language model training data. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), pp. 220–224. Association for Computational Linguistics, Stroudsburg (2010)Google Scholar
  14. 14.
    Koehn, P.: Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In: Proceedings of the Antenna Measurement Techniques Association (AMTA 2004), pp. 115–124. Springer, Berlin (2004)CrossRefGoogle Scholar
  15. 15.
    Mansour, S., Ney, H.: A simple and effective weighted phrase extraction for machine translation adaptation. In: Proceedings of the 9th International Workshop on Spoken Language Translation (IWSLT 2012), pp. 193–200. Springer, Heidelberg (2012)Google Scholar
  16. 16.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the Workshop on Automatic Summarization (ACL 2002), pp. 311–318. Association for Computational Linguistics, Stroudsburg (2002).  https://doi.org/10.3115/1073083.1073135
  17. 17.
    Stolcke, A.: SRILM-an extensible language modeling toolkit. Paper presented in the 7th International Conference on Spoken Language Processing, ICSLP 2002 - INTERSPEECH, Denver, Colorado, USA (2002)Google Scholar
  18. 18.
    Chen, S., Goodman, J.: An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (ACL 1996), pp. 310–318. Association for Computational Linguistics, Stroudsburg (1996).  https://doi.org/10.3115/981863.981904
  19. 19.
    Wang, L., Wong, D., Chao, L., Lu, Y., Xing, J.: A systematic comparison of data selection criteria for SMT domain adaptation. Sci. World J. 2014, 745485 (2014).  https://doi.org/10.1155/2014/745485CrossRefGoogle Scholar
  20. 20.
    Hovy, E.: Toward finely differentiated evaluation metrics for machine translation. Paper presented in the Proceedings of the EAGLES Workshop on Standards and Evaluation Conference, Pisa, Italy (1999)Google Scholar
  21. 21.
    Reeder, F.: Additional Mt-eval references. Technical report, International Standards for Language Engineering, Evaluation Working Group (2001)Google Scholar
  22. 22.
    Oyeka, I.C.A., Ebuh, G.U.: Modified Wilcoxon signed-rank test. Open J. Stat. 2, 172–176 (2012).  https://doi.org/10.4236/ojs.2012.22019MathSciNetCrossRefGoogle Scholar
  23. 23.
    Junczys-Dowmunt, M., Szał, A.: SyMGiza++: symmetrized word alignment models for statistical machine translation. In: Proceedings of the 2011 International Conference on Security and Intelligent Information Systems, pp. 379–390. Springer, Heidelberg (2002).  https://doi.org/10.1007/978-3-642-25261-7_30CrossRefGoogle Scholar
  24. 24.
    Durrani, N., Koehn, P., Hoang, H., Sajjad, H.: Integrating an unsupervised transliteration model into statistical machine translation. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 148–153. Association for Computational Linguistics, Stroudsburg (2014).  https://doi.org/10.3115/v1/e14-4029
  25. 25.
    Wołk, K., Marasek, K.: Enhanced bilingual evaluation understudy. Lect. Notes Inf. Theory 2, 191–197 (2014).  https://doi.org/10.12720/lnit.2.2.191-197CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Polish-Japanese Academy of Information TechnologyWarsawPoland

Personalised recommendations