Skip to main content

Segmenting Long Sentence Pairs to Improve Word Alignment in English-Hindi Parallel Corpora

  • Conference paper
Advances in Natural Language Processing (JapTAL 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7614))

Included in the following conference series:

Abstract

This paper presents an approach which improves the performance of word alignment for English-Hindi language pair. Longer sentences in the corpus create severe problems like the high computational requirements and poor quality of resulting word alignment. Here, we present a method to solve these problems by breaking the longer sentence pairs into shorter ones. Our approach first breaks the source and target sentences into clauses and then treats the resulting clause pairs as sentence pairs to train word alignment model. We also report preliminary work on automatically identifying clause boundaries which are appropriate for improvement of word alignment. This paper demonstrates the increase of precision, recall and F-measure by approximately 11%, 7%, 10% respectively and reduction in Alignment Error Rate (AER) by approximately 10% in the performance of IBM Model 1 for word alignment. These results are obtained by training on 270 sentence pair and testing on 30 sentence pairs. Experiments of this paper are based on TDIL corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993)

    Google Scholar 

  2. Gale, W.A., Church, K.: Identifying word correspondences in parallel texts. In: Fourth DARPA Workshop on Speech and Natural Language, Asilomar, pp. 152–157 (1991)

    Google Scholar 

  3. Xu, J., Zens, R., Ney, H.: Sentence segmentation using IBM word alignment model 1. In: Proc. the 10th Annual Conference of the European Association for Machine Translation, Budapest, Hungary, pp. 280–287 (May 2005)

    Google Scholar 

  4. Meng, B., Huang, S., Dai, X., Chen, J.: Segmenting long sentence pairs for statistical machine translation. In: International Conference on Asian Language Processing, Singapore, December 7-9 (2009)

    Google Scholar 

  5. Hutchins, J., Somers, H.: An Introduction to Machine Translation, pp. 175–189. Academic Press (1992)

    Google Scholar 

  6. Wilks, Y.: The Stanford Machine Translation project, Natural Language Processing, pp. 243–290. Algorithmics Press (1973)

    Google Scholar 

  7. Chandrasekar, R.: A Hybrid Approach to Machine Translation using Man Machine Communication, Ph.D. thesis, Tata Institute of Fundamental Research, Mumbai (1994)

    Google Scholar 

  8. Rao, D., Mohanraj, K., Hegde, J., Mehta, V., Mahadane, P.: A practical framework for syntactic transfer of compound-complex sentences for English-Hindi machine translation. In: Proceedings of KBCS (2000)

    Google Scholar 

  9. Koehn, P., Knight, K.: Feature-rich statistical translation of noun phrases. In: Proceedings of ACL (2003)

    Google Scholar 

  10. Kim, Y.-B., Ehara, T.: A method for partitioning of long Japanese sentences with subject resolution in J/E machine translation. In: Proc. International Conference on Computer Processing of Oriental Languages, pp. 467–473 (1994)

    Google Scholar 

  11. Marcu, D.: The Rhetorical Parsing, Summarization and Generation of Natural Language Texts, Ph.D. thesis, Department of Computer Science, University of Toronto, Toronto, Canada (December 1997)

    Google Scholar 

  12. Sudoh, K., Duh, K., Tsukada, H., Hirao, T., Nagata, M.: Divide and translate: improving long distance reordering in statistical machine translation. In: Workshop on Statistical Machine Translation and Metrics (2010)

    Google Scholar 

  13. Ramanathan, A., Bhattacharyya, P., Visweswariah, K., Ladha, K., Gandhe, A.: Clause-Based Reordering Constraints to Improve Statistical Machine Translation. In: Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp. 1351–1355 (November 2011)

    Google Scholar 

  14. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)

    Article  MATH  Google Scholar 

  15. Koehn, P.: Statistical Machine Translation. Cambridge University Press, Published in the United States of America by Cambridge University Press, New York (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Srivastava, J., Sanyal, S. (2012). Segmenting Long Sentence Pairs to Improve Word Alignment in English-Hindi Parallel Corpora. In: Isahara, H., Kanzaki, K. (eds) Advances in Natural Language Processing. JapTAL 2012. Lecture Notes in Computer Science(), vol 7614. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33983-7_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33983-7_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33982-0

  • Online ISBN: 978-3-642-33983-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics