Abstract
This paper presents a new method of aligning bilingual parallel texts based on punctuation statistics and lexical information. It is demonstrated that the punctuation statistics prove to be effective means to achieve good results. The task of sentence alignment of bilingual texts written in disparate language pairs like English and Chinese is reportedly more difficult. We examine the feasibility of using punctuations for high accuracy sentence alignment. Encouraging precision rate is demonstrated in aligning sentences in bilingual parallel corpora based solely on punctuation statistics. Improved results were obtained when both punctuation statistics and lexical information were employed. We have experimented with an implementation of the proposed method on the parallel corpora of Sinorama Magazine and Records of the Hong Kong Legislative Council with satisfactory results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA, USA, pp. 169–176 (1991)
Chen, S.F.: Aligning Sentences in Bilingual Corpora Using Lexical Information. In: Proceedings of ACL 1993, Columbus OH (1993)
Chuang, T., You, G.N., Chang, J.S.: Adaptive Bilingual Sentence Alignment. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 21–30. Springer, Heidelberg (2002)
Déjean, H., Gaussier, É., Sadat, F.: Bilingual Terminology Extraction: An Approach based on a Multilingual thesaurus Applicable to Comparable Corpora. In: Proceedings of the 19th International Conference on Computational Linguistics COLING 2002, Taipei, Taiwan, August 24-September 1, pp. 218–224 (2002)
Dolan, W.B., Pinkham, J., Richardson, S.D.: MSR-MT: The Microsoft Research Machine Translation System. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 237–239. Springer, Heidelberg (2002)
Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpus. Computational Linguistics 19, 75–102 (1991)
Gey, F.C., Chen, A., Buckland, M.K., Larson, R.R.: Translingual vocabulary mappings for multilingual information access. In: SIGIR 2002, pp. 455–456 (2002)
Jutras, J.-M.: An Automatic Reviser: The TransCheck System. In: Proc. of Applied Natural Language Processing, pp. 127–134 (2000)
Kay, M., Röscheisen, M.: Text-Translation Alignment. Computational Linguistics 19(1), 121–142 (1993)
Kueng, T.L., Su, K.-Y.: A Robust Cross-Domain Bilingual Sentence Alignment Model. In: Proceedings of the 19th International Conference on Computational Linguistics (2002)
Kwok, K.: NTCIR-2 Chinese, Cross-Language Retrieval Experiments Using PIRCS. In: Proceedings of the Second NTCIR Workshop Meeting, pp. (5) 14–20 (2001), National Institute of Informatics, Japan
Marcu, D., Wong, W.: A Phrase-Based, Joint Probability Model for Statistical Machine Translation. In: EMNLP (2002)
Melamed, I.: Dan, Models of Translational Equivalence among Words. Computational Linguistics 26(2), 221–249 (2000)
Moore, R.C.: Fast and Accurate Sentence Alignment of Bilingual Corpora. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 135–144. Springer, Heidelberg (2002)
Piao, S.S.: Sentence and word alignment between Chinese and English. Ph.D. thesis, Lancaster University (2000)
Proctor, P.: Longman English-Chinese Dictionary of Contemporary English. Longman Group (Far East), Hong Kong (1988)
Richards, J., et al.: Longman Dictionary of Applied Linguistics. Longman (1985)
Simard, M., Foster, G., Isabelle, P.: Using cognates to align sentences in bilingual corpora. In: Proceedings of TMI 1992, Montreal, Canada, pp. 67–81 (1992)
West, M.: A General Service List of English Words, Longman, London (1953)
Wu, D.: Aligning a parallel English-Chinese corpus statistically with lexical criteria. In: The Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, New, Mexico, USA, pp. 80–87 (1994)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chuang, T.C., Wu, JC., Lin, T., Shei, WC., Chang, J.S. (2005). Bilingual Sentence Alignment Based on Punctuation Statistics and Lexicon. In: Su, KY., Tsujii, J., Lee, JH., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2004. IJCNLP 2004. Lecture Notes in Computer Science(), vol 3248. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30211-7_24
Download citation
DOI: https://doi.org/10.1007/978-3-540-30211-7_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24475-2
Online ISBN: 978-3-540-30211-7
eBook Packages: Computer ScienceComputer Science (R0)