Abstract
This paper presents a novel language-independent context-based sentence alignment technique given parallel corpora. We can view the problem of aligning sentences as finding translations of sentences chosen from different sources. Unlike current approaches which rely on pre-defined features and models, our algorithm employs features derived from the distributional properties of sentences and does not use any language dependent knowledge. We make use of the context of sentences and introduce the notion of Zipfian word vectors which effectively models the distributional properties of a given sentence. We accept the context to be the frame in which the reasoning about sentence alignment is done. We examine alternatives for local context models and demonstrate that our context based sentence alignment algorithm performs better than prominent sentence alignment techniques. Our system dynamically selects the local context for a pair of set of sentences which maximizes the correlation. We evaluate the performance of our system based on two different measures: sentence alignment accuracy and sentence alignment coverage. We compare the performance of our system with commonly used sentence alignment systems and show that our system performs 1.1951 to 1.5404 times better in reducing the error rate in alignment accuracy and coverage.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bicici, E., Yuret, D.: Clustering word pairs to answer analogy questions. In: Proceedings of the Fifteenth Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN 2006), pp. 277–284, Akyaka, Mugla (June 2006)
Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: Proceedings of the 29th annual meeting on Association for Computational Linguistics, pp. 169–176, Association for Computational Linguistics, Morristown (1991)
Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993)
Chen, S.F.: Aligning sentences in bilingual corpora using lexical information. In: Proceedings of the 31st annual meeting on Association for Computational Linguistics, pp. 9–16, Morristown, Association for Computational Linguistics (1993)
Erjavec, T.: MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In: Fourth International Conference on Language Resources and Evaluation, LREC 2004, pp. 1535–1538. Paris (2004), ELRA. http://nl.ijs.si/et/Bib/LREC04/
Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Computational Linguistics 19(1), 75–102 (1993)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1997)
Joachims, T.: Learning to Classify Text using Support Vector Machines. Kluwer Academic Publishers, Boston (2002)
Kruskal, J.B.: An overview of sequence comparison. In: Sankoff, D., Kruskal, J.B. (eds.) Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, pp. 1–44. Addison-Wesley, London (1983)
Moore, R.C.: Fast and accurate sentence alignment of bilingual corpora. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 135–144. Springer, Heidelberg (2002)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarity in the amino acid sequences of two proteins. J. Mol. Biol. 48, 443–453 (1970)
Ristad, E.S., Thomas, R.G.: New techniques for context modeling. In: ACL, pp. 220–227 (1995)
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-acquis: A multilingual aligned parallel corpus with 20+ languages, pp. 2142–2147 (2006), Comment: hunalign is available at http://mokk.bme.hu/resources/hunalign
Turney, P.: Measuring semantic similarity by latent relational analysis. In: Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI 2005), pp. 1136–1141 (August 2005)
Wang, X.: Robust utilization of context in word sense disambiguation. In: Dey, A.K., Kokinov, B., Leake, D.B., Turner, R. (eds.) CONTEXT 2005. LNCS (LNAI), vol. 3554, pp. 529–541. Springer, Heidelberg (2005)
Yarowsky, D.: Decision lists for lexical ambiguity resolution. In: Hayes-Roth, B., Korf, R. (eds.) Proceedings of the Twelfth National Conference on Artificial Intelligence, Menlo Park. American Association for Artificial Intelligence, AAAI Press, Stanford (1994)
Yarowsky, D., Florian, R.: Evaluating sense disambiguation across diverse parameter spaces. Natural Language Engineering 8(4), 293–310 (2002)
Zipf, G.K.: The meaning-frequency relationship of words. The Journal of General Psychology 33, 251–256 (1945)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Biçici, E. (2007). Local Context Selection for Aligning Sentences in Parallel Corpora . In: Kokinov, B., Richardson, D.C., Roth-Berghofer, T.R., Vieu, L. (eds) Modeling and Using Context. CONTEXT 2007. Lecture Notes in Computer Science(), vol 4635. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74255-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-540-74255-5_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74254-8
Online ISBN: 978-3-540-74255-5
eBook Packages: Computer ScienceComputer Science (R0)