Abstract
This chapter addresses the issue of multilingual corpora alignment, presenting schemes which attempt alignment at sentence, clause, noun phrase and word level. Statistical inductive techniques are coupled with symbolic processing analysing specific language phenomena. Sentence alignment combines statistical techniques with the notion of semantic load of text units. Lexical equivalences are extracted based on morphosyntactic tagging and noun phrase recognition on each side of the parallel corpus. A statistical score then filters the most likely translation candidates of single and multi-word units. Similarly, clause alignment couples surface linguistic analysis with a probabilistic model based on word occurrence and cooccurrence probabilities, and word lengths. The best clause alignment is approximated by feeding all possible alignments into a dynamic programming framework. Word and clause alignment have been tested on English-Greek parallel corpora of different domains, yielding results exploitable in knowledge acquisition applications. Sentence alignment has been tested in several languages and integrated in a computer-aided translation platform maximizing translation reuse and consistency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abney, S. P. (1991). Parsing by chunks. In Abney, S. P., Berwick, R. C. Tenny, C. (Eds), Principle-based parsing: Computation and Psycholinguistics (pp. 257–278 ), Kluwer, Dordrecht.
Boutsis, S. Piperidis, S. (1996). Automatic extraction of lexical equivalences from parallel corpora. Proceedings of the MULSAIC ‘86 Workshop, 11–16 August 1996, Budapest, Hungary, 27–31.
Boutsis, S. Piperidis, S. (1998). Aligning Clauses in Parallel Texts. Proceedings of the Third Conference on Empirical Methods in Natural Language Processing, 2 June 1998, Granada, Spain, 17–26.
Boutsis, S., Piperidis, S. Demiros, I. (1999) Generating Translation Lexica from Multilingual Texts. Journal of Applied Artificial Intelligence, Special issue on multilinguality in the Software Industry, 13 (6), 583–606
Brill, E. (1995). Unsupervised Learning of Disambiguation Rules for Part-of-Speech Tagging. Proceedings of the Third Workshop on Very Large Corpora, 30 June 1995, Cambridge Massachusetts, 1–13
Brown, P. F., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Lafferty, J., Mercer, R. L. Roossin, P. (1990). A Statistical Approach to Machine Translation. Computational Linguistics, 16 (2), 79–85.
Brown, P. F., Lai, J. C. Mercer, R. L. (1991). Aligning sentences in parallel corpora. Proceedings of the 29’ h Annual Meeting of the Association for Computational Linguistics (ACL ‘81), 18–21 June, Berkley, 169–176.
Carbonell, J. G., Yang, Y., Frederking, R. E., Brown, R. D., Geng, Y. Lee, D. (1997). Translingual Information Retrieval: A comparative evaluation. Proceeedings of the 15` h International Joint Conference on Artificial Intelligence, IJCAI-97, 23–29 August, Nagoya, Vol..i, 708–714.
Church, K. W. (1988) A stochastic parts program and noun phrase parser for unrestricted text, Proceedings of the Second Conference on Applied Natural Language Processing, Association for Computational Linguistics, 9–12 February, Austin, Texas, 136–143.
Church, K. W. (1993). Char_align: A program for aligning parallel texts at character level. Proceedings of the 31 g ’ Annual Meeting of the Association for Computational Linguistics (ACL ‘83), Columbus, Ohio, 1–8.
Dagan, I., Itai, A. Schwall, U. (1991). Two languages are more informative than one. Proceedings of the 29 h Annual Meeting of the Association for Computational Linguistics (ACL ‘81), 18–21 June, Berkley, 130–137.
Daille, B., Gaussier, E. Langé, J.-M. (1994). Towards automatic extraction of monolingual and bilingual terminology. Proceedings of the I5` h International Conference on Computational Linguistics (COLING’ 94), 5–9 August, Kyoto, Vol. 1, 515–521
Fung, P. (1995). A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora. Proceedings of the 33rd Annual Conference of the Association for Computational Linguistics (ACL ‘85), 26–30 June, Boston, Massachusetts, 236–233.
Gale, W. A. Church, K. W. (1991a). A program for aligning sentences in parallel corpora. Proceedings of the 29 h Annual Meeting of the Association for Computational Linguistics (ACL’91), 18–21 June, Berkley, 177–184.
Gale, W. A. Church, K. W. (1991b) Identifying word correspondences in parallel texts. Proceedings of the Fourth DARPA Speech and Natural Language Workshop, 152–157.
Kay, M. Röscheisen, M. (1993). Text-translation Alignment. Computational Linguistics, 19 (1), 121–142.
Kitamura, M. Matsumoto, Y. (1995). A Machine Translation System based on Translation Rules Acquired from Parallel Texts. Recent Advances in Natural Language Processing, 27–44.
Kitamura, M. Matsumoto, Y. (1996). Automatic Extraction of Word Sequence’Correspondences in Parallel Corpora. Proceedings of the Fourth Workshop on Very Large Corpora, 4 August, Copenhagen, 79–87.
Kumano, A. Hirakawa, H. (1994). Building an MT Dictionary from Parallel Texts Based on Linguistic and Statistical Information. Proceedings of the 15` h International Conference on Computational Linguistics (COLING ‘84), 5–9 August, Kyoto, Vol. 1, 76–81.
Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual corpora. Proceedings of the 31“ Annual Meeting of the Association for Computational Linguistics (ACL 93), 22–26 June, Columbus, Ohio, 17–22.
Matsumoto, Y., Ishimoto, H. Utsuro, T. (1993). Structural Matching of Parallel Texts. In Proceedings of the 31“ Annual Meeting of the Association for Computational Linguistics (ACL 93), 22–26 June, Columbus, Ohio, 23–30.
Nagao, M. (1984). A framework of a mechanical translation between Japanese and English by analogy principle. In Elithom A. Banerji R. (Eds.), Artificial and Human Intelligence (pp 173–180 ), Amsterdam: North-Holland.
Nirenburg, S., Domashnev, C. Grannes, J. (1993). Two Approaches to Matching in Example-Based Machine Translation. Proceedings of International Conference on Theoretical and Methodological Issues in Machine Translation, TMI-93, Kyoto, Japan, 47–57.
Papageorgiou, H., Cranias, L. Piperidis, S. (1994). Automatic alignment in parallel corpora. Proceedings of the 32“” Annual Meeting of the Association for Computational Linguistics (ACL 94), 27–30 June 1994, Las Cruses, New Mexico, 334–336.
Papageorgiou, H. (1997). Clause recognition in the framework of alignment. In Mitkov, R., Nicolov, N. (Eds) Current Issues in Linguistic Theory, Vol. 136, (p p. 417–425). John Benjamins B. V.
Papageorgiou, H. (1996). Hybrid techniques in NLP exploiting parallel multilingual corpora. Ph.D Thesis. Division of Computer Science, Department of Electrical Engineering, NTUA, Athens, January 1996.
Piperidis S. (1995). Interactive Corpus-based Translation Drafting Tool, Aslib Proceedings, 47 (3), March 1995, 83–92.
Piperidis, S., Boutsis, S. Demiros, I. (1997). Automatic Translation Lexicon Generation from Multilingual texts. Proceedings of the MULSAIC ‘87 Workshop, 25 August 1997, Nagoya, Japan, 57–62.
Piperidis, S., Malavazos, C. Triantafyllou, I. (1998). Tr•AID: A Memory-based Translation Aid Framework. Proceedings of the Natural Language Processing and Industrial Applications Conference, NLP+IA, 18–21 August 1998, Moncton, Vol. 1, 103–109.
Piperidis, S., Papageorgiou, H., Demiros, I., Malavazos, C. Triantafyllou, I. (1998). A Framework for Example-based Translation-Aid Tools. Proceedings of the Panhellenic Conference on New Information Technology-(NIT’98), 8–10 October, Athens, Greece, 269–278.
Ramshaw, L. A. Marcus, M. P. (1995). Text Chunking Using Transformation-Based Learning. Proceedings of the Third Workshop on Very Large Corpora, 30 June 1995, Cambridge, MA, 82–94.
Sato, S. Nagao, M. (1990). Toward Memory-based Translation. Proceedings of the 13’ h Inter- national Conference on Computational Linguistics (COLING ‘80), Helsinki, Vol. 3, 247–252.
Sadler, V. Vendelmans, R. (1990). Pilot implementation of a bilingual knowledge bank. Proceedings of the 13’ h International Conference on Computational Linguistics (COLING ‘80), Helsinki, Vol. 3, 449–451.
Simard, M., Foster, G. F. Isabelle, P. (1992). Using cognates to align sentences in bilingual corpora. Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, TMI-92, 25–27 June, Montréal, Canada, 67–81.
Skut, W. Brants, T. (1998) A Maximum-Entropy Partial Parser for Unrestricted Text. Proceedings of the Sixth Workshop on Very Large Corpora,August 15–16, Montreal, Canada, 146151. Available: http://xxx.lanl.gov/abs/cmp-íg/9807006.
Smadja, F. A. (1992). How to compile a bilingual collocational lexicon automatically. Proceedings of the AAAI Workshop on Statistically-based NLP Techniques, San Jose, California, 6771.
Smadja, F. A., McKeown, K. R. Hatzivassiloglou, V. (1996). Translating Collocations for
Bilingual Lexicons: A Statistical Approach. Computational Linguistics,22(1), 1–38.
Veenstra, J. (1998) Fast NP chunking using memory-based learning techniques. Proceedings of Benelearn, Wageningen, The Netherlands, 71–79.
Wu, D. (1995). Grammarless extraction of phrasal translation examples from parallel texts. Proceedings of the Sixth International Conference on Theoretical and Methodological Issues in Machine Translation, July, Leuven, Belgium, Vol. 2, 354–372.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Piperidis, S., Papageorgiou, H., Boutsis, S. (2000). From sentences to words and clauses. In: Véronis, J. (eds) Parallel Text Processing. Text, Speech and Language Technology, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2535-4_6
Download citation
DOI: https://doi.org/10.1007/978-94-017-2535-4_6
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5555-2
Online ISBN: 978-94-017-2535-4
eBook Packages: Springer Book Archive