From sentences to words and clauses

Piperidis, Stelios; Papageorgiou, Harris; Boutsis, Sotiris

doi:10.1007/978-94-017-2535-4_6

Stelios Piperidis^4,5,
Harris Papageorgiou⁴ &
Sotiris Boutsis^4,5

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 13))

252 Accesses
3 Citations

Abstract

This chapter addresses the issue of multilingual corpora alignment, presenting schemes which attempt alignment at sentence, clause, noun phrase and word level. Statistical inductive techniques are coupled with symbolic processing analysing specific language phenomena. Sentence alignment combines statistical techniques with the notion of semantic load of text units. Lexical equivalences are extracted based on morphosyntactic tagging and noun phrase recognition on each side of the parallel corpus. A statistical score then filters the most likely translation candidates of single and multi-word units. Similarly, clause alignment couples surface linguistic analysis with a probabilistic model based on word occurrence and cooccurrence probabilities, and word lengths. The best clause alignment is approximated by feeding all possible alignments into a dynamic programming framework. Word and clause alignment have been tested on English-Greek parallel corpora of different domains, yielding results exploitable in knowledge acquisition applications. Sentence alignment has been tested in several languages and integrated in a computer-aided translation platform maximizing translation reuse and consistency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abney, S. P. (1991). Parsing by chunks. In Abney, S. P., Berwick, R. C. Tenny, C. (Eds), Principle-based parsing: Computation and Psycholinguistics (pp. 257–278 ), Kluwer, Dordrecht.
Chapter Google Scholar
Boutsis, S. Piperidis, S. (1996). Automatic extraction of lexical equivalences from parallel corpora. Proceedings of the MULSAIC ‘86 Workshop, 11–16 August 1996, Budapest, Hungary, 27–31.
Google Scholar
Boutsis, S. Piperidis, S. (1998). Aligning Clauses in Parallel Texts. Proceedings of the Third Conference on Empirical Methods in Natural Language Processing, 2 June 1998, Granada, Spain, 17–26.
Google Scholar
Boutsis, S., Piperidis, S. Demiros, I. (1999) Generating Translation Lexica from Multilingual Texts. Journal of Applied Artificial Intelligence, Special issue on multilinguality in the Software Industry, 13 (6), 583–606
Google Scholar
Brill, E. (1995). Unsupervised Learning of Disambiguation Rules for Part-of-Speech Tagging. Proceedings of the Third Workshop on Very Large Corpora, 30 June 1995, Cambridge Massachusetts, 1–13
Google Scholar
Brown, P. F., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Lafferty, J., Mercer, R. L. Roossin, P. (1990). A Statistical Approach to Machine Translation. Computational Linguistics, 16 (2), 79–85.
Google Scholar
Brown, P. F., Lai, J. C. Mercer, R. L. (1991). Aligning sentences in parallel corpora. Proceedings of the 29’ h Annual Meeting of the Association for Computational Linguistics (ACL ‘81), 18–21 June, Berkley, 169–176.
Google Scholar
Carbonell, J. G., Yang, Y., Frederking, R. E., Brown, R. D., Geng, Y. Lee, D. (1997). Translingual Information Retrieval: A comparative evaluation. Proceeedings of the 15` h International Joint Conference on Artificial Intelligence, IJCAI-97, 23–29 August, Nagoya, Vol..i, 708–714.
Google Scholar
Church, K. W. (1988) A stochastic parts program and noun phrase parser for unrestricted text, Proceedings of the Second Conference on Applied Natural Language Processing, Association for Computational Linguistics, 9–12 February, Austin, Texas, 136–143.
Google Scholar
Church, K. W. (1993). Char_align: A program for aligning parallel texts at character level. Proceedings of the 31 g ’ Annual Meeting of the Association for Computational Linguistics (ACL ‘83), Columbus, Ohio, 1–8.
Google Scholar
Dagan, I., Itai, A. Schwall, U. (1991). Two languages are more informative than one. Proceedings of the 29 h Annual Meeting of the Association for Computational Linguistics (ACL ‘81), 18–21 June, Berkley, 130–137.
Google Scholar
Daille, B., Gaussier, E. Langé, J.-M. (1994). Towards automatic extraction of monolingual and bilingual terminology. Proceedings of the I5` h International Conference on Computational Linguistics (COLING’ 94), 5–9 August, Kyoto, Vol. 1, 515–521
Google Scholar
Fung, P. (1995). A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora. Proceedings of the 33rd Annual Conference of the Association for Computational Linguistics (ACL ‘85), 26–30 June, Boston, Massachusetts, 236–233.
Chapter Google Scholar
Gale, W. A. Church, K. W. (1991a). A program for aligning sentences in parallel corpora. Proceedings of the 29 h Annual Meeting of the Association for Computational Linguistics (ACL’91), 18–21 June, Berkley, 177–184.
Google Scholar
Gale, W. A. Church, K. W. (1991b) Identifying word correspondences in parallel texts. Proceedings of the Fourth DARPA Speech and Natural Language Workshop, 152–157.
Google Scholar
Kay, M. Röscheisen, M. (1993). Text-translation Alignment. Computational Linguistics, 19 (1), 121–142.
Google Scholar
Kitamura, M. Matsumoto, Y. (1995). A Machine Translation System based on Translation Rules Acquired from Parallel Texts. Recent Advances in Natural Language Processing, 27–44.
Google Scholar
Kitamura, M. Matsumoto, Y. (1996). Automatic Extraction of Word Sequence’Correspondences in Parallel Corpora. Proceedings of the Fourth Workshop on Very Large Corpora, 4 August, Copenhagen, 79–87.
Google Scholar
Kumano, A. Hirakawa, H. (1994). Building an MT Dictionary from Parallel Texts Based on Linguistic and Statistical Information. Proceedings of the 15` h International Conference on Computational Linguistics (COLING ‘84), 5–9 August, Kyoto, Vol. 1, 76–81.
Google Scholar
Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual corpora. Proceedings of the 31“ Annual Meeting of the Association for Computational Linguistics (ACL 93), 22–26 June, Columbus, Ohio, 17–22.
Google Scholar
Matsumoto, Y., Ishimoto, H. Utsuro, T. (1993). Structural Matching of Parallel Texts. In Proceedings of the 31“ Annual Meeting of the Association for Computational Linguistics (ACL 93), 22–26 June, Columbus, Ohio, 23–30.
Google Scholar
Nagao, M. (1984). A framework of a mechanical translation between Japanese and English by analogy principle. In Elithom A. Banerji R. (Eds.), Artificial and Human Intelligence (pp 173–180 ), Amsterdam: North-Holland.
Google Scholar
Nirenburg, S., Domashnev, C. Grannes, J. (1993). Two Approaches to Matching in Example-Based Machine Translation. Proceedings of International Conference on Theoretical and Methodological Issues in Machine Translation, TMI-93, Kyoto, Japan, 47–57.
Google Scholar
Papageorgiou, H., Cranias, L. Piperidis, S. (1994). Automatic alignment in parallel corpora. Proceedings of the 32“” Annual Meeting of the Association for Computational Linguistics (ACL 94), 27–30 June 1994, Las Cruses, New Mexico, 334–336.
Google Scholar
Papageorgiou, H. (1997). Clause recognition in the framework of alignment. In Mitkov, R., Nicolov, N. (Eds) Current Issues in Linguistic Theory, Vol. 136, (p p. 417–425). John Benjamins B. V.
Google Scholar
Papageorgiou, H. (1996). Hybrid techniques in NLP exploiting parallel multilingual corpora. Ph.D Thesis. Division of Computer Science, Department of Electrical Engineering, NTUA, Athens, January 1996.
Google Scholar
Piperidis S. (1995). Interactive Corpus-based Translation Drafting Tool, Aslib Proceedings, 47 (3), March 1995, 83–92.
Article Google Scholar
Piperidis, S., Boutsis, S. Demiros, I. (1997). Automatic Translation Lexicon Generation from Multilingual texts. Proceedings of the MULSAIC ‘87 Workshop, 25 August 1997, Nagoya, Japan, 57–62.
Google Scholar
Piperidis, S., Malavazos, C. Triantafyllou, I. (1998). Tr•AID: A Memory-based Translation Aid Framework. Proceedings of the Natural Language Processing and Industrial Applications Conference, NLP+IA, 18–21 August 1998, Moncton, Vol. 1, 103–109.
Google Scholar
Piperidis, S., Papageorgiou, H., Demiros, I., Malavazos, C. Triantafyllou, I. (1998). A Framework for Example-based Translation-Aid Tools. Proceedings of the Panhellenic Conference on New Information Technology-(NIT’98), 8–10 October, Athens, Greece, 269–278.
Google Scholar
Ramshaw, L. A. Marcus, M. P. (1995). Text Chunking Using Transformation-Based Learning. Proceedings of the Third Workshop on Very Large Corpora, 30 June 1995, Cambridge, MA, 82–94.
Google Scholar
Sato, S. Nagao, M. (1990). Toward Memory-based Translation. Proceedings of the 13’ h Inter- national Conference on Computational Linguistics (COLING ‘80), Helsinki, Vol. 3, 247–252.
Google Scholar
Sadler, V. Vendelmans, R. (1990). Pilot implementation of a bilingual knowledge bank. Proceedings of the 13’ h International Conference on Computational Linguistics (COLING ‘80), Helsinki, Vol. 3, 449–451.
Google Scholar
Simard, M., Foster, G. F. Isabelle, P. (1992). Using cognates to align sentences in bilingual corpora. Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, TMI-92, 25–27 June, Montréal, Canada, 67–81.
Google Scholar
Skut, W. Brants, T. (1998) A Maximum-Entropy Partial Parser for Unrestricted Text. Proceedings of the Sixth Workshop on Very Large Corpora,August 15–16, Montreal, Canada, 146151. Available: http://xxx.lanl.gov/abs/cmp-íg/9807006.
Google Scholar
Smadja, F. A. (1992). How to compile a bilingual collocational lexicon automatically. Proceedings of the AAAI Workshop on Statistically-based NLP Techniques, San Jose, California, 6771.
Google Scholar
Smadja, F. A., McKeown, K. R. Hatzivassiloglou, V. (1996). Translating Collocations for
Google Scholar
Bilingual Lexicons: A Statistical Approach. Computational Linguistics,22(1), 1–38.
Google Scholar
Veenstra, J. (1998) Fast NP chunking using memory-based learning techniques. Proceedings of Benelearn, Wageningen, The Netherlands, 71–79.
Google Scholar
Wu, D. (1995). Grammarless extraction of phrasal translation examples from parallel texts. Proceedings of the Sixth International Conference on Theoretical and Methodological Issues in Machine Translation, July, Leuven, Belgium, Vol. 2, 354–372.
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Language and Speech Processing, Greece
Stelios Piperidis, Harris Papageorgiou & Sotiris Boutsis
National Technical University of Athens, Greece
Stelios Piperidis & Sotiris Boutsis

Authors

Stelios Piperidis
View author publications
You can also search for this author in PubMed Google Scholar
Harris Papageorgiou
View author publications
You can also search for this author in PubMed Google Scholar
Sotiris Boutsis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Université de Provence and CNRS, 29, Avenue Robert Schuman, 13100, Aix-en-Provence, France
Jean Véronis

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Piperidis, S., Papageorgiou, H., Boutsis, S. (2000). From sentences to words and clauses. In: Véronis, J. (eds) Parallel Text Processing. Text, Speech and Language Technology, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2535-4_6

Download citation

DOI: https://doi.org/10.1007/978-94-017-2535-4_6
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5555-2
Online ISBN: 978-94-017-2535-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics