Skip to main content

From sentences to words and clauses

  • Chapter
Parallel Text Processing

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 13))

Abstract

This chapter addresses the issue of multilingual corpora alignment, presenting schemes which attempt alignment at sentence, clause, noun phrase and word level. Statistical inductive techniques are coupled with symbolic processing analysing specific language phenomena. Sentence alignment combines statistical techniques with the notion of semantic load of text units. Lexical equivalences are extracted based on morphosyntactic tagging and noun phrase recognition on each side of the parallel corpus. A statistical score then filters the most likely translation candidates of single and multi-word units. Similarly, clause alignment couples surface linguistic analysis with a probabilistic model based on word occurrence and cooccurrence probabilities, and word lengths. The best clause alignment is approximated by feeding all possible alignments into a dynamic programming framework. Word and clause alignment have been tested on English-Greek parallel corpora of different domains, yielding results exploitable in knowledge acquisition applications. Sentence alignment has been tested in several languages and integrated in a computer-aided translation platform maximizing translation reuse and consistency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Abney, S. P. (1991). Parsing by chunks. In Abney, S. P., Berwick, R. C. Tenny, C. (Eds), Principle-based parsing: Computation and Psycholinguistics (pp. 257–278 ), Kluwer, Dordrecht.

    Chapter  Google Scholar 

  • Boutsis, S. Piperidis, S. (1996). Automatic extraction of lexical equivalences from parallel corpora. Proceedings of the MULSAIC ‘86 Workshop, 11–16 August 1996, Budapest, Hungary, 27–31.

    Google Scholar 

  • Boutsis, S. Piperidis, S. (1998). Aligning Clauses in Parallel Texts. Proceedings of the Third Conference on Empirical Methods in Natural Language Processing, 2 June 1998, Granada, Spain, 17–26.

    Google Scholar 

  • Boutsis, S., Piperidis, S. Demiros, I. (1999) Generating Translation Lexica from Multilingual Texts. Journal of Applied Artificial Intelligence, Special issue on multilinguality in the Software Industry, 13 (6), 583–606

    Google Scholar 

  • Brill, E. (1995). Unsupervised Learning of Disambiguation Rules for Part-of-Speech Tagging. Proceedings of the Third Workshop on Very Large Corpora, 30 June 1995, Cambridge Massachusetts, 1–13

    Google Scholar 

  • Brown, P. F., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Lafferty, J., Mercer, R. L. Roossin, P. (1990). A Statistical Approach to Machine Translation. Computational Linguistics, 16 (2), 79–85.

    Google Scholar 

  • Brown, P. F., Lai, J. C. Mercer, R. L. (1991). Aligning sentences in parallel corpora. Proceedings of the 29’ h Annual Meeting of the Association for Computational Linguistics (ACL ‘81), 18–21 June, Berkley, 169–176.

    Google Scholar 

  • Carbonell, J. G., Yang, Y., Frederking, R. E., Brown, R. D., Geng, Y. Lee, D. (1997). Translingual Information Retrieval: A comparative evaluation. Proceeedings of the 15` h International Joint Conference on Artificial Intelligence, IJCAI-97, 23–29 August, Nagoya, Vol..i, 708–714.

    Google Scholar 

  • Church, K. W. (1988) A stochastic parts program and noun phrase parser for unrestricted text, Proceedings of the Second Conference on Applied Natural Language Processing, Association for Computational Linguistics, 9–12 February, Austin, Texas, 136–143.

    Google Scholar 

  • Church, K. W. (1993). Char_align: A program for aligning parallel texts at character level. Proceedings of the 31 g ’ Annual Meeting of the Association for Computational Linguistics (ACL ‘83), Columbus, Ohio, 1–8.

    Google Scholar 

  • Dagan, I., Itai, A. Schwall, U. (1991). Two languages are more informative than one. Proceedings of the 29 h Annual Meeting of the Association for Computational Linguistics (ACL ‘81), 18–21 June, Berkley, 130–137.

    Google Scholar 

  • Daille, B., Gaussier, E. Langé, J.-M. (1994). Towards automatic extraction of monolingual and bilingual terminology. Proceedings of the I5` h International Conference on Computational Linguistics (COLING’ 94), 5–9 August, Kyoto, Vol. 1, 515–521

    Google Scholar 

  • Fung, P. (1995). A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora. Proceedings of the 33rd Annual Conference of the Association for Computational Linguistics (ACL ‘85), 26–30 June, Boston, Massachusetts, 236–233.

    Chapter  Google Scholar 

  • Gale, W. A. Church, K. W. (1991a). A program for aligning sentences in parallel corpora. Proceedings of the 29 h Annual Meeting of the Association for Computational Linguistics (ACL’91), 18–21 June, Berkley, 177–184.

    Google Scholar 

  • Gale, W. A. Church, K. W. (1991b) Identifying word correspondences in parallel texts. Proceedings of the Fourth DARPA Speech and Natural Language Workshop, 152–157.

    Google Scholar 

  • Kay, M. Röscheisen, M. (1993). Text-translation Alignment. Computational Linguistics, 19 (1), 121–142.

    Google Scholar 

  • Kitamura, M. Matsumoto, Y. (1995). A Machine Translation System based on Translation Rules Acquired from Parallel Texts. Recent Advances in Natural Language Processing, 27–44.

    Google Scholar 

  • Kitamura, M. Matsumoto, Y. (1996). Automatic Extraction of Word Sequence’Correspondences in Parallel Corpora. Proceedings of the Fourth Workshop on Very Large Corpora, 4 August, Copenhagen, 79–87.

    Google Scholar 

  • Kumano, A. Hirakawa, H. (1994). Building an MT Dictionary from Parallel Texts Based on Linguistic and Statistical Information. Proceedings of the 15` h International Conference on Computational Linguistics (COLING ‘84), 5–9 August, Kyoto, Vol. 1, 76–81.

    Google Scholar 

  • Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual corpora. Proceedings of the 31“ Annual Meeting of the Association for Computational Linguistics (ACL 93), 22–26 June, Columbus, Ohio, 17–22.

    Google Scholar 

  • Matsumoto, Y., Ishimoto, H. Utsuro, T. (1993). Structural Matching of Parallel Texts. In Proceedings of the 31“ Annual Meeting of the Association for Computational Linguistics (ACL 93), 22–26 June, Columbus, Ohio, 23–30.

    Google Scholar 

  • Nagao, M. (1984). A framework of a mechanical translation between Japanese and English by analogy principle. In Elithom A. Banerji R. (Eds.), Artificial and Human Intelligence (pp 173–180 ), Amsterdam: North-Holland.

    Google Scholar 

  • Nirenburg, S., Domashnev, C. Grannes, J. (1993). Two Approaches to Matching in Example-Based Machine Translation. Proceedings of International Conference on Theoretical and Methodological Issues in Machine Translation, TMI-93, Kyoto, Japan, 47–57.

    Google Scholar 

  • Papageorgiou, H., Cranias, L. Piperidis, S. (1994). Automatic alignment in parallel corpora. Proceedings of the 32“” Annual Meeting of the Association for Computational Linguistics (ACL 94), 27–30 June 1994, Las Cruses, New Mexico, 334–336.

    Google Scholar 

  • Papageorgiou, H. (1997). Clause recognition in the framework of alignment. In Mitkov, R., Nicolov, N. (Eds) Current Issues in Linguistic Theory, Vol. 136, (p p. 417–425). John Benjamins B. V.

    Google Scholar 

  • Papageorgiou, H. (1996). Hybrid techniques in NLP exploiting parallel multilingual corpora. Ph.D Thesis. Division of Computer Science, Department of Electrical Engineering, NTUA, Athens, January 1996.

    Google Scholar 

  • Piperidis S. (1995). Interactive Corpus-based Translation Drafting Tool, Aslib Proceedings, 47 (3), March 1995, 83–92.

    Article  Google Scholar 

  • Piperidis, S., Boutsis, S. Demiros, I. (1997). Automatic Translation Lexicon Generation from Multilingual texts. Proceedings of the MULSAIC ‘87 Workshop, 25 August 1997, Nagoya, Japan, 57–62.

    Google Scholar 

  • Piperidis, S., Malavazos, C. Triantafyllou, I. (1998). Tr•AID: A Memory-based Translation Aid Framework. Proceedings of the Natural Language Processing and Industrial Applications Conference, NLP+IA, 18–21 August 1998, Moncton, Vol. 1, 103–109.

    Google Scholar 

  • Piperidis, S., Papageorgiou, H., Demiros, I., Malavazos, C. Triantafyllou, I. (1998). A Framework for Example-based Translation-Aid Tools. Proceedings of the Panhellenic Conference on New Information Technology-(NIT’98), 8–10 October, Athens, Greece, 269–278.

    Google Scholar 

  • Ramshaw, L. A. Marcus, M. P. (1995). Text Chunking Using Transformation-Based Learning. Proceedings of the Third Workshop on Very Large Corpora, 30 June 1995, Cambridge, MA, 82–94.

    Google Scholar 

  • Sato, S. Nagao, M. (1990). Toward Memory-based Translation. Proceedings of the 13’ h Inter- national Conference on Computational Linguistics (COLING ‘80), Helsinki, Vol. 3, 247–252.

    Google Scholar 

  • Sadler, V. Vendelmans, R. (1990). Pilot implementation of a bilingual knowledge bank. Proceedings of the 13’ h International Conference on Computational Linguistics (COLING ‘80), Helsinki, Vol. 3, 449–451.

    Google Scholar 

  • Simard, M., Foster, G. F. Isabelle, P. (1992). Using cognates to align sentences in bilingual corpora. Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, TMI-92, 25–27 June, Montréal, Canada, 67–81.

    Google Scholar 

  • Skut, W. Brants, T. (1998) A Maximum-Entropy Partial Parser for Unrestricted Text. Proceedings of the Sixth Workshop on Very Large Corpora,August 15–16, Montreal, Canada, 146151. Available: http://xxx.lanl.gov/abs/cmp-íg/9807006.

    Google Scholar 

  • Smadja, F. A. (1992). How to compile a bilingual collocational lexicon automatically. Proceedings of the AAAI Workshop on Statistically-based NLP Techniques, San Jose, California, 6771.

    Google Scholar 

  • Smadja, F. A., McKeown, K. R. Hatzivassiloglou, V. (1996). Translating Collocations for

    Google Scholar 

  • Bilingual Lexicons: A Statistical Approach. Computational Linguistics,22(1), 1–38.

    Google Scholar 

  • Veenstra, J. (1998) Fast NP chunking using memory-based learning techniques. Proceedings of Benelearn, Wageningen, The Netherlands, 71–79.

    Google Scholar 

  • Wu, D. (1995). Grammarless extraction of phrasal translation examples from parallel texts. Proceedings of the Sixth International Conference on Theoretical and Methodological Issues in Machine Translation, July, Leuven, Belgium, Vol. 2, 354–372.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Piperidis, S., Papageorgiou, H., Boutsis, S. (2000). From sentences to words and clauses. In: Véronis, J. (eds) Parallel Text Processing. Text, Speech and Language Technology, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2535-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-94-017-2535-4_6

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-5555-2

  • Online ISBN: 978-94-017-2535-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics