Abstract
Cross-lingual semantic interoperability has drawn significant research attention recently, as the number of digital libraries in non-English languages has grown exponentially. Cross-lingual information retrieval (CLIR) across different European languages, such as English, Spanish and French, has been widely explored, but CLIR across European and Oriental languages is still at the initial stages. To cross the language boundary, a corpus-based approach shows promise of overcoming the limitations of knowledge-based and controlled vocabulary approaches. However, collecting parallel corpora between European and Oriental languages is not an easy task. Length-based and text-based approaches are two major approaches to align parallel documents. In this paper, we investigate several techniques using these approaches, and compare their performance in aligning English and Chinese titles of parallel documents available on the Web.
Chapter PDF
References
Brown, P., Lai, J., and Mercer, R.: “Aligning sentences in parallel corpora”. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA, USA(1991).
Chen, A., Kishida, K., Jiang, H., Liang, Q., Gey, F.:“Automatic Construction of a Japanese-English Lexicon and its Application in Cross-Language Information Retrieval”. In Proceedings of the Multilingual Information Discovery And Access workshop of the ACM SIGIR’99 Conference, August 14(1999).
Church, K. W.: “Char_align: A Program for Aligning Parallel Texts at the Character Level”. In Proceedings of ACL-93, Columbus OH (1993).
Fung, P. and McKeown, K.: “ A technical word-and term-translation aid using noisy parallel corpora across language groups”. In Machine Translation 12: 53–87(1997).
Fung, P.: “A Pattern Matching Method for Finding Noun and Proper Noun Translations from noisy Parallel Corpora”. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Boston, MA,(1995).
Gale, W. A., and Church, K.W.: “Identifying word correspondences in parallel texts”. In Proceedings of the Fourth DARPA Workshop on Speech and Natural Language, Asilomar, California (1991).
He, S.: “Translingual Alteration of Conceptual Information in Medical Translation: A Cross-Language Analysis between English and Chinese”. In Journal of the American Society for Information Science, Vol. 51, No. 11, pp.1047–1060(2000).
Ma X. and Liberman M.: “BITS: A Method for Bilingual Text Search over the Web”. In Machine Translation Summit VII, September 13th, 1999, Kent Ridge Digital Labs, National University of Singapore. (1999).
Macklovitch, E., Hannan, Marie-Louise: “Line’Em Up: Advances In Alignment Technology And Their Impact on Translation Support Tools”. In Proceedings of the Second Conference of the Association for Machine Translation in the Americas (AMTA-96), Montréal, Québec. (1996).
Melamed, I. D. and Marcus M. P.:Automatic Construction of Chinese-English Translation Lexicons, IRCS Technical Report #98–28. (1998).
Melamed, I. D.: A Geometric Approach to Mapping Bitext Correspondence, In Proceedings of the First Conference on Empirical Methods in Natural Language Processing (EMNLP’96), Philadelphia, PA. (1996).
Oard, D. W.: “Alternative approaches for cross-language text retrieval”. In Hull D, Oard D,(Eds.),1997 AAAI Symposium in Cross-Language Text and Speech Retrieval. American Association for Artificial Intelligence, March(1997).
Resnik P.: “Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text”. In Farwell D., Gerber L., and Hovy E. (eds.), Machine Translation and the Information Soup: Third Conference of the Association for Machine Translation in the Americas (AMTA-98), Langhorne, PA, Lecture Notes in Artificial Intelligence 1529, Springer, October (1998).
Resnik P.: “Mining the Web for Bilingual Text”. In 37th Annual Meeting of the Association for Computational Linguistics (ACL’99), College Park, Maryland, June(1999).
Rose, Marilyn Gaddis.(ed):“Translation Types and Conventions”. In Translation Spectrum: Essays in Theory and Practice, Marilyn Gaddis Rose, Ed., State University of New York Press, pp.31–33 (1981).
Simard, M.:“Text-translation Alignment: Three Languages Are Better Than Two”. In Proceedings of EMNLP/VLC-99. College Park, MD (1999).
Simard, M., Foster, G., Isabelle P.:“Using Cognates to Align Sentences in Bilingual Corpora”. In Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-92), Montreal, Canada (1992).
Sun, L., Du, L., Sun Y. and Jin, Y..: “Sentence Alignment of English-Chinese Complex Bilingual Corpora”. In Proceeding of the 5th Natural Language Processing Pacific Rim Symposium, Beijing, China (1999).
Utsuro T., Ikeda H., Yamane M., Matsumoto Y., and Nagao M.: “Bilingual Text Matching using Bilingual Dictionary and Statistics”. In Proceeding of 15th International Conference on Computational Linguistics, Kyoto (1994).
Warwick-Armstrong, S. and Russell, G.: “Bilingual Concordancing and Bilingual Lexicography”, Euralex (1990).
Wu, D.:“Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria”. In 32nd Annual Conference of the Association for Computational Linguistics, Las Cruces, New Mexico, (1994) pp80–87.
Wu, D. and Fung, P.: “Improving Chinese Tokenization with Linguistic Filters on Statistical Lexical Acquisition”. In 4th Conference on Applied Natural Language Processing,, Stuttgart, Germany, (1994) pp180–181.
Wu, Z. and Tseng G.:“Chinese text segmentation for text retrieval: Achievements and problems”. In Journal of The American Society for Information Science, 44(9):532–542. (1993).
Zanettin, F.: “Bilingual comparable corpora and the training of translators,” Laviosa, Sara.(ed.) META, 43:4, Special Issue. The corpus-based approach: a new paradigm in translation studies: 616–630 (1998).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yang, C.C., Wing Li, K. (2002). Building Parallel Corpora by Automatic Title Alignment. In: Lim, E.P., et al. Digital Libraries: People, Knowledge, and Technology. ICADL 2002. Lecture Notes in Computer Science, vol 2555. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36227-4_38
Download citation
DOI: https://doi.org/10.1007/3-540-36227-4_38
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00261-1
Online ISBN: 978-3-540-36227-2
eBook Packages: Springer Book Archive