Abstract
Parallel corpora are critical resources for building many NLP applications, ranging from machine translation (MT) to cross-lingual information retrieval. In this chapter, we explore a new but important area involving patents by investigating the potential of cultivating large-scale parallel corpora from comparable multilingual patents. Two major issues are investigated on multilingual patents: (1) How to build large-scale corpora of comparable patents involving many languages? (2) How to mine high-quality parallel sentences from these comparable patents? Four parallel corpora are presented as examples, and some preliminary SMT experiments are reported. We further investigate and show the considerable potential of cultivating large-scale parallel corpora from multilingual patents for a wide variety of languages, such as English, Chinese, Japanese, Korean, German, etc, which would to some extent reduce the parallel data acquisition bottleneck in multilingual information processing.
Keywords
This chapter is based on the authors’ previous work described in Lu et al. (2009, 2010a, 2010b, 2011)
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Anyone interested in the corpus are invited to contact the authors for more details.
- 3.
Retrieved March 2010, from http://www.collinslanguage.com/.
- 4.
Retrieved April, 2010 from http://www.wipo.int/pctdb/en/. The data below involving PCT patents comes from the website of WIPO.
- 5.
- 6.
Some contents are in image format. Thus the images were OCRed and the characters recognized were manually verified.
- 7.
Some contents of the English patents were OCRed by WIPO.
- 8.
- 9.
- 10.
- 11.
Correct means the English sentence is exactly the literal translation of the Chinese one, or the content overlap between them are above 80 % with no need to consider phrasal reordering during the translation; partially correct means the Chinese sentence and the English one are not the literal translation of each other, but the content of each sentence can cover more than 50 % of the other; incorrect means the contents of the Chinese sentence and the English one are not related, or more than 50 % of the content of one sentence is not translated in the other. Please see [17] for more details.
- 12.
References
Adafre, S.F., de Rijke, M.: Finding similar sentences across multiple languages in wikipedia. In: Proceedings of EACL, pp. 62–69 (2006)
Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: Proceedings of ACL, pp. 169–176 (1991)
Brown, P.F., Della, S.A., Pietra, V.J., Pietra, D., Mercer, R.L.: Mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Cao, G., Gao, J., Nie, J.: A system to mine large-scale bilingual dictionaries from monolingual web pages. In: Proceedings of MT Summit, pp. 57–64 (2007)
Chen, S.F.: Aligning sentences in bilingual corpora using lexical information. In: Proceedings of ACL, pp. 9–16 (1993)
Chiang, D.: Hierarchical phrase-based translation. Comput. Linguist. 33(2), 201–228 (2007)
Fujii, A., Utiyama, M., Yamamoto, M., Utsuro, T.: Overview of the patent translation task at the NTCIR-7 workshop. In: Proceedings of the NTCIR-7 Workshop, pp. 389–400. Tokyo, Japan (2008)
Fujii, A., Utiyama, M., Yamamoto, M., Utsuro, T., Ehara, T., Echizen-ya, H., Shimohata, S.: Overview of the patent translation task at the NTCIR-8 workshop. In: Proceedings of the NTCIR-8 Workshop. Tokyo, Japan (2010)
Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. In: Proceedings of ACL, pp. 79–85 (1991)
Ha, L.A., Fernandez, G., Mitkov, R., Corpas, G.: Mutual bilingual terminology extraction. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC), pp. 28–30 (2008)
Higuchi, S., Fukui, M., Fujii, A., Ishikawa, T.: PRIME: a system for multi-lingual patent retrieval. In: Proceedings of MT Summit VIII, pp. 163–167 (2001)
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit X (2005)
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of ACL Demo Session, pp. 177–180 (2007)
Kupiec, J.: An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora. In: Proceedings of ACL-93, pp. 17–22 (1993)
Lin, D., Zhao, S., Durme, B.V., Pasca, M.: Mining parenthetical translations from the web by word alignment. In: Proceedings of ACL-08, pp. 994–1002 (2008)
Jiang, L., Yang, S., Zhou, M., Liu, X., Zhu, Q.: Mining bilingual data from the web with adaptively learnt patterns. In: Proceedings of ACL-IJCNLP, pp. 870–878 (2009)
Lu, B., Tsou, B.K., Zhu, J., Jiang, T., Kwong, O.Y.: The construction of an English-Chinese patent parallel corpus. In: Proceedings of MT Summit XII 3rd Workshop on Patent Translation (2009)
Lu, B., Tsou, B.K.: Towards bilingual term extraction in comparable patents. In: Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation (PACLIC’23), pp. 755–762 (2009)
Lu, B., Tsou, B.K., Jiang, T., Kwong, O.Y., Zhu, J.: Mining large-scale parallel corpora from multilingual patents: an English-Chinese example and its application to SMT. In: Proceedings of the 1st CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP-2010). Beijing, China. August, 2010 (2010a)
Lu, B., Jiang, T., Chow, K., Tsou, B.K.: Building a large English-Chinese parallel corpus from comparable patents and its experimental application to SMT. In: Proceedings of Workshop on Building and Using Comparable Corpora. Malta (2010b)
Lu, B., Chow, K.P., Tsou, B.K.: The cultivation of a trilingual Chinese-English-Japanese parallel corpus from comparable patents. In: Proceedings of Machine Translation Summit XIII (MT Summit-XIII). Xiamen (2011a)
Lu, B., Tsou, B.K., Jiang, T., Zhu, J., Kwong, O.: Mining parallel knowledge from comparable patents. In: Ontology Learning and Knowledge Discovery Using the Web: Challenges and Recent Advances. IGI Global ( 2011b)
Ma, X.: Champollion: A robust parallel text sentence aligner. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC). Genova, Italy (2006)
Moore, R.C.: Fast and accurate sentence alignment of bilingual corpora. In: Proceedings of AMTA, pp. 135–144 (2002)
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)
Och, F.J, Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Och, F.J., Ney, H.: The alignment template approach to machine translation. Comput. Linguist. 30(4), 417–449 (2004)
Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Linguist. 29(3), 349–380 (2003)
Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: Proceedings of NAACL-HLT, pp. 403–411 (2010)
Simard, M., Plamondon, P.: Bilingual sentence alignment: balancing robustness and accuracy. Mach. Transl. 13(1), 59–80 (1998)
Utiyama, M., Isahara, H.: A Japanese-English patent parallel corpus. In: Proceeding of MT Summit XI, pp. 475–482 (2007)
Wu, D., Fung, P.: Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. In: Proceedings of IJCNLP2005 (2005)
Wu, D., Xia, X.: Learning an English-Chinese lexicon from a parallel corpus, In: Proceedings of the First Conference of the Association for Machine Translation in the Americas (1994)
Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news collection. In: Proceedings of Second IEEE International Conference on Data Mining (ICDM-02) (2002)
Acknowledgments
We wish to thank our colleagues, Dr. Kataoka S. and Mr. Wrong B. and others, for their help in evaluating the sampled sentence pairs and triplets.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Lu, B., Chow, K.P., Tsou, B.K. (2013). Comparable Multilingual Patents as Large-Scale Parallel Corpora. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-20128-8_9
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)