Harvesting Regional Transliteration Variants with Guided Search

Kuo, Jin-Shea; Li, Haizhou; Lin, Chih-Lung

doi:10.1007/978-3-642-00831-3_13

Jin-Shea Kuo²¹,
Haizhou Li²² &
Chih-Lung Lin²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5459))

Included in the following conference series:

International Conference on Computer Processing of Oriental Languages

820 Accesses
2 Citations

Abstract

This paper proposes a method to harvest regional transliteration variants with guided search. We first study how to incorporate transliteration knowledge into query formulation so as to significantly increase the chance of desired transliteration returns. Then, we study a cross-training algorithm, which explores valuable information across different regional corpora for the learning of transliteration models to in turn improve the overall extraction performance. The experimental results show that the proposed method not only effectively harvests a lexicon of regional transliteration variants but also mitigates the need of manual data labeling for transliteration modeling. We also conduct an inquiry into the underlying characteristics of regional transliterations that motivate the cross-training algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cheng, P.-J., Lu, W.-H., Tien, J.-W., Chien, L.-F.: Creating Multilingual Translation Lexicons with Regional Variations Using Web Corpora. In: Proc. of 42nd ACL, pp. 534–541 (2004)
Google Scholar
Kwong, O.Y., Tsou, B.K.: Regional Variation of Domain-Specific Lexical Items: Toward a Pan-Chinese Lexical Resource. In: Proc. of 5th SIGHAN Workshop on Chinese Language Processing, pp. 9–16 (2006)
Google Scholar
Li, H., Sim, K.C., Kuo, J.-S., Dong, M.: Semantic Transliteration of Personal Names. In: Proc. of 45th ACL, pp. 120–127 (2007)
Google Scholar
Knight, K., Graehl, J.: Machine Transliteration. Computational Linguistics 24(4), 599–612 (1998)
Google Scholar
Li, H., Zhang, M., Su., J.: A Joint Source Channel Model for Machine Transliteration. In: Proc. of 42nd ACL, pp. 159–166 (2004)
Google Scholar
Oh, J.-H., Choi, K.-S.: An Ensemble of Grapheme and Phoneme for Machine Transliteration. In: Proc. of 2nd IJCNLP, pp. 450–461 (2005)
Google Scholar
Hermjakob, U., Knight, K., Daumé III, H.: Name Translation in Statistical Machine Translation Learning When to Transliterate. In: Proc. of 46th ACL, pp. 389–397 (2008)
Google Scholar
Meng, H., Lo, W.-K., Chen, B., Tang, T.: Generate Phonetic Cognates to Handle Name Entities in English-Chinese Cross-language Spoken Document Retrieval. In: Proc. of the IEEE workshop on ASRU, pp. 311–314 (2001)
Google Scholar
Brill, E., Kacmarcik, G., Brockett, C.: Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs. In: Proc. of NLPPRS, pp. 393–399 (2001)
Google Scholar
Kuo, J.-S., Li, H., Yang, Y.-K.: A Phonetic Similarity Model for Automatic Extraction of Transliteration Pairs. ACM TALIP 6(2), 1–24 (2007)
Google Scholar
Nie, J.-Y., Isabelle, P., Simard, M., Durand, R.: Cross-language Information Retrieval based on Parallel Texts and Automatic Mining of Parallel Text from the Web. In: Proc. of 22nd ACM SIGIR, pp. 74–81 (1999)
Google Scholar
Sproat, R., Tao, T., Zhai, C.: Named Entity Transliteration with Comparable Corpora. In: Proc. of 44th ACL, pp. 73–80 (2006)
Google Scholar
Lin, D., Zhao, S., Durme, B., Pasca, M.: Mining Parenthetical Translations from the Web by Word Alignment. In: Proc. of 46th ACL, pp. 994–1002 (2008)
Google Scholar
Chang, M.-W., Ratinov, L., Roth, D.: Guiding Semi-Supervision with Constraint-Driven Learning. In: Proc. of 45th ACL, pp. 280–287 (2007)
Google Scholar
Sarawagi, S., Chakrabarti, S., Godboley, S.: Cross-training: Learning Probabilistic Mappings between Topics. In: Proc. of SIGKDD 2003, pp. 177–186 (2003)
Google Scholar
Soonthornphisaj, N., Kijsirikul, B.: Iterative Cross-training: An Algorithm for Learning from Unlabeled Web Pages. International Journal of Intelligent Systems 19(1-2), 131–147 (2004)
Article Google Scholar
Brin, S., Page, L.: The Anatomy of a Large-scale Hypertextual Web Search Engine. In: Proc. of 7th WWW, pp. 107–117 (1998)
Google Scholar
Chakrabarti, S., Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. In: Proc. of 8th WWW, pp. 545–562 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Chung-Hwa Telecomm. Labs., Taoyuan, Taiwan
Jin-Shea Kuo
Institute for Infocomm Research, Singapore
Haizhou Li
Chung Yuan Christian University, Taoyuan, Taiwan
Chih-Lung Lin

Authors

Jin-Shea Kuo
View author publications
You can also search for this author in PubMed Google Scholar
Haizhou Li
View author publications
You can also search for this author in PubMed Google Scholar
Chih-Lung Lin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
Wenjie Li
Division of Information and Communication Sciences, Macquarie University, NSW 2109, Sydney, Australia
Diego Mollá-Aliod

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kuo, JS., Li, H., Lin, CL. (2009). Harvesting Regional Transliteration Variants with Guided Search. In: Li, W., Mollá-Aliod, D. (eds) Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy. ICCPOL 2009. Lecture Notes in Computer Science(), vol 5459. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00831-3_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-00831-3_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00830-6
Online ISBN: 978-3-642-00831-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics