Statistical Khmer Name Romanization

Ding, Chenchen; Chea, Vichet; Utiyama, Masao; Sumita, Eiichiro; Sam, Sethserey; Seng, Sopheap

doi:10.1007/978-981-10-8438-6_15

Statistical Khmer Name Romanization

Chenchen Ding¹¹,
Vichet Chea¹²,
Masao Utiyama¹¹,
Eiichiro Sumita¹¹,
Sethserey Sam¹² &
…
Sopheap Seng¹²

Conference paper
First Online: 04 March 2018

842 Accesses
1 Citations
3 Altmetric

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 781))

Abstract

We discuss and solve the task of Khmer name Romanization. Although several standard Romanization systems exist for Khmer, conventional transcription methods are applied prevalently in practice. These are inconsistent and complicated in some cases, due to unstable phonemic, orthographic, and etymological principles. Consequently, statistical approaches are required for the task. We collect and manually align 7, 658 Khmer name Romanization instances. The alignment scheme is designed to reach a precise, consistent, and monotonic correspondence between the two different writing systems on grapheme level, through which various machine learning approaches are facilitated. Experimental results demonstrate that standard approaches of conditional random fields and support vector machine supervised by the manual alignment achieve a precision of .99 on grapheme level, which outperforms a state-of-the-art recurrent neural network approach in a pure sequence-to-sequence manner. The manually aligned data have been released under a license of CC BY-NC-SA for the research community.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://niptict.edu.kh/khmer-name-romanization-with-alignment-on-grapheme-level/.
2.
Some diacritics always stand for one vowel. There are also diacritics serving as “shifter” to switch the series of certain consonant letters, which affect the vowel sound of other diacritics.
3.
Actually, Khmer script has this diacritic, called viriam, but uses obsoletely.
4.
The original names are western names in the Thai-to-English task, which may make the grapheme correspondence more varied and inconsistent.
5.
Generally, the diacritics in an abugida system are observed as vowel shifters to change the inherent vowel. From the viewpoint of this study, the diacritics are actually treated as vowel notes and the inherent vowel is treated as one with a zero-alternant form, which the insertion processing makes explicit.
6.
and are diacritics for and , respectively, and and are followed by staking operators. So these four consonant letters are not bare.
7.
As the task is to transform Khmer script to Latin script, the graphemes are not guaranteed to be single letters on the Romanization side, e.g., the corresponds to CH here.
8.
The stacking operator is always aligned to a silent placeholder.
9.
Spaces are not counted because they are always maintained constant. Stacking operators as well as other silent Khmer letters are counted.
10.
Spaces are taken as one character in the BLEU calculation.
11.
A Python implementation for the rule-based transcription with different layers of rules is available at http://www2.nict.go.jp/astrec-att/member/mutiyama/software.html.
12.
An open-sourced tool is available at https://github.com/lemaoliu/Agtarbidir.
13.
http://taku910.github.io/crfpp/.
14.
http://www.phontron.com/kytea/.

References

Banchs, R.E., Zhang, M., Duan, X., Li, H., Kumaran, A.: Report of NEWS 2015 machine transliteration shared task. In: Proceedings of NEWS, pp. 10–23 (2015)
Google Scholar
Costa-jussà, M.R.: Moses-based official baseline for NEWS 2016. In: Proceedings of NEWS, pp. 88–90 (2016)
Google Scholar
Ehrman, M.E., Sos, K., Kheang, L.H.: Contemporary Cambodian – grammatical sketch (1974). https://www.livelingua.com/fsi/Fsi-ContemporaryCambodian-GrammaticalSketch.pdf
Finch, A., Liu, L., Wang, X., Sumita, E.: Neural network transduction models in transliteration generation. In: Proceedings of NEWS, pp. 61–66 (2015)
Google Scholar
Finch, A., Liu, L., Wang, X., Sumita, E.: Target-bidirectional neural models for machine transliteration. In: Proceedings of NEWS, pp. 78–82 (2016)
Google Scholar
Huffman, F.E.: Cambodian system of writing and beginning reader with drills and glossary (1970). http://www.pratyeka.org/csw/hlp-csw.pdf
Kunchukuttan, A., Bhattacharyya, P.: Data representation methods and use of mined corpora for Indian language transliteration. In: Proceedings of NEWS, pp. 78–82 (2015)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML, pp. 282–289 (2001)
Google Scholar
Liu, L., Finch, A., Utiyama, M., Sumita, E.: Agreement on target-bidirectional LSTMs for sequence-to-sequence learning. In: Proceedings of AAAI, pp. 2630–2637 (2016)
Google Scholar
Neubig, G., Nakata, Y., Mori, S.: Pointwise prediction for robust, adaptable Japanese morphological analysis. In: Proceedings of ACL-HLT, pp. 529–533 (2011)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002)
Google Scholar
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of HLT-NAACL, pp. 134–141 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Advanced Translation Technology Laboratory, ASTREC, National Institute of Information and Communications Technology, 3-5 Hikaridai, Seikacho, Sorakugun, Kyoto, 619-0289, Japan
Chenchen Ding, Masao Utiyama & Eiichiro Sumita
Research and Development Center, National Institute of Posts, Telecommunication and ICT, #41 Russian Federation Blvd., Phnom Penh, Cambodia
Vichet Chea, Sethserey Sam & Sopheap Seng

Authors

Chenchen Ding
View author publications
You can also search for this author in PubMed Google Scholar
Vichet Chea
View author publications
You can also search for this author in PubMed Google Scholar
Masao Utiyama
View author publications
You can also search for this author in PubMed Google Scholar
Eiichiro Sumita
View author publications
You can also search for this author in PubMed Google Scholar
Sethserey Sam
View author publications
You can also search for this author in PubMed Google Scholar
Sopheap Seng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenchen Ding .

Editor information

Editors and Affiliations

Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan
Kôiti Hasida
Natural Language Processing Lab, University of Computer Studies, Yangon, Yangon, Myanmar
Win Pa Pa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ding, C., Chea, V., Utiyama, M., Sumita, E., Sam, S., Seng, S. (2018). Statistical Khmer Name Romanization. In: Hasida, K., Pa, W. (eds) Computational Linguistics. PACLING 2017. Communications in Computer and Information Science, vol 781. Springer, Singapore. https://doi.org/10.1007/978-981-10-8438-6_15

Download citation

DOI: https://doi.org/10.1007/978-981-10-8438-6_15
Published: 04 March 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8437-9
Online ISBN: 978-981-10-8438-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics