Tibetan Word Segmentation as Sub-syllable Tagging with Syllable’s Part-of-Speech Property

Liu, Huidan; Long, Congjun; Nuo, Minghua; Wu, Jian

doi:10.1007/978-3-319-25816-4_16

Huidan Liu¹⁹,
Congjun Long^19,20,
Minghua Nuo¹⁹ &
…
Jian Wu¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9427))

Included in the following conference series:

7064 Accesses

Abstract

When Tibetan word segmentation task is taken as a sequence labelling problem, machine learning models such as ME and CRFs can be used to train the segmenter. The performance of the segmenter is related to many factors. In the paper, three factors, namely strategy on abbreviated syllables, tag set, and the syllable’s Part-Of-Speech property, are compared. Experiment data show that: first, if each abbreviate syllable is separated into two units for labelling rather than one, the F-measure improves 0.06 % and 0.10 % on 4-tag set and 6-tag set respectively. Second, if 6-tag set is used rather than 4-tag set, the F-measure improves 0.10 % and 0.14 % on the two strategies on abbreviated syllables respectively. Third, when the syllable’s Part-Of-Speech property is take into account, F-measure improves 0.47 % and 0.41 % respectively than the other two methods without using it on 4-tag set, while it improves 0.45 % and 0.35 % on 6-tag set, which is much more higher than the former improvements. So it’s a better choice to take advantage of the syllable’s Part-Of-Speech property information while using the sub-syllable as the tag unit.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

1.
http://taku910.github.io/crfpp.

References

Cai, R.J.: Research on the word categories and its annotation scheme for tibetan corpus. J. Chin. Inf. Process. 23(04), 107–112 (2009)
Google Scholar
Cai, Z.: The design of banzhida tibetan word segmentation system. In: Researches and Advancements of Information Processing for Chinese Minority Languages and Characters (2009)
Google Scholar
Cai, Z.: The design of banzhida tibetan word segmentation system. In: 12th Symposium on Chinese Minority Information Processing (2009)
Google Scholar
Cai, Z.: Identification of abbreviated word in tibetan word segmentation. J. Chin. Inf. Process. 23(01), 35–37 (2009)
Google Scholar
Cai, Z.: The design of banzhida tibetan word segmentation system. J. Ethic Normal Coll. Qinhai Normal Univ. 2, 75–77 (2010)
Google Scholar
Chen, Y., Li, B., Yu, S.: The design and implementation of a tibetan word segmentation system. J. Chin. Inf. Process. 17(3), 15–20 (2003)
Google Scholar
Chen, Y., Li, B., Yu, S., Lan, C.: An automatic tibetan segmentation scheme based on case auxiliary words and continuous features. Appl. Linguist. 1, 75–82 (2003)
Google Scholar
Chen, Y., Yu, S.: The present situation and prospect of the study of technological methods concerning handling the information in tibetan script. China Tibetol. 04, 97–107 (2003)
Google Scholar
Chungku, C., Rabgay, J., Faaß, G.: Building nlp resources for dzongkha: a tagset and a tagged corpus. In: Proceedings of the 8th Workshop on Asian Language Resources, pp. 103–110. Beijing, China (2010)
Google Scholar
Dolha, Z., Losanglangjie, O.: The parts-of-speech and tagging set standards of tibetan information process. In: the 11th Symposium on Chinese Minority Information Processing (2007)
Google Scholar
Emerson, T.: The second international chinese word segmentation bakeoff. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 123–133. Jeju Island, Korea (2005)
Google Scholar
Gyal, T., Zhu, J.: Research on tibetan segmentation scheme for information processing. J. Chin. Inf. Process. 23(04), 113–117 (2009)
Google Scholar
He, X., Li, Y., Ma, N., Yu, H.: Study on tibetan automatic word segmentation as syllable tagging. Appl. Res. Comput. 32(1), 61–65 (2015)
Google Scholar
Jiang, D.: History and progress of tibetan text information processing. In: Frontiers of Chinese Information Processing Proceedings of the 25th Anniversary Conference of Chinese Information Processing Society, pp. 83–97. Press of Tsinghua university, Beijing (2006)
Google Scholar
Jiang, T.: Tibetan word segmentation system based on conditional random fields. In: Software Engineering and Service Science (ICSESS), pp. 446–448 (2011)
Google Scholar
Kang, C., Jiang, D., Long, C.: Tibetan word segmentation based on word-position tagging. In: 2013 International Conference on Asian Language Processing (IALP), pp. 239–242. IEEE (2013)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)
Google Scholar
Li, Y., Jam, Y., Zong, C., Yu, H.: Research and implementation of tibetan automatic word segmentation based on conditional random field. J. Chin. Inf. Process. 27(4), 52–58 (2013)
Google Scholar
Liu, H., Nuo, M., Ma, L., et al.: Tibetan word segmentation as syllable tagging using conditional random fields. In: Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation (PACLIC 2011), pp. 168–177 (2011)
Google Scholar
Liu, H., Nuo, M., Zhao, W., et al.: SegT: a practical tibetan word segmentation system. J. Chin. Inf. Process. 26(1), 97–103 (2012)
Google Scholar
Liu, H., Zhao, W., Nuo, M., Jiang, L., Wu, J., He, Y.: Tibetan number identification based on classification of number components in tibetan word segmentation. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 719–724. Association for Computational Linguistics, Posters (2010)
Google Scholar
Low, J.K., Ng, H.T., Guo, W.: A maximum entropy approach to chinese word segmentation. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 161–164. Jeju Island, Korea (2005)
Google Scholar
Ng, H.T., Low, J.K.: Chinese part-of-speech tagging: one-at-a-time or all-at-once? word-based or character-based. In: Proceedings of 2004 Conference on Empirical Methods in Natural Language Processing, pp. 277–284 (2004)
Google Scholar
Norbu, S., Choejey, P., Dendup, T., Hussain, S., Mauz, A.: Dzongkha word segmentation. In: Proceedings of the 8th Workshop on Asian Language Resources, pp. 95–102, Beijing (2010)
Google Scholar
Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics, pp. 562–568, Geneva (2004)
Google Scholar
Qi, K.: Research on tibetan automatic word segmentation for information processing. J. Northwest Univ. National. Philos. Soc. Sci. 04, 92–97 (2006)
Google Scholar
Shi, X., Lu, Y., Yang, J.: A tibetan segmentation system. J. Chin. Inf. Process. 25(4), 54–56 (2011)
Google Scholar
Sproat, R., Emerson, T.: The first international chinese word segmentation bakeoff. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 133–143, Sapporo (2003)
Google Scholar
Sun, M., Huaquecairang, C., Jiang, W., et al.: Tibetan word segmentation based on discriminative classification and reranking. J. Chin. Inf. Process. 28(2), 61–66 (2014)
Google Scholar
Sun, Y., Luosang, Q., Yang, R., Zhao, X.: Design of a tibetan automatic segmentation scheme. In: Researches and Advancements of Information Processing for Chinese Minority Languages and Characters - Proceedings of the 12th Symposium on Chinese Minority Information Processing, pp. 228–237 (2009)
Google Scholar
Sun, Y., Luosang, Q., Yang, R., Zhao, X.: Study of segmentation strategy on tibetan crossing ambiguous words. In: Researches and Advancements of Information Processing for Chinese Minority Languages and Characters, pp. 238–243 (2009)
Google Scholar
Sun, Y., Wang, Z., Zhao, X., et al.: Design of a tibetan automatic word segmentation scheme. In: Proceedings of 2009 1st IEEE International Conference on Information Engineering and Computer Science, pp. 1–6 (2009)
Google Scholar
Sun, Y., Yan, X., Zhao, X., et al.: A resolution of overlapping ambiguity in tibetan word segmentation. In: Proceedings of 2010 3rd International Conference on Computer Science and Information Technology, pp. 222–225 (2010)
Google Scholar
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.: A conditional random field word segmenter for sighan bakeoff 2005. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 168–171, Jeju Island (2005)
Google Scholar
Xue, N.: Chinese word segmentation as character tagging. Comput. Linguist. Chin. Lang. Process. 8(1), 29–48 (2003)
Google Scholar
Xue, N., Converse, S.P.: Combining classifiers for chinese word segmentation. In: Proceedings of the First SIGHAN Workshop on Chinese Language Processing, pp. 63–70, Taipei (2002)
Google Scholar
Xue, N., Shen, L.: Chinese word segmentation as lmr tagging. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing in conjunction with ACL03, pp. 176–179, Sapporo (2003)
Google Scholar
Zhao, H., Huang, C.N., Li, M.: An improved chinese word segmentation system with conditional random field. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp. 108–117, Sidney (2006)
Google Scholar
Zhao, H., Huang, C., Li, M., Lu, B.: Effective tag set selection in chinese word segmentation via conditional random field modeling. In: Proceedings of the 20th Pacific Asia Conference on Language. Information and Computation, pp. 87–94, Wuhan (2006)
Google Scholar
Ciren, Z.: The design of a machine assisted tibetan word segmentation and new word registeration system. In: Proceedings of Modernization of Chinese Minority Nationality Languages (1999)
Google Scholar
Zhaxijia, D., Losanglangjie, O., et al.: Theoretical explanation on the parts-of-speech and tagging set standards of tibetan information processing. In: Procedings of the 11th China National Conference on Minority Language Information Processing, pp. 441–452 (2007)
Google Scholar

Download references

Acknowledgements

We thank the reviewers for their critical and constructive comments and suggestions that helped us improve the quality of the paper. The research is partially supported by National Science Foundation (No. 61202219, No. 61202220, No. 61303165) and Informationization Project of the Chinese Academy of Sciences (No. XXH12504-1-10).

Author information

Authors and Affiliations

Institute of Software, Chinese Academy of Sciences, Beijing, 100190, China
Huidan Liu, Congjun Long, Minghua Nuo & Jian Wu
Institute of Ethnology and Anthropology, Chinese Academy of Social Sciences, Beijing, 100081, China
Congjun Long

Authors

Huidan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Congjun Long
View author publications
You can also search for this author in PubMed Google Scholar
Minghua Nuo
View author publications
You can also search for this author in PubMed Google Scholar
Jian Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huidan Liu .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Maosong Sun
Tsinghua University, Beijing, China
Zhiyuan Liu
Soochow University, Suzhou, Jiangsu, China
Min Zhang
Tsinghua University, Beijing, China
Yang Liu

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, H., Long, C., Nuo, M., Wu, J. (2015). Tibetan Word Segmentation as Sub-syllable Tagging with Syllable’s Part-of-Speech Property. In: Sun, M., Liu, Z., Zhang, M., Liu, Y. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL NLP-NABD 2015 2015. Lecture Notes in Computer Science(), vol 9427. Springer, Cham. https://doi.org/10.1007/978-3-319-25816-4_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-25816-4_16
Published: 08 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25815-7
Online ISBN: 978-3-319-25816-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics