Abstract
When Tibetan word segmentation task is taken as a sequence labelling problem, machine learning models such as ME and CRFs can be used to train the segmenter. The performance of the segmenter is related to many factors. In the paper, three factors, namely strategy on abbreviated syllables, tag set, and the syllable’s Part-Of-Speech property, are compared. Experiment data show that: first, if each abbreviate syllable is separated into two units for labelling rather than one, the F-measure improves 0.06 % and 0.10 % on 4-tag set and 6-tag set respectively. Second, if 6-tag set is used rather than 4-tag set, the F-measure improves 0.10 % and 0.14 % on the two strategies on abbreviated syllables respectively. Third, when the syllable’s Part-Of-Speech property is take into account, F-measure improves 0.47 % and 0.41 % respectively than the other two methods without using it on 4-tag set, while it improves 0.45 % and 0.35 % on 6-tag set, which is much more higher than the former improvements. So it’s a better choice to take advantage of the syllable’s Part-Of-Speech property information while using the sub-syllable as the tag unit.
Notes
References
Cai, R.J.: Research on the word categories and its annotation scheme for tibetan corpus. J. Chin. Inf. Process. 23(04), 107–112 (2009)
Cai, Z.: The design of banzhida tibetan word segmentation system. In: Researches and Advancements of Information Processing for Chinese Minority Languages and Characters (2009)
Cai, Z.: The design of banzhida tibetan word segmentation system. In: 12th Symposium on Chinese Minority Information Processing (2009)
Cai, Z.: Identification of abbreviated word in tibetan word segmentation. J. Chin. Inf. Process. 23(01), 35–37 (2009)
Cai, Z.: The design of banzhida tibetan word segmentation system. J. Ethic Normal Coll. Qinhai Normal Univ. 2, 75–77 (2010)
Chen, Y., Li, B., Yu, S.: The design and implementation of a tibetan word segmentation system. J. Chin. Inf. Process. 17(3), 15–20 (2003)
Chen, Y., Li, B., Yu, S., Lan, C.: An automatic tibetan segmentation scheme based on case auxiliary words and continuous features. Appl. Linguist. 1, 75–82 (2003)
Chen, Y., Yu, S.: The present situation and prospect of the study of technological methods concerning handling the information in tibetan script. China Tibetol. 04, 97–107 (2003)
Chungku, C., Rabgay, J., Faaß, G.: Building nlp resources for dzongkha: a tagset and a tagged corpus. In: Proceedings of the 8th Workshop on Asian Language Resources, pp. 103–110. Beijing, China (2010)
Dolha, Z., Losanglangjie, O.: The parts-of-speech and tagging set standards of tibetan information process. In: the 11th Symposium on Chinese Minority Information Processing (2007)
Emerson, T.: The second international chinese word segmentation bakeoff. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 123–133. Jeju Island, Korea (2005)
Gyal, T., Zhu, J.: Research on tibetan segmentation scheme for information processing. J. Chin. Inf. Process. 23(04), 113–117 (2009)
He, X., Li, Y., Ma, N., Yu, H.: Study on tibetan automatic word segmentation as syllable tagging. Appl. Res. Comput. 32(1), 61–65 (2015)
Jiang, D.: History and progress of tibetan text information processing. In: Frontiers of Chinese Information Processing Proceedings of the 25th Anniversary Conference of Chinese Information Processing Society, pp. 83–97. Press of Tsinghua university, Beijing (2006)
Jiang, T.: Tibetan word segmentation system based on conditional random fields. In: Software Engineering and Service Science (ICSESS), pp. 446–448 (2011)
Kang, C., Jiang, D., Long, C.: Tibetan word segmentation based on word-position tagging. In: 2013 International Conference on Asian Language Processing (IALP), pp. 239–242. IEEE (2013)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)
Li, Y., Jam, Y., Zong, C., Yu, H.: Research and implementation of tibetan automatic word segmentation based on conditional random field. J. Chin. Inf. Process. 27(4), 52–58 (2013)
Liu, H., Nuo, M., Ma, L., et al.: Tibetan word segmentation as syllable tagging using conditional random fields. In: Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation (PACLIC 2011), pp. 168–177 (2011)
Liu, H., Nuo, M., Zhao, W., et al.: SegT: a practical tibetan word segmentation system. J. Chin. Inf. Process. 26(1), 97–103 (2012)
Liu, H., Zhao, W., Nuo, M., Jiang, L., Wu, J., He, Y.: Tibetan number identification based on classification of number components in tibetan word segmentation. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 719–724. Association for Computational Linguistics, Posters (2010)
Low, J.K., Ng, H.T., Guo, W.: A maximum entropy approach to chinese word segmentation. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 161–164. Jeju Island, Korea (2005)
Ng, H.T., Low, J.K.: Chinese part-of-speech tagging: one-at-a-time or all-at-once? word-based or character-based. In: Proceedings of 2004 Conference on Empirical Methods in Natural Language Processing, pp. 277–284 (2004)
Norbu, S., Choejey, P., Dendup, T., Hussain, S., Mauz, A.: Dzongkha word segmentation. In: Proceedings of the 8th Workshop on Asian Language Resources, pp. 95–102, Beijing (2010)
Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics, pp. 562–568, Geneva (2004)
Qi, K.: Research on tibetan automatic word segmentation for information processing. J. Northwest Univ. National. Philos. Soc. Sci. 04, 92–97 (2006)
Shi, X., Lu, Y., Yang, J.: A tibetan segmentation system. J. Chin. Inf. Process. 25(4), 54–56 (2011)
Sproat, R., Emerson, T.: The first international chinese word segmentation bakeoff. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 133–143, Sapporo (2003)
Sun, M., Huaquecairang, C., Jiang, W., et al.: Tibetan word segmentation based on discriminative classification and reranking. J. Chin. Inf. Process. 28(2), 61–66 (2014)
Sun, Y., Luosang, Q., Yang, R., Zhao, X.: Design of a tibetan automatic segmentation scheme. In: Researches and Advancements of Information Processing for Chinese Minority Languages and Characters - Proceedings of the 12th Symposium on Chinese Minority Information Processing, pp. 228–237 (2009)
Sun, Y., Luosang, Q., Yang, R., Zhao, X.: Study of segmentation strategy on tibetan crossing ambiguous words. In: Researches and Advancements of Information Processing for Chinese Minority Languages and Characters, pp. 238–243 (2009)
Sun, Y., Wang, Z., Zhao, X., et al.: Design of a tibetan automatic word segmentation scheme. In: Proceedings of 2009 1st IEEE International Conference on Information Engineering and Computer Science, pp. 1–6 (2009)
Sun, Y., Yan, X., Zhao, X., et al.: A resolution of overlapping ambiguity in tibetan word segmentation. In: Proceedings of 2010 3rd International Conference on Computer Science and Information Technology, pp. 222–225 (2010)
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.: A conditional random field word segmenter for sighan bakeoff 2005. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 168–171, Jeju Island (2005)
Xue, N.: Chinese word segmentation as character tagging. Comput. Linguist. Chin. Lang. Process. 8(1), 29–48 (2003)
Xue, N., Converse, S.P.: Combining classifiers for chinese word segmentation. In: Proceedings of the First SIGHAN Workshop on Chinese Language Processing, pp. 63–70, Taipei (2002)
Xue, N., Shen, L.: Chinese word segmentation as lmr tagging. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing in conjunction with ACL03, pp. 176–179, Sapporo (2003)
Zhao, H., Huang, C.N., Li, M.: An improved chinese word segmentation system with conditional random field. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp. 108–117, Sidney (2006)
Zhao, H., Huang, C., Li, M., Lu, B.: Effective tag set selection in chinese word segmentation via conditional random field modeling. In: Proceedings of the 20th Pacific Asia Conference on Language. Information and Computation, pp. 87–94, Wuhan (2006)
Ciren, Z.: The design of a machine assisted tibetan word segmentation and new word registeration system. In: Proceedings of Modernization of Chinese Minority Nationality Languages (1999)
Zhaxijia, D., Losanglangjie, O., et al.: Theoretical explanation on the parts-of-speech and tagging set standards of tibetan information processing. In: Procedings of the 11th China National Conference on Minority Language Information Processing, pp. 441–452 (2007)
Acknowledgements
We thank the reviewers for their critical and constructive comments and suggestions that helped us improve the quality of the paper. The research is partially supported by National Science Foundation (No. 61202219, No. 61202220, No. 61303165) and Informationization Project of the Chinese Academy of Sciences (No. XXH12504-1-10).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Liu, H., Long, C., Nuo, M., Wu, J. (2015). Tibetan Word Segmentation as Sub-syllable Tagging with Syllable’s Part-of-Speech Property. In: Sun, M., Liu, Z., Zhang, M., Liu, Y. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL NLP-NABD 2015 2015. Lecture Notes in Computer Science(), vol 9427. Springer, Cham. https://doi.org/10.1007/978-3-319-25816-4_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-25816-4_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25815-7
Online ISBN: 978-3-319-25816-4
eBook Packages: Computer ScienceComputer Science (R0)