Skip to main content

Tibetan Word Segmentation as Sub-syllable Tagging with Syllable’s Part-of-Speech Property

  • Conference paper
  • First Online:
Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data (CCL 2015, NLP-NABD 2015)

Abstract

When Tibetan word segmentation task is taken as a sequence labelling problem, machine learning models such as ME and CRFs can be used to train the segmenter. The performance of the segmenter is related to many factors. In the paper, three factors, namely strategy on abbreviated syllables, tag set, and the syllable’s Part-Of-Speech property, are compared. Experiment data show that: first, if each abbreviate syllable is separated into two units for labelling rather than one, the F-measure improves 0.06 % and 0.10 % on 4-tag set and 6-tag set respectively. Second, if 6-tag set is used rather than 4-tag set, the F-measure improves 0.10 % and 0.14 % on the two strategies on abbreviated syllables respectively. Third, when the syllable’s Part-Of-Speech property is take into account, F-measure improves 0.47 % and 0.41 % respectively than the other two methods without using it on 4-tag set, while it improves 0.45 % and 0.35 % on 6-tag set, which is much more higher than the former improvements. So it’s a better choice to take advantage of the syllable’s Part-Of-Speech property information while using the sub-syllable as the tag unit.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    http://taku910.github.io/crfpp.

References

  1. Cai, R.J.: Research on the word categories and its annotation scheme for tibetan corpus. J. Chin. Inf. Process. 23(04), 107–112 (2009)

    Google Scholar 

  2. Cai, Z.: The design of banzhida tibetan word segmentation system. In: Researches and Advancements of Information Processing for Chinese Minority Languages and Characters (2009)

    Google Scholar 

  3. Cai, Z.: The design of banzhida tibetan word segmentation system. In: 12th Symposium on Chinese Minority Information Processing (2009)

    Google Scholar 

  4. Cai, Z.: Identification of abbreviated word in tibetan word segmentation. J. Chin. Inf. Process. 23(01), 35–37 (2009)

    Google Scholar 

  5. Cai, Z.: The design of banzhida tibetan word segmentation system. J. Ethic Normal Coll. Qinhai Normal Univ. 2, 75–77 (2010)

    Google Scholar 

  6. Chen, Y., Li, B., Yu, S.: The design and implementation of a tibetan word segmentation system. J. Chin. Inf. Process. 17(3), 15–20 (2003)

    Google Scholar 

  7. Chen, Y., Li, B., Yu, S., Lan, C.: An automatic tibetan segmentation scheme based on case auxiliary words and continuous features. Appl. Linguist. 1, 75–82 (2003)

    Google Scholar 

  8. Chen, Y., Yu, S.: The present situation and prospect of the study of technological methods concerning handling the information in tibetan script. China Tibetol. 04, 97–107 (2003)

    Google Scholar 

  9. Chungku, C., Rabgay, J., Faaß, G.: Building nlp resources for dzongkha: a tagset and a tagged corpus. In: Proceedings of the 8th Workshop on Asian Language Resources, pp. 103–110. Beijing, China (2010)

    Google Scholar 

  10. Dolha, Z., Losanglangjie, O.: The parts-of-speech and tagging set standards of tibetan information process. In: the 11th Symposium on Chinese Minority Information Processing (2007)

    Google Scholar 

  11. Emerson, T.: The second international chinese word segmentation bakeoff. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 123–133. Jeju Island, Korea (2005)

    Google Scholar 

  12. Gyal, T., Zhu, J.: Research on tibetan segmentation scheme for information processing. J. Chin. Inf. Process. 23(04), 113–117 (2009)

    Google Scholar 

  13. He, X., Li, Y., Ma, N., Yu, H.: Study on tibetan automatic word segmentation as syllable tagging. Appl. Res. Comput. 32(1), 61–65 (2015)

    Google Scholar 

  14. Jiang, D.: History and progress of tibetan text information processing. In: Frontiers of Chinese Information Processing Proceedings of the 25th Anniversary Conference of Chinese Information Processing Society, pp. 83–97. Press of Tsinghua university, Beijing (2006)

    Google Scholar 

  15. Jiang, T.: Tibetan word segmentation system based on conditional random fields. In: Software Engineering and Service Science (ICSESS), pp. 446–448 (2011)

    Google Scholar 

  16. Kang, C., Jiang, D., Long, C.: Tibetan word segmentation based on word-position tagging. In: 2013 International Conference on Asian Language Processing (IALP), pp. 239–242. IEEE (2013)

    Google Scholar 

  17. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)

    Google Scholar 

  18. Li, Y., Jam, Y., Zong, C., Yu, H.: Research and implementation of tibetan automatic word segmentation based on conditional random field. J. Chin. Inf. Process. 27(4), 52–58 (2013)

    Google Scholar 

  19. Liu, H., Nuo, M., Ma, L., et al.: Tibetan word segmentation as syllable tagging using conditional random fields. In: Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation (PACLIC 2011), pp. 168–177 (2011)

    Google Scholar 

  20. Liu, H., Nuo, M., Zhao, W., et al.: SegT: a practical tibetan word segmentation system. J. Chin. Inf. Process. 26(1), 97–103 (2012)

    Google Scholar 

  21. Liu, H., Zhao, W., Nuo, M., Jiang, L., Wu, J., He, Y.: Tibetan number identification based on classification of number components in tibetan word segmentation. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 719–724. Association for Computational Linguistics, Posters (2010)

    Google Scholar 

  22. Low, J.K., Ng, H.T., Guo, W.: A maximum entropy approach to chinese word segmentation. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 161–164. Jeju Island, Korea (2005)

    Google Scholar 

  23. Ng, H.T., Low, J.K.: Chinese part-of-speech tagging: one-at-a-time or all-at-once? word-based or character-based. In: Proceedings of 2004 Conference on Empirical Methods in Natural Language Processing, pp. 277–284 (2004)

    Google Scholar 

  24. Norbu, S., Choejey, P., Dendup, T., Hussain, S., Mauz, A.: Dzongkha word segmentation. In: Proceedings of the 8th Workshop on Asian Language Resources, pp. 95–102, Beijing (2010)

    Google Scholar 

  25. Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics, pp. 562–568, Geneva (2004)

    Google Scholar 

  26. Qi, K.: Research on tibetan automatic word segmentation for information processing. J. Northwest Univ. National. Philos. Soc. Sci. 04, 92–97 (2006)

    Google Scholar 

  27. Shi, X., Lu, Y., Yang, J.: A tibetan segmentation system. J. Chin. Inf. Process. 25(4), 54–56 (2011)

    Google Scholar 

  28. Sproat, R., Emerson, T.: The first international chinese word segmentation bakeoff. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 133–143, Sapporo (2003)

    Google Scholar 

  29. Sun, M., Huaquecairang, C., Jiang, W., et al.: Tibetan word segmentation based on discriminative classification and reranking. J. Chin. Inf. Process. 28(2), 61–66 (2014)

    Google Scholar 

  30. Sun, Y., Luosang, Q., Yang, R., Zhao, X.: Design of a tibetan automatic segmentation scheme. In: Researches and Advancements of Information Processing for Chinese Minority Languages and Characters - Proceedings of the 12th Symposium on Chinese Minority Information Processing, pp. 228–237 (2009)

    Google Scholar 

  31. Sun, Y., Luosang, Q., Yang, R., Zhao, X.: Study of segmentation strategy on tibetan crossing ambiguous words. In: Researches and Advancements of Information Processing for Chinese Minority Languages and Characters, pp. 238–243 (2009)

    Google Scholar 

  32. Sun, Y., Wang, Z., Zhao, X., et al.: Design of a tibetan automatic word segmentation scheme. In: Proceedings of 2009 1st IEEE International Conference on Information Engineering and Computer Science, pp. 1–6 (2009)

    Google Scholar 

  33. Sun, Y., Yan, X., Zhao, X., et al.: A resolution of overlapping ambiguity in tibetan word segmentation. In: Proceedings of 2010 3rd International Conference on Computer Science and Information Technology, pp. 222–225 (2010)

    Google Scholar 

  34. Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.: A conditional random field word segmenter for sighan bakeoff 2005. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 168–171, Jeju Island (2005)

    Google Scholar 

  35. Xue, N.: Chinese word segmentation as character tagging. Comput. Linguist. Chin. Lang. Process. 8(1), 29–48 (2003)

    Google Scholar 

  36. Xue, N., Converse, S.P.: Combining classifiers for chinese word segmentation. In: Proceedings of the First SIGHAN Workshop on Chinese Language Processing, pp. 63–70, Taipei (2002)

    Google Scholar 

  37. Xue, N., Shen, L.: Chinese word segmentation as lmr tagging. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing in conjunction with ACL03, pp. 176–179, Sapporo (2003)

    Google Scholar 

  38. Zhao, H., Huang, C.N., Li, M.: An improved chinese word segmentation system with conditional random field. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp. 108–117, Sidney (2006)

    Google Scholar 

  39. Zhao, H., Huang, C., Li, M., Lu, B.: Effective tag set selection in chinese word segmentation via conditional random field modeling. In: Proceedings of the 20th Pacific Asia Conference on Language. Information and Computation, pp. 87–94, Wuhan (2006)

    Google Scholar 

  40. Ciren, Z.: The design of a machine assisted tibetan word segmentation and new word registeration system. In: Proceedings of Modernization of Chinese Minority Nationality Languages (1999)

    Google Scholar 

  41. Zhaxijia, D., Losanglangjie, O., et al.: Theoretical explanation on the parts-of-speech and tagging set standards of tibetan information processing. In: Procedings of the 11th China National Conference on Minority Language Information Processing, pp. 441–452 (2007)

    Google Scholar 

Download references

Acknowledgements

We thank the reviewers for their critical and constructive comments and suggestions that helped us improve the quality of the paper. The research is partially supported by National Science Foundation (No. 61202219, No. 61202220, No. 61303165) and Informationization Project of the Chinese Academy of Sciences (No. XXH12504-1-10).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huidan Liu .

Editor information

Editors and Affiliations

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Liu, H., Long, C., Nuo, M., Wu, J. (2015). Tibetan Word Segmentation as Sub-syllable Tagging with Syllable’s Part-of-Speech Property. In: Sun, M., Liu, Z., Zhang, M., Liu, Y. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL NLP-NABD 2015 2015. Lecture Notes in Computer Science(), vol 9427. Springer, Cham. https://doi.org/10.1007/978-3-319-25816-4_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25816-4_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25815-7

  • Online ISBN: 978-3-319-25816-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics