Character Tagging-Based Word Segmentation for Uyghur

Yang, Yating; Mi, Chenggang; Ma, Bo; Dong, Rui; Wang, Lei; Li, Xiao

doi:10.1007/978-3-662-45701-6_6

Yating Yang¹⁴,
Chenggang Mi^14,15,
Bo Ma¹⁴,
Rui Dong^14,15,
Lei Wang¹⁴ &
…
Xiao Li¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 493))

Included in the following conference series:

China Workshop on Machine Translation

621 Accesses
1 Citations

Abstract

For effectively obtain information in Uyghur words, we present a novel method based on character tagging for Uyghur word segmentation. In this paper, we suggest five labels for characters in a Uyghur word, include: Su, Bu, Iu, Eu and Au, according to our method, we segment Uyghur words as a sequence labeling procedure, which use Conditional Random Fields (CRFs) as the basic labeling model. Experimental show that our method collect more features in Uyghur words, therefore outperform several traditional used word segmentation models significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Christopher, D.M., Hinrich, S.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Meystre, S., Haug, P.J.: Automation of a problem list using natural language processing. BMC Medical Informatics and Decision Making 5(1), 30 (2005)
Article Google Scholar
Collobert, R., Weston, L., Bottou, M., Karlen, K.K., Kuksa, P.: Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research 12, 2493–2537 (2011)
MATH Google Scholar
Zaokere, K., Aishan, W., Tuergen, Y., et al.: Uyghur noun stemming system based on hybrid method. Computer Engineering and Applications 49(1), 171–175 (2013)
Google Scholar
Zou, Y., Tuergen, Y., Mairehaba, A., Aishan, W., Parida, T.: Uyghur event-anchored temporal expressions recognition using stemming method. Computer Engineering and Design 35(2), 625–630 (2014)
Google Scholar
Xue, H., Dong, X., Wang, L., Osman, T., Jiang, T.: Unsupervised Uyghur word segmentation method based on affix corpus. Computer Engineering and Design 32(9), 3191–3194 (2011)
Google Scholar
Chen, P.: Uyghur Stem Segmentation and POS Tagging based on Corpora. Master’s Thesis, Xinjiang University (2006)
Google Scholar
Adongbieke, G., Ablimit, M.: Research on Uighur Word Segmentation. Journal of Chinese Information Processing 18(6), 61–65 (2004)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001)
Google Scholar
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 134–141. Association for Computational Linguistics (2003)
Google Scholar
Wallach, H.M.: Conditional random fields: An introduction. Technical Reports (CIS), 22 (2004)
Google Scholar
Morwal, S., Jahan, N., Chopra, D.: Named entity recognition using hidden Markov model (HMM). Int. J. Nat. Lang. Comput(IJNLC) 1(4), 15–23 (2012)
Article Google Scholar
Morwal, S., Chopra, D.: NERHMM: A Tool For Named Entity Recognition based on Hidden Markov Model. International Journal on Natural Language Computing (IJNLC) 2, 43–49 (2013)
Article Google Scholar
Ratnaparkhi, A.: A simple introduction to maximum entropy models for natural language processing. IRCS Technical Reports Series 81 (1997)
Google Scholar
Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Computational Linguistics 22(1), 39–71 (1996)
Google Scholar
Malouf, R.: A comparison of algorithms for maximum entropy parameter estimation. In: Proceedings of the 6th Conference on Natural Language Learning, vol. 20, pp. 1–7. Association for Computational Linguistics (2002)
Google Scholar
Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 133–142 (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Xinjiang Technical Institute of Physics & Chemistry of Chinese Academy of Sciences, Urumqi, Xinjiang, 830011, China
Yating Yang, Chenggang Mi, Bo Ma, Rui Dong, Lei Wang & Xiao Li
University of Chinese Academy of Sciences, Beijing, 100049, China
Chenggang Mi & Rui Dong

Authors

Yating Yang
View author publications
You can also search for this author in PubMed Google Scholar
Chenggang Mi
View author publications
You can also search for this author in PubMed Google Scholar
Bo Ma
View author publications
You can also search for this author in PubMed Google Scholar
Rui Dong
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Xiamen University, 361005, Fujian, China
Xiaodong Shi
Xiamen University, Fujian, China
Yidong Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Y., Mi, C., Ma, B., Dong, R., Wang, L., Li, X. (2014). Character Tagging-Based Word Segmentation for Uyghur. In: Shi, X., Chen, Y. (eds) Machine Translation. CWMT 2014. Communications in Computer and Information Science, vol 493. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-45701-6_6

Download citation

DOI: https://doi.org/10.1007/978-3-662-45701-6_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-45700-9
Online ISBN: 978-3-662-45701-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics