Tibetan Syllable-Based Functional Chunk Boundary Identification

Shi, Shumin; Liu, Yujian; Wang, Tianhang; Long, Congjun; Huang, Heyan

doi:10.1007/978-3-319-69005-6_36

Shumin Shi^17,18,
Yujian Liu¹⁷,
Tianhang Wang¹⁷,
Congjun Long¹⁹ &
…
Heyan Huang^17,18

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10565))

Included in the following conference series:

1860 Accesses

Abstract

Tibetan syntactic functional chunk parsing is aimed at identifying syntactic constituents of Tibetan sentences. In this paper, based on the Tibetan syntactic functional chunk description system, we propose a method which puts syllables in groups instead of word segmentation and tagging and use the Conditional Random Fields (CRFs) to identify the functional chunk boundary of a sentence. According to the actual characteristics of the Tibetan language, we firstly identify and extract the syntactic markers as identification characteristics of syntactic functional chunk boundary in the text preprocessing stage, while the syntactic markers are composed of the sticky written form and the non-sticky written form. Afterwards we identify the syntactic functional chunk boundary using CRF. Experiments have been performed on a Tibetan language corpus containing 46783 syllables and the precision, recall rate and F value respectively achieves 75.70%, 82.54% and 79.12%. The experiment results show that the proposed method is effective when applied to a small-scale unlabeled corpus and can provide foundational support for many natural language processing applications such as machine translation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://taku910.github.io/crfpp/.

References

Church, K.W.: A stochastic parts program and noun phrase parser for unrestricted text. In: Proceedings of the second Conference on Applied Natural Language Processing, pp. 136–143. Association for Computational Linguistics (1988)
Google Scholar
Pla, F., Molina, A., Prieto, N.: An integrated statistical model for tagging and chunking unrestricted text. In: The Third International Workshop on Text, Speech and Dialogue, Brno, Czech Republic, pp. 15–20 (2000)
Google Scholar
Sun, H.L.: Induction of grammatical rules from an annotated corpus V+N sequence analysis. In: China National Conference on Computational Linguistics, pp. 157–163. Tsinghua University Press, Beijing(1997)
Google Scholar
Liu, C.Z.: Research on binding of common noun sequences based on POS tagging corpus. In: Proceedings of the International Conference on Chinese Information Processing, Tsinghua University Press, Beijing (1998)
Google Scholar
Li, W.J., Zhou, M., et al.: Automatic extraction of Chinese longest noun phrases based on corpus. In: Chen, L., Yuan, Q. (eds.) Progress and Application of Computational Linguistics, pp. 119–124. Tsinghua University Press, Beijing (1995)
Google Scholar
Huang, D., Yu, J.: The combination of distributed strategy and CRFs to identify Chinese chunk. J. Chinese Inf. Proces. 23(1), 16–22 (2009)
Google Scholar
Dai, C., Zhou, Q.L., Cai, D.F., et al.: Automatic identification of Chinese maximum noun phrase based on statistics and rules. J. Chinese Inf. Proces. 22(6), 110–115 (2008)
Google Scholar
Drábek, E.F., Zhou, Q.: Experiments in learning models for functional chunking of Chinese text. In: IEEE International Conference on Systems, Man, and Cybernetics IEEE, vol. 2, pp. 859–864 (2001)
Google Scholar
Chen, Y., Zhou, Q.: Analysis and construction of hierarchical chinese function block description library. J. Chinese Inf. Proces. 22(3), 24–31 (2008)
MathSciNet Google Scholar
Zhou, Q., Zhao, Y.Z.: Automatic parsing of chinese functional chunks. Chin. J. Inf. 21(5), 18–24 (2007)
Google Scholar
Jiang, D., Kang, C.J.: The methods of lemmatization of bound case makers in modern Tibetan. In: International Conference on Natural Language Processing and Knowledge Engineering, pp. 616–621(2003)
Google Scholar
Jiang, D.: The method and process of block segmentation in modern Tibetan. Minor. Lang. China 2003(4), 30–39 (2003)
Google Scholar
Long, C.J., Kang, C.J., Jiang, D.: The comparative research on the segmentation strategies of tibetan bounded-variant forms. In: International Conference on Asian Language Processing 2013(30), pp. 243–246. IEEE Computer Society (2013)
Google Scholar
Wang, T.H., Shi, S.H., Long, C.J., et al.: Syntactic boundary block identification of Tibetan syntactic functions based on error driven learning strategy. Chinese J. Inf. 28(5), 170–175 (2014)
Google Scholar
Wang, T.H.: Research on Tibetan functional block recognition for Machine Translation. Beijing Institute of Technology (2016)
Google Scholar
Liu, H.D.: Research on Tibetan Word Segmentation and Text Resource Mining. University of the Chinese Academy of Sciences (2012)
Google Scholar
Li, Y.C., Jia, Y.J., Zong, C.Q.: Research and implementation of tibetan automatic word segmentation based on conditional random field. J. Chinese Inf. Proces. 27(4), 52–58 (2013)
Google Scholar
Li, L., Long, C.J., Jiang, D.: Tibetan functional chunks boundary detection. J. Chinese Inf. Proces. 27(6), 165–168 (2013)
Google Scholar

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (61671064, 61201352, and 61132009), the National Key Basic Research Program of China (2013CB329303) and the Fundamental Research Fund of Beijing Institute of Technology (20130742010).

Author information

Authors and Affiliations

School of Computer Science and Technology Beijing Institute of Technology, Beijing, 100081, China
Shumin Shi, Yujian Liu, Tianhang Wang & Heyan Huang
Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications, Beijing, 100081, China
Shumin Shi & Heyan Huang
Institute of Ethnology and Anthropology Chinese Academy of Social Sciences, Beijing, 100081, China
Congjun Long

Authors

Shumin Shi
View author publications
You can also search for this author in PubMed Google Scholar
Yujian Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tianhang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Congjun Long
View author publications
You can also search for this author in PubMed Google Scholar
Heyan Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shumin Shi .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Maosong Sun
Beijing University of Posts and Telecommunications, Beijing, China
Xiaojie Wang
Peking University, Beijing, China
Baobao Chang
Soochow University, Suzhou, China
Deyi Xiong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shi, S., Liu, Y., Wang, T., Long, C., Huang, H. (2017). Tibetan Syllable-Based Functional Chunk Boundary Identification. In: Sun, M., Wang, X., Chang, B., Xiong, D. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2017 2017. Lecture Notes in Computer Science(), vol 10565. Springer, Cham. https://doi.org/10.1007/978-3-319-69005-6_36

Download citation

DOI: https://doi.org/10.1007/978-3-319-69005-6_36
Published: 07 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69004-9
Online ISBN: 978-3-319-69005-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics