Abstract
This paper presents a method for automatically inducing the parts-ofspeech of the Vietnamese language from a large text corpus. We first build a classbased bigram language model using several statistical algorithms assigning words to classes based on their ability to combine with neighbouring words.We then show that this model is able to extract word classes that have the flavor of either syntactically based or semantically based groupings of Vietnamese words, which are the long disputed approaches among the Vietnamese linguistic community. Finally, the quality of word clusters is quantitatively evaluated when word cluster features are used to improve the accuracy of a statistical part-of-speech tagger for Vietnamese.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Schütze, H.: Part-of-speech induction from scratch. In: Proceedings of ACL, pp. 251–258 (1993)
Con, N.H.: On the determination of Vietnamese word classes. Journal of Language, Vietnamese Institute of Linguistics, 36–46 (2003) (in Vietnamese)
Vietnam Social Science Committee (ed.): Vietnamese Grammar. Social Sciences Publisher, Hanoi (1983) (in Vietnamese)
Diep, Q.B., Hoang, V.T.: Vietnamese Grammar. Vietnam Education Publisher, Hanoi (1999) (in Vietnamese)
Doan, T.T., Nguyen, K.H., Pham, N.Q.: A Concise Vietnamese Grammar (For Non-native Speakers). World Publishers, Ha Noi (2003) (in Vietnamese)
Bao, H.T.: Building basic resources and tools for Vietnamese language and speech processing (VLSP). Technical report, The KC/01/06-10 project (2010)
Christodoulopoulos, C., Goldwater, S., Steedman, M.: Two decades of unsupervised POS induction: How far have we come? In: Proceedings of ACL (2010)
Brown, P.F., deSouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Computational Linguistics 18, 467–479 (1992)
Liang, P.: Semi-supervised learning for natural language. Master’s thesis. MIT (2005)
Nguyen, P.T., Xuan, L.V., Nguyen, T.M.H., Nguyen, V.H., Le-Hong, P.: Building a large syntactically-annotated corpus of Vietnamese. In: Proceedings of the 3rd Linguistic Annotation Workshop, ACL-IJCNLP, Singapore (2009)
Le-Hong, P., Nguyen, T.M.H., Roussanaly, A., Ho, T.V.: A hybrid approach to word segmentation of Vietnamese texts. In: MartÃn-Vide, C., Otto, F., Fernau, H. (eds.) LATA 2008. LNCS, vol. 5196, pp. 240–249. Springer, Heidelberg (2008)
McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for information and segmentation. In: Proceedings of ICML (2000)
Le-Hong, P., Roussanaly, A., Nguyen, T.M.H., Rossignol, M.: An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts. In: Proceedings of Traitement Automatique des Langues Naturelles (TALN 2010), Montreal, Canada (2010)
Minh, N.L., Bach, N.X., Cuong, N.V., Minh, P.Q.N., Shimazu, A.: A semi-supervised learning method for Vietnamese part-of-speech tagging. In: KSE, pp. 141–146 (2010)
Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of EMNLP-CoNLL, pp. 410–420 (2007)
Clark, A.: Combining distributional and morphological information for part-of-speech induction. In: Proceedings of EACL (2003)
Leibbrandt, R.E., Powers, D.M.W.: Robust induction of parts-of-speech in child-directed language by co-clustering of words and contexts. In: Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP, Avignon, France, pp. 44–54 (2012)
Chrupała, G.: Hierarchical clustering of word class distributions. In: Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure, Montréal, Canada, pp. 100–104 (2012)
Turian, J., Ratinov, L., Bengio, Y.: Word representations: A simple and general method for semi-supervised learning. In: Proceedings of ACL, Uppsala, Sweden, pp. 384–394 (2010)
Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context and multiple word prototypes. In: Proceedings of the ACL, pp. 873–882 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Le-Hong, P., Nguyen, T.M.H. (2014). Part-of-Speech Induction for Vietnamese. In: Huynh, V., Denoeux, T., Tran, D., Le, A., Pham, S. (eds) Knowledge and Systems Engineering. Advances in Intelligent Systems and Computing, vol 245. Springer, Cham. https://doi.org/10.1007/978-3-319-02821-7_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-02821-7_24
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-02820-0
Online ISBN: 978-3-319-02821-7
eBook Packages: EngineeringEngineering (R0)