Abstract
Social media has its own evergrowing language and distinct characteristics. Although social media is shown to be of great utility to research studies, varying quality of written texts degrades the performance of existing NLP tools. Normalization of texts, transforming from informal to well-written texts, appears to be a reasonable preprocessing step to adapt tools trained on different domains to social media. In this study, we compile the first Turkish normalization lexicon that sheds light to the kinds of observed lexical variations in social media texts. A graphical representation acquired from a text corpus is used to model contextual similarities between normalization equivalences and the lexicon is automatically generated by performing random walks on this graph. The underlying framework not only enables different lexicons to be generated from the same corpus but also produces lexicons that are tuned to specific genres. Evaluation studies demonstrated the effectiveness of induced lexicon in normalizing Turkish texts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Since the graph representation is for modeling contextual similarities between individual words, OOV words that contain more than one word due to omitted spaces (e.g., “şarkısözü” which is indeed “şarkı sözü{lyrics}”) are manually removed. Automatic handling of these cases is left as future work.
- 2.
- 3.
- 4.
The punctuation characters are omitted while identifying n-gram sequences.
- 5.
- 6.
Stop words (e.g., “ve”{and}) and very frequent words (e.g.,“bir”{one}) were observed to have higher degrees.
- 7.
In a bipartite graph, a step cannot be taken between the nodes of the same bipartite.
- 8.
More than two trials could be made in order to reduce the effect of randomness.
- 9.
How these cases can be handled is indeed in our future work.
- 10.
In our evaluations, an edit distance of 2 was used.
References
Hu, Y., Talamadupula, K., Kambhampati, S.: Dude, srsly?: the surprisingly formal nature of Twitter’s language. In: 7th International AAAI Conference on Weblogs and Social Media (ICWSM), pp. 244–253 (2013)
Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: Diffusion of lexical change in social media. PLoS One 9 (2014)
Herdağdelen, A.: Twitter n-gram corpus with demographic metadata. Lang. Resour. Eval. 47, 1127–1147 (2013)
Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E.P., Ungar, L.H.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS One 8 (2013)
Foster, J., Çetinoğlu, Ö., Wagner, J., Roux, J.L., Hogan, S., Nivre, J., Hogan, D., van Genabith, J.: # hardtoparse: POS tagging and parsing the twitterverse. In: The Workshop on Analyzing Microtext (AAAI), pp. 20–25 (2011)
Kucuk, D., Steinberger, R.: Experiments to improve named entity recognition on Turkish tweets. In: 5th Workshop on Language Analysis for Social Media, pp. 71–78 (2014)
Hassan, H., Menezes, A.: Social text normalization using contextual graph random walks. In: 51st Annual Meeting of the Association for Computational Linguistics, pp. 1577–1586 (2013)
Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: 38th Annual Meeting on Association for Computational Linguistics, pp. 286–293 (2000)
Tautanova, K., Moore, R.C.: A pronunciation modeling for improved spelling correction. In: 40th Annual Meeting on Association for Computational Linguistics, pp. 144–151 (2002)
Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., Basu, A.: Investigation and modeling of the structure of texting language. Int. J. Doc. Anal. Recogn. 10, 157–174 (2007)
Cook, P., Stevenson, S.: An unsupervised model for text message normalization. In: 4th Workshop on Computational Approaches to Linguistic Creativity (CALC), pp. 71–78 (2009)
Aw, A., Zhang, M., Xiao, J., Su, J.: A phrase-based statistical model for SMS text normalization. In: 21st International Conference on Computational Linguistics/ACL, pp. 33–40 (2006)
Kaufmann, M., Kalita, J.: Syntactic normalization of Twitter messages. In: International Conference on Natural Language Processing (2010)
Liu, F., Weng, F., Wang, B., Liu, Y.: Insertion, deletion, or substitution?: normalizing text messages without pre-categorization nor supervision. In: 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT), pp. 71–76 (2011)
Liu, F., Weng, F., Jiang, X.: A broad-coverage normalization system for social media language. In: 50th Annual Meeting of the Association for Computational Linguistics, pp. 1035–1044 (2012)
Han, B., Cook, P., Baldwin, T.: Lexical normalization for social media text. ACM Trans. Intell. Syst. Technol. (TIST) 4, 5(1)–5(27) (2013)
Sönmez, C., Özgür, A.: A graph-based approach for contextual text normalization. In: Conference on Empirical Methods on Natural Language Processing (EMNLP), pp. 313–324 (2014)
Torunoğlu, D., Eryiğit, G.: A cascaded approach for social media text normalization of Turkish. In: 5th Workshop on Language Analysis for Social Media (LASM), pp. 62–70 (2014)
Yıldırım, S., Yıldız, T.: An unsupervised text normalization architecture for Turkish language. In: 16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING) (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Demir, S., Tan, M., Topcu, B. (2018). Turkish Normalization Lexicon for Social Media. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_33
Download citation
DOI: https://doi.org/10.1007/978-3-319-75487-1_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75486-4
Online ISBN: 978-3-319-75487-1
eBook Packages: Computer ScienceComputer Science (R0)