Turkish Normalization Lexicon for Social Media

Demir, Seniz; Tan, Murat; Topcu, Berkay

doi:10.1007/978-3-319-75487-1_33

Seniz Demir¹⁴,
Murat Tan¹⁴ &
Berkay Topcu¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9624))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1187 Accesses
3 Citations

Abstract

Social media has its own evergrowing language and distinct characteristics. Although social media is shown to be of great utility to research studies, varying quality of written texts degrades the performance of existing NLP tools. Normalization of texts, transforming from informal to well-written texts, appears to be a reasonable preprocessing step to adapt tools trained on different domains to social media. In this study, we compile the first Turkish normalization lexicon that sheds light to the kinds of observed lexical variations in social media texts. A graphical representation acquired from a text corpus is used to model contextual similarities between normalization equivalences and the lexicon is automatically generated by performing random walks on this graph. The underlying framework not only enables different lexicons to be generated from the same corpus but also produces lexicons that are tuned to specific genres. Evaluation studies demonstrated the effectiveness of induced lexicon in normalizing Turkish texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Since the graph representation is for modeling contextual similarities between individual words, OOV words that contain more than one word due to omitted spaces (e.g., “şarkısözü” which is indeed “şarkı sözü{lyrics}”) are manually removed. Automatic handling of these cases is left as future work.
2.
twitter4j.org.
3.
http://www.kemik.yildiz.edu.tr/?id=28.
4.
The punctuation characters are omitted while identifying n-gram sequences.
5.
https://github.com/ahmetaa/zemberek-nlp.
6.
Stop words (e.g., “ve”{and}) and very frequent words (e.g.,“bir”{one}) were observed to have higher degrees.
7.
In a bipartite graph, a step cannot be taken between the nodes of the same bipartite.
8.
More than two trials could be made in order to reduce the effect of randomness.
9.
How these cases can be handled is indeed in our future work.
10.
In our evaluations, an edit distance of 2 was used.

References

Hu, Y., Talamadupula, K., Kambhampati, S.: Dude, srsly?: the surprisingly formal nature of Twitter’s language. In: 7th International AAAI Conference on Weblogs and Social Media (ICWSM), pp. 244–253 (2013)
Google Scholar
Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: Diffusion of lexical change in social media. PLoS One 9 (2014)
Google Scholar
Herdağdelen, A.: Twitter n-gram corpus with demographic metadata. Lang. Resour. Eval. 47, 1127–1147 (2013)
Article Google Scholar
Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E.P., Ungar, L.H.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS One 8 (2013)
Google Scholar
Foster, J., Çetinoğlu, Ö., Wagner, J., Roux, J.L., Hogan, S., Nivre, J., Hogan, D., van Genabith, J.: # hardtoparse: POS tagging and parsing the twitterverse. In: The Workshop on Analyzing Microtext (AAAI), pp. 20–25 (2011)
Google Scholar
Kucuk, D., Steinberger, R.: Experiments to improve named entity recognition on Turkish tweets. In: 5th Workshop on Language Analysis for Social Media, pp. 71–78 (2014)
Google Scholar
Hassan, H., Menezes, A.: Social text normalization using contextual graph random walks. In: 51st Annual Meeting of the Association for Computational Linguistics, pp. 1577–1586 (2013)
Google Scholar
Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: 38th Annual Meeting on Association for Computational Linguistics, pp. 286–293 (2000)
Google Scholar
Tautanova, K., Moore, R.C.: A pronunciation modeling for improved spelling correction. In: 40th Annual Meeting on Association for Computational Linguistics, pp. 144–151 (2002)
Google Scholar
Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., Basu, A.: Investigation and modeling of the structure of texting language. Int. J. Doc. Anal. Recogn. 10, 157–174 (2007)
Article Google Scholar
Cook, P., Stevenson, S.: An unsupervised model for text message normalization. In: 4th Workshop on Computational Approaches to Linguistic Creativity (CALC), pp. 71–78 (2009)
Google Scholar
Aw, A., Zhang, M., Xiao, J., Su, J.: A phrase-based statistical model for SMS text normalization. In: 21st International Conference on Computational Linguistics/ACL, pp. 33–40 (2006)
Google Scholar
Kaufmann, M., Kalita, J.: Syntactic normalization of Twitter messages. In: International Conference on Natural Language Processing (2010)
Google Scholar
Liu, F., Weng, F., Wang, B., Liu, Y.: Insertion, deletion, or substitution?: normalizing text messages without pre-categorization nor supervision. In: 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT), pp. 71–76 (2011)
Google Scholar
Liu, F., Weng, F., Jiang, X.: A broad-coverage normalization system for social media language. In: 50th Annual Meeting of the Association for Computational Linguistics, pp. 1035–1044 (2012)
Google Scholar
Han, B., Cook, P., Baldwin, T.: Lexical normalization for social media text. ACM Trans. Intell. Syst. Technol. (TIST) 4, 5(1)–5(27) (2013)
Google Scholar
Sönmez, C., Özgür, A.: A graph-based approach for contextual text normalization. In: Conference on Empirical Methods on Natural Language Processing (EMNLP), pp. 313–324 (2014)
Google Scholar
Torunoğlu, D., Eryiğit, G.: A cascaded approach for social media text normalization of Turkish. In: 5th Workshop on Language Analysis for Social Media (LASM), pp. 62–70 (2014)
Google Scholar
Yıldırım, S., Yıldız, T.: An unsupervised text normalization architecture for Turkish language. In: 16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING) (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

TUBITAK-BILGEM, Kocaeli, Turkey
Seniz Demir, Murat Tan & Berkay Topcu

Authors

Seniz Demir
View author publications
You can also search for this author in PubMed Google Scholar
Murat Tan
View author publications
You can also search for this author in PubMed Google Scholar
Berkay Topcu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seniz Demir .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Demir, S., Tan, M., Topcu, B. (2018). Turkish Normalization Lexicon for Social Media. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_33

Download citation

DOI: https://doi.org/10.1007/978-3-319-75487-1_33
Published: 21 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75486-4
Online ISBN: 978-3-319-75487-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics