Abstract
In this article, we address the problem of age identification of Twitter users, after their online text. We used a set of text mining, sociolinguistic-based and content-related text features, and we evaluated a number of well-known and widely used machine learning algorithms for classification, in order to examine their appropriateness on this task. The experimental results showed that Random Forest algorithm offered superior performance achieving accuracy equal to 61%. We ranked the classification features after their informativity, using the ReliefF algorithm, and we analyzed the results in terms of the sociolinguistic principles on age linguistic variation.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Androutsopoulos, J.K., Georgakopoulou, A. (eds.): Discourse Constructions of Youth Identities, vol. 110. John Benjamins Publishing, Amsterdam (2003)
Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Mining the blogosphere: Age, gender and the varieties of self-expression. First Monday, 12(9) (2007)
Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically profiling the author of an anonymous text. Commun. ACM 52(2), 119–123 (2009)
Barbieri, F.: Patterns of age-based linguistic variation in American English. J. Socioling. 12(1), 58–88 (2008)
Burger, J.D., Henderson, J.C.: An exploration of observable features related to blogger age. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pp. 15–20, March 2006
Eckert, P.: Age as a sociolinguistic variable. In: The Handbook of Sociolinguistics, pp. 151–167 (1997)
Esuli, A., Sebastiani, F.: SentiWordNet: a publicly available lexical resource for opinion mining. In: Proceedings of LREC, vol. 6, pp. 417–422, May 2006
Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers’ age and gender. In: Third International AAAI Conference on Weblogs and Social Media, March 2009
Kira, K., Rendell, L.: The feature selection problem: traditional methods and a new algorithm. In: AAAI (1992)
Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57868-4_57
Nguyen, D., Doğruöz, A.S., Rosé, C.P., de Jong, F.: Computational sociolinguistics: a survey. arXiv preprint arXiv:1508.07544 (2015)
Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: “How old do you think i am?”; a study of language and age in Twitter. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media. AAAI Press (2013)
Nguyen, D., Smith, N.A., Rosé, C.P.: Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 115–123. Association for Computational Linguistics, June 2011
Patra, B.G., Banerjee, S., Das, D., Saikh, T., Bandyopadhyay, S.: Automatic Author Profiling Based on Linguistic and Stylistic Features: Notebook for PAN at CLEF (2013)
Pennebaker, J.W., Stone, L.D.: Words of wisdom: language use over the life span. J. Pers. Soc. Psychol. 85(2), 291 (2003)
Pfeil, U., Arjan, R., Zaphiris, P.: Age differences in online social networking–a study of user profiles and the social capital divide among teenagers and older users in MySpace. Comput. Hum. Behav. 25(3), 643–654 (2009)
Prasath, R.R.: Learning age and gender using co-occurrence of non-dictionary words from stylistic variations. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS (LNAI), vol. 6086, pp. 544–550. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13529-3_58
Rosenthal, S., McKeown, K.: Age prediction in blogs: a study of style, content, and online behavior in pre-and post-social media generations. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 763–772. Association for Computational Linguistics, June 2011
Rustagi, M., Prasath, R.R., Goswami, S., Sarkar, S.: Learning age and gender of blogger from stylistic variation. In: Chaudhury, S., Mitra, S., Murthy, C.A., Sastry, P.S., Pal, Sankar K. (eds.) PReMI 2009. LNCS, vol. 5909, pp. 205–212. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-11164-8_33
Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, vol. 6, pp. 199–205, March 2006
Simaki, V., Aravantinou, C., Mporas, I., Megalooikonomou, V.: Using sociolinguistic inspired features for gender classification of web authors. In: Král, P., Matoušek, V. (eds.) TSD 2015. LNCS (LNAI), vol. 9302, pp. 587–594. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24033-6_66
Linguistic inquiry and word count. http://www.liwc.net/
http://www.adweek.com/socialtimes/social-media-statistics-2014/499230
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Simaki, V., Mporas, I., Megalooikonomou, V. (2018). Age Identification of Twitter Users: Classification Methods and Sociolinguistic Analysis. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9624. Springer, Cham. https://doi.org/10.1007/978-3-319-75487-1_30
Download citation
DOI: https://doi.org/10.1007/978-3-319-75487-1_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75486-4
Online ISBN: 978-3-319-75487-1
eBook Packages: Computer ScienceComputer Science (R0)