Abstract
This paper reports experiments on automatically detecting the gender of Twitter users, based on unstructured information found on their Twitter profile. A set of features previously proposed is evaluated on two datasets of English and Portuguese users, and their performance is assessed using several supervised and unsupervised approaches, including Naive Bayes variants, Logistic Regression, Support Vector Machines, Fuzzy c-Means clustering, and k-means. Results show that features perform well in both languages separately, but even best results were achieved when combining both languages. Supervised approaches reached 97.9 % accuracy, but Fuzzy c-Means also proved suitable for this task achieving 96.4 % accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Weka version 3-6-8. http://www.cs.waikato.ac.nz/ml/weka.
References
Baptista, J., Batista, F., Mamede, N.J., Mota, C.: Npro: um novo recurso para o processamento computacional do portugus. In: XXI Encontro APL, December 2005
Bechar-Israeli, H.: From \(<\)bonehead\(>\) to \(<\)clonehead\(>\): nicknames, play, and identity on internet relay chat. Comput.-Mediated Commun. 1(2) (1995)
Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: The fuzzy C-means clustering algorithm. Comput. Geosci. 10(23), 191–203 (1984)
Brogueira, G., Batista, F., Carvalho, J.P., Moniz, H.: Portuguese geolocated tweets: an overview. In: Proceedings of the International Conference on Information Systems and Design of Communication, ISDOC 2014, pp. 178–179. ACM, New York (2014). http://doi.acm.org/10.1145/2618168.2618200
Brogueira, G., Batista, F., Carvalho, J.P., Moniz, H.: Expanding a database of Portuguese tweets. In: 3rd Symposium on Languages, Applications and Technologies SLATE 2014. OpenAccess Series in Informatics (OASIcs), vol. 38, pp. 275–282 (2014)
Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: EMNLP 2011, pp. 1301–1309. ACL (2011)
Carvalho, J.P., Pedro, V., Batista, F.: Towards intelligent mining of public social networks’ influence in society. In: IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), pp. 478–483, Edmonton, Canada, June 2013
Chen, Y., You, J., Chu, M., Zhao, Y., Wang, J.: Identifying language origin of person names with n-grams of different units. In: IEEE ICASSP 2006, vol. 1, p. I, May 2006
Corney, M.W.: Analysing e-mail text authorship for forensic purposes. Ph.D. thesis, Queensland University of Technology (2003)
van Halteren, H., Speerstra, N.: Gender recognition on dutch tweets. Comput. Linguist. Neth. J. 4, 171–190 (2014)
Heil, B., Piskorski, M.: New twitter research: men follow men and nobody tweets. Harvard Bus. Rev. 1, 2009 (2009)
Keerthi, S., Shevade, S., Bhattacharyya, C., Murthy, K.: Improvements to platt’s SMO algorithm for SVM classifier design. Neural Comput. 13(3), 637–649 (2001)
Le Cessie, S., Van Houwelingen, J.C.: Ridge estimators in logistic regression. Appl. Stat. 41(1), 191–201 (1992)
MacQueen, J.: Some methods for classification and analysis of multivariate observations (1967). http://projecteuclid.org/euclid.bsmsp/1200512992
McCallum, A., Nigam, K., et al.: A comparison of event models for naive bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48 (1998)
Platt, J., et al.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods—Supports Vector Learning 3 (1999)
Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in twitter. In: Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents, pp. 37–44. ACM (2010)
Vicente, M., Batista, F., Carvalho, J.P.: Twitter gender classification using user unstructured information. In: FUZZ-IEEE 2015, IEEE International Conference on Fuzzy Systems. IEEE Xplorer, Istanbul, Turkey (Accepted)
Acknowledgements
This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) under project PTDC/IVC-ESCT/4919/2012 and funds with reference UID/CEC/50021/2013.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Vicente, M., Carvalho, J.P., Batista, F. (2015). Using Unstructured Profile Information for Gender Classification of Portuguese and English Twitter Users. In: Sierra-Rodríguez, JL., Leal, JP., Simões, A. (eds) Languages, Applications and Technologies. SLATE 2015. Communications in Computer and Information Science, vol 563. Springer, Cham. https://doi.org/10.1007/978-3-319-27653-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-27653-3_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27652-6
Online ISBN: 978-3-319-27653-3
eBook Packages: Computer ScienceComputer Science (R0)