Skip to main content

Using Unstructured Profile Information for Gender Classification of Portuguese and English Twitter Users

  • Conference paper
  • First Online:
Languages, Applications and Technologies (SLATE 2015)

Abstract

This paper reports experiments on automatically detecting the gender of Twitter users, based on unstructured information found on their Twitter profile. A set of features previously proposed is evaluated on two datasets of English and Portuguese users, and their performance is assessed using several supervised and unsupervised approaches, including Naive Bayes variants, Logistic Regression, Support Vector Machines, Fuzzy c-Means clustering, and k-means. Results show that features perform well in both languages separately, but even best results were achieved when combining both languages. Supervised approaches reached 97.9 % accuracy, but Fuzzy c-Means also proved suitable for this task achieving 96.4 % accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/scikit-fuzzy/scikit-fuzzy.

  2. 2.

    Weka version 3-6-8. http://www.cs.waikato.ac.nz/ml/weka.

References

  1. Baptista, J., Batista, F., Mamede, N.J., Mota, C.: Npro: um novo recurso para o processamento computacional do portugus. In: XXI Encontro APL, December 2005

    Google Scholar 

  2. Bechar-Israeli, H.: From \(<\)bonehead\(>\) to \(<\)clonehead\(>\): nicknames, play, and identity on internet relay chat. Comput.-Mediated Commun. 1(2) (1995)

    Google Scholar 

  3. Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: The fuzzy C-means clustering algorithm. Comput. Geosci. 10(23), 191–203 (1984)

    Article  Google Scholar 

  4. Brogueira, G., Batista, F., Carvalho, J.P., Moniz, H.: Portuguese geolocated tweets: an overview. In: Proceedings of the International Conference on Information Systems and Design of Communication, ISDOC 2014, pp. 178–179. ACM, New York (2014). http://doi.acm.org/10.1145/2618168.2618200

  5. Brogueira, G., Batista, F., Carvalho, J.P., Moniz, H.: Expanding a database of Portuguese tweets. In: 3rd Symposium on Languages, Applications and Technologies SLATE 2014. OpenAccess Series in Informatics (OASIcs), vol. 38, pp. 275–282 (2014)

    Google Scholar 

  6. Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: EMNLP 2011, pp. 1301–1309. ACL (2011)

    Google Scholar 

  7. Carvalho, J.P., Pedro, V., Batista, F.: Towards intelligent mining of public social networks’ influence in society. In: IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), pp. 478–483, Edmonton, Canada, June 2013

    Google Scholar 

  8. Chen, Y., You, J., Chu, M., Zhao, Y., Wang, J.: Identifying language origin of person names with n-grams of different units. In: IEEE ICASSP 2006, vol. 1, p. I, May 2006

    Google Scholar 

  9. Corney, M.W.: Analysing e-mail text authorship for forensic purposes. Ph.D. thesis, Queensland University of Technology (2003)

    Google Scholar 

  10. van Halteren, H., Speerstra, N.: Gender recognition on dutch tweets. Comput. Linguist. Neth. J. 4, 171–190 (2014)

    Google Scholar 

  11. Heil, B., Piskorski, M.: New twitter research: men follow men and nobody tweets. Harvard Bus. Rev. 1, 2009 (2009)

    Google Scholar 

  12. Keerthi, S., Shevade, S., Bhattacharyya, C., Murthy, K.: Improvements to platt’s SMO algorithm for SVM classifier design. Neural Comput. 13(3), 637–649 (2001)

    Article  MATH  Google Scholar 

  13. Le Cessie, S., Van Houwelingen, J.C.: Ridge estimators in logistic regression. Appl. Stat. 41(1), 191–201 (1992)

    Article  MATH  Google Scholar 

  14. MacQueen, J.: Some methods for classification and analysis of multivariate observations (1967). http://projecteuclid.org/euclid.bsmsp/1200512992

  15. McCallum, A., Nigam, K., et al.: A comparison of event models for naive bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48 (1998)

    Google Scholar 

  16. Platt, J., et al.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods—Supports Vector Learning 3 (1999)

    Google Scholar 

  17. Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in twitter. In: Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents, pp. 37–44. ACM (2010)

    Google Scholar 

  18. Vicente, M., Batista, F., Carvalho, J.P.: Twitter gender classification using user unstructured information. In: FUZZ-IEEE 2015, IEEE International Conference on Fuzzy Systems. IEEE Xplorer, Istanbul, Turkey (Accepted)

    Google Scholar 

Download references

Acknowledgements

This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) under project PTDC/IVC-ESCT/4919/2012 and funds with reference UID/CEC/50021/2013.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joao P. Carvalho .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Vicente, M., Carvalho, J.P., Batista, F. (2015). Using Unstructured Profile Information for Gender Classification of Portuguese and English Twitter Users. In: Sierra-Rodríguez, JL., Leal, JP., Simões, A. (eds) Languages, Applications and Technologies. SLATE 2015. Communications in Computer and Information Science, vol 563. Springer, Cham. https://doi.org/10.1007/978-3-319-27653-3_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27653-3_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27652-6

  • Online ISBN: 978-3-319-27653-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics