Skip to main content

Detecting Gender by Full Name: Experiments with the Russian Language

  • Conference paper
  • First Online:
Analysis of Images, Social Networks and Texts (AIST 2014)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 436))

Abstract

This paper describes a method that detects gender of a person by his/her full name. While some approaches were proposed for English language, little has been done so far for Russian. We fill this gap and present a large-scale experiment on a dataset of 100,000 Russian full names from Facebook. Our method is based on three types of features (word endings, character \(n\)-grams and dictionary of names) combined within a linear supervised model. Experiments show that the proposed simple and computationally efficient approach yields excellent results achieving accuracy up to 96 %.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.yandex.com/

  2. 2.

    http://www.mail.ru/

  3. 3.

    http://research.digsolab.com/gender

  4. 4.

    http://www.natcorp.ox.ac.uk/

  5. 5.

    http://www.ruscorpora.ru/en/

  6. 6.

    https://developers.facebook.com/tools/explorer

  7. 7.

    http://scikit-learn.org/

  8. 8.

    http://imena-list.ru/

  9. 9.

    http://www.gramota.ru/slovari/info/ag/

  10. 10.

    http://ru.wiktionary.org/wiki/

    figure f
  11. 11.

    http://ru.wikisource.org/wiki/

    figure g
  12. 12.

    Available at http://panchenko.me/gender/wiki-gender-dict.csv.

References

  1. Underwood, A.: Gender targeting for promoted products now available, October 2012

    Google Scholar 

  2. Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting age and gender in online social networks. In: Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents, pp. 37–44. ACM (2011)

    Google Scholar 

  3. Kharitonov, E., Serdyukov, P.: Gender-aware re-ranking. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1081–1082. ACM (2012)

    Google Scholar 

  4. Bi, B., Shokouhi, M., Kosinski, M., Graepel, T.: Inferring the demographics of search users: social data meets search queries. In: Proceedings of the 22nd International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 131–140 (2013)

    Google Scholar 

  5. Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary Linguist. Comput. 17(4), 401–412 (2002)

    Article  Google Scholar 

  6. Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers age and gender. In: Third International AAAI Conference on Weblogs and Social Media (2009)

    Google Scholar 

  7. Rangel, F., Rosso, P.: Use of language and author profiling: Identification of gender and age. In: Natural Language Processing and Cognitive Science, p. 177 (2013)

    Google Scholar 

  8. Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: how old do you think i am: a study of language and age in twitter. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (2013)

    Google Scholar 

  9. Ciot, M., Sonderegger, M., Ruths, D.: Gender inference of twitter users in non-english contexts. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Wash, pp. 18–21 (2013)

    Google Scholar 

  10. Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in twitter. In: Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents, pp. 37–44. ACM (2010)

    Google Scholar 

  11. Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1301–1309. Association for Computational Linguistics (2011)

    Google Scholar 

  12. Daniel, M. A. Zelenkov, Y.: Russian national corpus as a playground for sociolinguistic research. episode iv. gender and length of the utterance. In: Proceedings of Dialog-2012, pp. 51–62 (2012)

    Google Scholar 

  13. Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 207–217. Association for Computational Linguistics (2010)

    Google Scholar 

  14. Vapnik, V.: The nature of statistical learning theory. Data Min. Knowl. Discovery 6, 1–47 (1995)

    Google Scholar 

  15. Al Zamal, F., Liu, W., Ruths, D.: Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In: ICWSM (2012)

    Google Scholar 

  16. Liu, W., Zamal, F.A., Ruths, D.: Using social media to infer gender composition of commuter populations. In: Proceedings of the When the City Meets the Citizen Worksop (2012)

    Google Scholar 

  17. Bishop, C.M., Nasrabadi, N.M.: Pattern Recognition and Machine Learning, vol. 1. Springer, New York (2006)

    MATH  Google Scholar 

  18. Bird, S.: Nltk: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions, pp. 69–72. Association for Computational Linguistics (2006)

    Google Scholar 

  19. Agresti, A.: Categorical Data Analysis, vol. 359. Wiley, New York (2002)

    Book  MATH  Google Scholar 

  20. Panchenko, A., Beaufort, R., Naets, H., Fairon, C.: Towards detection of child sexual abuse media: categorization of the associated filenames. In: Serdyukov, P., Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 776–779. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  21. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  22. Yu, H.F., Huang, F.L., Lin, C.J.: Dual coordinate descent methods for logistic regression and maximum entropy models. Mach. Learn. 85(1–2), 41–75 (2011)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgments

This research was supported by Digital Society Laboratory LLC. We thank Kirill Shileev, Segei Objedkov and three anonymous reviewers for their helpful comments that significantly improved quality of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Panchenko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Panchenko, A., Teterin, A. (2014). Detecting Gender by Full Name: Experiments with the Russian Language. In: Ignatov, D., Khachay, M., Panchenko, A., Konstantinova, N., Yavorsky, R. (eds) Analysis of Images, Social Networks and Texts. AIST 2014. Communications in Computer and Information Science, vol 436. Springer, Cham. https://doi.org/10.1007/978-3-319-12580-0_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12580-0_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12579-4

  • Online ISBN: 978-3-319-12580-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics