Gender Prediction Using Browsing History

  • Do Viet PhuongEmail author
  • Tu Minh Phuong
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 244)


Demographic attributes such as gender and age of Internet users provide important information for marketing, personalization, and user behavior research. This paper addresses the problem of predicting users’ gender based on browsing history. We employ a classification-based approach to the problem and investigate a number of features derived from browsing log data. We show that high-level content features such as topics or categories are very predictive of gender and combining such features with features derived from access times and browsing patterns leads to significant improvements in prediction accuracy. We empirically verified the effectiveness of the method on real datasets from Vietnamese online media. The method substantially outperformed a baseline, and achieved a macro-averaged F1 score of 0.805. Experimental results also demonstrate the effectiveness of combining different feature types: a combination of features achieved 12% improvement of F1 score over the best performing individual feature type.


Gender prediction browsing history classification 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)zbMATHGoogle Scholar
  2. 2.
    Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on Twitter. In: Proc. of EMNLP 2011, pp. 1301–1309 (2011)Google Scholar
  3. 3.
    Computerworld Report: Men Want Facts, Women Seek Personal Connections on Web,
  4. 4.
    Ellist, D.: Social (distributed) language modeling, clustering and dialectometry. In: Proc. of TextGraphs at ACL-IJCNLP 2009, pp. 1–4 (2009)Google Scholar
  5. 5.
    Filippova, K.: User demographics and language in an implicit social network. In: Proceedings of EMNLP-CoNLL 2012 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1478–1488 (2012)Google Scholar
  6. 6.
    Garera, N., Yarowsky, D.: Modeling latent biographic attributes in conversational genres. In: Proc. of ACL-IJCNLP 2009, pp. 710–718 (2009)Google Scholar
  7. 7.
    Gillick, D.: Can conversational word usage be used to predict speaker demographics? In: Proceedings of Interspeech, Makuhari, Japan (2010)Google Scholar
  8. 8.
    Herring, S.C., Paolillo, J.C.: Gender and genre variation in weblogs. Journal of Sociolinguistics 10(4), 710–718 (2010)Google Scholar
  9. 9.
    Herring, S.C., Scheidt, L.A., Bonus, S., Wright, E.: Bridging the gap: A genre analysis of weblogs. In: HICSS 2004 (2004)Google Scholar
  10. 10.
    Hu, J., Zeng, H.J., Li, H., Niu, C., Chen, Z.: Demographic prediction based on user’s browsing behavior. In: Proceedings of the 16th International Conference on World Wide Web, pp. 151–160 (2007)Google Scholar
  11. 11.
    Kabbur, S., Han, E.H., Karypis, G.: Content-based methods for predicting web-site demographic attributes. In: Proceedings of ICDM 2010 (2010)Google Scholar
  12. 12.
    MacKinnon, I., Warren, R.: Age and geographic inferences of the LiveJournal social network. In: Statistical Network Analysis: Models, Issues, and New Directions Workshop at ICML 2006, Pittsburgh, PA (June 29, 2006)Google Scholar
  13. 13.
    Mulac, A., Seibold, D.R., Farris, J.R.: Female and male managers’ and professionals’ criticism giving: Differences in language use and effects. Journal of Language and Social Psychology 19(4), 389–415 (2000)CrossRefGoogle Scholar
  14. 14.
    Nowson, S., Oberlander, J.: The identity of bloggers: Openness and gender in personal weblogs. In: Proceedings of the AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, Stanford, CA, March 27-29, pp. 163–167 (2006)Google Scholar
  15. 15.
    Otterbacher, J.: Inferring Gender of Movie Reviewers: Exploiting Writing Style, Content and Metadata. In: Proceedings of CIKM 2010 (2010)Google Scholar
  16. 16.
    Pennachiotti, M., Popescu, A.M.: A machine learning approach to Twitter user classification. In: Proceedings of AAAI 2011 (2011)Google Scholar
  17. 17.
    Phuong, D.V., Phuong, T.M.: A keyword-topic model for contextual advertising. In: Proceedings of SoICT 2010 (2012)Google Scholar
  18. 18.
    Popescu, A., Grefenstette, G.: Mining user home location and gender from Flickr tags. In: Proc. of ICWSM 2010, pp. 1873–1876 (2010)Google Scholar
  19. 19.
    Rosenthal, S., McKeown, K.: Age prediction in blogs: A study of style, content, and online behavior in pre- and post-social media generations. In: Proc. of ACL 2011, pp. 763–772 (2011)Google Scholar
  20. 20.
    Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging. In: Proceedings of the AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, Stanford, CA, March 27-29, pp. 199–205 (2006)Google Scholar
  21. 21.
    Search Engine Watch Journal, Behavioral Targeting and Contextual Advertising,
  22. 22.
    Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic author-topic models for information discovery. In: Processing KDD 2004. ACM, New York (2004)Google Scholar
  23. 23.
    Yan, X., Yan, L.: Gender classification of weblogs authors. In: Proceedings of the AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, Stanford, CA, March 27-29, pp. 228–230 (2006)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.R&D LabVietnam Communication CorporationHanoiVietnam
  2. 2.Department of Computer SciencePosts and Telecommunications Institute of TechnologyHanoiVietnam

Personalised recommendations