Age Detection for Chinese Users in Weibo

  • Li Chen
  • Tieyun QianEmail author
  • Fei Wang
  • Zhenni You
  • Qingxi Peng
  • Ming Zhong
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9098)


Age is one of the most important attributes in one user’s profile. Age detection has many applications like personalized search, targeted advertisement and recommendation. Current research has uncovered the relationship between the use of western language and social identities to some extents. However, the age detection problem for Chinese users is so far unexplored. Due to the cultural and societal difference, some well known features in English may not be applicable to the Chinese users. For example, while the frequency of capitalized letter in English has proved to be a good feature, Chinese users do not have such patterns. Moreover, Chinese has its own characteristics such as rich emoticons, complex syntax and unique lexicon structures. Hence age detection for Chinese users is a new big challenge.

In this paper, we present our age detection study on a corpus of microblogs from 3200 users in Sina Weibo. We construct three types of Chinese language patterns, including stylistic, lexical, and syntactic features, and then investigate their effects on age prediction. We find a number of interesting language patterns: (1) there is a significant topic divergence among Chinese people in various age groups, (2) the young people are open and easy to accept new slangs from the internet or foreign languages, and (3) the young adult people exhibit distinguished syntactic structures from all other people. Our best result reaches an accuracy of 88% when classifying users into four age groups.


Age detection Chinese users Feature selection Feature combination 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bergsma, S., Durme, B.V.: Using conceptual class attributes to characterize social media users. In: Proc. of ACL, pp. 710–720 (2013)Google Scholar
  2. 2.
    Cheng, N., Chen, X., Chandramouli, R., Subbalakshmi, K.P.: Gender identification from e-mails. In: CIDM, pp. 154–158 (2009)Google Scholar
  3. 3.
    Garera, N., Yarowsky, D.: Modeling latent biographic attributes in conversational genres. In: Proc. of ACL and IJCNLP, pp. 710–718 (2009)Google Scholar
  4. 4.
    Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers’ age and gender. In: Proc. of ICWSM, pp. 214–217 (2009)Google Scholar
  5. 5.
    Gressel, G., Hrudya, P., Surendran, K., Thara, S., Aravind, A., Poornachandran, P.: Ensemble learning approach for author profiling. In: PAN at CLEF (2014)Google Scholar
  6. 6.
    Kabbur, S., Han, E.H., Karypis, G.: Content-based methods for predicting web-site demographic attributes. In: Proc. of ICDM (2010)Google Scholar
  7. 7.
    Kosinski, M., Stillwell, D., Graepel, T.: Private traits and attributes are predictable from digital records of human behavior. PNAS 110, 5802–5805 (2013)CrossRefGoogle Scholar
  8. 8.
    Li, J., Ritter, A., Hovy, E.: Weakly supervised user profile extraction from twitter. In: Proc. of ACL, pp. 165–174 (2014)Google Scholar
  9. 9.
    Mislove, A., Viswanath, B., Gummadi, P.K., Druschel, P.: You are who you know: inferring user profiles in online social networks. In: Proc. of WSDM, pp. 251–260 (2010)Google Scholar
  10. 10.
    Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proc. of EMNLP, pp. 207–217 (2010)Google Scholar
  11. 11.
    Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: “how old do you think i am?”: A study of language and age in twitter. In: Proc. of ICWSM, pp. 439–448 (2013)Google Scholar
  12. 12.
    Nguyen, D., Smith, N.A., Rosé, C.P.: Author age prediction from text using linear regression. In: Proc. of the 5th ACL-HLT Workshop, pp. 115–123 (2011)Google Scholar
  13. 13.
    Nguyen, D., Trieschnigg, D., Dog̀ruöz, A.S., Grave, R., Theune, M., Meder, T., de Jong, F.: Why gender and age prediction from tweets is hard: lessons from a crowdsourcing experiment. In: Proc. of COLING, pp. 1950–1961 (2014)Google Scholar
  14. 14.
    Otterbacher, J.: Inferring gender of movie reviewers: exploiting writing style, content and metadata. In: Proc. of CIKM, pp. 369–378 (2010)Google Scholar
  15. 15.
    Peersman, C., Daelemans, W., Vaerenbergh, L.V.: Predicting age and gender in online social networks. In: Proc. of SMUC, pp. 37–44 (2011)Google Scholar
  16. 16.
    Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in twitter. In: Proc. of SMUC, pp. 37–44 (2010)Google Scholar
  17. 17.
    Rosenthal, S., McKeown, K.: Age prediction in blogs: a study of style, content, and online behavior in pre- and post-social media generations. In: Proc. of ACL, pp. 763–772 (2011)Google Scholar
  18. 18.
    Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: Proc. of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, pp. 199–205 (2005)Google Scholar
  19. 19.
    Tam, J., Martell., C.H.: Age detection in chat. In: Proc. of ICSC, pp. 33–39 (2009)Google Scholar
  20. 20.
    Xiao, C., Zhou, F., Wu, Y.: Predicting audience gender in online content-sharing social networks. JASIST 64, 1284–1297 (2013)CrossRefGoogle Scholar
  21. 21.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. of ICML, pp. 412–420 (1997)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Li Chen
    • 1
  • Tieyun Qian
    • 1
    Email author
  • Fei Wang
    • 1
  • Zhenni You
    • 1
  • Qingxi Peng
    • 1
  • Ming Zhong
    • 1
  1. 1.State Key Laboratory of Software EngineeringWuhan UniversityWuhanChina

Personalised recommendations