Learning Age and Gender of Blogger from Stylistic Variation

  • Mayur Rustagi
  • R. Rajendra Prasath
  • Sumit Goswami
  • Sudeshna Sarkar
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5909)

Abstract

We report results of stylistic differences in blogging for gender and age group variation. The results are based on two mutually independent features. The first feature is the use of slang words which is a new concept proposed by us for Stylistic study of bloggers. For the second feature, we have analyzed the variation in average length of sentences across various age groups and gender. These features are augmented with previous study results reported in literature for stylistic analysis. The combined feature list enhances the accuracy by a remarkable extent in predicting age and gender. These machine learning experiments were done on two separate demographically tagged blog corpus. Gender determination is more accurate than age group detection over the data spread across all ages but the accuracy of age prediction increases if we sample data with remarkable age difference.

Keywords

Confusion Matrix Content Word Sentence Length Taboo Word Stylistic Variation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    ICWSM 2009, Spinn3r Dataset (May 2009)Google Scholar
  2. 2.
    Argamon, S., Koppel, M., Avneri, G.: Routing documents according to style. In: Proc. of First Int. Workshop on Innovative Inform. Syst. (1998)Google Scholar
  3. 3.
    Spinn3r Indexing Blogosphere, www.spinn3r.com (last accessed on March 01, 2009)
  4. 4.
    Brank, J., Grobelnik, M., Milic-Frayling, N., Mladenic, D.: Feature selection using support vector machines. In: Proc. of the 3rd Int. Conf. on Data Mining Methods and Databases for Eng., Finance, and Other Fields, pp. 84–89 (2002)Google Scholar
  5. 5.
    Corney, M., de Vel, O., Anderson, A., Mohay, G.: Gender-preferential text mining of e-mail discourse. In: 18th Annual Computer Security Appln. Conference (2002)Google Scholar
  6. 6.
    Burger, J.D., Henderson, J.C.: An exploration of observable features related to blogger age. In: Proc. of the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs (2006)Google Scholar
  7. 7.
    Estival, D., Gaustad, T., Pham, S.B., Radford, W., Hutchinson, B.: Tat: an author profiling tool with application to arabic emails. In: Proc. of the Australasian Language Technology Workshop, pp. 21–30 (2007)Google Scholar
  8. 8.
    Ispell (2009), http://www.gnu.org/software/ispell/ (last accessed on April 02, 2009)
  9. 9.
    Holmes, J.: Women’s talk: The question of sociolinguistic universals. Australian Journal of Communications 20(3) (1993)Google Scholar
  10. 10.
    Simkins-Bullock, J., Wildman, B.: An investigation into relationship between gender and language Sex Roles 24. Springer, Netherlands (1991)Google Scholar
  11. 11.
    Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic Inquiry and Word Count. In: LIWC 2001 (2001)Google Scholar
  12. 12.
    Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17(4), 401–412 (2002)CrossRefGoogle Scholar
  13. 13.
    Palander-Collin, M.: Male and female styles in 17th century correspondence: I think. Language Variation and Change 11, 123–141 (1999)CrossRefGoogle Scholar
  14. 14.
    Pennebaker, J.W., Stone, L.D.: Words of wisdom: Language use over the lifespan. Journal of Personality and Social Psychology 85, 291–301 (2003)CrossRefGoogle Scholar
  15. 15.
    McMenamin, G.R.: Forensic Linguistics: Advances in Forensic Stylistic. CRC Press, Boca Raton (2002)Google Scholar
  16. 16.
    Datta, S., Sarkar, S.: A comparative study of statistical features of language in blogs-vs-splogs. In: AND 2008: Proc. of the second workshop on Analytics for noisy unstructured text data, pp. 63–66. ACM, New York (2008)CrossRefGoogle Scholar
  17. 17.
    Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers’ age and gender. To appear in: Proc. of ICWSM (2009)Google Scholar
  18. 18.
    Herring, S.: Two variants of an electronic message schema. In: Herring, S. (ed.) Computer-Mediated Communication: Linguistic, Social and Cross-Cultural Perspectives, vol. 11, pp. 81–106 (1996)Google Scholar
  19. 19.
    Argamon, S., Schler, J., Koppel, M., Pennebaker, J.: Effects of age and gender on blogging. In: Proc. of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs (April 2006)Google Scholar
  20. 20.
    Leximancer Manual V.3, www.leximancer.com (last accessed on January 22, 2009)
  21. 21.
    Witten, I.H., Frank, E.: DataMining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)Google Scholar
  22. 22.
    Yan, R.: Gender classification of weblog authors with bayesian analysis. In: Proc. of the AAAI Spring Symp. on Computational Approaches to Analyzing Weblogs (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Mayur Rustagi
    • 1
  • R. Rajendra Prasath
    • 1
  • Sumit Goswami
    • 1
  • Sudeshna Sarkar
    • 1
  1. 1.Department of Computer Science and EngineeringIndian Institute of TechnologyKharagpurIndia

Personalised recommendations