Experimental Study of Gender and Language Variety Identification in Social Media

  • Vineetha Rebecca ChackoEmail author
  • M. Anand Kumar
  • K. P. Soman
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 750)


Social media has evolved to be a crucial part of life today for everyone. With such a global population communicating with each other, comes the accumulation of large amounts of social media data. This data can be categorized as “Big Data”, owing to its large quantity. It contains valuable information in the form of the demographics of authors on online platforms; the analysis of which is required in certain scenarios to maintain decorum in the online community. Here, we have analyzed Twitter data, which is the training data of the PAN@CLEF 2017 shared task contest, to identify the gender, as well as the language variety of the author. It is available in four different languages, namely, English, Spanish, Portuguese, and Arabic. Both Document-Term Matrix (DTM) and Term Frequency-Inverse Document Frequency (TF-IDF) have been used for text representation. The classifiers used are SVM, AdaBoost, Decision Tree, and Random Forest.


Document Term Matrix Term Frequency-Inverse Document Frequency n-grams Support vector machines AdaBoost Random Forest Decision Tree Author Profiling 


  1. 1.
    Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. CLEF (2017)Google Scholar
  2. 2.
    Anand Kumar, M., Barathi Ganesh, H.B., Singh, S., Soman, K.P., Rosso, P.: Overview of the INLI PAN at FIRE-2017 track on Indian native language identification. CEUR Workshop Proc. 2036, 99–105 (2017)Google Scholar
  3. 3.
    Basile, A., Dwyer, G., Medvedeva, M., Rawee, J., Haagsma, H., Nissim, M.: N-GrAM: New Groningen Author-profiling Model, Notebook for PAN at CLEF 2017. CLEF (2017)Google Scholar
  4. 4.
    Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation for language variety identification. CICLing—Computational Linguistics and Intelligent Text Processing (2016)Google Scholar
  5. 5.
    Martinc, M., Krjanec, I., Zupan, K., Pollak, S.: PAN 2017: Author Profiling—Gender and Language Variety Prediction, Notebook for PAN at CLEF 2017. CLEF (2017)Google Scholar
  6. 6.
    Tellez, E.S., Miranda-Jimnez, S., Grafi, M., Moctezuma, D.: Gender and language-variety identification with MicroTC, Notebook for PAN at CLEF 2017. CLEF (2017)Google Scholar
  7. 7.
    Barathi Ganesh, H.B., Anand Kumar, M., Soman, K.P.: Vector Space Model as Cognitive Space for Text Classification. Notebook for PAN at CLEF 2017. CLEF (2017)Google Scholar
  8. 8.
    Bougiatiotis, K., Krithara, A.: Author profiling using complementary second order attributes and stylometric features. CLEF 2016 Evaluation Labs and Workshop—Working Notes Papers. CLEF (2016)Google Scholar
  9. 9.
    Barathi Ganesh, H.B., Reshma, U., Anand Kumar, M., Soman, K.P. Representation of target classes for text classification—AMRITA-CEN-NLPRusProfiling PAN 2017. In: CEUR Workshop Proceedings, 2036, pp. 25–27 (2017)Google Scholar
  10. 10.
    Barathi Ganesh, H.B., Anand Kumar, M., Soman, K.P. Distributional semantic representation for text classification and information retrieval. In: CEUR Workshop Proceedings, 1737, pp. 126–130 (2016)Google Scholar
  11. 11.
    Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. (2010)Google Scholar
  12. 12.
    Medvedeva, M., Kroon, M., Plank, B.: When sparse traditional models outperform dense neural networks: the curious case of discriminating between similar languages. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 156–163. Association for Computational Linguistics (2017)Google Scholar
  13. 13.
  14. 14.

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Vineetha Rebecca Chacko
    • 1
    Email author
  • M. Anand Kumar
    • 1
  • K. P. Soman
    • 1
  1. 1.Centre for Computational Engineering and Networking (CEN), Amrita School of EngineeringAmrita Vishwa VidyapeethamCoimbatoreIndia

Personalised recommendations