Skip to main content
Log in

Twitter n-gram corpus with demographic metadata

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Social media is a natural laboratory for linguistic and sociological purposes. In micro-blogging platforms such as Twitter, people share hundreds of millions of short messages about their lives and experiences on a daily basis. These messages, coupled with metadata about their authors, provide an opportunity to understand a wide variety of phenomena ranging from political polarization to geographic and demographic lexical variation. Lack of publicly available micro-blogging datasets has been a hindrance to replicable research. In this paper, I introduce Rovereto Twitter n-gram corpus, a publicly available n-gram dataset of Twitter messages, which contains gender-of-the-author and time-of-posting tags associated with the n-grams. I compare this dataset to a more traditional web-based corpus and present a case study which shows the potential of combining an n-gram corpus with demographic metadata.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Named as an homage to the author’s alma mater.

  2. http://clic.cimec.unitn.it/amac/twitter_ngram.

  3. http://www.twitter.com.

  4. http://dev.twitter.com/.

  5. http://creativecommons.org/licenses/by-nc-sa/3.0/.

  6. http://snap.stanford.edu/data/twitter7.html.

  7. http://dev.twitter.com/terms/api-terms.

  8. http://trec.nist.gov/data/tweets/.

  9. http://github.com/lintool/twitter-corpus-tools.

  10. http://www.ngrams.info/.

  11. http://dev.twitter.com/docs/streaming-api.

  12. http://code.google.com/p/language-detection/.

  13. English, Italian, German, Portuguese, Spanish, French, Turkish, Indonesian, Japanese, Dutch, Korean, Arabic, Russian, and Thai.

  14. http://alias-i.com/lingpipe/.

  15. http://blog.twitter.com/2010/03/state-of-twitter-spam.html.

  16. Note that the gender information of individual users are intentionally lost during the n-gram aggregation process; thus the predicted genders of users are not disclosed in the dataset.

  17. http://github.com/amacinho/Name-Gender-Guesser.

  18. http://clic.cimec.unitn.it/amac/twitter_ngram.

  19. http://www.freebase.com/.

  20. http://github.com/amacinho/Gender-Expectations.

  21. The limitation of n-grams instead of full sentences may introduce errors in lemmatization process, mainly because of the ambiguities in part of speech.

References

  • Alias-i (2008). LingPipe 4.1.0. http://alias-i.com/lingpipe.

  • Argamon, S., Koppel, M., Pennebaker, J. W., & Schler, J. (2007). Mining the blogosphere: Age, gender and the varieties of self-expression. First Monday, 12(9) (3 September 2007).

  • Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3), 209–226.

    Article  Google Scholar 

  • Benevenuto, F., Magno, G., Rodrigues, T., & Almeida, V. (2010). Detecting spammers on Twitter. In Collaboration, electronic messaging, anti-abuse and spam conference (CEAS).

  • Bifet, A., & Frank, E. (2010). Sentiment knowledge discovery in Twitter streaming data. In B. Pfahringer, G. Holmes, A. & Hoffmann, A. (Eds.) Discovery science, lecture notes in computer science (Vol. 6332, pp. 1–15). Berlin: Springer. doi:10.1007/978-3-642-16184-1.

  • Brants, T., & Franz, A. (2006). Web 1T 5-gram Version I.

  • Burton, K., Kasch, N., & Soboroff, I. (2011). The ICWSM 2011 Spinn3r dataset. In Proceedings of the fifth annual conference on weblogs and social media (ICWSM 2011).

  • Carpenter, B. (2005). Scaling high-order character language models to gigabytes. In ACL software workshop, association for computational linguistics (pp. 86–99). doi:10.3115/1626315.1626322.

  • Chu, Z., Gianvecchio, S., Wang, H., & Jajodia, S. (2010). Who is tweeting on Twitter: Human, bot, or cyborg? In Proceedings of the 26th annual computer security applications conference (pp. 21–30). Austin, Texas: ACM.

  • Conover, M. D., Ratkiewicz, J., Francisco, M., Gonçalves, B., Flammini, A., & Menczer, F. (2011). Political polarization on Twitter. In Proceedings of the 5th international conference on weblogs and social media.

  • Culotta, A. (2010). Towards detecting influenza epidemics by analyzing Twitter messages. In KDD workshop on social media analytics (pp. 115–122). doi:10.1145/1964858.1964874.

  • Cunha, E., Magno, G., Comarela, G., Almeida, V., Gonçalves, M. A., & Benevenuto, F. (2011). Analyzing the dynamic evolution of hashtags on Twitter: A language-based approach. In Proceedings of the ACL HLT 2011 (p. 58).

  • Davidov, D., Tsur, O., & Rappoport, A. (2010). Semi-supervised recognition of sarcastic sentences in Twitter and Amazon. In CoNLL ’10 Proceedings of the fourteenth conference on computational natural language learning, association for computational linguistics (pp. 107–116).

  • Davies, M. (2009). The 385+ million word corpus of contemporary American English: Design, architecture, and linguistic insights. International Journal of Corpus Linguistics, 14(2), 159–190.

    Article  Google Scholar 

  • Evert, S. (2010). Google Web 1T 5-Grams made easy (but not for the computer). In Proceedings of the NAACL HLT 2010 sixth web as corpus workshop, association for computational linguistics (pp. 32–40).

  • Ferraresi, A., Zanchetta, E., Baroni, M., & Bernardini, S. (2008). Introducing and evaluating ukWaC, a very large web-derived corpus of English. In Proceedings of the 4th web as corpus workshop (WAC-4) Can we beat Google (pp. 47–54).

  • Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., & Smith, N. A. (2011). Part-of-speech tagging for Twitter: Annotation, features, and experiments. In Proceedings of the annual meeting of the association for computational linguistics, Portland (pp. 42–47).

  • Golder, S. A., & Macy, M. W. (2011). Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science, 333(6051), 1878–1881, doi:10.1126/science.1202775, http://www.sciencemag.org/content/333/6051/1878.abstract.

  • González-Ibáñez, R., Muresan, S., & Wacholder, N. (2011). Identifying sarcasm in Twitter: a closer look. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, association for computational linguistics, Portland, Oregon, USA (pp. 581–586).

  • Havasi, C., Speer, R., Pustejovsky, J., & Lieberman, H. (2009). Digital intuition: Applying common sense using dimensionality reduction. IEEE Intelligent Systems, 24(4), 24–35.

    Article  Google Scholar 

  • Herdağdelen, A., & Baroni, M. (2011). Stereotypical gender actions can be extracted from web text. Journal of the American Society for Information Science and Technology, 62(9), 1741–1749. doi:10.1002/asi.21579.

    Article  Google Scholar 

  • Hoffmann, S. (2007). Processing internet-derived text: Creating a corpus of usenet messages. Literary and Linguistic Computing, 22(2), 151–165.

    Article  Google Scholar 

  • Hong, L., Convertino, G., & Chi, E. H. (2011). Language matters in Twitter: A large scale study. In Proceedings of the fifth international AAAI conference on weblogs and social media (Vol. 91, pp. 518–521).

  • Hundt, M., Nesselhauf, N., & Biewer, C. (2007). Corpus linguistics and the web. Rodopi.

  • Kelly, R. (2009). Twitter study. http://www.pearanalytics.com/blog/wp-content/uploads/2010/05/Twitter-Study-August-2009.pdf.

  • Klimt, B., & Yang, Y. (2004). The enron corpus: A new dataset for email classification research. In Machine learning: ECML 2004 (pp. 217–226).

  • Kyumin, L., James, C., & Steve, W. (2010). Uncovering social spammers: Social honeypots+ machine learning. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, ACM, Geneva, Switzerland (pp. 435–442).

  • Lazer D., Pentland A. S., Adamic L., Aral S., Barabasi A. L., Brewer D., Christakis N., Contractor N., Fowler J., Gutmann M., et al. (2009). Life in the network: the coming age of computational social science. Science (New York, NY), 323(5915), 721.

  • Lerman, K., & Ghosh, R. (2010). Information contagion: An empirical study of the spread of news on Digg and Twitter social networks. In Proceedings of 4th international conference on weblogs and social media (ICWSM).

  • Liu, F., Weng, F., Wang, B., & Liu, Y. (2011). Insertion, deletion, or substitution? Normalizing text messages without pre-categorization nor supervision. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, association for computational linguistics, Portland, Oregon, USA (pp. 71–76).

  • Madden, M., & Zickuhr, K. (2011). 65 % of online adults use social networking sites, http://pewinternet.org/Reports/2011/Social-Networking-Sites.aspx.

  • Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., & Orwant, J., et al. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176.

    Google Scholar 

  • Mislove, A., Lehmann, S., Ahn, Y. Y., Onnela, J.P., & Rosenquist, J. N. (2011). Understanding the demographics of Twitter users. In Proceedings of the fifth international AAAI Conference on weblogs and social media, AAAI (pp. 554–557).

  • Naaman, M., Boase, J., & Lai, C. H. (2010). Is it really about me? In Proceedings of the 2010 ACM conference on computer supported cooperative work CSCW 10, ACM Press, New York, NY, USA, CSCW ’10 (pp. 189–192). doi:10.1145/1718918.1718953.

  • Naveed, N., Gottron, T., Kunegis, J., & Alhadi, A. C. (2011). Bad news travel fast: A content-based analysis of interestingness on Twitter. In ACM WebSci ’11, Koblenz, Germany (pp. 1–7).

  • O’Connor, B., Eisenstein, J., Xing, E. P., & Smith, N. A. (2010). A mixture model of demographic lexical variation. In Proceedings of NIPS workshop on machine learning in computational social science.

  • Osborne, M. (2010). Personal communication.

  • Pak, A., & Paroubek, P. (2010). Twitter based system: Using Twitter for disambiguating sentiment ambiguous adjectives. In SemEval ’10 Proceedings of the 5th international workshop on semantic evaluation, association for computational linguistics (pp. 436–439).

  • Petrović, S., Osborne, M., & Lavrenko, V. (2010). The Edinburgh Twitter corpus. In Proceedings of the NAACL HLT 2010 workshop on computational linguistics in a world of social media, association for computational linguistics (pp. 25–26).

  • Puschmann, C. (2010). The corporate blog as an emerging genre of computer-mediated communication: features, constraints, discourse situation, Göttinger Schriften zur Internetforschung, vol 7. Universitätsverlag Göttingen, Göttingen.

    Google Scholar 

  • Ritter, A., Clark, S., & Mausam Etzioni, O. (2011). Extracting a calendar from Twitter (in submission).

  • Sakaki, T., Okazaki, M., & Matsuo, Y. (2010). Earthquake shakes Twitter users: Real-time event detection by social sensors. In Proceedings of the 19th international conference on world wide web, ACM (pp. 851–860).

  • Schler, J., Koppel, M., Argamon, S., & Pennebaker, J. (2006). Effects of age and gender on blogging. In Proceedings of the AAAI spring symposia on computational approaches to analyzing weblogs (pp. 27–29).

  • Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German. In Proceedings of the EACL-SIGDAT workshop, Dublin, Ireland.

  • Shaikh, S., Strzalkowski, T., Broadwell, A., Stromer-Galley, J., Taylor, S., & Webb, N. (2010). MPC: A multi-party chat corpus for modeling social phenomena in discourse. In Proceedings of the seventh international conference on language resources and evaulation (LREC 2010).

  • Shaoul, C., & Westbury, C. (2011). A USENET corpus (2005–2010). University of Alberta, Canada. http://www.psych.ualberta.ca/westburylab/downloads/usenetcorpus.download.html.

  • Speer, R. (2007). Open mind commons: An inquisitive approach to learning common sense. In Proceedings of the workshop on common sense and intelligent user interfaces, Honolulu, HI.

  • Thelwall, M., Buckley, K., & Paltoglou, G. (2010). Sentiment in Twitter events. Journal of the American Society for Information Science and Technology, 62(2), 406–418.

    Article  Google Scholar 

  • Tumasjan, A., Sprenger, T. O., Sandner, P. G., & Welpe, I. M. (2010). Predicting elections with Twitter: What 140 characters reveal about political sentiment. In Proceedings of the fourth international AAAI conference on weblogs and social media (pp. 178–185).

  • Wang, K., Thrasher, C., Viegas, E., Li, X., & Hsu, B. P. (2010) An overview of Microsoft Web n-gram corpus and applications. In Proceedings of the NAACL HLT 2010 demonstration session, association for computational linguistics (pp. 45–48).

  • Witten, I., & Bell, T. (1991). The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4), 1085–1094. doi:10.1109/18.87000.

    Article  Google Scholar 

  • Yang, J., & Leskovec, J. (2011). Patterns of temporal variation in online media. In Proceedings of the fourth ACM international conference on Web search and data mining—WSDM ’11, ACM Press, New York, USA (p. 177). doi:10.1145/1935826.1935863.

  • Yano, T., Cohen, W. W., & Smith, N. A. (2009). Predicting response to political blog posts with topic models. In, Proceedings of human language technologies. The 2009 annual conference of the North American chapter of the association for computational linguistics on NAACL 09, association for computational linguistics (p. 477). doi:10.3115/1620754.1620824.

  • Yardi, S., Romero, D., Schoenebeck, G., & Boyd, D. (2010). Detecting spam in a Twitter network. First Monday, 15(1), 1–13.

    Google Scholar 

  • Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E. P., Yan, H., & Li, X. (2011). Comparing Twitter and traditional media using topic models. In ECIR’11 Proceedings of the 33rd European conference on advances in information retrieval (pp. 338–349).

Download references

Acknowledgments

I thank Marco Baroni, Eser Aygün, and anonymous reviewers for their valuable comments on a draft of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amaç Herdağdelen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Herdağdelen, A. Twitter n-gram corpus with demographic metadata. Lang Resources & Evaluation 47, 1127–1147 (2013). https://doi.org/10.1007/s10579-013-9227-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-013-9227-2

Keywords

Navigation