Skip to main content

Language Influences on Tweeter Geolocation

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10193))

Included in the following conference series:

Abstract

We investigate the influence of language on the accuracy of geolocating Twitter users. Our analysis, using a large corpus of tweets written in thirteen languages, provides a new understanding of the reasons behind reported performance disparities between languages. The results show that data imbalance has a greater impact on accuracy than geographical coverage. A comparison between micro and macro averaging demonstrates that existing evaluation approaches are less appropriate than previously thought. Our results suggest both averaging approaches should be used to effectively evaluate geolocation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://archive.org/details/twitterstream&tab=collection.

  2. 2.

    An open source language identification tool, trained over 97 languages, and tested over six European languages with an accuracy of 0.94. We accepted predictions with confidence \(\ge \)0.5 only.

  3. 3.

    http://pythonhosted.org/Tashaphyne/.

  4. 4.

    https://github.com/tq010or/acl2013.

  5. 5.

    Although Cheng et al. [3] showed empirically that the percentage of tweeters within x miles increases as x increases, e.g., 30% of tweeters are placed within 16 km and 51% within 161 km, all subsequent research used an arbitrarily chosen 161 km. Note, Cheng et al. tested only on a US-based dataset, where the average distance between neighboring cities might be different from densely populated or small countries. Accuracy within 161 km might not be an effective evaluation measure from a language comparison perspective, however as it has been used in past work, we use it here.

References

  1. Ahmed, A., Hong, L., Smola, A.J.: Hierarchical geographical modeling of user locations from social media posts. In: Proceedings of WWW, pp. 25–36 (2013)

    Google Scholar 

  2. Backstrom, L., Sun, E., Marlow, C.: Find me if you can: improving geographical prediction with social and spatial proximity. In: Proceedings of WWW, pp. 61–70 (2010)

    Google Scholar 

  3. Cheng, Z., Caverlee, J., Lee, K.: You are where you tweet: a content-based approach to geo-locating Twitter users. In: Proceedings of CIKM, pp. 759–768 (2010)

    Google Scholar 

  4. Darwish, K., Magdy, W., Mourad, A.: Language processing for Arabic microblog retrieval. In: Proceedings of CIKM, pp. 2427–2430 (2012)

    Google Scholar 

  5. Diakopoulos, N., De Choudhury, M., Naaman, M.: Finding and assessing social media information sources in the context of journalism. In: Proceedings of SIGCHI, pp. 2451–2460 (2012)

    Google Scholar 

  6. Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: Proceedings of EMNLP, pp. 1277–1287 (2010)

    Google Scholar 

  7. Gonçalves, B., Sánchez, D.: Crowdsourcing dialect characterization through Twitter. PloS One 9(11), e112074 (2014)

    Article  Google Scholar 

  8. Han, B., Baldwin, T.: Lexical normalisation of short text messages: Makn sens a# Twitter. In: Proceedings of ACL, pp. 368–378 (2011)

    Google Scholar 

  9. Han, B., Cook, P., Baldwin, T.: Text-based Twitter user geolocation prediction. J. Artif. Intell. Res. 49, 451–500 (2014)

    Google Scholar 

  10. Hecht, B., Hong, L., Suh, B., Chi, E.H.: Tweets from Justin Bieber’s heart: the dynamics of the location field in user profiles. In: Proceedings of SIGCHI, pp. 237–246 (2011)

    Google Scholar 

  11. Jurgens, D., Finethy, T., McCorriston, J., Xu, Y.T., Ruths, D.: Geolocation prediction in Twitter using social networks: a critical analysis and review of current practice. In: Proceedings of ICWSM (2015)

    Google Scholar 

  12. Kinsella, S., Murdock, V., O’Hare, N.: I’m eating a sandwich in Glasgow: modeling locations with tweets. In: Proceedings of the 3rd International Workshop on Search and Mining User-generated Contents, pp. 61–68 (2011)

    Google Scholar 

  13. Lui, M., Baldwin, T.: langid. py: an off-the-shelf language identification tool. In: Proceedings of ACL, pp. 25–30 (2012)

    Google Scholar 

  14. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  15. Priedhorsky, R., Culotta, A., Del Valle, S.Y.: Inferring the origin locations of tweets with quantitative confidence. In: Proceedings of CSCW, pp. 1523–1536 (2014)

    Google Scholar 

  16. Rahimi, A., Cohn, T., Baldwin, T.: pigeo: a Python geotagging tool. In: Proceedings of ACL-2016 System Demonstrations, pp. 127–132 (2016)

    Google Scholar 

  17. Roller, S., Speriosu, M., Rallapalli, S., Wing, B., Baldridge, J.: Supervised text-based geolocation using language models on an adaptive grid. In: Proceedings of EMNLP, pp. 1500–1510 (2012)

    Google Scholar 

  18. Sadilek, A., Kautz, H., Bigham, J.P.: Finding your friends and following them to where you are. In: Proceedings of WSDM, pp. 723–732 (2012)

    Google Scholar 

  19. Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes Twitter users: real-time event detection by social sensors. In: Proceedings of WWW, pp. 851–860 (2010)

    Google Scholar 

  20. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)

    Article  Google Scholar 

  21. Starbird, K., Muzny, G., Palen, L.: Learning from the crowd: collaborative filtering techniques for identifying on-the-ground Twitterers during mass disruptions. In: Proceedings of ISCRAM (2012)

    Google Scholar 

  22. Wing, B., Baldridge, J.: Hierarchical discriminative classification for text-based geolocation. In: Proceedings of EMNLP, pp. 336–348 (2014)

    Google Scholar 

  23. Yang, Y.: An evaluation of statistical approaches to text categorization. Inf. Retr. 1(1–2), 69–90 (1999)

    Article  Google Scholar 

Download references

Acknowledgments

This work was made possible by NPRP grant# NPRP 6-1377-1-257 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors.

We thank the anonymous reviewers for their careful reading of our manuscript and their many insightful comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmed Mourad .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Mourad, A., Scholer, F., Sanderson, M. (2017). Language Influences on Tweeter Geolocation. In: Jose, J., et al. Advances in Information Retrieval. ECIR 2017. Lecture Notes in Computer Science(), vol 10193. Springer, Cham. https://doi.org/10.1007/978-3-319-56608-5_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-56608-5_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-56607-8

  • Online ISBN: 978-3-319-56608-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics