Skip to main content
Log in

Location detection and disambiguation from twitter messages

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

A remarkable amount of Twitter messages are generated every second. Detecting the location entities mentioned in these messages is useful in text mining applications. Therefore, techniques for extracting the location entities from the Twitter textual content are needed. In this work, we approach this task in a similar manner to the Named Entity Recognition (NER) task, but we focus only on locations, while NER systems detect names of persons, organizations, locations, and sometimes more (e.g., dates, times). But, unlike NER systems, we address a deeper task: classifying the detected locations into names of cities, provinces/states, and countries in order to map them into physical locations. We approach the task in a novel way, consisting in two stages. In the first stage, we train Conditional Random Fields (CRF) models that are able to detect the locations mentioned in the messages. We train three classifiers: one for cities, one for provinces/states, and one for countries, with various sets of features. Since a dataset annotated with this kind of information was not available, we collected and annotated our own dataset to use for training and testing. In the second stage, we resolve the remaining ambiguities, namely, cases when there exists more than one place with the same name. We proposed a set of heuristics able to choose the correct physical location in these cases. Our two-stage model will allow a social media monitoring system to visualize the places mentioned in Twitter messages on a map of the world or to compute statistics about locations. This kind of information can be of interest to business or marketing applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. https://github.com/rex911/locdet.

  2. CRFs are undirected graphical models, while HMMs and MEMMs are directed graphical models. MEMMs can use features similar to those used in CRFs, while HMMs use only the transition probabilities learnt from training data, without computing specific features.

  3. A stochastic process has the Markov property if the conditional probability distribution of future states of the process, conditional on both past and present values, depends only upon the present state; that is, given the present, the future does not depend on the past.

  4. https://dev.twitter.com

  5. We do not include states or provinces elsewhere because 1. not all countries have states or provinces; 2. our application focuses on North America.

  6. http://www.geonames.org.

References

  • Amitay, E., Har’El, N., Sivan, R., & Soffer, A. (2004). Web-a-Where: Geotagging Web Content. In Proceedings of the 27th annual international conference on Research and development in information retrieval - SIGIR ’04 (pp. 273–280). New York: ACM Press. doi:10.1145/1008992.1009040, http://dl.acm.org/citation.cfm?id=1008992.1009040.

  • Berger, A.L., Pietra, V.J.D., & Pietra, S.A.D. (1996). A maximum entropy approach to natural language processing. Computational linguistics, 22(1), 39–71.

    Google Scholar 

  • Bouillot, F., Poncelet, P., & Roche, M. (2012). How and why exploit tweet’s location information? In Jérôme Gensel, D.J., & Vandenbroucke, D. (Eds.), AGILE’2012 International Conference on Geographic Information Science, Avignon (pp. 24–27).

  • Cohen, W.W. (2004). Minorthird: Methods for identifying names and ontological relations in text using heuristics for inducing regularities from data.

  • Culotta, A., Bekkerman, R., & McCallum, A. (2004). Extracting social networks and contact information from email and the web.

  • Cunningham, H. (2002). GATE, a general architecture for text engineering. Computers and the Humanities, 36(2), 223–254.

    Article  MathSciNet  Google Scholar 

  • Gelernter, J., & Mushegian, N. (2011). Geo-parsing messages from microtext. Transactions in GIS, 15(6), 753–773.

    Article  Google Scholar 

  • Lafferty, J., McCallum, A., & Pereira, F.C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data.

  • Li, H., Srihari, R.K., Niu, C., & Li, W. (2002). Location normalization for information extraction. In Proceedings of the 19th international conference on Computational linguistics, Association for Computational Linguistics, Morristown, NJ, USA, (Vol. 1 pp. 1–7). doi:10.3115/1072228.1072355, http://dl.acm.org/citation.cfm?id=1072228.1072355.

  • Mani, I., Hitzeman, J., Richer, J., Harris, D., Quimby, R., & Wellner, B. (2008). SpatialML: Annotation Scheme, Corpora, and Tools. In Proceedings of the 6th international Conference on Language Resources and Evaluation (2008), Vol. 11. http://www.lrec-conf.org/proceedings/lrec2008/summaries/106.html.

  • McCallum, A., & Li, W. (2003). Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th conference on Natural language learning at HLT-NAACL 2003-Volume 4 (pp. 188–191): Association for Computational Linguistics.

  • Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., & Smith, N.A. (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL-HLT (pp. 380–390).

  • Paradesi, S. (2011). Geotagging tweets using their content. In Proceedings of the 24th International Florida (pp. 355–356). http://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS11/paper/viewFile/2617/3058.

  • Qin, T., Xiao, R., Fang, L., Xie, X., & Zhang, L. (2010). An efficient location extraction algorithm by leveraging web contextual information. In Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems (pp. 53–60): ACM.

  • Rabiner, L., & Juang, B.H. (1986). An introduction to hidden markov models. IEEE ASSP Magazine, 3(1), 4–16.

    Article  Google Scholar 

  • Razavi, A.H., Inkpen, D., Falcon, R., & Abielmona, R. (2014). Textual risk mining for maritime situational awareness. In 2014 IEEE International Inter-Disciplinary Conference on Cognitive methods in situation awareness and decision support (cogSIMA) (pp. 167–173): IEEE.

  • Sarawagi, S., & Cohen, W.W. (2004). Semi-markov conditional random fields for information extraction. In NIPS, (Vol. 17 pp. 1185–1192).

  • Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 1453–1484.

  • Wang, C., Xie, X., Wang, L., Lu, Y., & Ma, W.Y. (2005). Detecting geographic locations from web resources. In Proceedings of the 2005 workshop on Geographic information retrieval - GIR ’05, Vol. 17. New York: ACM Press. doi:10.1145/1096985.1096991, http://dl.acm.org/citation.cfm?id=1096985.1096991.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diana Inkpen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Inkpen, D., Liu, J., Farzindar, A. et al. Location detection and disambiguation from twitter messages. J Intell Inf Syst 49, 237–253 (2017). https://doi.org/10.1007/s10844-017-0458-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-017-0458-3

Keywords

Navigation