Location detection and disambiguation from twitter messages

Inkpen, Diana; Liu, Ji; Farzindar, Atefeh; Kazemi, Farzaneh; Ghazi, Diman

doi:10.1007/s10844-017-0458-3

Location detection and disambiguation from twitter messages

Published: 31 March 2017

Volume 49, pages 237–253, (2017)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Diana Inkpen ORCID: orcid.org/0000-0002-0202-2444¹,
Ji Liu¹,
Atefeh Farzindar²,
Farzaneh Kazemi² &
…
Diman Ghazi¹

1039 Accesses
25 Citations
Explore all metrics

Abstract

A remarkable amount of Twitter messages are generated every second. Detecting the location entities mentioned in these messages is useful in text mining applications. Therefore, techniques for extracting the location entities from the Twitter textual content are needed. In this work, we approach this task in a similar manner to the Named Entity Recognition (NER) task, but we focus only on locations, while NER systems detect names of persons, organizations, locations, and sometimes more (e.g., dates, times). But, unlike NER systems, we address a deeper task: classifying the detected locations into names of cities, provinces/states, and countries in order to map them into physical locations. We approach the task in a novel way, consisting in two stages. In the first stage, we train Conditional Random Fields (CRF) models that are able to detect the locations mentioned in the messages. We train three classifiers: one for cities, one for provinces/states, and one for countries, with various sets of features. Since a dataset annotated with this kind of information was not available, we collected and annotated our own dataset to use for training and testing. In the second stage, we resolve the remaining ambiguities, namely, cases when there exists more than one place with the same name. We proposed a set of heuristics able to choose the correct physical location in these cases. Our two-stage model will allow a social media monitoring system to visualize the places mentioned in Twitter messages on a map of the world or to compute statistics about locations. This kind of information can be of interest to business or marketing applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

https://github.com/rex911/locdet.
CRFs are undirected graphical models, while HMMs and MEMMs are directed graphical models. MEMMs can use features similar to those used in CRFs, while HMMs use only the transition probabilities learnt from training data, without computing specific features.
A stochastic process has the Markov property if the conditional probability distribution of future states of the process, conditional on both past and present values, depends only upon the present state; that is, given the present, the future does not depend on the past.
https://dev.twitter.com
We do not include states or provinces elsewhere because 1. not all countries have states or provinces; 2. our application focuses on North America.
http://www.geonames.org.

References

Amitay, E., Har’El, N., Sivan, R., & Soffer, A. (2004). Web-a-Where: Geotagging Web Content. In Proceedings of the 27th annual international conference on Research and development in information retrieval - SIGIR ’04 (pp. 273–280). New York: ACM Press. doi:10.1145/1008992.1009040, http://dl.acm.org/citation.cfm?id=1008992.1009040.
Berger, A.L., Pietra, V.J.D., & Pietra, S.A.D. (1996). A maximum entropy approach to natural language processing. Computational linguistics, 22(1), 39–71.
Google Scholar
Bouillot, F., Poncelet, P., & Roche, M. (2012). How and why exploit tweet’s location information? In Jérôme Gensel, D.J., & Vandenbroucke, D. (Eds.), AGILE’2012 International Conference on Geographic Information Science, Avignon (pp. 24–27).
Cohen, W.W. (2004). Minorthird: Methods for identifying names and ontological relations in text using heuristics for inducing regularities from data.
Culotta, A., Bekkerman, R., & McCallum, A. (2004). Extracting social networks and contact information from email and the web.
Cunningham, H. (2002). GATE, a general architecture for text engineering. Computers and the Humanities, 36(2), 223–254.
Article MathSciNet Google Scholar
Gelernter, J., & Mushegian, N. (2011). Geo-parsing messages from microtext. Transactions in GIS, 15(6), 753–773.
Article Google Scholar
Lafferty, J., McCallum, A., & Pereira, F.C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
Li, H., Srihari, R.K., Niu, C., & Li, W. (2002). Location normalization for information extraction. In Proceedings of the 19th international conference on Computational linguistics, Association for Computational Linguistics, Morristown, NJ, USA, (Vol. 1 pp. 1–7). doi:10.3115/1072228.1072355, http://dl.acm.org/citation.cfm?id=1072228.1072355.
Mani, I., Hitzeman, J., Richer, J., Harris, D., Quimby, R., & Wellner, B. (2008). SpatialML: Annotation Scheme, Corpora, and Tools. In Proceedings of the 6th international Conference on Language Resources and Evaluation (2008), Vol. 11. http://www.lrec-conf.org/proceedings/lrec2008/summaries/106.html.
McCallum, A., & Li, W. (2003). Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the 7th conference on Natural language learning at HLT-NAACL 2003-Volume 4 (pp. 188–191): Association for Computational Linguistics.
Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., & Smith, N.A. (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL-HLT (pp. 380–390).
Paradesi, S. (2011). Geotagging tweets using their content. In Proceedings of the 24th International Florida (pp. 355–356). http://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS11/paper/viewFile/2617/3058.
Qin, T., Xiao, R., Fang, L., Xie, X., & Zhang, L. (2010). An efficient location extraction algorithm by leveraging web contextual information. In Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems (pp. 53–60): ACM.
Rabiner, L., & Juang, B.H. (1986). An introduction to hidden markov models. IEEE ASSP Magazine, 3(1), 4–16.
Article Google Scholar
Razavi, A.H., Inkpen, D., Falcon, R., & Abielmona, R. (2014). Textual risk mining for maritime situational awareness. In 2014 IEEE International Inter-Disciplinary Conference on Cognitive methods in situation awareness and decision support (cogSIMA) (pp. 167–173): IEEE.
Sarawagi, S., & Cohen, W.W. (2004). Semi-markov conditional random fields for information extraction. In NIPS, (Vol. 17 pp. 1185–1192).
Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 1453–1484.
Wang, C., Xie, X., Wang, L., Lu, Y., & Ma, W.Y. (2005). Detecting geographic locations from web resources. In Proceedings of the 2005 workshop on Geographic information retrieval - GIR ’05, Vol. 17. New York: ACM Press. doi:10.1145/1096985.1096991, http://dl.acm.org/citation.cfm?id=1096985.1096991.

Download references

Author information

Authors and Affiliations

School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON, K1N6N5, Canada
Diana Inkpen, Ji Liu & Diman Ghazi
NLP Technologies, Montreal, QC, H2Y1W7, Canada
Atefeh Farzindar & Farzaneh Kazemi

Authors

Diana Inkpen
View author publications
You can also search for this author in PubMed Google Scholar
Ji Liu
View author publications
You can also search for this author in PubMed Google Scholar
Atefeh Farzindar
View author publications
You can also search for this author in PubMed Google Scholar
Farzaneh Kazemi
View author publications
You can also search for this author in PubMed Google Scholar
Diman Ghazi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diana Inkpen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Inkpen, D., Liu, J., Farzindar, A. et al. Location detection and disambiguation from twitter messages. J Intell Inf Syst 49, 237–253 (2017). https://doi.org/10.1007/s10844-017-0458-3

Download citation

Received: 07 October 2014
Revised: 18 March 2017
Accepted: 20 March 2017
Published: 31 March 2017
Issue Date: October 2017
DOI: https://doi.org/10.1007/s10844-017-0458-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Location detection and disambiguation from twitter messages

Abstract

Access this article

Similar content being viewed by others

Social media analytics: a survey of techniques, tools and platforms

Political mud slandering and power dynamics during Indian assembly elections

Machine learning-based social media bot detection: a comprehensive literature review

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Location detection and disambiguation from twitter messages

Abstract

Access this article

Similar content being viewed by others

Social media analytics: a survey of techniques, tools and platforms

Political mud slandering and power dynamics during Indian assembly elections

Machine learning-based social media bot detection: a comprehensive literature review

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation