Abstract
We discuss techniques for information extraction from texts, and present two applications that use these techniques. We focus in particular on social media texts (Twitter messages), which present challenges for the information extraction techniques because they are noisy and short. The first application is extracting the locations mentioned in Twitter messages, and the second one is detecting the location of the users based on all the tweets written by each user. The same techniques can be used for extracting other kinds of information from social media texts, with the purpose of monitoring the topics, events, emotions, or locations of interest to security and defence applications.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
[25] recently released a dataset of various kinds of social media data annotated with generic location expressions, but not with cities, states/provinces, and countries.
- 2.
- 3.
- 4.
The number of countries is larger than 200 because alternative names are counted; the same for states/provinces and cities.
- 5.
- 6.
- 7.
- 8.
Explained in Sect. 5.2.
- 9.
Not all of these 5000 n-grams are necessarily good location indicators, we don’t manually distinguish them; a machine learning model after training should be able to do so.
- 10.
Alternatively, we also tried the loss function defined as the average squared error of output numbers, which is equivalent to the average Euclidean distance between the estimated location and the true location; this alternative model did not perform well.
- 11.
- 12.
- 13.
Our code is available at https://github.com/rex911/usrloc.
- 14.
We are unable to conduct t-tests on the Eisenstein models, because of the unavailability of the details of the results produced by these models.
- 15.
We are unable to conduct t-tests on the other models, because of the unavailability of the details of the results produced by these models.
- 16.
Only this metric was reported by the author in the top 3 % features configuration.
References
Aggarwal, C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer (2012). http://dx.doi.org/10.1007/978-1-4614-3223-4_6
Aman, S., Szpakowicz, S.: Identifying expressions of emotion in text. In: Text, Speech and Dialogue, pp. 196–205. Springer (2007)
Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. Neural Networks: Tricks of the Trade 7700, pp. 437–478 (2012). http://link.springer.com/chapter/10.1007/978-3-642-35289-8_26
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013). http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6472238
Bengio, Y., Lamblin, P.: Greedy layer-wise training of deep networks. Adv. Neural Inf. Process. Syst. 19(153) (2007). https://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy). vol. 4, p. 3 (2010)
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. Knowl. Data Eng. IEEE Trans. 18(10), 1411–1428 (2006)
Cohen, W.W.: Minorthird: methods for identifying names and ontological relations in text using heuristics for inducing regularities from data (2004)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). http://dx.doi.org/10.1007/BF00994018
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: A framework and graphical development environment for robust nlp tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02). Association for Computational Linguistics (2002)
Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012)
Eisenstein, J., Ahmed, A., Xing, E.P.: Sparse additive generative models of text. In: Proceedings of the 28th International Conference on Machine Learning (ICML’11), pp. 1041–1048 (2011)
Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 1277–1287. ACL (2010)
Ghazi, D., Inkpen, D., Szpakowicz, S.: Hierarchical versus flat classification of emotions in text. In: Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text, pp. 140–146. Association for Computational Linguistics, Los Angeles, (June 2010). http://www.aclweb.org/anthology/W10-0217
Ghazi, D., Inkpen, D., Szpakowicz, S.: Prior and contextual emotion of words in sentential context. Comput. Speech Lang. 28(1), 76–92 (2014). http://dx.doi.org/10.1016/j.csl.2013.04.009
Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep learning approach. In: Proceedings of the 28th International Conference on Machine Learning (ICML’11). pp. 513–520 (2011)
Han, B., Cook, P., Baldwin, T.: Text-based Twitter user geolocation prediction. Artif. Intell. Res. 49(1), 451–500, (Jan 2014). http://dl.acm.org/citation.cfm?id=2655713.2655726
Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 54–1527, (Jul 2006). http://dl.acm.org/citation.cfm?id=1161603.1161605
Huang, F., Yates, A.: Exploring representation-learning approaches to domain adaptation. In: Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, pp. 23–30 (2010). http://dl.acm.org/citation.cfm?id=1870530
Inkpen, D., Liu, J., Farzindar, A., Kazemi, F., Ghazi, D.: Detecting and disambiguating locations in Twitter messages. In: Proceedings of the 16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2015). Cairo, Egypt (2014)
Keshtkar, F., Inkpen, D.: A hierarchical approach to mood classification in blogs. Nat. Lang. Eng. 18(1), 61–81 (2012)
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001). http://dl.acm.org/citation.cfm?id=645530.655813
Li, H., Srihari, R.K., Niu, C., Li, W.: Location normalization for information extraction. In: Proceedings of the 19th international conference on Computational linguistics, vol. 1, pp. 1–7. Association for Computational Linguistics, Morristown, (Aug 2002). http://dl.acm.org/citation.cfm?id=1072228.1072355
Li, T., Zhang, Y., Sindhwani, V.: A non-negative matrix tri-factorization approach to sentiment classification with lexical prior knowledge. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. pp. 244–252. Association for Computational Linguistics, (Aug 2009). http://dl.acm.org/citation.cfm?id=1687878.1687914
Liu, F., Vasardani, M., Baldwin, T.: Automatic identification of locative expressions from social media text: A comparative analysis. In: Proceedings of the 4th International Workshop on Location and the Web. LocWeb ’14, pp. 9–16. ACM, New York (2014). http://doi.acm.org/10.1145/2663713.2664426
Liu, J., Inkpen, D.: Estimating user locations on social media: a deep learning approach. Technical Report. University of Ottawa (2014)
Mani, I., Hitzeman, J., Richer, J., Harris, D., Quimby, R., Wellner, B.: SpatialML: annotation scheme, corpora, and tools. In: Proceedings of the 6th international Conference on Language Resources and Evaluation (2008), p. 11 (2008). http://www.lrec-conf.org/proceedings/lrec2008/summaries/106.html
Melville, P., Gryc, W., Lawrence, R.D.: Sentiment analysis of blogs by combining lexical knowledge with text classification. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09), p. 1275. ACM Press, New York (June 2009). http://dl.acm.org/citation.cfm?id=1557019.1557156
Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Improved part-of-speech tagging for online conversational text with word clusters. In: Proceedings of NAACL-HLT, pp. 380–390 (2013)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Priedhorsky, R., Culotta, A., Del Valle, S.Y.: Inferring the origin locations of tweets with quantitative confidence. In: Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW ’14), pp. 1523–1536. ACM Press, New York (Feb 2014). http://dl.acm.org/citation.cfm?id=2531602.2531607
Razavi, A.H., Inkpen, D., Brusilovsky, D., Bogouslavski, L.: General topic annotation in social networks: A latent dirichlet allocation approach. In: Advances in Artificial Intelligence, Lecture Notes in Computer Science, vol. 7884, pp. 293–300. Springer, Berlin (2013). http://dx.doi.org/10.1007/978-3-642-38457-8_29
Razavi, A.H., Inkpen, D., Falcon, R., Abielmona, R.: Textual risk mining for maritime situational awareness. In: 2014 IEEE International Inter-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA), pp. 167–173. IEEE (2014)
Roller, S., Speriosu, M., Rallapalli, S., Wing, B., Baldridge, J.: Supervised text-based geolocation using language models on an adaptive grid. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. pp. 1500–1510. Association for Computational Linguistics (Jul 2012). http://dl.acm.org/citation.cfm?id=2390948.2391120
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. Technical Report. DTIC Document (1985)
Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. NIPS 17, 1185–1192 (2004)
Sinnott, R.W.: Virtues of the haversine. Sky Telesc. 68, 158 (1984)
Tang, D., Qin, B., Liu, T., Li, Z.: Learning sentence representation for emotion classification on microblogs. Natural Language Processing and Chinese Computing, vol. 400, pp. 212–223 (2013). http://link.springer.com/chapter/10.1007/978-3-642-41644-6_20
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning (ICML’08), pp. 1096–1103 (2008). http://portal.acm.org/citation.cfm?doid=1390156.1390294
Wing, B.P., Baldridge, J.: Simple supervised document geolocation with geodesic grids. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL HLT ’11), pp. 955–964. Association for Computational Linguistics (June 2011). http://dl.acm.org/citation.cfm?id=2002472.2002593
Yao, Y., Rosasco, L., Caponnetto, A.: On early stopping in gradient descent learning. Constr. Approx. 26(2), 289–315 (2007)
Acknowledgments
I would like to thank my collaborators who helped with the location detection project: Ji Rex Liu, Diman Ghazi, Atefeh Farzindar and Farzaneh Kazemi for task 1 and Ji Rex Liu for task 2.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Inkpen, D. (2016). Text Mining in Social Media for Security Threats. In: Abielmona, R., Falcon, R., Zincir-Heywood, N., Abbass, H. (eds) Recent Advances in Computational Intelligence in Defense and Security. Studies in Computational Intelligence, vol 621. Springer, Cham. https://doi.org/10.1007/978-3-319-26450-9_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-26450-9_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26448-6
Online ISBN: 978-3-319-26450-9
eBook Packages: EngineeringEngineering (R0)