Text Mining in Social Media for Security Threats

Inkpen, Diana

doi:10.1007/978-3-319-26450-9_19

Text Mining in Social Media for Security Threats

Diana Inkpen⁶

Chapter
First Online: 20 December 2015

1319 Accesses
3 Citations

Part of the book series: Studies in Computational Intelligence ((SCI,volume 621))

Abstract

We discuss techniques for information extraction from texts, and present two applications that use these techniques. We focus in particular on social media texts (Twitter messages), which present challenges for the information extraction techniques because they are noisy and short. The first application is extracting the locations mentioned in Twitter messages, and the second one is detecting the location of the users based on all the tweets written by each user. The same techniques can be used for extracting other kinds of information from social media texts, with the purpose of monitoring the topics, events, emotions, or locations of interest to security and defence applications.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
[25] recently released a dataset of various kinds of social media data annotated with generic location expressions, but not with cities, states/provinces, and countries.
2.
https://dev.twitter.com.
3.
http://www.geonames.org.
4.
The number of countries is larger than 200 because alternative names are counted; the same for states/provinces and cities.
5.
https://github.com/rex911/locdet.
6.
http://www.ark.cs.cmu.edu/GeoTwitter.
7.
https://github.com/utcompling/textgrounder/wiki/RollerEtAl_EMNLP2012.
8.
Explained in Sect. 5.2.
9.
Not all of these 5000 n-grams are necessarily good location indicators, we don’t manually distinguish them; a machine learning model after training should be able to do so.
10.
Alternatively, we also tried the loss function defined as the average squared error of output numbers, which is equivalent to the average Euclidean distance between the estimated location and the true location; this alternative model did not perform well.
11.
http://www.mapquest.com.
12.
http://www.census.gov/geo/maps-data/maps/pdfs/reference/us_regdiv.pdf.
13.
Our code is available at https://github.com/rex911/usrloc.
14.
We are unable to conduct t-tests on the Eisenstein models, because of the unavailability of the details of the results produced by these models.
15.
We are unable to conduct t-tests on the other models, because of the unavailability of the details of the results produced by these models.
16.
Only this metric was reported by the author in the top 3 % features configuration.

References

Aggarwal, C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer (2012). http://dx.doi.org/10.1007/978-1-4614-3223-4_6
Aman, S., Szpakowicz, S.: Identifying expressions of emotion in text. In: Text, Speech and Dialogue, pp. 196–205. Springer (2007)
Google Scholar
Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. Neural Networks: Tricks of the Trade 7700, pp. 437–478 (2012). http://link.springer.com/chapter/10.1007/978-3-642-35289-8_26
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013). http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6472238
Google Scholar
Bengio, Y., Lamblin, P.: Greedy layer-wise training of deep networks. Adv. Neural Inf. Process. Syst. 19(153) (2007). https://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy). vol. 4, p. 3 (2010)
Google Scholar
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. Knowl. Data Eng. IEEE Trans. 18(10), 1411–1428 (2006)
Article Google Scholar
Cohen, W.W.: Minorthird: methods for identifying names and ontological relations in text using heuristics for inducing regularities from data (2004)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). http://dx.doi.org/10.1007/BF00994018
Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: A framework and graphical development environment for robust nlp tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02). Association for Computational Linguistics (2002)
Google Scholar
Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012)
Article Google Scholar
Eisenstein, J., Ahmed, A., Xing, E.P.: Sparse additive generative models of text. In: Proceedings of the 28th International Conference on Machine Learning (ICML’11), pp. 1041–1048 (2011)
Google Scholar
Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 1277–1287. ACL (2010)
Google Scholar
Ghazi, D., Inkpen, D., Szpakowicz, S.: Hierarchical versus flat classification of emotions in text. In: Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text, pp. 140–146. Association for Computational Linguistics, Los Angeles, (June 2010). http://www.aclweb.org/anthology/W10-0217
Ghazi, D., Inkpen, D., Szpakowicz, S.: Prior and contextual emotion of words in sentential context. Comput. Speech Lang. 28(1), 76–92 (2014). http://dx.doi.org/10.1016/j.csl.2013.04.009
Google Scholar
Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep learning approach. In: Proceedings of the 28th International Conference on Machine Learning (ICML’11). pp. 513–520 (2011)
Google Scholar
Han, B., Cook, P., Baldwin, T.: Text-based Twitter user geolocation prediction. Artif. Intell. Res. 49(1), 451–500, (Jan 2014). http://dl.acm.org/citation.cfm?id=2655713.2655726
Google Scholar
Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 54–1527, (Jul 2006). http://dl.acm.org/citation.cfm?id=1161603.1161605
Google Scholar
Huang, F., Yates, A.: Exploring representation-learning approaches to domain adaptation. In: Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, pp. 23–30 (2010). http://dl.acm.org/citation.cfm?id=1870530
Inkpen, D., Liu, J., Farzindar, A., Kazemi, F., Ghazi, D.: Detecting and disambiguating locations in Twitter messages. In: Proceedings of the 16th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2015). Cairo, Egypt (2014)
Google Scholar
Keshtkar, F., Inkpen, D.: A hierarchical approach to mood classification in blogs. Nat. Lang. Eng. 18(1), 61–81 (2012)
Article Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001). http://dl.acm.org/citation.cfm?id=645530.655813
Li, H., Srihari, R.K., Niu, C., Li, W.: Location normalization for information extraction. In: Proceedings of the 19th international conference on Computational linguistics, vol. 1, pp. 1–7. Association for Computational Linguistics, Morristown, (Aug 2002). http://dl.acm.org/citation.cfm?id=1072228.1072355
Li, T., Zhang, Y., Sindhwani, V.: A non-negative matrix tri-factorization approach to sentiment classification with lexical prior knowledge. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. pp. 244–252. Association for Computational Linguistics, (Aug 2009). http://dl.acm.org/citation.cfm?id=1687878.1687914
Liu, F., Vasardani, M., Baldwin, T.: Automatic identification of locative expressions from social media text: A comparative analysis. In: Proceedings of the 4th International Workshop on Location and the Web. LocWeb ’14, pp. 9–16. ACM, New York (2014). http://doi.acm.org/10.1145/2663713.2664426
Liu, J., Inkpen, D.: Estimating user locations on social media: a deep learning approach. Technical Report. University of Ottawa (2014)
Google Scholar
Mani, I., Hitzeman, J., Richer, J., Harris, D., Quimby, R., Wellner, B.: SpatialML: annotation scheme, corpora, and tools. In: Proceedings of the 6th international Conference on Language Resources and Evaluation (2008), p. 11 (2008). http://www.lrec-conf.org/proceedings/lrec2008/summaries/106.html
Melville, P., Gryc, W., Lawrence, R.D.: Sentiment analysis of blogs by combining lexical knowledge with text classification. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09), p. 1275. ACM Press, New York (June 2009). http://dl.acm.org/citation.cfm?id=1557019.1557156
Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Improved part-of-speech tagging for online conversational text with word clusters. In: Proceedings of NAACL-HLT, pp. 380–390 (2013)
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Priedhorsky, R., Culotta, A., Del Valle, S.Y.: Inferring the origin locations of tweets with quantitative confidence. In: Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW ’14), pp. 1523–1536. ACM Press, New York (Feb 2014). http://dl.acm.org/citation.cfm?id=2531602.2531607
Razavi, A.H., Inkpen, D., Brusilovsky, D., Bogouslavski, L.: General topic annotation in social networks: A latent dirichlet allocation approach. In: Advances in Artificial Intelligence, Lecture Notes in Computer Science, vol. 7884, pp. 293–300. Springer, Berlin (2013). http://dx.doi.org/10.1007/978-3-642-38457-8_29
Google Scholar
Razavi, A.H., Inkpen, D., Falcon, R., Abielmona, R.: Textual risk mining for maritime situational awareness. In: 2014 IEEE International Inter-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA), pp. 167–173. IEEE (2014)
Google Scholar
Roller, S., Speriosu, M., Rallapalli, S., Wing, B., Baldridge, J.: Supervised text-based geolocation using language models on an adaptive grid. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. pp. 1500–1510. Association for Computational Linguistics (Jul 2012). http://dl.acm.org/citation.cfm?id=2390948.2391120
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. Technical Report. DTIC Document (1985)
Google Scholar
Sarawagi, S., Cohen, W.W.: Semi-markov conditional random fields for information extraction. NIPS 17, 1185–1192 (2004)
Google Scholar
Sinnott, R.W.: Virtues of the haversine. Sky Telesc. 68, 158 (1984)
Google Scholar
Tang, D., Qin, B., Liu, T., Li, Z.: Learning sentence representation for emotion classification on microblogs. Natural Language Processing and Chinese Computing, vol. 400, pp. 212–223 (2013). http://link.springer.com/chapter/10.1007/978-3-642-41644-6_20
Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning (ICML’08), pp. 1096–1103 (2008). http://portal.acm.org/citation.cfm?doid=1390156.1390294
Wing, B.P., Baldridge, J.: Simple supervised document geolocation with geodesic grids. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL HLT ’11), pp. 955–964. Association for Computational Linguistics (June 2011). http://dl.acm.org/citation.cfm?id=2002472.2002593
Yao, Y., Rosasco, L., Caponnetto, A.: On early stopping in gradient descent learning. Constr. Approx. 26(2), 289–315 (2007)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

I would like to thank my collaborators who helped with the location detection project: Ji Rex Liu, Diman Ghazi, Atefeh Farzindar and Farzaneh Kazemi for task 1 and Ji Rex Liu for task 2.

Author information

Authors and Affiliations

School of Electrical Engineering and Computer Science, University of Ottawa, 800 King Edward, Ottawa, ON, K1N 6N5, Canada
Diana Inkpen

Authors

Diana Inkpen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diana Inkpen .

Editor information

Editors and Affiliations

Larus Technologies Corporation, Ottawa, Ontario, Canada
Rami Abielmona
Larus Technologies Corporation, Ottawa, Ontario, Canada
Rafael Falcon
Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada
Nur Zincir-Heywood
School of Engineering and Information Technology, University of New South Wales, Canberra, Aust Capital Terr, Australia
Hussein A. Abbass

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Inkpen, D. (2016). Text Mining in Social Media for Security Threats. In: Abielmona, R., Falcon, R., Zincir-Heywood, N., Abbass, H. (eds) Recent Advances in Computational Intelligence in Defense and Security. Studies in Computational Intelligence, vol 621. Springer, Cham. https://doi.org/10.1007/978-3-319-26450-9_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-26450-9_19
Published: 20 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26448-6
Online ISBN: 978-3-319-26450-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics