Advertisement

Blog Data Mining for Cyber Security Threats

  • Flora S. Tsai
  • Kap Luk Chan

Blog data mining is a growing research area that addresses the domain-specific problem of extracting information from blog data. In our work, we analyzed blogs for various categories of cyber threats related to the detection of security threats and cyber crime. We have extended the Author-Topic model based on Latent Dirichlet Allocation for identify patterns of similarities in keywords and dates distributed across blog documents. From this model, we visualized the content and date similarities using the Isomap dimensionality reduction technique. Our findings support the theory that our probabilistic blog model can present the blogosphere in terms of topics with measurable keywords, hence aiding the investigative processes to understand and respond to critical cyber security events and threats.

Keywords

Latent Dirichlet Allocation Security Threat Kullback Leibler Probabilistic Latent Semantic Analysis Nonlinear Dimensionality Reduction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    P. Avesani, M. Cova, C. Hayes, P. Massa, Learning Contextualised Weblog Topics, Proceedings of the WWW '05 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2005.Google Scholar
  2. 2.
    D.M. Blei, A.Y. Ng, and M.I. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.MATHCrossRefGoogle Scholar
  3. 3.
    Y. Chen, F.S. Tsai, K.L. Chan, Machine Learning Techniques for Business Blog Search and Mining, Expert Systems With Applications 35(3), pp 581–590, 2008.CrossRefGoogle Scholar
  4. 4.
    T. Cox and M. Cox, Multidimensional Scaling. Second Edition, New York: Chapman & Hall, 2001.MATHGoogle Scholar
  5. 5.
    S. Deerwester, S. Dumais, T. Landauer, G. Furnas, R. Harshman, Indexing by latent semantic analysis, Journal of the American Society of Information Science 41(6) (1990) 391–407.CrossRefGoogle Scholar
  6. 6.
    K.E. Gill, How Can We Measure the Influence of the Blogosphere? Proceedings of the WWW '04 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2004.Google Scholar
  7. 7.
    N.S. Glance, M. Hurst, T. Tomokiyo, BlogPulse: Automated Trend Discovery for Weblogs, Proceedings of the WWW '04 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2004.Google Scholar
  8. 8.
    D. Gruhl, R. Guha, D. Liben-Nowell, A. Tomkins, Information Diffusion Through Blogspace, Proceedings of the WWW '04 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2004.Google Scholar
  9. 9.
    M. Hickins, Congress Lights Fire Under Vote Systems Agency, Business, www. internetnews.com/bus-news/article.php/3655001, 2007.Google Scholar
  10. 10.
    D.H. Johnson and S. Sinanovic, “Symmetrizing the Kullback-Leibler distance, ” Technical Report, Rice University. , 2001.Google Scholar
  11. 11.
    T. Hofmann, Unsupervised Learning by Probabilistic Latent Semantic Analysis, Machine Learning Journal 42(1) (2001) 177–196.MATHCrossRefGoogle Scholar
  12. 12.
    Q. Mei, C. Liu, H. Su, C. Zhai, A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs, Proceedings of the WWW '06 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2006.Google Scholar
  13. 13.
    S. Nakajima, J. Tatemura, Y. Hino, Y. Hara, K. Tanaka, Discovering Important Bloggers based on Analyzing Blog Threads, Proceedings of the WWW '05 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2005.Google Scholar
  14. 14.
    D. Newman, C. Chemudugunta, P. Smyth, M. Steyvers, Analyzing Entities and Topics in News Articles Using Statistical Topic Models, Proceedings of the IEEE International Conference on Intelligence and Security Informatics (ISI), 2006.Google Scholar
  15. 15.
    R. Prabowo, M. Thelwall, A Comparison of Feature Selection Methods for an Evolving RSS Feed Corpus, Information Processing and Management 42(6) (2006) 1491–1512.CrossRefGoogle Scholar
  16. 16.
    D. Shen, J.-T. Sun, Q. Yang, , Z. Chen, Latent Friend Mining from Blog Data, Proceedings of the IEEE International Conference on Data Mining (ICDM), 2006.Google Scholar
  17. 17.
    Sophos security threat report, http://www.sophos.com/security/ whitepapers/, 2008.Google Scholar
  18. 18.
    M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths, “Probabilistic Author-Topic Models for Information Discovery, ” SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004.Google Scholar
  19. 19.
    J. Tenenbaum, V. de Silva, and J. Langford, “A Global Geometric Framework for Nonlinear Dimensionality Reduction, ” Science, vol. 290, pp. 2319–2323, Dec. 2000.CrossRefGoogle Scholar
  20. 20.
    Wikipedia contributors, Intelligence Analysis, Wikipedia, The Free Encyclopedia, http:// en.wikipedia.org/wiki/Intelligence_analysis, 2006.Google Scholar
  21. 21.
    Y. Xue, D.E. Brown, Spatial analysis with preference specification of latent decision makers for criminal event prediction, Decision Support Systems 41(3), (2006) 560–573.CrossRefGoogle Scholar
  22. 22.
    C.C. Yang, X. Shi, C.-P. Wei, Tracing the Event Evolution of Terror Attacks from On-Line News, Proceedings of the IEEE International Conference on Intelligence and Security Informatics (ISI), 2006.Google Scholar
  23. 23.
    O. Yilmazel, S. Symonenko, N. Balasubramanian, E.D. Liddy, Leveraging One-Class SVM and Semantic Analysis to Detect Anomalous Content, Proceedings of the IEEE International Conference on Intelligence and Security Informatics (ISI), 2005.Google Scholar
  24. 24.
    D. Zeimpekis, E. Gallopoulos, TMG: A MATLAB Toolbox for generating term-document matrices from text collections, Proceedings of Grouping Multidimensional Data: Recent Advances in Clustering, 2006.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Nanyang Technological UniversitySingapore

Personalised recommendations