Semantic Features from Web-Traffic Streams

  • Steve HutchinsonEmail author
Part of the Advances in Information Security book series (ADIS, volume 55)


We describe a method to convert web-traffic textual streams into a set of documents in a corpus to allow use of established linguistic tools for the study of semantics, topic evolution, and token-combination signatures. A novel web-document corpus is also described which represents semantic features from each batch for subsequent analysis. A (American-English) lexicon is used to create a canonical representation of each corpus whereby there is a consistent mapping of each TermID to the corresponding lexicon-word or token. Finally, representation of a corpus member as a ‘document’ is accomplished by combining the (http) request string with the concatenation of all responses to it. This representation thus allows association of the request string tokens with the resulting content, for consumption by document classification and comparison algorithms.


Semantic Analysis Latent Dirichlet Allocation Document Classification Corpus Member Stopword List 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    C. Wang, D. Blei, D. Heckerman, Continuous Time Dynamic Topic Models (Princeton University, Princeton, 2008)Google Scholar
  2. 2.
    M. Hearst, Multi-Paragraph Segmentation of Expository Text (Computer Science Division, UC Berkeley, Berkeley, 1994)Google Scholar
  3. 3.
    A. Jain, A. Kadav, J. Kawale, Semantic Text Segmentation and Sub-topic Extraction. Retrieved from, 2008
  4. 4.
    R. Kern, M. Granitzer, Efficient linear text segmentation based on information retrieval techniques. MEDES 2009, Lyon, France, pp. 167–171, 2009Google Scholar
  5. 5.
    M. Porter, An algorithm for suffix stripping. Program 14, 130–137 (1980)CrossRefGoogle Scholar
  6. 6.
    R. Futrelle, A. Grimes, M. Shao, Extracting structure from HTML documents for language visualization and analysis. Biological Knowledge Laboratory, College of Computer and Information Science, Northeastern University, in ICDAR (Intl. Conf. Document Analysis and Recognition), Edinburgh, 2003Google Scholar
  7. 7.
    P. Wittek, S. Daranyi, Spectral composition of semantic spaces, in Proceedings of QI-11, 5th International Quantum Interaction Symposium, Aberdeen, UK, 2011Google Scholar
  8. 8.
    D. Mochihashi, lda, a Latent Dirichlet Allocation package. NTT Communication Science Laboratories, 2004.
  9. 9.
    G. Stumme, A. Hotho, B. Berendt, Semantic Web Mining State of the Art and Future Directions (University of Kassel, Kassel, 2004)Google Scholar
  10. 10.
    J. Williams, S. Herrero, C. Leonardi, S. Chan, A. Sanchez, Z. Aung, Large in-memory cyber-physical security-related analytics via scalable coherent shared memory architectures. 2011 IEEE Symposium on Computational Intelligence in Cyber Security (CICS), 2011Google Scholar
  11. 11.
    P. Wittek, S. Daranyi, Connecting the dots: mass, energy, word meaning, and particle-wave duality, in QI-12, 6th International Quantum Interaction Symposium, Paris, France, 2012Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.ICF International Fairfax USA

Personalised recommendations