Abstract
We describe a method to convert web-traffic textual streams into a set of documents in a corpus to allow use of established linguistic tools for the study of semantics, topic evolution, and token-combination signatures. A novel web-document corpus is also described which represents semantic features from each batch for subsequent analysis. A (American-English) lexicon is used to create a canonical representation of each corpus whereby there is a consistent mapping of each TermID to the corresponding lexicon-word or token. Finally, representation of a corpus member as a ‘document’ is accomplished by combining the (http) request string with the concatenation of all responses to it. This representation thus allows association of the request string tokens with the resulting content, for consumption by document classification and comparison algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
C. Wang, D. Blei, D. Heckerman, Continuous Time Dynamic Topic Models (Princeton University, Princeton, 2008)
M. Hearst, Multi-Paragraph Segmentation of Expository Text (Computer Science Division, UC Berkeley, Berkeley, 1994)
A. Jain, A. Kadav, J. Kawale, Semantic Text Segmentation and Sub-topic Extraction. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.120.7624&rep=rep1&type=pdf, 2008
R. Kern, M. Granitzer, Efficient linear text segmentation based on information retrieval techniques. MEDES 2009, Lyon, France, pp. 167–171, 2009
M. Porter, An algorithm for suffix stripping. Program 14, 130–137 (1980)
R. Futrelle, A. Grimes, M. Shao, Extracting structure from HTML documents for language visualization and analysis. Biological Knowledge Laboratory, College of Computer and Information Science, Northeastern University, in ICDAR (Intl. Conf. Document Analysis and Recognition), Edinburgh, 2003
P. Wittek, S. Daranyi, Spectral composition of semantic spaces, in Proceedings of QI-11, 5th International Quantum Interaction Symposium, Aberdeen, UK, 2011
D. Mochihashi, lda, a Latent Dirichlet Allocation package. NTT Communication Science Laboratories, 2004. http://chasen.org/~daiti-m/dist/lda/
G. Stumme, A. Hotho, B. Berendt, Semantic Web Mining State of the Art and Future Directions (University of Kassel, Kassel, 2004)
J. Williams, S. Herrero, C. Leonardi, S. Chan, A. Sanchez, Z. Aung, Large in-memory cyber-physical security-related analytics via scalable coherent shared memory architectures. 2011 IEEE Symposium on Computational Intelligence in Cyber Security (CICS), 2011
P. Wittek, S. Daranyi, Connecting the dots: mass, energy, word meaning, and particle-wave duality, in QI-12, 6th International Quantum Interaction Symposium, Paris, France, 2012
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix: Sample Representations
The following excerpt illustrates typical web-traffic captured by Snort, TCPdump, and other string-oriented capture tools. These tools often add line-feeds following header records to enhance human readability. Other, binary data are rendered as ASCII characters, or as ‘.’ when the corresponding byte is not a printable character (Fig. 1).
Raw Data (from TCPdump)
Raw Traffic Strings After Tokenization
After Stop-Word Removal and Mapping to the Lexicon TermIDs
A validation display of token expansion from (T:F) back to lexicon word[T] per document. Documents #25 and #26 are shown:
WebStops
The webstops list contains many word tokens that comprise HTML markup and containers in web pages, such as tables, JavaScript functions, list structures, and style sheets. These words relate to the construction of such containers common to all web pages, and hence, are devoid of semantic content. These tokens, even though colliding with some lexicon words, must be removed so that the high frequency of these tokens does not dominate (T:F) sensitive semantic analysis algorithms like latent Dirichlet allocation.
academic | block | chars |
accent | body | check |
accept | bold | class |
action | border | click |
agent | bottom | clip |
align | bounding | close |
alive | box | color |
application | boxes | colorful |
author | browse | comma |
auto | browser | common |
background | bundle | compatibility |
banner | button | compatible |
batch | buttons | connection |
before | bytes | console |
begin | cache | content |
bind | cancel | continue |
blackboard | center | control |
blank | char | cookie |
character | cookies | |
characters | copy . . . . |
CustomStopWords
Modern web pages contain other non-lexicon words associated with JavaScript code, variable names and values. JavaScripting markup attributes also typically contain numerous key-value pairs. We observe that the values of KVPs most often contain semantically interesting content. The current process tries to retain these values, as well as other named entities, while removing non-lexicon keys and variable names. The following custom stopword list was formed after examination of a small set of web-pages, and was labeled by the author for subsequent use on this dataset. We also suggest that a more accurate and dynamic result should process the KVPs early in tokenization by splitting on the equals character (“=”). HTML and JavaScript keywords can be formally enumerated and removed. Variable names will be more difficult to determine precisely; it is likely partial-word stemming of segments may yield satisfactory performance since most often, variables consist of concatenated words, sometimes camel-cased, for self documentation. Entropy measures of the discovered components could also improve recognition of variable names.
bbnj | puvq | carin |
pbtpid | panose | callout |
errorh | btngradientopacity | imcspan |
yvlq | abpay | unexpectedtype |
validatedelete | pubi | headerbgcolor |
logout | rssheadlinecell | colheader |
classe | brea | codebase |
iptg | serv | privacypolicy |
sfri | offborder | jrskl |
emihidden | regexpmatch | pollcometinterval |
pickname | fieldcaption | reqrevision |
headgrade | playlists | baccentmedium |
mathfont | getmenubyname | substr |
nbsp | clickimage | bord |
sethttpmethod | nprmodpipe | active . . . . |
Semantic Analysis After LDA (Latent Dirichlet Allocation)
Treating such vectors in their pre-converted form would preserve anonymity while allowing trending, differentiation and anomaly analysis and comparisons. The following output has been expanded to the original words as a validation step to show correspondence to the original corpus documents.
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media New York
About this chapter
Cite this chapter
Hutchinson, S. (2014). Semantic Features from Web-Traffic Streams. In: Pino, R. (eds) Network Science and Cybersecurity. Advances in Information Security, vol 55. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-7597-2_14
Download citation
DOI: https://doi.org/10.1007/978-1-4614-7597-2_14
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-7596-5
Online ISBN: 978-1-4614-7597-2
eBook Packages: Computer ScienceComputer Science (R0)