Semantic Features from Web-Traffic Streams

Hutchinson, Steve

doi:10.1007/978-1-4614-7597-2_14

Steve Hutchinson²

Part of the book series: Advances in Information Security ((ADIS,volume 55))

3356 Accesses

Abstract

We describe a method to convert web-traffic textual streams into a set of documents in a corpus to allow use of established linguistic tools for the study of semantics, topic evolution, and token-combination signatures. A novel web-document corpus is also described which represents semantic features from each batch for subsequent analysis. A (American-English) lexicon is used to create a canonical representation of each corpus whereby there is a consistent mapping of each TermID to the corresponding lexicon-word or token. Finally, representation of a corpus member as a ‘document’ is accomplished by combining the (http) request string with the concatenation of all responses to it. This representation thus allows association of the request string tokens with the resulting content, for consumption by document classification and comparison algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

C. Wang, D. Blei, D. Heckerman, Continuous Time Dynamic Topic Models (Princeton University, Princeton, 2008)
Google Scholar
M. Hearst, Multi-Paragraph Segmentation of Expository Text (Computer Science Division, UC Berkeley, Berkeley, 1994)
Google Scholar
A. Jain, A. Kadav, J. Kawale, Semantic Text Segmentation and Sub-topic Extraction. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.120.7624&rep=rep1&type=pdf, 2008
R. Kern, M. Granitzer, Efficient linear text segmentation based on information retrieval techniques. MEDES 2009, Lyon, France, pp. 167–171, 2009
Google Scholar
M. Porter, An algorithm for suffix stripping. Program 14, 130–137 (1980)
Article Google Scholar
R. Futrelle, A. Grimes, M. Shao, Extracting structure from HTML documents for language visualization and analysis. Biological Knowledge Laboratory, College of Computer and Information Science, Northeastern University, in ICDAR (Intl. Conf. Document Analysis and Recognition), Edinburgh, 2003
Google Scholar
P. Wittek, S. Daranyi, Spectral composition of semantic spaces, in Proceedings of QI-11, 5th International Quantum Interaction Symposium, Aberdeen, UK, 2011
Google Scholar
D. Mochihashi, lda, a Latent Dirichlet Allocation package. NTT Communication Science Laboratories, 2004. http://chasen.org/~daiti-m/dist/lda/
G. Stumme, A. Hotho, B. Berendt, Semantic Web Mining State of the Art and Future Directions (University of Kassel, Kassel, 2004)
Google Scholar
J. Williams, S. Herrero, C. Leonardi, S. Chan, A. Sanchez, Z. Aung, Large in-memory cyber-physical security-related analytics via scalable coherent shared memory architectures. 2011 IEEE Symposium on Computational Intelligence in Cyber Security (CICS), 2011
Google Scholar
P. Wittek, S. Daranyi, Connecting the dots: mass, energy, word meaning, and particle-wave duality, in QI-12, 6th International Quantum Interaction Symposium, Paris, France, 2012
Google Scholar

Download references

Author information

Authors and Affiliations

ICF International, Fairfax, VA, USA
Steve Hutchinson

Authors

Steve Hutchinson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steve Hutchinson .

Editor information

Editors and Affiliations

ICF International, Lee Highway, Fairfax, 22031, Virginia, USA
Robinson E. Pino

Appendices

Appendix: Sample Representations

The following excerpt illustrates typical web-traffic captured by Snort, TCPdump, and other string-oriented capture tools. These tools often add line-feeds following header records to enhance human readability. Other, binary data are rendered as ASCII characters, or as ‘.’ when the corresponding byte is not a printable character (Fig. 1).

Raw Data (from TCPdump)

Raw Traffic Strings After Tokenization

After Stop-Word Removal and Mapping to the Lexicon TermIDs

A validation display of token expansion from (T:F) back to lexicon word[T] per document. Documents #25 and #26 are shown:

WebStops

The webstops list contains many word tokens that comprise HTML markup and containers in web pages, such as tables, JavaScript functions, list structures, and style sheets. These words relate to the construction of such containers common to all web pages, and hence, are devoid of semantic content. These tokens, even though colliding with some lexicon words, must be removed so that the high frequency of these tokens does not dominate (T:F) sensitive semantic analysis algorithms like latent Dirichlet allocation.

academic	block	chars
accent	body	check
accept	bold	class
action	border	click
agent	bottom	clip
align	bounding	close
alive	box	color
application	boxes	colorful
author	browse	comma
auto	browser	common
background	bundle	compatibility
banner	button	compatible
batch	buttons	connection
before	bytes	console
begin	cache	content
bind	cancel	continue
blackboard	center	control
blank	char	cookie
	character	cookies
	characters	copy . . . .

CustomStopWords

Modern web pages contain other non-lexicon words associated with JavaScript code, variable names and values. JavaScripting markup attributes also typically contain numerous key-value pairs. We observe that the values of KVPs most often contain semantically interesting content. The current process tries to retain these values, as well as other named entities, while removing non-lexicon keys and variable names. The following custom stopword list was formed after examination of a small set of web-pages, and was labeled by the author for subsequent use on this dataset. We also suggest that a more accurate and dynamic result should process the KVPs early in tokenization by splitting on the equals character (“=”). HTML and JavaScript keywords can be formally enumerated and removed. Variable names will be more difficult to determine precisely; it is likely partial-word stemming of segments may yield satisfactory performance since most often, variables consist of concatenated words, sometimes camel-cased, for self documentation. Entropy measures of the discovered components could also improve recognition of variable names.

bbnj	puvq	carin
pbtpid	panose	callout
errorh	btngradientopacity	imcspan
yvlq	abpay	unexpectedtype
validatedelete	pubi	headerbgcolor
logout	rssheadlinecell	colheader
classe	brea	codebase
iptg	serv	privacypolicy
sfri	offborder	jrskl
emihidden	regexpmatch	pollcometinterval
pickname	fieldcaption	reqrevision
headgrade	playlists	baccentmedium
mathfont	getmenubyname	substr
nbsp	clickimage	bord
sethttpmethod	nprmodpipe	active . . . .

Semantic Analysis After LDA (Latent Dirichlet Allocation)

Treating such vectors in their pre-converted form would preserve anonymity while allowing trending, differentiation and anomaly analysis and comparisons. The following output has been expanded to the original words as a validation step to show correspondence to the original corpus documents.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hutchinson, S. (2014). Semantic Features from Web-Traffic Streams. In: Pino, R. (eds) Network Science and Cybersecurity. Advances in Information Security, vol 55. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-7597-2_14

Download citation

DOI: https://doi.org/10.1007/978-1-4614-7597-2_14
Published: 15 June 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-7596-5
Online ISBN: 978-1-4614-7597-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics