Abstract
We study the combination of symbol frequence analysis and negative selection for anomaly detection of discrete sequences where conventional negative selection algorithms are not practical due to data sparsity. Theoretical analysis on ergodic Markov chains is used to outline the properties of the presented anomaly detection algorithm and to predict the probability of successful detection. Simulations are used to evaluate the detection sensitivity and the resolution of the analysis on both generated artificial data and real-world language data including the English Wikipedia. Simulation results on large reference corpora are used to study the effects of the assumptions made in the theoretical model in comparison to real-world data.
Similar content being viewed by others
References
National Institute of Standards and Technology (NIST). FIPS 180-2: Secure Hash Standard, August 2002. Available online at http://itl.nist.gov/fipspubs./
Forrest S, Perelson A S, Allen L, Cherukuri R. Self-nonself discrimination in a computer. In Proc. the 1994 IEEE Symposium on Research in Security and Privacy, Oakland, USA, May 16-18, 1994, pp.202–212.
Pöllä M, Honkela T. Change detection of text documents using negative first-order statistics. In Proc. the Second International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR2008), Porvoo, Finland, Sept. 17-19, 2008, pp.48–55.
Arstila T P, Casrouge A, Baron V, Even J, Kanellopoulos J, Kourilsky P. A direct estimate of the human α β T cell receptor diversity. Science, Oct. 1999, 286(29): 958–961.
Leandro N. de Castro, Jonathan Timmis (eds.). Artificial Immune Systems: A New Computational Intelligence Approach. Springer-Verlag, 2002.
Forrest S, Hofmeyr S A, Somayaji A, Longstaff T A. A sense of self for UNIX processes. In Proc. the 1996 IEEE Symp. Security and Privacy, Oakland, USA, May 6-8, 1996, pp.120–128.
Hofmeyr S A, Forrest S, Somayaji A. Intrusion detection using sequences of system calls. Journal of Computer Security, 1998, 6(3): 151–180.
Dasgupta D, Forrest S. Tool breakage detection in milling operations using a negative-selection algorithm. Technical Report CS95-5, Dept. Computer Science, Univ. New Mexico, 1995.
Dasgupta D, Forrest S. Novelty detection in time series data using ideas from immunology. In Proc. The International Conference on Intelligent Systems, 1995.
Ji Z, D Dasgupta. Revisiting negative selection algorithms. Evolutionary Computation, 2007, 15(2): 223–251.
Stibor T, Timmis J, Eckert C. The link between r-contiguous detectors and k-CNF satisfiability. In Proc. Congress on Evolutionary Computation (CEC), Vancouver, Canada, Jul. 2006, pp.491–498.
Esponda F, Forrest S, Helman P. A formal framework for positive and negative detection. IEEE Transactions on Systems, Man, and Cybernetics, 2004, 34(1): 357–373.
Fischer I. Pattern recognition algorithms for symbol strings [Ph.D. Dissertation]. University of Tübingen, 2003.
Percus J K, Percus O, Perelson A S. Predicting the size of the antibody combining region from consideration of efficient self/non-self discrimination. Proc. the National Academy of Science of the USA, 1993, 90(5): 1691–1695.
Balthrop J, Esponda F, Forrest S, Glickman M. Coverage and generalization in an artificial immune system. In Proc. GECCO-2002, New York, USA, July 9-13, 2002, pp.3–10.
Stibor T, Bayarou K M, Eckert C. An investigation of Rchunk detector generation on higher alphabets. In Proc. GECCO, Seattle, USA, Jun. 26-30, 2004, pp.299–307.
Stibor T. On the appropriateness of negative selection for anomaly detection and network intrusion detection [Ph.D. Dissertation]. Technische Universität Darmstadt, 2006.
D’haeseleer P, Forrest S, Helman P. An immunological approach to change detection: Algorithms, analysis, and implications. In Proc. the Symposium on Research in Security and Privacy, Oaklands, USA, May 6-8, 1996, pp.110–119.
D’haeseleer P. An immunological approach to change detection: Theoretical results. In Proc. the 9th Computer Security Foundations Workshop, Dromquinna Manor, Ireland, Mar. 10-12, 1996, pp.18–26.
Lewis D D, Yang Y, Rose T, Li F. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 2004, 5: 361–397.
González F A, Dasgupta D. Anomaly detection using realvalued negative selection. Genetic Programming and Evolvable Machines, 2003, 4(4): 383–403.
Grinstead C M, Snell L J. Introduction to Probability. American Mathematical Society, 4 July, 2006 edition, 2006.
Koehn P. Europarl: A Parallel Corpus for Statistical Machine Translation. MT Summit, 2005.
Timmis J, Hone A, Stibor T, Clark E. Theoretical advances in artificial immune systems. Theoretical Computer Science, 2008, 403(1): 11–32.
The Unicode Consortium. The Unicode Standard, Version 5.0. Addison-Wesley Professional, 5th Edition, Nov. 2006.
Pöllä M. A generative model for self/non-self discrimination in strings. In Proc. Int. Conf. Adaptive and Natural Computing Algorithms, Kuopio, Finland, Apr. 23-25, 2009, pp.293–302.
Pöllä M. An evaluation of windowing-based anomaly detection schemes for discrete sequences. 2010, unpublished manuscript.
Stibor T. A study of detecting computer viruses in real-infected files in the n-gram representation with machine learning methods. In Proc. the 23rd International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems (IEA-AIE), 2010. (Accepted)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was funded by the Academy of Finland under Grant No. 214144.
Rights and permissions
About this article
Cite this article
Pöllä, M., Honkela, T. Negative Selection of Written Language Using Character Multiset Statistics. J. Comput. Sci. Technol. 25, 1256–1266 (2010). https://doi.org/10.1007/s11390-010-9403-4
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-010-9403-4