Negative Selection of Written Language Using Character Multiset Statistics

Pöllä, Matti; Honkela, Timo

doi:10.1007/s11390-010-9403-4

Negative Selection of Written Language Using Character Multiset Statistics

Regular Paper
Published: 03 November 2010

Volume 25, pages 1256–1266, (2010)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Matti Pöllä¹ &
Timo Honkela¹

58 Accesses
1 Citation
Explore all metrics

Abstract

We study the combination of symbol frequence analysis and negative selection for anomaly detection of discrete sequences where conventional negative selection algorithms are not practical due to data sparsity. Theoretical analysis on ergodic Markov chains is used to outline the properties of the presented anomaly detection algorithm and to predict the probability of successful detection. Simulations are used to evaluate the detection sensitivity and the resolution of the analysis on both generated artificial data and real-world language data including the English Wikipedia. Simulation results on large reference corpora are used to study the effects of the assumptions made in the theoretical model in comparison to real-world data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

National Institute of Standards and Technology (NIST). FIPS 180-2: Secure Hash Standard, August 2002. Available online at http://itl.nist.gov/fipspubs./
Forrest S, Perelson A S, Allen L, Cherukuri R. Self-nonself discrimination in a computer. In Proc. the 1994 IEEE Symposium on Research in Security and Privacy, Oakland, USA, May 16-18, 1994, pp.202–212.
Pöllä M, Honkela T. Change detection of text documents using negative first-order statistics. In Proc. the Second International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR2008), Porvoo, Finland, Sept. 17-19, 2008, pp.48–55.
Arstila T P, Casrouge A, Baron V, Even J, Kanellopoulos J, Kourilsky P. A direct estimate of the human α β T cell receptor diversity. Science, Oct. 1999, 286(29): 958–961.
Article Google Scholar
Leandro N. de Castro, Jonathan Timmis (eds.). Artificial Immune Systems: A New Computational Intelligence Approach. Springer-Verlag, 2002.
Forrest S, Hofmeyr S A, Somayaji A, Longstaff T A. A sense of self for UNIX processes. In Proc. the 1996 IEEE Symp. Security and Privacy, Oakland, USA, May 6-8, 1996, pp.120–128.
Hofmeyr S A, Forrest S, Somayaji A. Intrusion detection using sequences of system calls. Journal of Computer Security, 1998, 6(3): 151–180.
Google Scholar
Dasgupta D, Forrest S. Tool breakage detection in milling operations using a negative-selection algorithm. Technical Report CS95-5, Dept. Computer Science, Univ. New Mexico, 1995.
Dasgupta D, Forrest S. Novelty detection in time series data using ideas from immunology. In Proc. The International Conference on Intelligent Systems, 1995.
Ji Z, D Dasgupta. Revisiting negative selection algorithms. Evolutionary Computation, 2007, 15(2): 223–251.
Article Google Scholar
Stibor T, Timmis J, Eckert C. The link between r-contiguous detectors and k-CNF satisfiability. In Proc. Congress on Evolutionary Computation (CEC), Vancouver, Canada, Jul. 2006, pp.491–498.
Esponda F, Forrest S, Helman P. A formal framework for positive and negative detection. IEEE Transactions on Systems, Man, and Cybernetics, 2004, 34(1): 357–373.
Article Google Scholar
Fischer I. Pattern recognition algorithms for symbol strings [Ph.D. Dissertation]. University of Tübingen, 2003.
Percus J K, Percus O, Perelson A S. Predicting the size of the antibody combining region from consideration of efficient self/non-self discrimination. Proc. the National Academy of Science of the USA, 1993, 90(5): 1691–1695.
Article Google Scholar
Balthrop J, Esponda F, Forrest S, Glickman M. Coverage and generalization in an artificial immune system. In Proc. GECCO-2002, New York, USA, July 9-13, 2002, pp.3–10.
Stibor T, Bayarou K M, Eckert C. An investigation of Rchunk detector generation on higher alphabets. In Proc. GECCO, Seattle, USA, Jun. 26-30, 2004, pp.299–307.
Stibor T. On the appropriateness of negative selection for anomaly detection and network intrusion detection [Ph.D. Dissertation]. Technische Universität Darmstadt, 2006.
D’haeseleer P, Forrest S, Helman P. An immunological approach to change detection: Algorithms, analysis, and implications. In Proc. the Symposium on Research in Security and Privacy, Oaklands, USA, May 6-8, 1996, pp.110–119.
D’haeseleer P. An immunological approach to change detection: Theoretical results. In Proc. the 9th Computer Security Foundations Workshop, Dromquinna Manor, Ireland, Mar. 10-12, 1996, pp.18–26.
Lewis D D, Yang Y, Rose T, Li F. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 2004, 5: 361–397.
Google Scholar
González F A, Dasgupta D. Anomaly detection using realvalued negative selection. Genetic Programming and Evolvable Machines, 2003, 4(4): 383–403.
Article Google Scholar
Grinstead C M, Snell L J. Introduction to Probability. American Mathematical Society, 4 July, 2006 edition, 2006.
Koehn P. Europarl: A Parallel Corpus for Statistical Machine Translation. MT Summit, 2005.
Timmis J, Hone A, Stibor T, Clark E. Theoretical advances in artificial immune systems. Theoretical Computer Science, 2008, 403(1): 11–32.
Article MATH MathSciNet Google Scholar
The Unicode Consortium. The Unicode Standard, Version 5.0. Addison-Wesley Professional, 5th Edition, Nov. 2006.
Pöllä M. A generative model for self/non-self discrimination in strings. In Proc. Int. Conf. Adaptive and Natural Computing Algorithms, Kuopio, Finland, Apr. 23-25, 2009, pp.293–302.
Pöllä M. An evaluation of windowing-based anomaly detection schemes for discrete sequences. 2010, unpublished manuscript.
Stibor T. A study of detecting computer viruses in real-infected files in the n-gram representation with machine learning methods. In Proc. the 23rd International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems (IEA-AIE), 2010. (Accepted)

Download references

Author information

Authors and Affiliations

Department of Information and Computer Science, School of Science and Technology, Aalto University, P.O. Box 15400, FI-00076, Aalto, Finland
Matti Pöllä & Timo Honkela

Authors

Matti Pöllä
View author publications
You can also search for this author in PubMed Google Scholar
Timo Honkela
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matti Pöllä.

Additional information

This work was funded by the Academy of Finland under Grant No. 214144.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pöllä, M., Honkela, T. Negative Selection of Written Language Using Character Multiset Statistics. J. Comput. Sci. Technol. 25, 1256–1266 (2010). https://doi.org/10.1007/s11390-010-9403-4

Download citation

Received: 15 July 2008
Revised: 06 August 2010
Published: 03 November 2010
Issue Date: November 2010
DOI: https://doi.org/10.1007/s11390-010-9403-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Negative Selection of Written Language Using Character Multiset Statistics

Abstract

Access this article

Similar content being viewed by others

Optimisation of Character n-gram Profiles Method for Intrinsic Plagiarism Detection

Automatic Language Identification for Celtic Texts

Spanish Diacritic Error Detection and Restoration—A Survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Negative Selection of Written Language Using Character Multiset Statistics

Abstract

Access this article

Similar content being viewed by others

Optimisation of Character n-gram Profiles Method for Intrinsic Plagiarism Detection

Automatic Language Identification for Celtic Texts

Spanish Diacritic Error Detection and Restoration—A Survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation