Skip to main content
Log in

Negative Selection of Written Language Using Character Multiset Statistics

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

We study the combination of symbol frequence analysis and negative selection for anomaly detection of discrete sequences where conventional negative selection algorithms are not practical due to data sparsity. Theoretical analysis on ergodic Markov chains is used to outline the properties of the presented anomaly detection algorithm and to predict the probability of successful detection. Simulations are used to evaluate the detection sensitivity and the resolution of the analysis on both generated artificial data and real-world language data including the English Wikipedia. Simulation results on large reference corpora are used to study the effects of the assumptions made in the theoretical model in comparison to real-world data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. National Institute of Standards and Technology (NIST). FIPS 180-2: Secure Hash Standard, August 2002. Available online at http://itl.nist.gov/fipspubs./

  2. Forrest S, Perelson A S, Allen L, Cherukuri R. Self-nonself discrimination in a computer. In Proc. the 1994 IEEE Symposium on Research in Security and Privacy, Oakland, USA, May 16-18, 1994, pp.202–212.

  3. Pöllä M, Honkela T. Change detection of text documents using negative first-order statistics. In Proc. the Second International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR2008), Porvoo, Finland, Sept. 17-19, 2008, pp.48–55.

  4. Arstila T P, Casrouge A, Baron V, Even J, Kanellopoulos J, Kourilsky P. A direct estimate of the human α β T cell receptor diversity. Science, Oct. 1999, 286(29): 958–961.

    Article  Google Scholar 

  5. Leandro N. de Castro, Jonathan Timmis (eds.). Artificial Immune Systems: A New Computational Intelligence Approach. Springer-Verlag, 2002.

  6. Forrest S, Hofmeyr S A, Somayaji A, Longstaff T A. A sense of self for UNIX processes. In Proc. the 1996 IEEE Symp. Security and Privacy, Oakland, USA, May 6-8, 1996, pp.120–128.

  7. Hofmeyr S A, Forrest S, Somayaji A. Intrusion detection using sequences of system calls. Journal of Computer Security, 1998, 6(3): 151–180.

    Google Scholar 

  8. Dasgupta D, Forrest S. Tool breakage detection in milling operations using a negative-selection algorithm. Technical Report CS95-5, Dept. Computer Science, Univ. New Mexico, 1995.

  9. Dasgupta D, Forrest S. Novelty detection in time series data using ideas from immunology. In Proc. The International Conference on Intelligent Systems, 1995.

  10. Ji Z, D Dasgupta. Revisiting negative selection algorithms. Evolutionary Computation, 2007, 15(2): 223–251.

    Article  Google Scholar 

  11. Stibor T, Timmis J, Eckert C. The link between r-contiguous detectors and k-CNF satisfiability. In Proc. Congress on Evolutionary Computation (CEC), Vancouver, Canada, Jul. 2006, pp.491–498.

  12. Esponda F, Forrest S, Helman P. A formal framework for positive and negative detection. IEEE Transactions on Systems, Man, and Cybernetics, 2004, 34(1): 357–373.

    Article  Google Scholar 

  13. Fischer I. Pattern recognition algorithms for symbol strings [Ph.D. Dissertation]. University of Tübingen, 2003.

  14. Percus J K, Percus O, Perelson A S. Predicting the size of the antibody combining region from consideration of efficient self/non-self discrimination. Proc. the National Academy of Science of the USA, 1993, 90(5): 1691–1695.

    Article  Google Scholar 

  15. Balthrop J, Esponda F, Forrest S, Glickman M. Coverage and generalization in an artificial immune system. In Proc. GECCO-2002, New York, USA, July 9-13, 2002, pp.3–10.

  16. Stibor T, Bayarou K M, Eckert C. An investigation of Rchunk detector generation on higher alphabets. In Proc. GECCO, Seattle, USA, Jun. 26-30, 2004, pp.299–307.

  17. Stibor T. On the appropriateness of negative selection for anomaly detection and network intrusion detection [Ph.D. Dissertation]. Technische Universität Darmstadt, 2006.

  18. D’haeseleer P, Forrest S, Helman P. An immunological approach to change detection: Algorithms, analysis, and implications. In Proc. the Symposium on Research in Security and Privacy, Oaklands, USA, May 6-8, 1996, pp.110–119.

  19. D’haeseleer P. An immunological approach to change detection: Theoretical results. In Proc. the 9th Computer Security Foundations Workshop, Dromquinna Manor, Ireland, Mar. 10-12, 1996, pp.18–26.

  20. Lewis D D, Yang Y, Rose T, Li F. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 2004, 5: 361–397.

    Google Scholar 

  21. González F A, Dasgupta D. Anomaly detection using realvalued negative selection. Genetic Programming and Evolvable Machines, 2003, 4(4): 383–403.

    Article  Google Scholar 

  22. Grinstead C M, Snell L J. Introduction to Probability. American Mathematical Society, 4 July, 2006 edition, 2006.

  23. Koehn P. Europarl: A Parallel Corpus for Statistical Machine Translation. MT Summit, 2005.

  24. Timmis J, Hone A, Stibor T, Clark E. Theoretical advances in artificial immune systems. Theoretical Computer Science, 2008, 403(1): 11–32.

    Article  MATH  MathSciNet  Google Scholar 

  25. The Unicode Consortium. The Unicode Standard, Version 5.0. Addison-Wesley Professional, 5th Edition, Nov. 2006.

  26. Pöllä M. A generative model for self/non-self discrimination in strings. In Proc. Int. Conf. Adaptive and Natural Computing Algorithms, Kuopio, Finland, Apr. 23-25, 2009, pp.293–302.

  27. Pöllä M. An evaluation of windowing-based anomaly detection schemes for discrete sequences. 2010, unpublished manuscript.

  28. Stibor T. A study of detecting computer viruses in real-infected files in the n-gram representation with machine learning methods. In Proc. the 23rd International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems (IEA-AIE), 2010. (Accepted)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matti Pöllä.

Additional information

This work was funded by the Academy of Finland under Grant No. 214144.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pöllä, M., Honkela, T. Negative Selection of Written Language Using Character Multiset Statistics. J. Comput. Sci. Technol. 25, 1256–1266 (2010). https://doi.org/10.1007/s11390-010-9403-4

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-010-9403-4

Keywords

Navigation