Advertisement

Run-Length-Encoded Compression Scheme

  • T. Ravindra Babu
  • M. Narasimha Murty
  • S. V. Subrahmanya
Part of the Advances in Computer Vision and Pattern Recognition book series (ACVPR)

Abstract

Mining large datasets in a compressed domain are an interesting direction in data mining. In this chapter, we propose a nonlossy compression scheme. It is based on run-length encoding of binary-valued features or floating-point-valued features that are appropriately quantized into binary-valued data. The proposed algorithm compresses a given dataset in terms of runs and computes the dissimilarity in the compressed domain directly. This results in significant gains in computation time and storage. We provide a detailed discussion on relevant terms, algorithms, and its performance. We demonstrate efficiency of its working on classification of unseen compressed patterns. We discuss applicability of the scheme to genetic algorithms where classification happens to be a fitness function. We provide a few application scenarios in data mining. We provide theoretical discussions on the scheme. Bibliographic notes provide a brief discussion on important relevant references. A list of references is provided in the end.

Keywords

Genetic Algorithm Boolean Function Anomaly Detection Conjunctive Normal Form Minimum Description Length 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in Proc. 1993 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’93) (1993), pp. 266–271 Google Scholar
  2. A. Blumer, A. Ehrenfeucht, D. Haussler, M.K. Warmuth, Learnability and the Vapnik–Chervonenkis dimension. J. Assoc. Comput. Mach. 36(4), 929–965 (1989) MathSciNetCrossRefMATHGoogle Scholar
  3. P. Bradley, U.M. Fayyad, C. Reina, Scaling clustering algorithms to large databases, in Proceedings of 4th Intl. Conf. on Knowledge Discovery and Data Mining (AAAI Press, New York, 1998), pp. 9–15 Google Scholar
  4. M.M. Breuing, H.P. Kriegel, J. Sander, Fast hierarchical clustering based on compressed data and OPTICS, in Proc. 4th European Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD), vol. 1910 (2000) Google Scholar
  5. G.J. Chaitin, On the length of programs for computing finite binary sequences. J. Assoc. Comput. Mach. 13, 547–569 (1966) MathSciNetCrossRefMATHGoogle Scholar
  6. V. Cherkassky, F. Mulier, Learning from Data (Wiley, New York, 1998) MATHGoogle Scholar
  7. R.O. Duda, P.E. Hart, Pattern Classification and Scene Analysis (Wiley, New York, 1973) MATHGoogle Scholar
  8. W. DuMouchel, C. Volinksy, T. Johnson, C. Cortez, D. Pregibon, Squashing flat files flatter. in Proc. 5th Intl. Conf. on Knowledge Discovery and Data Mining, San Diego, CA (AAAI Press, New York, 2002) Google Scholar
  9. B.C.M. Fung, Hierarchical document clustering using frequent itemsets. M.Sc. Thesis, Simon Fraser University (2002) Google Scholar
  10. M. Girolami, C. He, Probability density estimation from optimally condensed data samples. IEEE Trans. Pattern Anal. Mach. Intell. 25(10), 1253–1264 (2003) CrossRefGoogle Scholar
  11. D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning (Addison-Wesley, Reading, 1989) MATHGoogle Scholar
  12. J. Han, M. Kamber, J. Pei, Data Mining—Concepts and Techniques (Morgan-Kauffman, New York, 2012) MATHGoogle Scholar
  13. T. Hastie, R. Tibshirani, Classification by pairwise coupling. Ann. Stat. 26(2) (1998) Google Scholar
  14. A.K. Jain, M.N. Murty, P. Flynn, Data clustering: a review. ACM Comput. Surv., 32(3) (1999) Google Scholar
  15. A.N. Kolmogorov, Three approaches to the quantitative definitions of information. Probl. Inf. Transm. 1(1), 1–7 (1965) MathSciNetGoogle Scholar
  16. V. Makinen, G. Navarro, E. Ukkinen, Approximate matching of run-length compressed strings. Algorithmica 35(4), 347–369 (2003) MathSciNetCrossRefGoogle Scholar
  17. J.P. Marques de Sa, Pattern Recognition—Concepts, Methods and Applications (Springer, Berlin, 2001) Google Scholar
  18. P. Mitra, C.A. Murthy, S.K. Pal, Data condensation in large databases by incremental learning with support vector machines, in Proc. 15th International Conference on Pattern Recognition (ICPR’00), vol. 2 (2000) Google Scholar
  19. T. Ravindra Babu, M. Narasimha Murty, V.K. Agrawal, Classification of run-length encoded binary strings. Pattern Recognit. 40(1), 321–323 (2007) CrossRefMATHGoogle Scholar
  20. J. Rissanen, Modeling by shortest data description. Automatica 14, 465–471 (1978) CrossRefMATHGoogle Scholar
  21. Z. Tian, R. Raghu, L. Micon, BIRCH: an efficient data clustering method for very large databases, in Proceedings of ACM SIGMOD International Conference of Management of Data (1996) Google Scholar
  22. V. Vapnik, Statistical Learning Theory, 2nd edn. (Wiley, New York, 1999) Google Scholar
  23. V. Vapnik, A.Ya. Chervonenkis, On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Dokl. Akad. Nauk, vol. 181 (1968) (Engl. Transl.: Sov. Math. Dokl.) Google Scholar
  24. V. Vapnik, A.Ya. Chervonenkis, The necessary and sufficient conditions for the consistency of the method of empirical risk minimization. Pattern Recognit. Image Anal. 1, 284–305 (1991) (Engl. Transl.) Google Scholar
  25. M. Vidyasagar, A Theory of Learning and Generalization (Springer, Berlin, 1997) MATHGoogle Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • T. Ravindra Babu
    • 1
  • M. Narasimha Murty
    • 2
  • S. V. Subrahmanya
    • 1
  1. 1.Infosys Technologies Ltd.BangaloreIndia
  2. 2.Indian Institute of ScienceBangaloreIndia

Personalised recommendations