Skip to main content

Data Compaction Through Simultaneous Selection of Prototypes and Features

  • Chapter
Compression Schemes for Mining Large Datasets

Abstract

Efficiency in algorithms for data mining can be achieved through identifying representative prototypes or representative features and base explorative study only on those subsets. It is interesting to examine whether both of them can be achieved simultaneously through lossy compression and efficient clustering algorithms on large datasets. We study this aspect in the present chapter. We further examine whether there is a preference in sequencing both these activities; specifically, we examine clustering followed by compression and compression followed by clustering. We provide a detailed discussion on background material that includes definition of various terms, parameters, choice of thresholds in reducing number of patterns and features, etc. We study eight combinations of lossy compression scenarios. We demonstrate that these lossy compression scenarios with compressed information provide a better classification accuracy than the original dataset. In this direction, we implement the proposed scheme on two large datasets, one with binary-valued features and the other with float-point-valued features. At the end of the chapter, we provide bibliographic notes and a list of references.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in Proceedings of International Conference on VLDB (1994)

    Google Scholar 

  • P. Bradley, U.M. Fayyad, C. Reina, Scaling clustering algorithms to large databases, in Proceedings of 4th Intl. Conf. on Knowledge Discovery and Data Mining (AAAI Press, New York, 1998), pp. 9–15

    Google Scholar 

  • C.J.C. Burges, A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998).

    Article  Google Scholar 

  • P. Domingos, Occam’s two razors: the sharp and the blunt, in Proc. of 4th Intl. Conference on Knowledge Discovery and Data Mining (KDD’98), ed. by R. Agrawal, P. Stolorz (AAAI Press, New York, 1998), pp. 37–43

    Google Scholar 

  • W. DuMouchel, C. Volinksy, T. Johnson, C. Cortez, D. Pregibon, Squashing flat files flatter, in Proc. 5th Intl. Conf. on Knowledge Discovery and Data Mining, San Diego, CA (AAAI Press, New York, 2002)

    Google Scholar 

  • R.O. Duda, P.E. Hart, D.J. Stork, Pattern Classification (Wiley-Interscience, New York, 2000)

    Google Scholar 

  • J. Han, M. Kamber, J. Pei, Data Mining—Concepts and Techniques (Morgan-Kauffman, New York, 2012)

    MATH  Google Scholar 

  • P.E. Hart, The condensed nearest neighbor rule. IEEE Trans. Inf. Theory IT-14, 515–516 (1968)

    Article  Google Scholar 

  • A.K. Jain, M.N. Murty, P. Flynn, Data clustering: a review. ACM Comput. Surv. 32(3) (1999)

    Google Scholar 

  • J. Kittler, Feature selection and extraction, in Handbook of Pattern Recognition and Image Proc., ed. by T.Y. Young, K.S. Fu. (Academic Press, San Diego, 1986), pp. 59–83

    Google Scholar 

  • L. Kaufman, P.J. Rousseeuw, Finding Groups in Data—An Introduction to Cluster Analysis (Wiley, New York, 1989)

    Google Scholar 

  • S.K. Pal, P. Mitra, Pattern Recognition Algorithms for Data Mining (Chapman & Hall/CRC, London/Boca Raton, 2004)

    Book  MATH  Google Scholar 

  • T. Ravindra Babu, M. Narasimha Murty, V.K. Agrawal, Hybrid learning scheme for data mining applications, in Proc. of Fourth Intl. Conf. on Hybrid Intelligent Systems (IEEE Computer Society, Los Alamitos, 2004), pp. 266–271. doi:10.1109/ICHIS.2004.56

    Chapter  Google Scholar 

  • T. Ravindra Babu, M. Narasimha Murty, V.K. Agrawal, On simultaneous selection of prototypes and features in large data, in Proceedings of the First International Conference on Pattern Recognition and Machine Intelligence. Lecture Notes in Computer Science, vol. 3776 (Springer, Berlin, 2005), pp. 595–600

    Chapter  Google Scholar 

  • T. Ravindra Babu, M. Narasimha Murty, Comparison of genetic algorithm based prototype selection schemes. Pattern Recognit. 34(2), 523–525 (2001)

    Article  Google Scholar 

  • H. Spath, Cluster Analysis Algorithms for Data Reduction and Classification (Ellis Horwood, Chichester, 1980)

    Google Scholar 

  • T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient data clustering method for very large databases, in Proceedings of the ACM SIGMOD International Conference of Management of Data (SIGMOD’96) (1996), pp. 103–114

    Google Scholar 

  • Iris dataset (2013) http://archive.isc.uci.edu/ml/datasets/Iris. Accessed on 18 April 2013

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag London

About this chapter

Cite this chapter

Ravindra Babu, T., Narasimha Murty, M., Subrahmanya, S.V. (2013). Data Compaction Through Simultaneous Selection of Prototypes and Features. In: Compression Schemes for Mining Large Datasets. Advances in Computer Vision and Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-4471-5607-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-5607-9_5

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-5606-2

  • Online ISBN: 978-1-4471-5607-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics