Abstract
Efficiency in algorithms for data mining can be achieved through identifying representative prototypes or representative features and base explorative study only on those subsets. It is interesting to examine whether both of them can be achieved simultaneously through lossy compression and efficient clustering algorithms on large datasets. We study this aspect in the present chapter. We further examine whether there is a preference in sequencing both these activities; specifically, we examine clustering followed by compression and compression followed by clustering. We provide a detailed discussion on background material that includes definition of various terms, parameters, choice of thresholds in reducing number of patterns and features, etc. We study eight combinations of lossy compression scenarios. We demonstrate that these lossy compression scenarios with compressed information provide a better classification accuracy than the original dataset. In this direction, we implement the proposed scheme on two large datasets, one with binary-valued features and the other with float-point-valued features. At the end of the chapter, we provide bibliographic notes and a list of references.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in Proceedings of International Conference on VLDB (1994)
P. Bradley, U.M. Fayyad, C. Reina, Scaling clustering algorithms to large databases, in Proceedings of 4th Intl. Conf. on Knowledge Discovery and Data Mining (AAAI Press, New York, 1998), pp. 9–15
C.J.C. Burges, A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998).
P. Domingos, Occam’s two razors: the sharp and the blunt, in Proc. of 4th Intl. Conference on Knowledge Discovery and Data Mining (KDD’98), ed. by R. Agrawal, P. Stolorz (AAAI Press, New York, 1998), pp. 37–43
W. DuMouchel, C. Volinksy, T. Johnson, C. Cortez, D. Pregibon, Squashing flat files flatter, in Proc. 5th Intl. Conf. on Knowledge Discovery and Data Mining, San Diego, CA (AAAI Press, New York, 2002)
R.O. Duda, P.E. Hart, D.J. Stork, Pattern Classification (Wiley-Interscience, New York, 2000)
J. Han, M. Kamber, J. Pei, Data Mining—Concepts and Techniques (Morgan-Kauffman, New York, 2012)
P.E. Hart, The condensed nearest neighbor rule. IEEE Trans. Inf. Theory IT-14, 515–516 (1968)
A.K. Jain, M.N. Murty, P. Flynn, Data clustering: a review. ACM Comput. Surv. 32(3) (1999)
J. Kittler, Feature selection and extraction, in Handbook of Pattern Recognition and Image Proc., ed. by T.Y. Young, K.S. Fu. (Academic Press, San Diego, 1986), pp. 59–83
L. Kaufman, P.J. Rousseeuw, Finding Groups in Data—An Introduction to Cluster Analysis (Wiley, New York, 1989)
S.K. Pal, P. Mitra, Pattern Recognition Algorithms for Data Mining (Chapman & Hall/CRC, London/Boca Raton, 2004)
T. Ravindra Babu, M. Narasimha Murty, V.K. Agrawal, Hybrid learning scheme for data mining applications, in Proc. of Fourth Intl. Conf. on Hybrid Intelligent Systems (IEEE Computer Society, Los Alamitos, 2004), pp. 266–271. doi:10.1109/ICHIS.2004.56
T. Ravindra Babu, M. Narasimha Murty, V.K. Agrawal, On simultaneous selection of prototypes and features in large data, in Proceedings of the First International Conference on Pattern Recognition and Machine Intelligence. Lecture Notes in Computer Science, vol. 3776 (Springer, Berlin, 2005), pp. 595–600
T. Ravindra Babu, M. Narasimha Murty, Comparison of genetic algorithm based prototype selection schemes. Pattern Recognit. 34(2), 523–525 (2001)
H. Spath, Cluster Analysis Algorithms for Data Reduction and Classification (Ellis Horwood, Chichester, 1980)
T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient data clustering method for very large databases, in Proceedings of the ACM SIGMOD International Conference of Management of Data (SIGMOD’96) (1996), pp. 103–114
Iris dataset (2013) http://archive.isc.uci.edu/ml/datasets/Iris. Accessed on 18 April 2013
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag London
About this chapter
Cite this chapter
Ravindra Babu, T., Narasimha Murty, M., Subrahmanya, S.V. (2013). Data Compaction Through Simultaneous Selection of Prototypes and Features. In: Compression Schemes for Mining Large Datasets. Advances in Computer Vision and Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-4471-5607-9_5
Download citation
DOI: https://doi.org/10.1007/978-1-4471-5607-9_5
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5606-2
Online ISBN: 978-1-4471-5607-9
eBook Packages: Computer ScienceComputer Science (R0)