Large Scale Instance Selection by Means of a Parallel Algorithm

de Haro-García, Aida; del Castillo, Juan Antonio Romero; García-Pedrajas, Nicolás

doi:10.1007/978-3-642-15381-5_1

Large Scale Instance Selection by Means of a Parallel Algorithm

Aida de Haro-García²¹,
Juan Antonio Romero del Castillo²¹ &
Nicolás García-Pedrajas²¹

Conference paper

1610 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6283))

Abstract

Instance selection is becoming more and more relevant due to the huge amount of data that is constantly being produced. However, although current algorithms are useful for fairly large datasets, many scaling problems are found when the number of instances is of hundred of thousands or millions. Most instance selection algorithms are of complexity at least O(n ²), n being the number of instances. When we face huge problems, the scalability becomes an issue, and most of the algorithms are not applicable.

This paper presents a way of removing this difficulty by means of a parallel algorithm that performs several rounds of instance selection on subsets of the original dataset. These rounds are combined using a voting scheme to allow a very good performance in terms of testing error and storage reduction, while the execution time of the process is decreased very significantly. The method is specially efficient when we use instance selection algorithms that are of a high computational cost.

An extensive comparison in 35 datasets of medium and large sizes from the UCI Machine Learning Repository shows the usefulness of our method. Additionally, the method is applied to 6 huge datasets (from three hundred thousands to more than four millions instances) with very good results and fast execution time.

This work was supported in part by the Project TIN2008-03151 of the Spanish Ministry of Science and Innovation and the Project of Excelence in Research P09-TIC-04623 of the Junta de Andalucía.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Asimov, D.: The Grand Tour: a Tool for Viewing Multidimensional Data. SIAM Journal on Scientific and Statistical Computing 6(1), 128–143 (1985)
Article MATH MathSciNet Google Scholar
Brighton, H., Mellish, C.: Advances in Instance Selection for Instance-Based Learning Algorithms. Data Mining and Knowledge Discovery 6, 153–172 (2002)
Article MATH MathSciNet Google Scholar
Cano, J.R., Herrera, F., Lozano, M.: Using Evolutionary Algorithms as Instance Selection for Data Reduction in KDD: An Experimental Study. IEEE Transactions on Evolutionary Computation 7(6), 561–575 (2003)
Article Google Scholar
Craven, M., DiPasquoa, D., Freitagb, D., McCalluma, A., Mitchella, T., Nigama, K., Slatterya, S.: Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence 118(1–2), 69–113 (2000)
Article MATH Google Scholar
García-Osorio, C., de Haro-García, A., García-Pedrajas, N.: Democratic Instance Selection: a linear complexity instance selection algorithm based on classifier ensemble concepts. Artificial Intelligence (2010)
Google Scholar
Liu, H., Motada, H., Yu, L.: A selective sampling approach to active feature selection. Artificial Intelligence 159(1-2), 49–74 (2004)
Article MATH MathSciNet Google Scholar
Schapire, R.E., Freund, Y., Bartlett, P.L., Lee, W.S.: Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics 26(5), 1651–1686 (1998)
Article MATH MathSciNet Google Scholar
Wilson, D.R., Martinez, T.R.: Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning 38, 257–286 (2000)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing and Numerical, Analysis of the University of Córdoba, Campus de Rabanales, 14071, Córdoba, Spain
Aida de Haro-García, Juan Antonio Romero del Castillo & Nicolás García-Pedrajas

Authors

Aida de Haro-García
View author publications
You can also search for this author in PubMed Google Scholar
Juan Antonio Romero del Castillo
View author publications
You can also search for this author in PubMed Google Scholar
Nicolás García-Pedrajas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing, University of the West of Scotland, PA1 2BE, Paisley, UK
Colin Fyfe
University of Birmingham, B15 2TT, Birmingham, UK
Peter Tino
University of Ulster, Coleraine, UK
Darryl Charles
Universidad de Burgos, Burgos, Spain
Cesar Garcia-Osorio
School of Electrical and Electronic Engineering, University of Manchester, Sackville Street Building, Sackville Street, M60 1QD, Manchester, UK
Hujun Yin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Haro-García, A., del Castillo, J.A.R., García-Pedrajas, N. (2010). Large Scale Instance Selection by Means of a Parallel Algorithm. In: Fyfe, C., Tino, P., Charles, D., Garcia-Osorio, C., Yin, H. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2010. IDEAL 2010. Lecture Notes in Computer Science, vol 6283. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15381-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-15381-5_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15380-8
Online ISBN: 978-3-642-15381-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics