A Data Structure to Speed-Up Machine Learning Algorithms on Massive Datasets

Padillo, Francisco; Luna, J. M.; Cano, Alberto; Ventura, Sebastián

doi:10.1007/978-3-319-32034-2_31

Francisco Padillo¹⁷,
J. M. Luna¹⁷,
Alberto Cano¹⁸ &
…
Sebastián Ventura¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9648))

Included in the following conference series:

International Conference on Hybrid Artificial Intelligence Systems

2207 Accesses
8 Citations

Abstract

Data processing in a fast and efficient way is an important functionality in machine learning, especially with the growing interest in data storage. This exponential increment in data size has hampered traditional techniques for data analysis and data processing, giving rise to a new set of methodologies under the term Big Data. Many efficient algorithms for machine learning have been proposed, facing up time and main memory requirements. Nevertheless, this process could still become hard when the number of features or records is extremely high. In this paper, the goal is not to propose new efficient algorithms but a new data structure that could be used by a variety of existing algorithms without modifying their original schemata. Moreover, the proposed data structure enables sparse datasets to be massively reduced, efficiently processing the data input into a new data structure output. The results demonstrate that the proposed data structure is highly promising, reducing the amount of storage and improving query performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Information about the API is available in: http://www.uco.es/grupos/kdis/kdiswiki/SpeedingUpML.
2.
Due to space limitations, we have put a table with the results for each dataset at the website http://www.uco.es/grupos/kdis/kdiswiki/SpeedingUpML.

References

Cox, M., Ellsworth, D.: Application-controlled demand paging for out-of-core visualization. In: Proceedings of the 8th Conference on Visualization 1997, VIS 1997, pp. 235-ff. IEEE Computer Society Press, Los Alamitos (1997)
Google Scholar
Cano, A., Luna, J.M., Ventura, S.: High performance evaluation of evolutionary-mined association rules on gpus. J. Supercomput. 66(3), 1438–1461 (2013)
Article Google Scholar
Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Berlin (2009)
Book MATH Google Scholar
Luk, R., Lam, W.: Efficient in-memory extensible inverted file. Inform. Syst. 32(5), 733–754 (2007)
Article Google Scholar
Abadi, D., Madden, S., Ferreira, M.: Integrating compression and execution in column-oriented database systems. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 671–682. SIGMOD Conference, Chicago, Illinois, USA (2006)
Google Scholar
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, New York (2011)
Book Google Scholar
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)
Article Google Scholar
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)
Article MATH Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemannr, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11, 10–18 (2009)
Article Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory (Chap. 3), 2nd edn. Wiley-Interscience (2006)
Google Scholar
Vreeken, J., Leeuwen, M., Siebes, A.: Krimp: mining itemsets that compress. Data Min. Knowl. Discovery 23(1), 169–214 (2011)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

This research was supported by the Spanish Ministry of Economy and Competitiveness, project TIN-2014-55252-P, and by FEDER funds.

Author information

Authors and Affiliations

Department of Computer Science and Numerical Analysis, University of Cordoba, Rabanales Campus, 14071, Cordoba, Spain
Francisco Padillo, J. M. Luna & Sebastián Ventura
Department of Computer Science, School of Engineering, Virginia Commonwealth University, 401 W. Main St, East Hall, E4251, Richmond, VA, USA
Alberto Cano

Authors

Francisco Padillo
View author publications
You can also search for this author in PubMed Google Scholar
J. M. Luna
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Cano
View author publications
You can also search for this author in PubMed Google Scholar
Sebastián Ventura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sebastián Ventura .

Editor information

Editors and Affiliations

Universidad Pablo de Olavide, Sevilla, Spain
Francisco Martínez-Álvarez
Universidad Pablo de Olavide, Sevilla, Spain
Alicia Troncoso
University of Salamanca, Salamanca, Spain
Héctor Quintián
University of Salamanca, Salamanca, Spain
Emilio Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Padillo, F., Luna, J.M., Cano, A., Ventura, S. (2016). A Data Structure to Speed-Up Machine Learning Algorithms on Massive Datasets. In: Martínez-Álvarez, F., Troncoso, A., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2016. Lecture Notes in Computer Science(), vol 9648. Springer, Cham. https://doi.org/10.1007/978-3-319-32034-2_31

Download citation

DOI: https://doi.org/10.1007/978-3-319-32034-2_31
Published: 14 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32033-5
Online ISBN: 978-3-319-32034-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics