Abstract
Data processing in a fast and efficient way is an important functionality in machine learning, especially with the growing interest in data storage. This exponential increment in data size has hampered traditional techniques for data analysis and data processing, giving rise to a new set of methodologies under the term Big Data. Many efficient algorithms for machine learning have been proposed, facing up time and main memory requirements. Nevertheless, this process could still become hard when the number of features or records is extremely high. In this paper, the goal is not to propose new efficient algorithms but a new data structure that could be used by a variety of existing algorithms without modifying their original schemata. Moreover, the proposed data structure enables sparse datasets to be massively reduced, efficiently processing the data input into a new data structure output. The results demonstrate that the proposed data structure is highly promising, reducing the amount of storage and improving query performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Information about the API is available in: http://www.uco.es/grupos/kdis/kdiswiki/SpeedingUpML.
- 2.
Due to space limitations, we have put a table with the results for each dataset at the website http://www.uco.es/grupos/kdis/kdiswiki/SpeedingUpML.
References
Cox, M., Ellsworth, D.: Application-controlled demand paging for out-of-core visualization. In: Proceedings of the 8th Conference on Visualization 1997, VIS 1997, pp. 235-ff. IEEE Computer Society Press, Los Alamitos (1997)
Cano, A., Luna, J.M., Ventura, S.: High performance evaluation of evolutionary-mined association rules on gpus. J. Supercomput. 66(3), 1438–1461 (2013)
Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Berlin (2009)
Luk, R., Lam, W.: Efficient in-memory extensible inverted file. Inform. Syst. 32(5), 733–754 (2007)
Abadi, D., Madden, S., Ferreira, M.: Integrating compression and execution in column-oriented database systems. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 671–682. SIGMOD Conference, Chicago, Illinois, USA (2006)
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, New York (2011)
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemannr, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11, 10–18 (2009)
Cover, T.M., Thomas, J.A.: Elements of Information Theory (Chap. 3), 2nd edn. Wiley-Interscience (2006)
Vreeken, J., Leeuwen, M., Siebes, A.: Krimp: mining itemsets that compress. Data Min. Knowl. Discovery 23(1), 169–214 (2011)
Acknowledgments
This research was supported by the Spanish Ministry of Economy and Competitiveness, project TIN-2014-55252-P, and by FEDER funds.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Padillo, F., Luna, J.M., Cano, A., Ventura, S. (2016). A Data Structure to Speed-Up Machine Learning Algorithms on Massive Datasets. In: Martínez-Álvarez, F., Troncoso, A., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2016. Lecture Notes in Computer Science(), vol 9648. Springer, Cham. https://doi.org/10.1007/978-3-319-32034-2_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-32034-2_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32033-5
Online ISBN: 978-3-319-32034-2
eBook Packages: Computer ScienceComputer Science (R0)