Learned Data Structures

Ferragina, Paolo; Vinciguerra, Giorgio

doi:10.1007/978-3-030-43883-8_2

Paolo Ferragina⁶ &
Giorgio Vinciguerra⁶

Part of the book series: Studies in Computational Intelligence ((SCI,volume 896))

1444 Accesses
16 Citations

Abstract

Very recently, the unexpected combination of data structures and machine learning has led to the development of a new area of research, called learned data structures. Their distinguishing trait is the ability to reveal and exploit patterns and trends in the input data for achieving more efficiency in time and space, compared to previously known data structures. The goal of this chapter is to provide the first comprehensive survey of these results and to stimulate further research in this promising area.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We assume that universe \(\mathcal {U}\) is a range of reals because of the arithmetic operations required by the models. This works for any kind of keys that can be mapped to reals by preserving their order, such as integers or strings.
2.
Private correspondence with the authors clarified that only linear models were used in the hierarchy due to their superior performance.
3.
It has to be noted that a B⁺-tree supports also insertions and deletions, which require a more complex structure that is both slower and more space-inefficient than other static tree-based indexes.
4.
The authors suggest organising the buffer “like a hash table”. Even though hashing simplifies buffer modifications, we lose the ability to do predecessor or range queries efficiently.
5.
The authors do not discuss the details of this step, but we assume that it is performed in \(O(\log \tilde{m})\) time via a binary search on the \(\tilde{m}\) sorted key-model pairs.
6.
We remark that MurmurHash is a known function which can be implemented in a few dozen lines of C code.
7.
It has to be noted that for 5K elements a Bloom filter of \({\approx }6\) KB and \(k=7\) hash functions would have sufficed to achieve 1% false positives. Similarly, a Bloom filter of \({\approx }9\) KB and \(k=10\) hash functions would have sufficed to achieve 0.1% false positives.

References

Ao, N., Zhang, F., Wu, D., Stones, D.S., Wang, G., Liu, X., Liu, J., Lin, S.: Efficient parallel lists intersection and index compression algorithms using graphics processing units. Proc. VLDB Endow. 4(8), 470–481 (2011)
Article Google Scholar
Bender, M.A., Farach-Colton, M., Mosteiro, M.A.: Insertion sort is \(o(n \log n)\). Theory Comput. Syst. 39(3), 391–397 (2006)
Article MathSciNet Google Scholar
Bender, M.A., Hu, H.: An adaptive packed-memory array. ACM Trans. Database Syst. 32(4) (2007)
Google Scholar
Bentley, J.L., Yao, A.C.C.: An almost optimal algorithm for unbounded searching. Inf. Process. Lett. 5(3), 82–87 (1976)
Article MathSciNet Google Scholar
Broder, A., Mitzenmacher, M.: Network applications of Bloom filters: a survey. Internet Math. 1(4), 485–509 (2004)
Article MathSciNet Google Scholar
Coraddu, A., Oneto, L., Baldi, F., Anguita, D.: Vessels fuel consumption forecast and trim optimisation: a data analytics perspective. Ocean. Eng. 130, 351–370 (2017)
Article Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)
MATH Google Scholar
Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C.: Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches, vol. 4, No. 1–3, pp. 1–294. Foundations and Trends in Databases (2011)
Google Scholar
Cormode, G., Muthukrishnan, S.: Summarizing and mining skewed data streams. In: Proceedings of the International Conference on Data Mining, SDM, pp. 44–55. SIAM (2005)
Google Scholar
Dai, Z., Shrivastava, A.: Adaptive learned Bloom filter (Ada-BF): Efficient utilization of the classifier. arXiv:1910.09131 (2019)
Ding, J., Minhas, U.F., Zhang, H., Li, Y., Wang, C., Chandramouli, B., Gehrke, J., Kossmann, D., Lomet, D.: ALEX: An updatable adaptive learned index. arXiv:1905.08898 (2019)
Ferragina, P., Vinciguerra, G.: The PGM-index: a fully dynamic compressed learned index with provable worst-case bounds. PVLDB J. 13(8) (2020). ISSN 2150-8097
Google Scholar
Gaede, V., Günther, O.: Multidimensional access methods. ACM Comput. Surv. 30(2), 170–231 (1998)
Article Google Scholar
Galakatos, A., Markovitch, M., Binnig, C., Fonseca, R., Kraska, T.: FITing-Tree: a data-aware index structure. In: Proceedings of the International Conference on Management of Data, SIGMOD, pp. 1189–1206. ACM, New York, NY, USA (2019)
Google Scholar
Hadian, A., Heinis, T.: Considerations for handling updates in learned index structures. In: Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM, pp. 3:1–3:4. ACM, New York, NY, USA (2019)
Google Scholar
Hashemi, M., Swersky, K., Smith, J.A., Ayers, G., Litz, H., Chang, J., Kozyrakis, C., Ranganathan, P.: Learning memory access patterns. In: ICML, Proceedings of Machine Learning Research, vol. 80, pp. 1924–1933. PMLR (2018)
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction, 2nd edn., Springer Series in Statistics, Springer (2009)
Google Scholar
Hsu, C.Y., Indyk, P., Katabi, D., Vakilian, A.: Learning-based frequency estimation algorithms. In: International Conference on Learning Representations, ICLR (2019)
Google Scholar
Idreos, S., Zoumpatianos, K., Chatterjee, S., Qin, W., Wasay, A., Hentschel, B., Kester, M., Dayan, N., Guo, D., Kang, M., Sun, Y.: Learning data structure alchemy. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 42(2), 46–57 (2019)
Google Scholar
Idreos, S., Zoumpatianos, K., Hentschel, B., Kester, M.S., Guo, D.: The data calculator: Data structure design and cost synthesis from first principles and learned cost models. In: Proceedings of the International Conference on Management of Data, SIGMOD, pp. 535–550. ACM, New York, NY, USA (2018)
Google Scholar
Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P.A., Kemper, A.: Learned cardinalities: estimating correlated joins with deep learning. In: CIDR. www.cidrdb.org (2019)
Kraska, T., Alizadeh, M., Beutel, A., Chi, E.H., Kristo, A., Leclerc, G., Madden, S., Mao, H., Nathan, V.: SageDB: a learned database system. In: CIDR. www.cidrdb.org (2019)
Kraska, T., Beutel, A., Chi, E.H., Dean, J., Polyzotis, N.: The case for learned index structures. In: Proceedings of the International Conference on Management of Data, SIGMOD, pp. 489–504. ACM, New York, NY, USA (2018)
Google Scholar
Li, P., Hua, Y., Zuo, P., Jia, J.: A scalable learned index scheme in storage systems. arXiv:1905.06256 (2019)
Li, X., Li, J., Wang, X.: ASLM: Adaptive single layer model for learned index. In: Li, G., Yang, J., Gama, J., Natwichai, J., Tong, Y. (eds.) Database Systems for Advanced Applications, pp. 80–95. Springer International Publishing, Cham (2019)
Google Scholar
Macke, S., Beutel, A., Kraska, T., Sathiamoorthy, M., Cheng, D.Z., Chi, E.H.: Lifting the curse of multidimensional data with learned existence indexes. In: Workshop on ML for Systems at NeurIPS (2018)
Google Scholar
Mehta, D.P., Sahni, S. (eds.): Handbook of Data Structures and Applications, 2nd edn. Chapman and Hall/CRC, Boca Raton (2018)
MATH Google Scholar
Mitzenmacher, M.: A model for learned Bloom filters and optimizing by sandwiching. In: 32nd Conference on Neural Information Processing Systems, NeurIPS (2018)
Google Scholar
Navarro, G.: Compact data structures: a practical approach. Cambridge University Press, New York (2016)
Book Google Scholar
Oneto, L., Ridella, S., Anguita, D.: Tikhonov, ivanov and morozov regularization for support vector machine learning. Mach. Learn. 103(1), 103–136 (2015)
Article MathSciNet Google Scholar
O’Rourke, J.: An on-line algorithm for fitting straight lines between data ranges. Commun. ACM 24(9), 574–578 (1981)
Article Google Scholar
Petrov, A.: Algorithms behind modern storage systems. Commun. ACM 61(8), 38–44 (2018)
Article Google Scholar
Qu, W., Wang, X., Li, J., Li, X.: Hybrid indexes by exploring traditional B-tree and linear regression. In: Proceedings of the 16th International Conference on Web Information Systems and Applications, WISA, pp. 601–613. Springer International Publishing, Cham (2019)
Google Scholar
Rae, J., Bartunov, S., Lillicrap, T.: Meta-learning neural Bloom filters. In: Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 97, pp. 5271–5280. PMLR, Long Beach, California, USA (2019)
Google Scholar
Vahdat, M., Oneto, L., Anguita, D., Funk, M., Rauterberg, M.: A learning analytics approach to correlate the academic achievements of students with interaction data from an educational simulator. In: European Conference on Technology Enhanced Learning (2015)
Google Scholar
Vinciguerra, G., Ferragina, P., Miccinesi, M.: Superseding traditional indexes by orchestrating learning and geometry. arXiv:1903.00507 (2019)
Vitter, J.S.: External memory algorithms and data structures: dealing with massive data. ACM Comput. Surv. 33(2), 209–271 (2001)
Article Google Scholar
Wang, H., Fu, X., Xu, J., Lu, H.: Learned index for spatial queries. In: Proceedings of the 20th International Conference on Mobile Data Management, MDM, pp. 569–574. IEEE (2019)
Google Scholar
Wu, Y., Yu, J., Tian, Y., Sidle, R., Barber, R.: Designing succinct secondary indexing mechanism by exploiting column correlations. In: Proceedings of the International Conference on Management of Data, SIGMOD, pp. 1223–1240. ACM, New York, NY, USA (2019)
Google Scholar
Xiang, W., Zhang, H., Cui, R., Chu, X., Li, K., Zhou, W.: Pavo: A RNN-based learned inverted index, supervised or unsupervised? IEEE Access 7, 293–303 (2019)
Article Google Scholar
Zhang, M., Wang, H., Li, J., Gao, H.: Learned sketches for frequency estimation. Inf. Sci. 507, 365–385 (2020)
Article MathSciNet Google Scholar
Zhang, Y., Huang, Y.: “Learned” operating systems. SIGOPS Oper. Syst. Rev. 53(1), 40–45 (2019)
Article Google Scholar

Download references

Acknowledgements

Part of this work has been supported by the Italian MIUR PRIN project “Multicriteria Data Structures and Algorithms: from compressed to learned indexes, and beyond” (Prot. 2017WR7SHH), by Regione Toscana (under POR FSE 2014/2020), and by the European Integrated Infrastructure for Social Mining and Big Data Analytics (SoBigData++, Grant Agreement # 871042).

Author information

Authors and Affiliations

Università di Pisa, Pisa, Italy
Paolo Ferragina & Giorgio Vinciguerra

Authors

Paolo Ferragina
View author publications
You can also search for this author in PubMed Google Scholar
Giorgio Vinciguerra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giorgio Vinciguerra .

Editor information

Editors and Affiliations

DIBRIS, University of Genoa, Genova, Italy
Luca Oneto
Department of Mathematics “Tullio Levi-Civita”, University of Padua, Padova, Italy
Nicolò Navarin
Department of Mathematics “Tullio Levi-Civita”, University of Padua, Padova, Italy
Alessandro Sperduti
DIBRIS, Università degli Studi di Genova, Genova, Italy
Davide Anguita

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ferragina, P., Vinciguerra, G. (2020). Learned Data Structures. In: Oneto, L., Navarin, N., Sperduti, A., Anguita, D. (eds) Recent Trends in Learning From Data. Studies in Computational Intelligence, vol 896. Springer, Cham. https://doi.org/10.1007/978-3-030-43883-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-43883-8_2
Published: 04 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43882-1
Online ISBN: 978-3-030-43883-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics