Storage and Recreation Trade-Off for Multi-version Data Management

  • Yin Zhang
  • Huiping Liu
  • Cheqing JinEmail author
  • Ye Guo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10988)


With the tremendous development of data acquisition technology, massive observation data have been accumulated in scientific disciplines. As the difference between the successive observations only changes slightly, it is critical to utilize multi-version data management technology to compress data to minimize both storage and recreation. However, the existing work on this field only optimizes the total storage and recreation costs, but ignores the recreation cost of some special versions. Consequently, in this paper, we investigate the trade-off among all of three metrics, including total storage cost, total recreation cost, and the maximum recreation cost for each version. We formulate two problems, including (1) discover a storage plan to lower the total recreation and the individual recreation if the total storage is limited; (2) find a storage plan to minimize the total storage with restricted total recreation and individual recreation. To solve above problems, we model all versions with a directed graph and then devise two efficient algorithms based on spanning tree. A series of experiments indicate that our proposals are effective and efficient in dealing with the problems.


Multi-version data management Storage and recreation trade-off Scientific data management 


  1. 1.
  2. 2.
    Large Synoptic Survey Telescope.
  3. 3.
    Baumann, P.: Standardizing big earth datacubes. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 67–73. IEEE (2017)Google Scholar
  4. 4.
    Bhattacherjee, S., Chavan, A., Huang, S., Deshpande, A., Parameswaran, A.: Principles of dataset versioning: exploring the recreation/storage tradeoff. VLDB Endow. 8(12), 1346–1357 (2015)CrossRefGoogle Scholar
  5. 5.
    Chan, T.N., Yiu, M.L., Hua, K.A.: Efficient sub-window nearest neighbor search on matrix. IEEE Trans. Knowl. Data Eng. 29(4), 784–797 (2017)CrossRefGoogle Scholar
  6. 6.
    Chavan, A., Deshpande, A.: DEX: query execution in a delta-based storage system. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 171–186. ACM (2017)Google Scholar
  7. 7.
    Cormen, T.H.: Introduction to Algorithms. MIT press, Cambridge (2009)zbMATHGoogle Scholar
  8. 8.
    Cudré-Mauroux, P., Kimura, H., Lim, K.T., Rogers, J., Simakov, R., Soroush, E., Velikhov, P., Wang, D.L., Balazinska, M., Becla, J., et al.: A demonstration of SciDB: a science-oriented DBMS. VLDB Endow. 2(2), 1534–1537 (2009)CrossRefGoogle Scholar
  9. 9.
    Gosain, A., Saroha, K.: Storage structure for handling schema versions in temporal data warehouses. In: Sa, P.K., Sahoo, M.N., Murugappan, M., Wu, Y., Majhi, B. (eds.) Progress in Intelligent Computing Techniques: Theory, Practice, and Applications. AISC, vol. 518, pp. 501–511. Springer, Singapore (2018). Scholar
  10. 10.
    Li, J., Kawashima, H., Tatebe, O.: Efficient window aggregate method on array database system. J. Inf. Process. 24(6), 867–877 (2016)Google Scholar
  11. 11.
    Prim, R.C.: Shortest connection networks and some generalizations. Bell Labs Tech. J. 36(6), 1389–1401 (1957)CrossRefGoogle Scholar
  12. 12.
    Rusu, F., Cheng, Y.: A survey on array storage, query languages, and systems. arXiv preprint arXiv:1302.0103 (2013)
  13. 13.
    Seering, A., Cudre-Mauroux, P., Madden, S., Stonebraker, M.: Efficient versioning for scientific array databases. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 1013–1024. IEEE (2012)Google Scholar
  14. 14.
    Soroush, E., Balazinska, M.: Time travel in a scientific array database. In: 29th Data Engineering (ICDE), pp. 98–109. IEEE (2013)Google Scholar
  15. 15.
    Soroush, E., Balazinska, M., Krughoff, S., Connolly, A.: Efficient iterative processing in the SciDB parallel array engine. In: 27th International Conference on Scientific and Statistical Database Management, p. 39. ACM (2015)Google Scholar
  16. 16.
    Soroush, E., Balazinska, M., Wang, D.: ArrayStore: a storage manager for complex parallel array processing. In: 2011 ACM SIGMOD International Conference on Management of data, pp. 253–264. ACM (2011)Google Scholar
  17. 17.
    Tansel, A.U., Clifford, J., Gadia, S.K., Jajodia, S., Segev, A., Snodgrass, R.T. (eds.): Temporal Databases: Theory, Design, and Implementation. Benjamin/Cummings, San FranciscoGoogle Scholar
  18. 18.
    Zhang, Y., Xu, F., Frise, E., Wu, S., Yu, B., Xu, W.: DataLab: a version data management and analytics system. In: 2nd International Workshop on BIG Data Software Engineering, pp. 12–18. ACM (2016)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Data Science and EngineeringEast China Normal UniversityShanghaiChina
  2. 2.Tongji UniversityShanghaiChina

Personalised recommendations