Abstract
A pervasive problem in Data Science is that the knowledge generated by possibly expensive analytics processes is subject to decay over time as the data and algorithms used to compute it change, and the external knowledge embodied by reference datasets evolves. Deciding when such knowledge outcomes should be refreshed, following a sequence of data change events, requires problem-specific functions to quantify their value and its decay over time, as well as models for estimating the cost of their re-computation. Challenging is the ambition to develop a decision support system for informing re-computation decisions over time that is both generic and customisable. With the help of a case study from genomics, in this paper we offer an initial formalisation of this problem, highlight research challenges, and outline a possible approach based on the analysis of metadata from a history of past computations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
Analysis of the specific “backwards” cases will appear in a separate contribution.
References
Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the kepler scientific workflow system. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 118–132. Springer, Heidelberg (2006). doi:10.1007/11890850_14
Burgess, L.C., Crotty, D., de Roure, D., Gibbons, J., Goble, C., Missier, P., Mortier, R., Nichols, T.E., O’Beirne, R.: Alan Turing Intitute Symposium on Reproducibility for Data-Intensive Research - Final Report (2016)
Cała, J., Marei, E., Xu, Y., Takeda, K., Missier, P.: Scalable and efficient whole-exome data processing using workflows on the cloud. Future Gener. Comput. Syst. 65(Special Issue: Big Data in the Cloud), 153–168 (2016)
Cooper, G.M., Shendure, J.: Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet. 12(9), 628–640 (2011)
Freire, J., Fuhr, N., Rauber, A.: Reproducibility of data-oriented experiments in e-science (Dagstuhl Seminar 16041). Dagstuhl Reports 6(1), 108–159 (2016)
Ikeda, R., Das Sarma, A., Widom, J.: Logical provenance in data-oriented workflows? In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 877–888. IEEE, April 2013
Ikeda, R., Salihoglu, S., Widom, J.: Provenance-based refresh in data-oriented workflows. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 1659–1668 (2011)
Ikeda, R., Widom, J.: Panda: a system for provenance and data. In: Proceedings of the 2nd USENIX Workshop on the Theory and Practice of Provenance (TaPP 2010), vol. 33, pp. 1–8 (2010)
Koop, D., Santos, E., Bauer, B., Troyer, M., Freire, J., Silva, C.T.: Bridging workflow and data provenance using strong links. In: Gertz, M., Ludäscher, B. (eds.) SSDBM 2010. LNCS, vol. 6187, pp. 397–415. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13818-8_28
Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific workflow management and the Kepler system. Concurrency Comput. Pract. Exp. 18(10), 1039–1065 (2006)
Malik, M.J., Fahringer, T., Prodan, R.: Execution time prediction for grid infrastructures based on runtime provenance data. In: Proceedings of WORKS 2013, pp. 48–57, New York, USA. ACM Press (2013)
Missier, P., Wijaya, E., Kirby, R., Keogh, M.: SVI: a simple single-nucleotide human variant interpretation tool for clinical use. In: Ashish, N., Ambite, J.-L. (eds.) DILS 2015. LNCS, vol. 9162, pp. 180–194. Springer, Cham (2015). doi:10.1007/978-3-319-21843-4_14
Miu, T., Missier, P.: Predicting the execution time of workflow activities based on their input features. In: Taylor, I., Montagnat, J., (eds.) Proceedings of WORKS 2012, Salt Lake City, US. ACM (2012)
Moreau, L., Missier, P., Belhajjame, K., B’Far, R., Cheney, J.T.: PROV-DM: the PROV data model. Technical report, World Wide Web Consortium (2012)
Oliveira, W., Missier, P., Ocaña, K., Oliveira, D., Braganholo, V.: Analyzing provenance across heterogeneous provenance graphs. In: Mattoso, M., Glavic, B. (eds.) IPAW 2016. LNCS, vol. 9672, pp. 57–70. Springer, Cham (2016). doi:10.1007/978-3-319-40593-3_5
Pietri, I., Juve, G., Deelman, E., Sakellariou, R.: A performance model to estimate execution time of scientific workflows on the cloud. In: Proceedings of WORKS 2014, pp. 11–19. IEEE, November 2014
Stodden, V., Leisch, F., Peng, R.D.: Implementing Reproducible Research. CRC Press, Boca Raton (2014)
Woodman, S., Hiden, H., Watson, P.: Workflow provenance: an analysis of long term storage costs. In: Proceedings of WORKS 2015, pp. 9: 1–9: 9 (2015)
PROV-Overview. An Overview of the PROV Family of Documents, April 2013
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Missier, P., Cała, J., Rathi, M. (2017). Preserving the Value of Large Scale Data Analytics over Time Through Selective Re-computation. In: Calì, A., Wood, P., Martin, N., Poulovassilis, A. (eds) Data Analytics. BICOD 2017. Lecture Notes in Computer Science(), vol 10365. Springer, Cham. https://doi.org/10.1007/978-3-319-60795-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-60795-5_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60794-8
Online ISBN: 978-3-319-60795-5
eBook Packages: Computer ScienceComputer Science (R0)