Preserving the Value of Large Scale Data Analytics over Time Through Selective Re-computation

Missier, Paolo; Cała, Jacek; Rathi, Manisha

doi:10.1007/978-3-319-60795-5_6

Paolo Missier¹⁷,
Jacek Cała¹⁷ &
Manisha Rathi¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10365))

Included in the following conference series:

British International Conference on Databases

1115 Accesses
1 Citations

Abstract

A pervasive problem in Data Science is that the knowledge generated by possibly expensive analytics processes is subject to decay over time as the data and algorithms used to compute it change, and the external knowledge embodied by reference datasets evolves. Deciding when such knowledge outcomes should be refreshed, following a sequence of data change events, requires problem-specific functions to quantify their value and its decay over time, as well as models for estimating the cost of their re-computation. Challenging is the ambition to develop a decision support system for informing re-computation decisions over time that is both generic and customisable. With the help of a case study from genomics, in this paper we offer an initial formalisation of this problem, highlight research challenges, and outline a possible approach based on the analysis of metadata from a history of past computations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.ncbi.nlm.nih.gov/clinvar.
2.
https://www.omim.org.
3.
Analysis of the specific “backwards” cases will appear in a separate contribution.

References

Altintas, I., Barney, O., Jaeger-Frank, E.: Provenance collection support in the kepler scientific workflow system. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 118–132. Springer, Heidelberg (2006). doi:10.1007/11890850_14
Chapter Google Scholar
Burgess, L.C., Crotty, D., de Roure, D., Gibbons, J., Goble, C., Missier, P., Mortier, R., Nichols, T.E., O’Beirne, R.: Alan Turing Intitute Symposium on Reproducibility for Data-Intensive Research - Final Report (2016)
Google Scholar
Cała, J., Marei, E., Xu, Y., Takeda, K., Missier, P.: Scalable and efficient whole-exome data processing using workflows on the cloud. Future Gener. Comput. Syst. 65(Special Issue: Big Data in the Cloud), 153–168 (2016)
Google Scholar
Cooper, G.M., Shendure, J.: Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet. 12(9), 628–640 (2011)
Article Google Scholar
Freire, J., Fuhr, N., Rauber, A.: Reproducibility of data-oriented experiments in e-science (Dagstuhl Seminar 16041). Dagstuhl Reports 6(1), 108–159 (2016)
Google Scholar
Ikeda, R., Das Sarma, A., Widom, J.: Logical provenance in data-oriented workflows? In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 877–888. IEEE, April 2013
Google Scholar
Ikeda, R., Salihoglu, S., Widom, J.: Provenance-based refresh in data-oriented workflows. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 1659–1668 (2011)
Google Scholar
Ikeda, R., Widom, J.: Panda: a system for provenance and data. In: Proceedings of the 2nd USENIX Workshop on the Theory and Practice of Provenance (TaPP 2010), vol. 33, pp. 1–8 (2010)
Google Scholar
Koop, D., Santos, E., Bauer, B., Troyer, M., Freire, J., Silva, C.T.: Bridging workflow and data provenance using strong links. In: Gertz, M., Ludäscher, B. (eds.) SSDBM 2010. LNCS, vol. 6187, pp. 397–415. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13818-8_28
Chapter Google Scholar
Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific workflow management and the Kepler system. Concurrency Comput. Pract. Exp. 18(10), 1039–1065 (2006)
Article Google Scholar
Malik, M.J., Fahringer, T., Prodan, R.: Execution time prediction for grid infrastructures based on runtime provenance data. In: Proceedings of WORKS 2013, pp. 48–57, New York, USA. ACM Press (2013)
Google Scholar
Missier, P., Wijaya, E., Kirby, R., Keogh, M.: SVI: a simple single-nucleotide human variant interpretation tool for clinical use. In: Ashish, N., Ambite, J.-L. (eds.) DILS 2015. LNCS, vol. 9162, pp. 180–194. Springer, Cham (2015). doi:10.1007/978-3-319-21843-4_14
Chapter Google Scholar
Miu, T., Missier, P.: Predicting the execution time of workflow activities based on their input features. In: Taylor, I., Montagnat, J., (eds.) Proceedings of WORKS 2012, Salt Lake City, US. ACM (2012)
Google Scholar
Moreau, L., Missier, P., Belhajjame, K., B’Far, R., Cheney, J.T.: PROV-DM: the PROV data model. Technical report, World Wide Web Consortium (2012)
Google Scholar
Oliveira, W., Missier, P., Ocaña, K., Oliveira, D., Braganholo, V.: Analyzing provenance across heterogeneous provenance graphs. In: Mattoso, M., Glavic, B. (eds.) IPAW 2016. LNCS, vol. 9672, pp. 57–70. Springer, Cham (2016). doi:10.1007/978-3-319-40593-3_5
Chapter Google Scholar
Pietri, I., Juve, G., Deelman, E., Sakellariou, R.: A performance model to estimate execution time of scientific workflows on the cloud. In: Proceedings of WORKS 2014, pp. 11–19. IEEE, November 2014
Google Scholar
Stodden, V., Leisch, F., Peng, R.D.: Implementing Reproducible Research. CRC Press, Boca Raton (2014)
Google Scholar
Woodman, S., Hiden, H., Watson, P.: Workflow provenance: an analysis of long term storage costs. In: Proceedings of WORKS 2015, pp. 9: 1–9: 9 (2015)
Google Scholar
PROV-Overview. An Overview of the PROV Family of Documents, April 2013
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing Science, Newcastle University, Newcastle upon Tyne, UK
Paolo Missier, Jacek Cała & Manisha Rathi

Authors

Paolo Missier
View author publications
You can also search for this author in PubMed Google Scholar
Jacek Cała
View author publications
You can also search for this author in PubMed Google Scholar
Manisha Rathi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paolo Missier .

Editor information

Editors and Affiliations

Birkbeck, University of London, London, United Kingdom
Andrea Calì
Birkbeck, University of London, London, United Kingdom
Peter Wood
Birkbeck, University of London, London, United Kingdom
Nigel Martin
Birkbeck, University of London, London, United Kingdom
Alexandra Poulovassilis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Missier, P., Cała, J., Rathi, M. (2017). Preserving the Value of Large Scale Data Analytics over Time Through Selective Re-computation. In: Calì, A., Wood, P., Martin, N., Poulovassilis, A. (eds) Data Analytics. BICOD 2017. Lecture Notes in Computer Science(), vol 10365. Springer, Cham. https://doi.org/10.1007/978-3-319-60795-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-60795-5_6
Published: 14 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60794-8
Online ISBN: 978-3-319-60795-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics