Skip to main content

Incremental Inference of Provenance Types

  • Conference paper
  • First Online:
Provenance and Annotation of Data and Processes (IPAW 2020, IPAW 2021)

Abstract

Long-running applications nowadays are increasingly instrumented to continuously log provenance. In that context, we observe an emerging need for processing fragments of provenance continuously produced by applications. Thus, there is an increasing requirement for processing of provenance incrementally, while the application is still running, to replace batch processing of a complete provenance dataset available only after the application has completed. A type of processing of particular interest is summarising provenance graphs, which has been proposed as an effective way of extracting key features of provenance and storing them in an efficient manner. To that goal, summarisation makes use of provenance types, which, in loose terms, are an encoding of the neighbourhood of nodes.

This paper shows that the process of creating provenance summaries of continuously provided data can benefit from a mode of incremental processing of provenance types. We also introduce the concept of a library of types to reduce the need for storing copies of the same string representations for types multiple times. Further, we show that the computational complexity associated with the task of inferring types is, in most common cases, the best possible: only new nodes have to be processed. We also identify and analyse the exception scenarios. Finally, although our library of types, in theory, can be exponentially large, we present empirical results that show it is quite compact in practice.

This work is supported by a Department of Navy award (Award No. N62909-18-1-2079) issued by the Office of Naval Research. The United States Government has a royalty-free license throughout the world in all copyrightable material contained herein.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Note that when application types are included, a provenance expression may have more than one label, for e.g. \(lab(v) = \{\text {ag}, \text {Prov:Operator}\}\). When \(lab(v)\) is a singleton set, we will abuse notation and omit the set-brackets.

  2. 2.

    Note that removing a node automatically removes all edges connected to it.

  3. 3.

    For readability, we index elements of map \(\mathcal {T}_{k}\) with k.

  4. 4.

    Recall that nodes may have more than one label.

References

  1. Chirigati, F., Shasha, D., Freire, J.: Reprozip: using provenance to support computational reproducibility. In: Presented as part of the 5th USENIX Workshop on the Theory and Practice of Provenance (2013)

    Google Scholar 

  2. Fan, W., Wang, X., Wu, Y.: Incremental graph pattern matching. ACM Trans. Database Syst. 38(3) (2013). https://doi.org/10.1145/2489791

  3. Gil, Y., et al.: PROV model primer. W3C Working Group Note (2013)

    Google Scholar 

  4. Glavic, B., Sheykh Esmaili, K., Fischer, P.M., Tatbul, N.: Ariadne: managing fine-grained provenance on data streams. In: Proceedings of the 7th ACM International Conference on Distributed Event-Based Systems, DEBS 2013, pp. 39–50. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2488222.2488256

  5. Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: 23rd International Conference on Very Large Data Bases (VLDB 1997) (1997). http://ilpubs.stanford.edu:8090/232/

  6. Gou, X., Zou, L., Zhao, C., Yang, T.: Fast and accurate graph stream summarization. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1118–1129. IEEE (2019)

    Google Scholar 

  7. Groth, P., Moreau, L. (eds.): PROV-Overview. An Overview of the PROV Family of Documents. W3C Working Group Note NOTE-PROV-overview-20130430, World Wide Web Consortium, April 2013. http://www.w3.org/TR/2013/NOTE-prov-overview-20130430/

  8. Han, X., Pasquier, T., Ranjan, T., Goldstein, M., Seltzer, M.: Frappuccino: fault-detection through runtime analysis of provenance. In: 9th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 2017) (2017)

    Google Scholar 

  9. Johnson, A.E., et al.: Mimic-iii, a freely accessible critical care database. Sci. Data 3, 160035 (2016)

    Google Scholar 

  10. Ma, X., Fox, P., Tilmes, C., Jacobs, K., Waple, A.: Capturing provenance of global change information. Nat. Clim. Chang. 4, 409–413 (2014). https://doi.org/10.1038/nclimate2141

    Article  Google Scholar 

  11. Mariconti, E., Onwuzurike, L., Andriotis, P., Cristofaro, E.D., Ross, G.J., Stringhini, G.: Mamadroid: detecting android malware by building Markov chains of behavioral models. CoRR abs/1612.04433 (2016). http://arxiv.org/abs/1612.04433

  12. Moreau, L.: The foundations for provenance on the web. Found. Trends Web Sci. 2(2–3), 99–241 (2010). https://doi.org/10.1561/1800000010

    Article  MathSciNet  Google Scholar 

  13. Moreau, L.: Aggregation by provenance types: a technique for summarising provenance graphs. In: Graphs as Models 2015 (An ETAPS 2015 Workshop), pp. 129–144. Electronic Proceedings in Theoretical Computer Science, London, UK, April 2015. https://doi.org/10.4204/EPTCS.181.9

  14. Ramchurn, S., Huynh, T.D., Venanzi, M., Shi, B.: Collabmap: crowdsourcing maps for emergency planning. In: Proceedings of the 3rd Annual ACM Web Science Conference, WebSci 2013, pp. 326–335 (2013). https://doi.org/10.1145/2464464.2464508

  15. Ramchurn, S.D., et al.: A disaster response system based on human-agent collectives. J. Artif. Intell. Res. 57, 661–708 (2016)

    Article  Google Scholar 

  16. Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K., Borgwardt, K.M.: Weisfeiler-Lehman graph kernels. J. Mach. Learn. Re. 12(Sep), 2539–2561 (2011)

    Google Scholar 

  17. Song, C., Ge, T.: Labeled graph sketches. In: 2018 IEEE 34th International Conference on Data Engineering (ICDE), pp. 1312–1315. IEEE (2018)

    Google Scholar 

  18. Vijayakumar, N.N., Plale, B.: Towards low overhead provenance tracking in near real-time stream filtering. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 46–54. Springer, Heidelberg (2006). https://doi.org/10.1007/11890850_6

    Chapter  Google Scholar 

  19. Vries, G.K.D.: A fast approximation of the Weisfeiler-Lehman graph kernel for RDF data. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013. LNCS (LNAI), vol. 8188, pp. 606–621. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40988-2_39

    Chapter  Google Scholar 

  20. Yao, Y., Holder, L.: Scalable SVM-based classification in dynamic graphs. In: 2014 IEEE International Conference on Data Mining, pp. 650–659, December 2014. https://doi.org/10.1109/ICDM.2014.69

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Kohan Marzagão .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kohan Marzagão, D., Huynh, T.D., Moreau, L. (2021). Incremental Inference of Provenance Types. In: Glavic, B., Braganholo, V., Koop, D. (eds) Provenance and Annotation of Data and Processes. IPAW IPAW 2020 2021. Lecture Notes in Computer Science(), vol 12839. Springer, Cham. https://doi.org/10.1007/978-3-030-80960-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-80960-7_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-80959-1

  • Online ISBN: 978-3-030-80960-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics