Anomaly Detection in MapReduce Using Transformation Provenance

  • Anu Mary Chacko
  • Jayendra Sreekar Medicherla
  • S. D. Madhu Kumar
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 645)


Data provenance is the metadata that captures information about data origin, how it was manipulated, and updated over time. Data provenance has great significance for big data applications as it provides mechanisms for verification of results. This paper discusses an approach to detect anomalies in Hadoop cluster/MapReduce job by reviewing the transformation provenance captured by mining the MapReduce logs. A rule-based framework is used to identify the patterns for extracting provenance information. The provenance information derived is converted into a provenance profile which is used for detecting anomalies in cluster and job execution.


Provenance Transformation provenance Big data Hadoop security Anomaly detection 


  1. 1.
    Glavic, B., Dittrich, K.: Data provenance: a categorization of existing approaches. In: Proceedings of the 12th GI Conference on Datenbanksysteme in Business, Technologie und Web (2007)Google Scholar
  2. 2.
    Ikeda, R., Widom, J.: Panda: a system for provenance and data. IEEE Data Eng. Bull. Spec. Issue Data Provenance 33(3), 42–49 (2010)Google Scholar
  3. 3.
    Rama, S., Liu, J.: Understanding the semantics of data provenance to support active conceptual modeling. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 4512, pp. 17–29. LNCS (2008)Google Scholar
  4. 4.
    Simmhan, Y.L., Pale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Rec. 34(3), 31–36 (2005).
  5. 5.
    Ikeda, R., Widom, J.: Ramp: a system for capturing and tracing provenance in map reduce workflows. In: International Conference on Very Large Databases (August 2011)Google Scholar
  6. 6.
    Akoush, S., Sohan, R., Hopper, A.: Hadoopprov: towards provenance as a first class citizen in mapreduce. In: Presented as part of the 5th USENIX Workshop on the Theory and Practice of Provenance. USENIX, Berkeley, CA (2013)Google Scholar
  7. 7.
    Crawl, D., Wang, J., Altintas, I.: Provenance for mapreduce-based data-intensive workflows. In: Proceedings of the 6th Workshop on Workflows in Support of Large-Scale Science WORKS’11, pp. 21–30 (2011)Google Scholar
  8. 8.
    Wei, W., Du, J., Yu, T., Gu, X.: Securemr: a service integrity assurance framework for mapreduce. In: 2009 Annual Computer Security Applications Conference, ACSAC’09, pp. 73–82 (2009)Google Scholar
  9. 9.
    Ghoshal, D., Plale, B.: Provenance from log files: a bigdata problem. In: ACM International Conference Proceeding Series, pp. 290–297 (2013)Google Scholar
  10. 10.
    Chacko, A., Madhu, S., Madhu Kumar S.D., Gupta, A.: Improving execution speed of incremental runs of mapreduce using provenance. In: Special Issue on Big Data Visualization and Analytics. Inderscience Publishers (In Press) (2016)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • Anu Mary Chacko
    • 1
  • Jayendra Sreekar Medicherla
    • 1
  • S. D. Madhu Kumar
    • 1
  1. 1.National Institute of TechnologyCalicutIndia

Personalised recommendations