Online Anomaly Detection over Big Data Streams

  • Laura Rettig
  • Mourad Khayati
  • Philippe Cudré-MaurouxEmail author
  • Michał Piorkówski


In many domains, high-quality data are used as a foundation for decision-making. An essential component to assess data quality lies in anomaly detection. We describe and empirically evaluate the design and implementation of a framework for data quality testing over real-world streams in a large-scale telecommunication network. This approach is both general—by using general-purpose measures borrowed from information theory and statistics—and scalable—through anomaly detection pipelines that are executed in a distributed setting over state-of-the-art big data streaming and batch processing infrastructures. We empirically evaluate our system and discuss its merits and limitations by comparing it to existing anomaly detection techniques, showing its high accuracy, efficiency, as well as its scalability in parallelizing operations across a large number of nodes.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Dasu, T., Krishnan, S., Venkatasubramanian, S., & Yi, K. (2006). An information-theoretic approach to detecting changes in multi-dimensional data streams. In Proceedings of the 38th Symposium on the Interface of Statistics, Computing Science, and Applications.Google Scholar
  2. Datar, M., Gionis, A., Indyk, P., & Motwani, R. (2002). Maintaining stream statistics over sliding windows. SIAM Journal on Computing, 31, 1794–1813.MathSciNetCrossRefGoogle Scholar
  3. Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. In OSDI’04: Sixth Symposium on Operating System Design and Implementation.Google Scholar
  4. Flajolet, P., Fusy, É., Gandouet, O., & Meunier, F. (2007). HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm. In Conference on Analysis of Algorithms, AofA.Google Scholar
  5. Gupta, M., Gao, J., Aggarwal, C. C., & Han, J. (2014). Outlier detection for temporal data: A survey. IEEE Transactions on Knowledge and Data Engineering.Google Scholar
  6. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22, 79–86.MathSciNetCrossRefGoogle Scholar
  7. Lee, W., & Xiang, D. (2001). Information-theoretic measures for anomaly detection. In IEEE Symposium on Security and Privacy (pp. 130–143).Google Scholar
  8. Ma, Q., Muthukrishnan, S., & Sandler, M. (2013). Frugal streaming for estimating quantiles. In Space-Efficient Data Structures, Streams, and Algorithms (Vol. 8066, pp. 77–96). Berlin: Springer.Google Scholar
  9. Marz, N., & Warren, J. (2013). Big Data: Principles and best practices of scalable realtime data systems. Greenwich, CT: Manning Publications Co.Google Scholar
  10. Münz, G., Li, S., & Carle, G. (2007). Traffic anomaly detection using k-means clustering. In GI/ITG Workshop MMBnet.Google Scholar
  11. Muthukrishnan, S. (2005). Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science (Vol. 1).Google Scholar
  12. Papapetrou, O., Garofalakis, M., & Deligiannakis, A. (2012). Sketch-based querying of distributed sliding-window data streams. In Proceedings of the VLDB Endowment (Vol. 5, pp. 992–1003).Google Scholar
  13. The Apache Software Foundation. (2015). Spark Streaming programming guide. Retrieved from
  14. Young, W. C., Blumenstock, J. E., Fox, E. B., & Mccormick, T. H. (2014). Detecting and classifying anomalous behavior in spatiotemporal network data. In The 20th ACM Conference on Knowledge Discovery and Mining (KDD ’14), Workshop on Data Science for Social Good.Google Scholar
  15. Zaharia, M., Chowdhury, M., Das, T., Dave, A., McCauley, M., Franklin, M. J., et al. (2012b). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation.Google Scholar
  16. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing.Google Scholar
  17. Zaharia, M., Das, T., Li, H., Shenker, S., & Stoica, I. (2012a). Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Computing.Google Scholar
  18. Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., & Stoica, I. (2013). Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.Google Scholar
  19. Zhang, J., Lou, M., Ling, T. W., & Wang, H. (2004). HOS-Miner: A system for detecting outlying subspaces in high-dimensional data. In Proceedings of the 30th International Conference on Very Large Databases.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Laura Rettig
    • 1
  • Mourad Khayati
    • 1
  • Philippe Cudré-Mauroux
    • 1
    Email author
  • Michał Piorkówski
    • 2
  1. 1.University of FribourgFribourgSwitzerland
  2. 2.Philip Morris InternationalLausanneSwitzerland

Personalised recommendations