Skip to main content

Cross-Correlation as Tool to Determine the Similarity of Series of Measurements for Big-Data Analysis Tasks

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 9106))

Abstract

One aspect of the so called Big Data challenge is the rising quantity of data in almost all scientific, social, governmental and commercial disciplines. As a result there are many ongoing developments of analysis techniques to substitute manual processes with automatic or semi-automatic algorithms. This means the knowledge of data analysts has to be transferred to algorithms which can be executed simultaneously on many data sets. Such, the rising amount of data can be analysed in an constant quality and in a shorter time. Even if the number of existing algorithms is enormous, a ready to use solution for each problem doesn’t exist. Especially for analysing and comparing series of measurements, e.g. for analysing data of activity trackers or to monitor service execution infrastructures, we discovered a lack of options. Thus we explain the basics of an algorithm using the cross-correlation function to determine a meaningful value of similarity for two or more series of measurements. We used the new method to analyse and categorise job centric monitoring data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://procps.sourceforge.net/index.html

  2. 2.

    http://library.gnome.org/users/gnome-system-monitor/stable/index.html

  3. 3.

    http://ganglia.info/

  4. 4.

    http://sourceware.org/binutils/docs/gprof/

  5. 5.

    Non-discrete sequences are series of measurements which are not guaranteed to be taken at constant time intervals.

  6. 6.

    This lowering has to respect the leading sign. A cross-correlation coefficient less then zero represents similar functions with reverse signs.

References

  1. Andreeva, J., Calloni, M., Colling, D., Fanzago, F., D’Hondt, J., Klem, J., Maier, G., Letts, J., Maes, J., Padhi, S., Sarkar, S., Spiga, D., Mulders, P.V., Villella, I.: CMS analysis operations. J. Phys. Conf. Ser. 219(7), 072007 (2010). http://stacks.iop.org/1742-6596/219/i=7/a=072007

    Article  Google Scholar 

  2. Baur, T., Breu, R., Klmn, T., Lindinger, T., Milbert, A., Poghosyan, G., Reiser, H., Romberg, M.: An interoperable grid information system for integrated resource monitoring based on virtual organizations. J. Grid Comput. 7, 319–333 (2009). doi:10.1007/s10723-009-9134-3. http://dx.doi.org/10.1007/s10723-009-9134-3

    Article  Google Scholar 

  3. Beasley, D., Bull, D.R., Martin, R.R.: An overview of genetic algorithms: Part 1, fundamentals. Univ. Comput. 15, 58–69 (1993)

    Google Scholar 

  4. Brunst, H., Hackenberg, D., Juckeland, G., Rohling, H.: Comprehensive performance tracking with vampir 7. In: Müller, M.S., Resch, M.M., Schulz, A., Nagel, W.E. (eds.) Tools for High Performance Computing 2009, pp. 17–29. Springer, Heidelberg (2010). doi:10.1007/978-3-642-11261-4_2

    Chapter  Google Scholar 

  5. Chan, P., Stolfo, S.J.: Toward parallel and distributed learning by meta-learning. In: AAAI Workshop in Knowledge Discovery in Databases, pp. 227–240 (1993)

    Google Scholar 

  6. Denning, D.E.: An intrusion-detection model. IEEE Trans. Softw. Eng. 13(2), 222–232 (1987)

    Article  Google Scholar 

  7. Dickerson, J.E., Dickerson, J.A.: Fuzzy network profiling for intrusion detection. In: Proceedings of NAFIPS 19th International Conference of the North American Fuzzy Information Processing Society, Atlanta, pp. 301–306 (2000)

    Google Scholar 

  8. Eichenhardt, H., Müller-Pfefferkorn, R., Neumann, R., William, T.: User- and job-centric monitoring: analysing and presenting large amounts of monitoring data. In: Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing. GRID 2008, pp. 225–232. IEEE Computer Society, Washington (2008). http://dx.doi.org/10.1109/GRID.2008.4662803

  9. Elmroth, E., Gardfjll, P., Mulmo, O., ke Sandgren, Sandholm, T. : A Coordinated Accounting Solution for SweGrid Version: Draft 0.1.3, 7 October 2003

    Google Scholar 

  10. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD 1996, pp. 226–231 (1996)

    Google Scholar 

  11. Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, SIGMOD 1994, pp. 419–429. ACM, New York (1994). http://doi.acm.org/10.1145/191839.191925

  12. Grefenstette, J.: Optimization of control parameters for genetic algorithms. IEEE Trans. Syst. Man Cybern. 16(1), 122–128 (1986)

    Article  Google Scholar 

  13. von Grünigen, D.: Digitale Signalverarbeitung: Mit einer Einführung in die kontinuierlichen Signale und Systeme. Fachbuchverlag Leipzig (2008)

    Google Scholar 

  14. Gusfield, D.: Algorithms on Stings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press, New York (1997)

    Book  Google Scholar 

  15. Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a K-means clustering algorithm. J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)

    MATH  Google Scholar 

  16. Hilbrich, M., Müller-Pfefferkorn, R.: A scalable infrastructure for job-centric monitoring data from distributed systems. In: Bubak, M., Turala, M., Wiatr, K. (eds.) Proceedings Cracow Grid Workshop 2009, pp. 120–125. ACC CYFRONET AGH, ul. Nawojki 11, 30–950 Krakow 61, P.O. Box 386, Poland, February 2010

    Google Scholar 

  17. Hilbrich, M., Müller-Pfefferkorn, R.: Achieving scalability for job centric monitoring in a distributed infrastructure. In: Mühl, G., Richling, J., Herkersdorf, A. (eds.) ARCS Workshops. LNI, vol. 200, pp. 481–492. GI (2012)

    Google Scholar 

  18. Hilbrich, M., Weber, M., Tschüter, R.: Automatic analysis of large data sets: a walk-through on methods from different perspectives. In: 2013 International Conference on Cloud Computing and Big Data (CloudCom-Asia), pp. 373–380, December 2013

    Google Scholar 

  19. Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM 18(6), 341–343 (1975)

    Article  MATH  MathSciNet  Google Scholar 

  20. Holland, J.: Genetic algorithms. Sci. Am. 267(1), 44–50 (1992)

    Article  Google Scholar 

  21. Huffmire, T., Sherwood, T.: Wavelet-based phase classification. In: Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, PACT 2006, pp. 95–104. ACM, New York (2006). http://doi.acm.org/10.1145/1152154.1152172

  22. Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)

    Article  Google Scholar 

  23. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. Wiley, New York (2005). http://books.google.de/books?id=yS0nAQAAIAAJ

    Google Scholar 

  24. Keogh, E.J., Pazzani, M.J.: An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback (1998)

    Google Scholar 

  25. Lee, W., Stolfo, S.J.: Data mining approaches for intrusion detection. In: Proceedings of the 7th Conference on USENIX Security Symposium, SSYM 1998, vol. 7, p. 6. USENIX Association, Berkeley (1998). http://dl.acm.org/citation.cfm?id=1267549.1267555

  26. Lee, W., Stolfo, S., Mok, K.: A data mining framework for building intrusion detection models. In: Proceedings of the 1999 IEEE Symposium on Security and Privacy, pp. 120–132 (1999)

    Google Scholar 

  27. Lunt, T., Jagannathan, R., Lee, R., Whitehurst, A., Listgarten, S.: Knowledge-based intrusion detection. In: 1989 Proceedings of the Annual AI Systems in Government Conference, pp. 102–107 (1989)

    Google Scholar 

  28. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)

    Google Scholar 

  29. Müller-Pfefferkorn, R., Neumann, R., Borovac, S., Hammad, A., Harenberg, T., Hüsken, M., Mättig, P., Mechtel, M., Meder-Marouelli, D., Ueberholz, P.: Monitoring of jobs and their execution for the LHC computing grid. In: Bubak, M., Turala, M., Wiatr, K. (eds.) Proceedings Cracow Grid Workshop 2006, pp. 224–231. ACC CYFRONET AGH, ul. Nawojki 11, 30–950 Krakow 61, P.O. Box 386, Poland (October 2006)

    Google Scholar 

  30. Myers, E.W.: An O(ND) difference algorithm and its variations. Algorithmica 1, 251–266 (1986)

    Article  MATH  MathSciNet  Google Scholar 

  31. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)

    Article  Google Scholar 

  32. Paxson, V.: Bro: a system for detecting network intruders in real-time. Comput. Netw. 31(2324), 2435–2463 (1999). http://www.sciencedirect.com/science/article/pii/S1389128699001127

    Article  Google Scholar 

  33. Piro Rosario, M., Andrea, G., Giuseppe, P., Albert, W.: Using historical accounting information to predict the resource usage of grid jobs. Future Gener. Comput. Syst. 25(5), 499–510 (2009)

    Article  Google Scholar 

  34. Roesch, M., Telecommunications, S.: Snort - lightweight intrusion detection for networks, pp. 229–238 (1999)

    Google Scholar 

  35. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  36. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Boston (2005)

    Google Scholar 

  37. Tang, K., Man, K., Kwong, S., He, Q.: Genetic algorithms and their applications. IEEE Signal Process. Mag. 13(6), 22–37 (1996)

    Article  Google Scholar 

  38. Vanderster, D.C., Brochu, F., Cowan, G., Egede, U., Elmsheuser, J., Gaidoz, B., Harrison, K., Lee, H.C., Liko, D., Maier, A., Mocicki, J.T., Muraru, A., Pajchel, K., Reece, W., Samset, B., Slater, M., Soroko, A., Tan, C.L., Williams, M.: Ganga: user-friendly grid job submission and management tool for LHC and beyond. J. Phys. Conf. Ser. 219(7), 072022 (2010). http://stacks.iop.org/1742-6596/219/i=7/a=072022

    Article  Google Scholar 

  39. Vlachos, M., Lin, J., Keogh, E., Gunopulos, D.: A wavelet-based anytime algorithm for K-means clustering of time series. In: Proceedings of the Workshop on Clustering High Dimensionality Data and Its Applications (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcus Hilbrich .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Hilbrich, M., Müller-Pfefferkorn, R. (2015). Cross-Correlation as Tool to Determine the Similarity of Series of Measurements for Big-Data Analysis Tasks. In: Qiang, W., Zheng, X., Hsu, CH. (eds) Cloud Computing and Big Data. CloudCom-Asia 2015. Lecture Notes in Computer Science(), vol 9106. Springer, Cham. https://doi.org/10.1007/978-3-319-28430-9_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-28430-9_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-28429-3

  • Online ISBN: 978-3-319-28430-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics