Cross-Correlation as Tool to Determine the Similarity of Series of Measurements for Big-Data Analysis Tasks

Hilbrich, Marcus; Müller-Pfefferkorn, Ralph

doi:10.1007/978-3-319-28430-9_20

Cross-Correlation as Tool to Determine the Similarity of Series of Measurements for Big-Data Analysis Tasks

Marcus Hilbrich¹⁶ &
Ralph Müller-Pfefferkorn¹⁷

Conference paper
First Online: 10 January 2016

1345 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 9106))

Abstract

One aspect of the so called Big Data challenge is the rising quantity of data in almost all scientific, social, governmental and commercial disciplines. As a result there are many ongoing developments of analysis techniques to substitute manual processes with automatic or semi-automatic algorithms. This means the knowledge of data analysts has to be transferred to algorithms which can be executed simultaneously on many data sets. Such, the rising amount of data can be analysed in an constant quality and in a shorter time. Even if the number of existing algorithms is enormous, a ready to use solution for each problem doesn’t exist. Especially for analysing and comparing series of measurements, e.g. for analysing data of activity trackers or to monitor service execution infrastructures, we discovered a lack of options. Thus we explain the basics of an algorithm using the cross-correlation function to determine a meaningful value of similarity for two or more series of measurements. We used the new method to analyse and categorise job centric monitoring data.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://procps.sourceforge.net/index.html
2.
http://library.gnome.org/users/gnome-system-monitor/stable/index.html
3.
http://ganglia.info/
4.
http://sourceware.org/binutils/docs/gprof/
5.
Non-discrete sequences are series of measurements which are not guaranteed to be taken at constant time intervals.
6.
This lowering has to respect the leading sign. A cross-correlation coefficient less then zero represents similar functions with reverse signs.

References

Andreeva, J., Calloni, M., Colling, D., Fanzago, F., D’Hondt, J., Klem, J., Maier, G., Letts, J., Maes, J., Padhi, S., Sarkar, S., Spiga, D., Mulders, P.V., Villella, I.: CMS analysis operations. J. Phys. Conf. Ser. 219(7), 072007 (2010). http://stacks.iop.org/1742-6596/219/i=7/a=072007
Article Google Scholar
Baur, T., Breu, R., Klmn, T., Lindinger, T., Milbert, A., Poghosyan, G., Reiser, H., Romberg, M.: An interoperable grid information system for integrated resource monitoring based on virtual organizations. J. Grid Comput. 7, 319–333 (2009). doi:10.1007/s10723-009-9134-3. http://dx.doi.org/10.1007/s10723-009-9134-3
Article Google Scholar
Beasley, D., Bull, D.R., Martin, R.R.: An overview of genetic algorithms: Part 1, fundamentals. Univ. Comput. 15, 58–69 (1993)
Google Scholar
Brunst, H., Hackenberg, D., Juckeland, G., Rohling, H.: Comprehensive performance tracking with vampir 7. In: Müller, M.S., Resch, M.M., Schulz, A., Nagel, W.E. (eds.) Tools for High Performance Computing 2009, pp. 17–29. Springer, Heidelberg (2010). doi:10.1007/978-3-642-11261-4_2
Chapter Google Scholar
Chan, P., Stolfo, S.J.: Toward parallel and distributed learning by meta-learning. In: AAAI Workshop in Knowledge Discovery in Databases, pp. 227–240 (1993)
Google Scholar
Denning, D.E.: An intrusion-detection model. IEEE Trans. Softw. Eng. 13(2), 222–232 (1987)
Article Google Scholar
Dickerson, J.E., Dickerson, J.A.: Fuzzy network profiling for intrusion detection. In: Proceedings of NAFIPS 19th International Conference of the North American Fuzzy Information Processing Society, Atlanta, pp. 301–306 (2000)
Google Scholar
Eichenhardt, H., Müller-Pfefferkorn, R., Neumann, R., William, T.: User- and job-centric monitoring: analysing and presenting large amounts of monitoring data. In: Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing. GRID 2008, pp. 225–232. IEEE Computer Society, Washington (2008). http://dx.doi.org/10.1109/GRID.2008.4662803
Elmroth, E., Gardfjll, P., Mulmo, O., ke Sandgren, Sandholm, T. : A Coordinated Accounting Solution for SweGrid Version: Draft 0.1.3, 7 October 2003
Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD 1996, pp. 226–231 (1996)
Google Scholar
Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, SIGMOD 1994, pp. 419–429. ACM, New York (1994). http://doi.acm.org/10.1145/191839.191925
Grefenstette, J.: Optimization of control parameters for genetic algorithms. IEEE Trans. Syst. Man Cybern. 16(1), 122–128 (1986)
Article Google Scholar
von Grünigen, D.: Digitale Signalverarbeitung: Mit einer Einführung in die kontinuierlichen Signale und Systeme. Fachbuchverlag Leipzig (2008)
Google Scholar
Gusfield, D.: Algorithms on Stings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press, New York (1997)
Book Google Scholar
Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a K-means clustering algorithm. J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)
MATH Google Scholar
Hilbrich, M., Müller-Pfefferkorn, R.: A scalable infrastructure for job-centric monitoring data from distributed systems. In: Bubak, M., Turala, M., Wiatr, K. (eds.) Proceedings Cracow Grid Workshop 2009, pp. 120–125. ACC CYFRONET AGH, ul. Nawojki 11, 30–950 Krakow 61, P.O. Box 386, Poland, February 2010
Google Scholar
Hilbrich, M., Müller-Pfefferkorn, R.: Achieving scalability for job centric monitoring in a distributed infrastructure. In: Mühl, G., Richling, J., Herkersdorf, A. (eds.) ARCS Workshops. LNI, vol. 200, pp. 481–492. GI (2012)
Google Scholar
Hilbrich, M., Weber, M., Tschüter, R.: Automatic analysis of large data sets: a walk-through on methods from different perspectives. In: 2013 International Conference on Cloud Computing and Big Data (CloudCom-Asia), pp. 373–380, December 2013
Google Scholar
Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM 18(6), 341–343 (1975)
Article MATH MathSciNet Google Scholar
Holland, J.: Genetic algorithms. Sci. Am. 267(1), 44–50 (1992)
Article Google Scholar
Huffmire, T., Sherwood, T.: Wavelet-based phase classification. In: Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, PACT 2006, pp. 95–104. ACM, New York (2006). http://doi.acm.org/10.1145/1152154.1152172
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Article Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. Wiley, New York (2005). http://books.google.de/books?id=yS0nAQAAIAAJ
Google Scholar
Keogh, E.J., Pazzani, M.J.: An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback (1998)
Google Scholar
Lee, W., Stolfo, S.J.: Data mining approaches for intrusion detection. In: Proceedings of the 7th Conference on USENIX Security Symposium, SSYM 1998, vol. 7, p. 6. USENIX Association, Berkeley (1998). http://dl.acm.org/citation.cfm?id=1267549.1267555
Lee, W., Stolfo, S., Mok, K.: A data mining framework for building intrusion detection models. In: Proceedings of the 1999 IEEE Symposium on Security and Privacy, pp. 120–132 (1999)
Google Scholar
Lunt, T., Jagannathan, R., Lee, R., Whitehurst, A., Listgarten, S.: Knowledge-based intrusion detection. In: 1989 Proceedings of the Annual AI Systems in Government Conference, pp. 102–107 (1989)
Google Scholar
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)
Google Scholar
Müller-Pfefferkorn, R., Neumann, R., Borovac, S., Hammad, A., Harenberg, T., Hüsken, M., Mättig, P., Mechtel, M., Meder-Marouelli, D., Ueberholz, P.: Monitoring of jobs and their execution for the LHC computing grid. In: Bubak, M., Turala, M., Wiatr, K. (eds.) Proceedings Cracow Grid Workshop 2006, pp. 224–231. ACC CYFRONET AGH, ul. Nawojki 11, 30–950 Krakow 61, P.O. Box 386, Poland (October 2006)
Google Scholar
Myers, E.W.: An O(ND) difference algorithm and its variations. Algorithmica 1, 251–266 (1986)
Article MATH MathSciNet Google Scholar
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Article Google Scholar
Paxson, V.: Bro: a system for detecting network intruders in real-time. Comput. Netw. 31(2324), 2435–2463 (1999). http://www.sciencedirect.com/science/article/pii/S1389128699001127
Article Google Scholar
Piro Rosario, M., Andrea, G., Giuseppe, P., Albert, W.: Using historical accounting information to predict the resource usage of grid jobs. Future Gener. Comput. Syst. 25(5), 499–510 (2009)
Article Google Scholar
Roesch, M., Telecommunications, S.: Snort - lightweight intrusion detection for networks, pp. 229–238 (1999)
Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article MATH Google Scholar
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Boston (2005)
Google Scholar
Tang, K., Man, K., Kwong, S., He, Q.: Genetic algorithms and their applications. IEEE Signal Process. Mag. 13(6), 22–37 (1996)
Article Google Scholar
Vanderster, D.C., Brochu, F., Cowan, G., Egede, U., Elmsheuser, J., Gaidoz, B., Harrison, K., Lee, H.C., Liko, D., Maier, A., Mocicki, J.T., Muraru, A., Pajchel, K., Reece, W., Samset, B., Slater, M., Soroko, A., Tan, C.L., Williams, M.: Ganga: user-friendly grid job submission and management tool for LHC and beyond. J. Phys. Conf. Ser. 219(7), 072022 (2010). http://stacks.iop.org/1742-6596/219/i=7/a=072022
Article Google Scholar
Vlachos, M., Lin, J., Keogh, E., Gunopulos, D.: A wavelet-based anytime algorithm for K-means clustering of time series. In: Proceedings of the Workshop on Clustering High Dimensionality Data and Its Applications (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Software Quality Lab (s-lab), Universität Paderborn, Paderborn, Germany
Marcus Hilbrich
Center for Information Services and High Performance Computing (ZIH), Technische Universität Dresden, Dresden, Germany
Ralph Müller-Pfefferkorn

Authors

Marcus Hilbrich
View author publications
You can also search for this author in PubMed Google Scholar
Ralph Müller-Pfefferkorn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcus Hilbrich .

Editor information

Editors and Affiliations

School of Computer Science and Tech., Huazhong Univ. of Science and Technology, Wuhan, China
Weizhong Qiang
College of Mathematics and Computer Sci., Fuzhou University, Fuzhou, China
Xianghan Zheng
Dept. of Computer Scie and Informat. Eng, Chung Hua University, Hsinchu, Taiwan
Ching-Hsien Hsu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hilbrich, M., Müller-Pfefferkorn, R. (2015). Cross-Correlation as Tool to Determine the Similarity of Series of Measurements for Big-Data Analysis Tasks. In: Qiang, W., Zheng, X., Hsu, CH. (eds) Cloud Computing and Big Data. CloudCom-Asia 2015. Lecture Notes in Computer Science(), vol 9106. Springer, Cham. https://doi.org/10.1007/978-3-319-28430-9_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-28430-9_20
Published: 10 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28429-3
Online ISBN: 978-3-319-28430-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics