Abstract
One aspect of the so called Big Data challenge is the rising quantity of data in almost all scientific, social, governmental and commercial disciplines. As a result there are many ongoing developments of analysis techniques to substitute manual processes with automatic or semi-automatic algorithms. This means the knowledge of data analysts has to be transferred to algorithms which can be executed simultaneously on many data sets. Such, the rising amount of data can be analysed in an constant quality and in a shorter time. Even if the number of existing algorithms is enormous, a ready to use solution for each problem doesn’t exist. Especially for analysing and comparing series of measurements, e.g. for analysing data of activity trackers or to monitor service execution infrastructures, we discovered a lack of options. Thus we explain the basics of an algorithm using the cross-correlation function to determine a meaningful value of similarity for two or more series of measurements. We used the new method to analyse and categorise job centric monitoring data.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
- 5.
Non-discrete sequences are series of measurements which are not guaranteed to be taken at constant time intervals.
- 6.
This lowering has to respect the leading sign. A cross-correlation coefficient less then zero represents similar functions with reverse signs.
References
Andreeva, J., Calloni, M., Colling, D., Fanzago, F., D’Hondt, J., Klem, J., Maier, G., Letts, J., Maes, J., Padhi, S., Sarkar, S., Spiga, D., Mulders, P.V., Villella, I.: CMS analysis operations. J. Phys. Conf. Ser. 219(7), 072007 (2010). http://stacks.iop.org/1742-6596/219/i=7/a=072007
Baur, T., Breu, R., Klmn, T., Lindinger, T., Milbert, A., Poghosyan, G., Reiser, H., Romberg, M.: An interoperable grid information system for integrated resource monitoring based on virtual organizations. J. Grid Comput. 7, 319–333 (2009). doi:10.1007/s10723-009-9134-3. http://dx.doi.org/10.1007/s10723-009-9134-3
Beasley, D., Bull, D.R., Martin, R.R.: An overview of genetic algorithms: Part 1, fundamentals. Univ. Comput. 15, 58–69 (1993)
Brunst, H., Hackenberg, D., Juckeland, G., Rohling, H.: Comprehensive performance tracking with vampir 7. In: Müller, M.S., Resch, M.M., Schulz, A., Nagel, W.E. (eds.) Tools for High Performance Computing 2009, pp. 17–29. Springer, Heidelberg (2010). doi:10.1007/978-3-642-11261-4_2
Chan, P., Stolfo, S.J.: Toward parallel and distributed learning by meta-learning. In: AAAI Workshop in Knowledge Discovery in Databases, pp. 227–240 (1993)
Denning, D.E.: An intrusion-detection model. IEEE Trans. Softw. Eng. 13(2), 222–232 (1987)
Dickerson, J.E., Dickerson, J.A.: Fuzzy network profiling for intrusion detection. In: Proceedings of NAFIPS 19th International Conference of the North American Fuzzy Information Processing Society, Atlanta, pp. 301–306 (2000)
Eichenhardt, H., Müller-Pfefferkorn, R., Neumann, R., William, T.: User- and job-centric monitoring: analysing and presenting large amounts of monitoring data. In: Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing. GRID 2008, pp. 225–232. IEEE Computer Society, Washington (2008). http://dx.doi.org/10.1109/GRID.2008.4662803
Elmroth, E., Gardfjll, P., Mulmo, O., ke Sandgren, Sandholm, T. : A Coordinated Accounting Solution for SweGrid Version: Draft 0.1.3, 7 October 2003
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD 1996, pp. 226–231 (1996)
Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequence matching in time-series databases. In: Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, SIGMOD 1994, pp. 419–429. ACM, New York (1994). http://doi.acm.org/10.1145/191839.191925
Grefenstette, J.: Optimization of control parameters for genetic algorithms. IEEE Trans. Syst. Man Cybern. 16(1), 122–128 (1986)
von Grünigen, D.: Digitale Signalverarbeitung: Mit einer Einführung in die kontinuierlichen Signale und Systeme. Fachbuchverlag Leipzig (2008)
Gusfield, D.: Algorithms on Stings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press, New York (1997)
Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a K-means clustering algorithm. J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)
Hilbrich, M., Müller-Pfefferkorn, R.: A scalable infrastructure for job-centric monitoring data from distributed systems. In: Bubak, M., Turala, M., Wiatr, K. (eds.) Proceedings Cracow Grid Workshop 2009, pp. 120–125. ACC CYFRONET AGH, ul. Nawojki 11, 30–950 Krakow 61, P.O. Box 386, Poland, February 2010
Hilbrich, M., Müller-Pfefferkorn, R.: Achieving scalability for job centric monitoring in a distributed infrastructure. In: Mühl, G., Richling, J., Herkersdorf, A. (eds.) ARCS Workshops. LNI, vol. 200, pp. 481–492. GI (2012)
Hilbrich, M., Weber, M., Tschüter, R.: Automatic analysis of large data sets: a walk-through on methods from different perspectives. In: 2013 International Conference on Cloud Computing and Big Data (CloudCom-Asia), pp. 373–380, December 2013
Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM 18(6), 341–343 (1975)
Holland, J.: Genetic algorithms. Sci. Am. 267(1), 44–50 (1992)
Huffmire, T., Sherwood, T.: Wavelet-based phase classification. In: Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, PACT 2006, pp. 95–104. ACM, New York (2006). http://doi.acm.org/10.1145/1152154.1152172
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. Wiley, New York (2005). http://books.google.de/books?id=yS0nAQAAIAAJ
Keogh, E.J., Pazzani, M.J.: An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback (1998)
Lee, W., Stolfo, S.J.: Data mining approaches for intrusion detection. In: Proceedings of the 7th Conference on USENIX Security Symposium, SSYM 1998, vol. 7, p. 6. USENIX Association, Berkeley (1998). http://dl.acm.org/citation.cfm?id=1267549.1267555
Lee, W., Stolfo, S., Mok, K.: A data mining framework for building intrusion detection models. In: Proceedings of the 1999 IEEE Symposium on Security and Privacy, pp. 120–132 (1999)
Lunt, T., Jagannathan, R., Lee, R., Whitehurst, A., Listgarten, S.: Knowledge-based intrusion detection. In: 1989 Proceedings of the Annual AI Systems in Government Conference, pp. 102–107 (1989)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)
Müller-Pfefferkorn, R., Neumann, R., Borovac, S., Hammad, A., Harenberg, T., Hüsken, M., Mättig, P., Mechtel, M., Meder-Marouelli, D., Ueberholz, P.: Monitoring of jobs and their execution for the LHC computing grid. In: Bubak, M., Turala, M., Wiatr, K. (eds.) Proceedings Cracow Grid Workshop 2006, pp. 224–231. ACC CYFRONET AGH, ul. Nawojki 11, 30–950 Krakow 61, P.O. Box 386, Poland (October 2006)
Myers, E.W.: An O(ND) difference algorithm and its variations. Algorithmica 1, 251–266 (1986)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Paxson, V.: Bro: a system for detecting network intruders in real-time. Comput. Netw. 31(2324), 2435–2463 (1999). http://www.sciencedirect.com/science/article/pii/S1389128699001127
Piro Rosario, M., Andrea, G., Giuseppe, P., Albert, W.: Using historical accounting information to predict the resource usage of grid jobs. Future Gener. Comput. Syst. 25(5), 499–510 (2009)
Roesch, M., Telecommunications, S.: Snort - lightweight intrusion detection for networks, pp. 229–238 (1999)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Boston (2005)
Tang, K., Man, K., Kwong, S., He, Q.: Genetic algorithms and their applications. IEEE Signal Process. Mag. 13(6), 22–37 (1996)
Vanderster, D.C., Brochu, F., Cowan, G., Egede, U., Elmsheuser, J., Gaidoz, B., Harrison, K., Lee, H.C., Liko, D., Maier, A., Mocicki, J.T., Muraru, A., Pajchel, K., Reece, W., Samset, B., Slater, M., Soroko, A., Tan, C.L., Williams, M.: Ganga: user-friendly grid job submission and management tool for LHC and beyond. J. Phys. Conf. Ser. 219(7), 072022 (2010). http://stacks.iop.org/1742-6596/219/i=7/a=072022
Vlachos, M., Lin, J., Keogh, E., Gunopulos, D.: A wavelet-based anytime algorithm for K-means clustering of time series. In: Proceedings of the Workshop on Clustering High Dimensionality Data and Its Applications (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Hilbrich, M., Müller-Pfefferkorn, R. (2015). Cross-Correlation as Tool to Determine the Similarity of Series of Measurements for Big-Data Analysis Tasks. In: Qiang, W., Zheng, X., Hsu, CH. (eds) Cloud Computing and Big Data. CloudCom-Asia 2015. Lecture Notes in Computer Science(), vol 9106. Springer, Cham. https://doi.org/10.1007/978-3-319-28430-9_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-28430-9_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28429-3
Online ISBN: 978-3-319-28430-9
eBook Packages: Computer ScienceComputer Science (R0)