Abstract
As workflow systems get more widely used, the number of workflows and the volume of provenance they generate has grown considerably. New tools and infrastructure are needed to allow users to interact with, reason about, and re-use this information. In this paper, we explore the use of clustering techniques to organize large collections of workflow and provenance graphs. We propose two different representations for these graphs and present an experimental evaluation, using a collection of 1,700 workflow graphs, where we study the trade-offs of these representations and the effectiveness of alternative clustering techniques.
Chapter PDF
Similar content being viewed by others
References
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press/Addison-Wesley (1999)
Barbosa, L., Freire, J., da Silva, A.S.: Organizing hidden-web databases by clustering visible web documents. In: Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, pp. 326–335. IEEE, Los Alamitos (2007)
Bunke, H., Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters 19(3-4), 255–259 (1998)
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: SIGIR 1992: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 318–329 (1992)
Ester, M., Frommelt, A., Kriegel, H.-P., Sander, J.: Spatial data mining: Database primitives, algorithms and efficient dbms support. Data Mining and Knowledge Discovery 4(2-3), 193–216 (2000)
Freire, J., Silva, C.T., Callahan, S.P., Santos, E., Scheidegger, C.E., Vo, H.T.: Managing rapidly-evolving scientific workflows. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 10–18. Springer, Heidelberg (2006)
Greco, G., Guzzo, A., Pontieri, L., Sacca, D.: Discovering expressive process models by clustering log traces. IEEE Transactions on Knowledge and Data Engineering 18(8), 1010–1027 (2006)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer Series in Statistics. Springer, Heidelberg (2001)
Jaccard, P.: Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37, 547–579 (1901)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)
Kitware. The Visualization Toolkit (March 15, 2008), http://www.vtk.org
Makrogiannis, S., Economou, G., Fotopoulos, S., Bourbakis, N.: Segmentation of color images using multiscale clustering and graph theoretic region synthesis. IEEE Transactions on Systems, Man and Cybernetics, Part A 35(2), 224–238 (2005)
myExperiment (March 15, 2008), http://myexperiment.org
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of ACM 18(11), 613–620 (1975)
Sanfeliu, A., Fu, K.S.: A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man and Cybernetics (Part B) 13(3), 353–363 (1983)
Ullmann, J.R.: An algorithm for subgraph isomorphism. J. ACM 23(1), 31–42 (1976)
The VisTrails Project (March 15, 2008), http://www.vistrails.org
Wu, Z., Leahy, R.: An optimal graph theoretic approach to data clustering: theory and its application to image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1101–1113 (1993)
Yahoo! Pipes (March 15, 2008), http://pipes.yahoo.com
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Santos, E., Lins, L., Ahrens, J.P., Freire, J., Silva, C.T. (2008). A First Study on Clustering Collections of Workflow Graphs. In: Freire, J., Koop, D., Moreau, L. (eds) Provenance and Annotation of Data and Processes. IPAW 2008. Lecture Notes in Computer Science, vol 5272. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89965-5_18
Download citation
DOI: https://doi.org/10.1007/978-3-540-89965-5_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89964-8
Online ISBN: 978-3-540-89965-5
eBook Packages: Computer ScienceComputer Science (R0)