Abstract
Scientific workflows are abstractions used to model and execute in silico scientific experiments. They represent key resources for scientists and are enacted and managed by engines called Scientific Workflow Management Systems (SWfMS). Each SWfMS has a particular workflow language. This heterogeneity of languages and formats poses as complex scenario for scientists to search or discover workflows in distributed repositories for reuse. The existing workflows in these repositories can be used to leverage the identification and construction of families of workflows (clusters) that aim at a particular goal. However it is hard to compare the structure of these workflows since they are modeled in different formats. One alternative way is to compare workflow metadata such as natural language descriptions (usually found in workflow repositories) instead of comparing workflow structure. In this scenario, we expect that the effective use of classical text mining techniques can cluster a set of workflows in families, offering to the scientists the possibility of finding and reusing existing workflows, which may decrease the complexity of modeling a new experiment. This paper presents Athena, a cloud-based approach to support workflow clustering from disperse repositories using their natural language descriptions, thus integrating these repositories and providing a facilitated form to search and reuse workflows.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Mattoso, M., Werner, C., Travassos, G.H., Braganholo, V., Murta, L., Ogasawara, E., Oliveira, D., Cruz, S.M.S.D., Martinho, W.: Towards Supporting the Life Cycle of Large Scale Scientific Experiments. International Journal of Business Process Integration and Management 5(1), 79–92 (2010)
Goderis, A., De Roure, D., Goble, C., Bhagat, J., Cruickshank, D., Fisher, P., Michaelides, D., Tanoh, F.: Discovering Scientific Workflows: The myExperiment Benchmarks. IEEE Transactions on Automation Science and Engineering (2008)
Santos, E., Lins, L., Ahrens, J.P., Freire, J., Silva, C.T.: A first study on clustering collections of workflow graphs. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 160–173. Springer, Heidelberg (2008)
Goderis, A., Fisher, P., Gibson, A., Tanoh, F., Wolstencroft, K., De Roure, D., Goble, C.: Benchmarking Workflow Discovery: A Case Study From Bioinformatics. Concurrency and Computation: Practice and Experience 21, 2052–2069 (2009)
Goderis, A., Li, P., Goble, C.: Workflow discovery: the problem, a case study from e-Science and a graph-based solution. In: International Conference on Web Services, ICWS 2006, pp. 312–319 (2006)
Pressman, R.S.: Software Engineering Software Engineering: A Practitioner’s Approach, 6th edn. McGraw-Hill, New York (2004)
Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M.R., Li, P., Oinn, T.: Taverna: a tool for building and running workflows of services. Nucleic Acids Research 34(Web Server issue), 729–732 (2006)
Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Silva, C.T., Vo, H.T.: VisTrails: visualization meets data management. In: Proc. SIGMOD 2006, Chicago, Illinois, USA, pp. 745–747 (2006)
Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., Mock, S.: Kepler: an extensible system for design and execution of scientific workflows. In: Scientific and Statistical Database Management, Greece, pp. 423–424 (2004)
Taylor, I., Shields, M., Wang, I., Harrison, A.: The Triana Workflow Environment: Architecture and Applications. In: Workflows for e-Science, pp. 320–339. Springer, Heidelberg (2007)
Deelman, E., Mehta, G., Singh, G., Su, M., Vahi, K.: Pegasus: Mapping Large-Scale Workflows to Distributed Resources. In: Workflows for e-Science, pp. 376–394. Springer, Heidelberg (2007)
Zhao, Y., Hategan, M., Clifford, B., Foster, I., von Laszewski, G., Nefedova, V., Raicu, I., Stef-Praun, T., Wilde, M.: Swift: Fast, Reliable, Loosely Coupled Parallel Computation. In: Services 2007, Salt Lake City, UT, USA, pp. 199–206 (2007)
Jung, J., Bae, J.: Workflow clustering method based on process similarity. In: Gavrilova, M.L., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Laganá, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3981, pp. 379–389. Springer, Heidelberg (2006)
Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2006)
Oliveira, D., Baião, F., Mattoso, M.: Towards a Taxonomy for Cloud Computing from an e-Science Perspective. In: Cloud Computing: Principles, Systems and Applications. Springer, Heidelberg (2010)
Amazon EC2, 2010. Amazon Elastic Compute Cloud (Amazon EC2). Amazon Elastic Compute Cloud (Amazon EC2). Dispon?vel em, http://aws.amazon.com/ec2/ (acesso em: March 5, 2010)
Cruz, S.M.S.D., Barros, P.M., Bisch, P.M., Campos, M.L.M., Mattoso, M.: A Provenance-based Approach to Resource Discovery. In: Proceedings of the Red Workshop (2009)
Corcho, O., Alper, P., Missier, P., Bechhofer, S., Goble, C.: Grid metadata management: Requirements and architecture. In: 8th IEEE/ACM International Conference on Grid Computing, pp. 97–104 (2007)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2006)
Dragut, E., Fang, F., Sistla, P., Yu, C., Meng, W.: Stop word and related problems in web interface integration. Proc. VLDB Endow. 2(1), 349–360 (2009)
Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of finnish text documents. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, Washington, D.C., USA, pp. 625–633 (2004)
Guan, H., Zhou, J., Guo, M.: A class-feature-centroid classifier for text categorization. In: Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain, pp. 201–210 (2009)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Sedding, J., Kazakov, D.: WordNet-based text document clustering. In: Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data, Geneva, pp. 104–113 (2004)
Hu, X., Sun, N., Zhang, C., Chua, T.: Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, Hong Kong, China, pp. 919–928 (2009)
Hu, J., Fang, L., Cao, Y., Zeng, H., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore, pp. 179–186 (2008)
Chen, L., Tokuda, N., Nagai, A.: A differential LSI method for document classification. In: Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages, Sapporo, Japan, vol. 11, pp. 25–32 (2003)
Abbasi, A., Chen, H.: Categorization and analysis of text in computer mediated communication archives using visualization. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, Vancouver, BC, Canada, pp. 11–18 (2007)
Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: SciCumulus: A Lightweigth Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows. In: Proc. 3rd IEEE International Conference on Cloud Computing, Miami, FL (2010)
Oliveira, D., Ogasawara, E., Baiao, F., Mattoso, M.: An Adaptive Approach for Workflow Activity Execution in Clouds. In: International Workshop on Challenges in e-Science - SBAC, Petrópolis, RJ - Brazil, pp. 9–16 (2010)
Ogasawara, E., Paulino, C., Murta, L., Werner, C., Mattoso, M.: Experiment Line: Software Reuse in Scientific Workflows. In: Scientific and Statistical Database Management, New Orleans, LA, pp. 264–272 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Costa, F., de Oliveira, D., Ogasawara, E., Lima, A.A.B., Mattoso, M. (2012). Athena: Text Mining Based Discovery of Scientific Workflows in Disperse Repositories. In: Lacroix, Z., Vidal, M.E. (eds) Resource Discovery. RED 2010. Lecture Notes in Computer Science, vol 6799. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27392-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-27392-6_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27391-9
Online ISBN: 978-3-642-27392-6
eBook Packages: Computer ScienceComputer Science (R0)