Abstract
Accelerated growth in the field of bioinformatics has resulted in large data sets being produced and analyzed. With this rapid growth has come the need to analyze these data in a quick, easy, scalable, and reliable manner on a variety of computing infrastructures including desktops, clusters, grids and clouds. This paper presents the application of workflow technologies, and, specifically, Pegasus WMS, a robust scientific workflow management system, to a variety of bioinformatics projects from RNA sequencing, proteomics, and data quality control in population studies using GWAS data.
Keywords
Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Deelman, E., Mehta, G., Singh, G., Su, M.H., Vahi, K.: Pegasus: Mapping Large-Scale Workflows to Distributed Resources. In: Workflows for e-Science (2007)
Deelman, E., et al.: Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming Journal 13, 219–237 (2005)
Juve, G., Deelman, E., Vahi, K., Mehta, G., et al.: Data Sharing Options for Scientific Workflows on Amazon EC2. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2010)
Frey, J., Tannenbaum, T., Livny, M., Foster, I., Tuecke, S.: Condor-G: a computation management agent for multi-institutional grids. In: Proceedings 10th IEEE International Symposium on High Performance Distributed Computing, vol. 5(3), pp. 55–63 (2002)
Litzkow, M.J., Livny, M., Mutka, M.W.: Condor: A Hunter of Idle Workstations. In: 8th International Conference on Distributed Computing Systems (1988)
Couvares, P., Kosar, T., Roy, A., et al.: Workflow in Condor. In: Taylor, I., Deelman, E., et al. (eds.) Workflows for e-Science. Springer Press (January 2007)
Xu, H., Freitas, M.A.: Bioinformatics 25(10), 1341–1343 (2009)
Freitas, M.A., Mehta, G., et al.: Large-Scale Proteomic Data Analysis via Flexible Scalable Workflows. In: RECOMB Satellite Conference on Computational Proteomics (2010)
Transcriptional Atlas of the Developing Human Brain, http://www.brainspan.org/
Illumina Eland Alignment Algorithm, http://www.illumina.com
Chen, Y., Souaiaia, T., Chen, T.: PerM: Efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics 25(19), 2514–2521 (2009)
Wang, Y., Mehta, G., Mayani, R., Lu, J., Souaiaia, T., et al.: RseqFlow: Workflows for RNA-Seq data analysis. Submission: Oxford Bioinformatics-Application Notes
O’Connor, B., Merriman, B., Nelson, S.: SeqWare Query Engine: storing and searching sequence data in the cloud. BMC Bioinformatics 11(suppl. 12), S2 (2010)
Matise, T.C., Ambite, J.L., et al.: For the PAGE Study. Population Architecture using Genetics and Epidemiology. Am. J. Epidemiol (2011), doi:10.1093/aje/kwr160
Mailman, M.D., Feolo, M., Jin, Y., Kimura, M., Tryka, K., et al.: The NCBI dbGaP Database of Genotypes and Phenotypes. Nat Genet. 39(10), 1181–1186 (2007)
Virtual Box, http://www.virtualbox.org/
VMware, http://www.vmware.com/
Kivity, A., Kamay, Y., Laor, D., Lublin, U., Liguori, A.: kvm: the Linux virtual machine monitor. In: OLS 2007: The 2007 Ottawa Linux Symposium, pp. 225–230 (July 2007)
Ludascher, B., Altintas, I., Berkley, C., et al.: Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience (2005)
Blankenberg, D., et al.: Galaxy: a web-based genome analysis tool for experimentalists. In: Current Protocols in Molecular Biology, ch. 19, Unit 19.10.1-21 (2010)
Hull, D., Wolstencroft, K., Stevens, R., Goble, C., et al.: Taverna: a tool for building and running workflows of services. Nucleic Acids Research 34, 729–732 (2006)
Romano, P.: Automation of in-silico data analysis processes through workflow management systems. Briefings in Bioinformatics 9(1), 57–68 (2008)
Nakata, K., Lipska, B.L., Hyde, T.M., Ye, T., et al.: DISC1 splice variants are upregulated in schizophrenia and associated with risk polymorphisms. PNAS, August 24 (2009)
Deelman, E., Kesselman, C., Mehta, G., et al.: GriPhyN and LIGO, Building a Virtual Data Grid for Gravitational Wave Scientists. In: 11th Int. Symposium HPDC, HPDC11 2002, p. 225 (2002)
Eng, J.K., McCormack, A.L., Yates III, J.R.: An Approach to Correlate Tandem Mass Spectral Data of Peptides with Amino Acid Sequences in a Protein Database. J. Am. Soc. Mass. Spectrom. 5(11), 976–989 (1994)
Perkins, D.N., Pappin, D.J., et al.: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20(18), 3551–3567 (1999)
Eker, J., Janneck, J., Lee, E.A., Liu, J., et al.: Taming heterogeneity - the Ptolemy approach. Proceedings of the IEEE 91(1), 127–144 (2003)
Pegasus Workflow Management System, http://pegasus.isi.edu/wms
Teragrid, http://www.teragrid.org
Open Science Grid, http://www.opensciencegrid.org
FutureGrid, http://www.futuregrid.org
Nagavaram, A., Agrawal, G., et al.: A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis. In: Proceedings of the 7th IEEE International Conference on e-Science (e-Science 2011) (December 2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mehta, G. et al. (2012). Enabling Data and Compute Intensive Workflows in Bioinformatics. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29740-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-29740-3_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29739-7
Online ISBN: 978-3-642-29740-3
eBook Packages: Computer ScienceComputer Science (R0)