Building Portable and Reproducible Cancer Informatics Workflows: An RNA Sequencing Case Study
The Seven Bridges Cancer Genomics Cloud (CGC) is part of the National Cancer Institute Cloud Resource project, which was created to explore the paradigm of co-locating massive datasets with the computational resources to analyze them. The CGC was designed to allow researchers to easily find the data they need and analyze it with robust applications in a scalable and reproducible fashion. To enable this, individual tools are packaged within Docker containers and described by the Common Workflow Language (CWL), an emerging standard for enabling reproducible data analysis. On the CGC, researchers can deploy individual tools and customize massive workflows by chaining together tools. Here, we discuss a case study in which RNA sequencing data is analyzed with different methods and compared on the Seven Bridges CGC. We highlight best practices for designing command line tools, Docker containers, and CWL descriptions to enable massively parallelized and reproducible biomedical computation with cloud resources.
Key wordsCloud Bioinformatics Cancer informatics TCGA AWS Docker Reproducibility Software design
The Cancer Genomics Cloud is powered by Seven Bridges and has been funded in whole or in part with federal funds from the NCI, NIH, Department of Health and Human Services, under contract no. HHSN261201400008C and HHSN261200800001E. We thank the entire Seven Bridges team, the Cancer Genomics Cloud Pilot teams from the NCI, the Broad Institute, and the Institute of Systems Biology, the Genomic Data Commons team, countless early users, and data donors. We also wish to further acknowledge the source of two of the datasets that are available to authorized users through the CGC and that were central to its development: The Cancer Genome Atlas (TCGA, phs000178). The resources described here were developed in part based upon data generated by The Cancer Genome Atlas managed by the NCI and NHGRI. Information about TCGA can be found at https://cancergenome.nih.gov/. And Therapeutically Applicable Research to Generate Effective Treatments (TARGET, phs000218). The resources described here were developed in part based on data generated by the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) initiative managed by the NCI.
- 3.Merkel D (2014) Docker: lightweight linux containers for consistent development and deployment. Linux J 2014(239):2Google Scholar
- 4.Amstutz, Peter, Crusoe, Michael R, Tijanić, Nebojša, Chapman, Brad, Chilton, John, Heuer, Michael, Kartashov, Andrey, Leehr, Dan, Ménager, Hervé, Nedeljkovich, Maya, Scales, Matt, Soiland-Reyes, Stian, Stojanovic, Luka (2016) Common workflow language, v1.0. FigshareGoogle Scholar