Application of Array-Oriented Scientific Data Formats (NetCDF) to Genotype Data, GWASpi as an Example
Over the last three decades, the power, resolution and sophistication of scientific experiments has vastly increased, allowing the generation of vast volumes of biological data that need to be stored and processed. Array-oriented Scientific Data Formats are part of an effort by diverse scientific communities to solve the increasing problems of data storage and manipulations. Genome-wide Association Studies (GWAS) based on Single Nucleotide Polymorphism (SNP) arrays are one of the technologies that produce large volumes of data, particularly information on genomic variability. Due to the complexity of the methods and software packages available, each with its particular and intricate formats and work-flows, the analysis of GWAS confronts scientists with a complex hardware and software problematic. To help easing these issues, we have introduced the use of Array-oriented Scientific Data Format databases (NetCDF) in the GWASpi application, a user-friendly, multi-platform, desktop-able software for the management and analysis of GWAS data. The achieved leap of performance has permitted to leverage the most out of commonly available desktop hardware, on which GWASpi now enables ”start- to-end” GWAS management, from raw data to end results and charts. Not only NetCDF allows storing the data efficiently, but it reduces the time needed to achieve the basic results of a GWAS in up to two orders of magnitude. Additionally, the same principles can be used to store and analyze variability data generated by means of ultrasequencing technologies. Available at http://www.gwaspi.org .
KeywordsArray-oriented Scientific Data Formats NetCDF HDF GWAS Genome wide association studies single nucleotide polymorphisms SNP
Unable to display preview. Download preview PDF.
- 6.Luca Cavalli-Sforza, L.: The Human Genome Diversity Project: past, present and future. Genetics 6, 3–10 (2005)Google Scholar
- 7.Siva, N.: 1000 Genomes project. Nature Biotechnology 26(3), 256 (2008)Google Scholar
- 9.Ensembl, http://www.ensembl.org/
- 10.dbSNP: the NCBI database, http://www.ncbi.nlm.nih.gov/SNP
- 11.NetCDF at UCAR, http://www.unidata.ucar.edu/software/netcdf/
- 12.Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A.R., Bender, D., Maller, J., Sklar, P., De Bakker, P.I.W., Daly, M.J., Sham, P.C.: PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81(3), 559–575 (2007)CrossRefGoogle Scholar
- 15.JFreeChart, http://www.jfree.org/jfreechart/
- 16.Apache Derby, http://db.apache.org/derby/
- 17.Sugawara, H.: Trends in bioinformatics. Tanpakushitsu Kakusan Koso Protein Nucleic Acid Enzyme 49(1), 72–73 (2004)Google Scholar
- 19.The HDF Group, http://www.hdfgroup.org/