Application of Array-Oriented Scientific Data Formats (NetCDF) to Genotype Data, GWASpi as an Example

  • Fernando Muñiz Fernandez
  • Angel Carreño Torres
  • Carlos Morcillo-Suarez
  • Arcadi Navarro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6620)


Over the last three decades, the power, resolution and sophistication of scientific experiments has vastly increased, allowing the generation of vast volumes of biological data that need to be stored and processed. Array-oriented Scientific Data Formats are part of an effort by diverse scientific communities to solve the increasing problems of data storage and manipulations. Genome-wide Association Studies (GWAS) based on Single Nucleotide Polymorphism (SNP) arrays are one of the technologies that produce large volumes of data, particularly information on genomic variability. Due to the complexity of the methods and software packages available, each with its particular and intricate formats and work-flows, the analysis of GWAS confronts scientists with a complex hardware and software problematic. To help easing these issues, we have introduced the use of Array-oriented Scientific Data Format databases (NetCDF) in the GWASpi application, a user-friendly, multi-platform, desktop-able software for the management and analysis of GWAS data. The achieved leap of performance has permitted to leverage the most out of commonly available desktop hardware, on which GWASpi now enables ”start- to-end” GWAS management, from raw data to end results and charts. Not only NetCDF allows storing the data efficiently, but it reduces the time needed to achieve the basic results of a GWAS in up to two orders of magnitude. Additionally, the same principles can be used to store and analyze variability data generated by means of ultrasequencing technologies. Available at .


Array-oriented Scientific Data Formats NetCDF HDF GWAS Genome wide association studies single nucleotide polymorphisms SNP 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Yap, G.: Affymetrix, Inc. Pharmacogenomics 3(5), 709–711 (2002)CrossRefGoogle Scholar
  2. 2.
    Steemers, F.J., Gunderson, K.L.: Illumina, Inc. Pharmacogenomics 6(7), 777–782 (2005)CrossRefGoogle Scholar
  3. 3.
    McCarthy, M.I., Abecasis, G.R., Cardon, L.R., Goldstein, D.B., Little, J., Loannidis, J.P.A., Hirschhorn, J.N.: Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Reviews Genetics 9(5), 356–369 (2008)CrossRefGoogle Scholar
  4. 4.
    Yu, W., Wulf, A., Yesupriya, A., Clyne, M., Khoury, M.J., Gwinn, M.: HuGE Watch: tracking trends and patterns of published studies of genetic association and human genome epidemiology in near-real time. European Journal of Human Genetics EJHG 16(9), 1155–1158 (2008)CrossRefGoogle Scholar
  5. 5.
    International Hapmap Consortium: The International HapMap Project. Nature 426, 789–796 (2003)CrossRefGoogle Scholar
  6. 6.
    Luca Cavalli-Sforza, L.: The Human Genome Diversity Project: past, present and future. Genetics 6, 3–10 (2005)Google Scholar
  7. 7.
    Siva, N.: 1000 Genomes project. Nature Biotechnology 26(3), 256 (2008)Google Scholar
  8. 8.
    Gomes, I., Collins, A., Lonjou, C., Thomas, N.S., Wilkinson, J., Watson, M., Morton, N.: Hardy-Weinberg quality control. Annals of Human Genetics 63(pt. 6), 535–538 (1999)CrossRefGoogle Scholar
  9. 9.
  10. 10.
    dbSNP: the NCBI database,
  11. 11.
  12. 12.
    Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A.R., Bender, D., Maller, J., Sklar, P., De Bakker, P.I.W., Daly, M.J., Sham, P.C.: PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81(3), 559–575 (2007)CrossRefGoogle Scholar
  13. 13.
    Browning, S.R., Browning, B.L.: Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. The American Journal of Human Genetics 81(5), 1084–1097 (2007)CrossRefGoogle Scholar
  14. 14.
    Nothnagel, M., Ellinghaus, D., Schreiber, S., Krawczak, M., Franke, A.: A comprehensive evaluation of SNP genotype imputation. Human Genetics 125(2), 163–171 (2009)CrossRefGoogle Scholar
  15. 15.
  16. 16.
  17. 17.
    Sugawara, H.: Trends in bioinformatics. Tanpakushitsu Kakusan Koso Protein Nucleic Acid Enzyme 49(1), 72–73 (2004)Google Scholar
  18. 18.
    Schadt, E.E., Linderman, M.D., Sorenson, J., Lee, L., Nolan, G.P.: Computational solutions to large-scale data management and analysis. Nature Reviews Genetics 11(9), 647–657 (2010)CrossRefGoogle Scholar
  19. 19.
    The HDF Group,

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Fernando Muñiz Fernandez
    • 1
    • 2
  • Angel Carreño Torres
    • 1
    • 2
  • Carlos Morcillo-Suarez
    • 1
    • 2
    • 3
  • Arcadi Navarro
    • 1
    • 2
    • 3
    • 4
    • 5
  1. 1.Institut de Biología Evolutiva (UPF-CSIC), Biomedical Research Park (PRBB)Universitat Pompeu FabraBarcelonaSpain
  2. 2.Population Genomics Node (GNV8), National Institute for Bioinformatics (INB)Universitat Pompeu FabraBarcelonaSpain
  3. 3.National Genotyping Centre (CeGen)Universitat Pompeu FabraBarcelonaSpain
  4. 4.Institució Catalana de Recerca i Estudis Avançats, ICREAUniversitat Pompeu FabraBarcelonaSpain
  5. 5.Departament de Ciències Experimentals i de la SalutUniversitat Pompeu FabraBarcelonaSpain

Personalised recommendations