Big data challenges in genome informatics

  • Ka-Chun WongEmail author
Letter to the Editor


In recent years, we have witnessed a big data explosion in genomics, thanks to the improvement in high-throughput technologies at drastically decreasing costs. We are entering the era of millions of available genomes. Notably, each genome can be composed of billions of nucleotides stored as plain text files in gigabytes (GBs). It is undeniable that those genome data impose unprecedented data challenges for us. In this article, we briefly discuss the big data challenges associated with genomics in recent years.


Funding information

The literature review and writing in this paper were substantially supported by three grants from the Research Grants Council of the Hong Kong Special Administrative Region [CityU 21200816], [CityU 11203217], and [CityU 11200218]. The donation support of the Titan Xp GPU from the NVIDIA Corporation is appreciated.

Compliance with ethical standards

Conflict of interest

Ka-Chun Wong declares that he has no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by the author.


  1. 1000 Genomes Project Consortium, et al. (2010) A map of human genome variation from population-scale sequencing. Nature 467(7319):1061–1073CrossRefGoogle Scholar
  2. Babaei S, Mahfouz A, Hulsman M, Lelieveldt BP, de Ridder J, Reinders M (2015) Hi-C chromatin interaction networks predict co-expression in the mouse cortex. PLoS Comput Biol 11(5):e1004221CrossRefGoogle Scholar
  3. Belton JM, McCord RP, Gibcus JH, Naumova N, Zhan Y, Dekker J (2012) Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58(3):268–276CrossRefGoogle Scholar
  4. Bock C, Reither S, Mikeska T, Paulsen M, Walter J, Lengauer T (2005) Biq analyzer: visualization and quality control for dna methylation data from bisulfite sequencing. Bioinformatics 21(21):4067–4068CrossRefGoogle Scholar
  5. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30(15):2114–2120CrossRefGoogle Scholar
  6. Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert S, Chapman JA, Chapuis G, Chikhi R, Chitsaz H, Chou WC, Corbeil J, Del Fabbro C, Docking TR, Durbin R, Earl D, Emrich S, Fedotov P, Fonseca NA, Ganapathy G, Gibbs RA, Gnerre S, Godzaridis E, Goldstein S, Haimel M, Hall G, Haussler D, Hiatt JB, Ho IY, Howard J, Hunt M, Jackman SD, Jaffe DB, Jarvis ED, Jiang H, Kazakov S, Kersey PJ, Kitzman JO, Knight JR, Koren S, Lam TW, Lavenier D, Laviolette F, Li Y, Li Z, Liu B, Liu Y, Luo R, Maccallum I, Macmanes MD, Maillet N, Melnikov S, Naquin D, Ning Z, Otto TD, Paten B, Paulo OS, Phillippy AM, Pina-Martins F, Place M, Przybylski D, Qin X, Qu C, Ribeiro FJ, Richards S, Rokhsar DS, Ruby JG, Scalabrin S, Schatz MC, Schwartz DC, Sergushichev A, Sharpe T, Shaw TI, Shendure J, Shi Y, Simpson JT, Song H, Tsarev F, Vezzi F, Vicedomini R, Vieira BM, Wang J, Worley KC, Yin S, Yiu SM, Yuan J, Zhang G, Zhang H, Zhou S, Korf IF (2013) Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2(1):10CrossRefGoogle Scholar
  7. ENCODE Project Consortium, et al. (2004) The encode (encyclopedia of DNA elements) project. Science 306 (5696):636–640CrossRefGoogle Scholar
  8. David M, Dzamba M, Lister D, Ilie L, Brudno M (2011) SHRiMP2: sensitive yet practical Short Read Mapping. Bioinformatics 27(7):1011–1012CrossRefGoogle Scholar
  9. Fullwood MJ, Liu MH, Pan YF, Liu J, Xu H, Mohamed YB, Orlov YL, Velkov S, Ho A, Mei PH et al (2009) An oestrogen-receptor-α-bound human chromatin interactome. Nature 462(7269):58–64CrossRefGoogle Scholar
  10. Gurevich A, Saveliev V, Vyahhi N, Tesler G (2013) QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8):1072–1075CrossRefGoogle Scholar
  11. Jin F, Li Y, Dixon JR, Selvaraj S, Ye Z, Lee AY, Yen CA, Schmitt AD, Espinoza CA, Ren B (2013) A high-resolution map of the three-dimensional chromatin interactome in human cells. Nature 503 (7475):290–294CrossRefGoogle Scholar
  12. Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J, Ziller MJ et al (2015) Integrative analysis of 111 reference human epigenomes. Nature 518(7539):317–330CrossRefGoogle Scholar
  13. Lan X, Witt H, Katsumura K, Ye Z, Wang Q, Bresnick EH, Farnham PJ, Jin VX (2012) Integration of Hi-C and ChIP-seq data reveals distinct types of chromatin linkages. Nucleic Acids Res 40 (16):7690–7704CrossRefGoogle Scholar
  14. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO et al (2009) Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326(5950):289–293CrossRefGoogle Scholar
  15. Mardis ER (2011) A decade’s perspective on DNA sequencing technology. Nature 470(7333):198–203CrossRefGoogle Scholar
  16. Mardis ER (2008) The impact of next-generation sequencing technology on genetics. Trends Genet 24(3):133–141CrossRefGoogle Scholar
  17. Mifsud B, Tavares-Cadete F, Young AN, Sugar R, Schoenfelder S, Ferreira L, Wingett SW, Andrews S, Grey W, Ewels PA, Herman B, Happe S, Higgs A, LeProust E, Follows GA, Fraser P, Luscombe NM, Osborne CS (2015) Mapping long-range promoter contacts in human cells with high-resolution capture Hi-C. Nat Genet 47(6):598–606CrossRefGoogle Scholar
  18. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by rna-seq. Nat Methods 5(7):621–628CrossRefGoogle Scholar
  19. Ho SR, Franklin Pugh B (2011) Comprehensive genome-wide protein-dna interactions detected at single-nucleotide resolution. Cell 147(6):1408–1419CrossRefGoogle Scholar
  20. Robasky K, Lewis NE, Church GM (2014) The role of replicates for error mitigation in next-generation sequencing. Nat Rev Genet 15(1):56–62CrossRefGoogle Scholar
  21. Schadt EE, Turner S, Kasarskis A (2010) A window into third-generation sequencing. Hum Mol Genet 19(R2):R227–R240CrossRefGoogle Scholar
  22. Visel A, Blow MJ, Li Z, Zhang T, Akiyama JA, Holt A, Plajzer-Frick I, Shoukry M, Wright C, Chen F et al (2009) Chip-seq accurately predicts tissue-specific activity of enhancers. Nature 457 (7231):854–858CrossRefGoogle Scholar
  23. Wong KC, Li Y, Peng C, Zhang Z (2015a) SignalSpider: probabilistic pattern discovery on multiple normalized ChIP-Seq signal profiles. Bioinformatics 31(1):17–24CrossRefGoogle Scholar
  24. Wong K-C, Peng C, Li Y (2015b) Probabilistic inference on multiple normalized signal profiles from next generation sequencing: Transcription factor binding sites. IEEE/ACM Trans Comput Biol Bioinform 12(6):1416–1428CrossRefGoogle Scholar
  25. Wong K-C, Chan T-M, Peng C, Li Y, Zhang Z (2013) Dna motif elucidation using belief propagation. Nucleic Acids Res 41(16):e153–e153CrossRefGoogle Scholar
  26. Wong K-C, Zhang Z (2014) Snpdryad: predicting deleterious non-synonymous human snps using only orthologous protein sequences. Bioinformatics page btt769Google Scholar
  27. Yang X, Chockalingam SP, Aluru S (2013) A survey of error-correction methods for next-generation sequencing. Brief Bioinform 14(1):56–66CrossRefGoogle Scholar
  28. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W et al (2008) Model-based analysis of chip-seq (macs). Genome Biol 9(9):R137CrossRefGoogle Scholar

Copyright information

© International Union for Pure and Applied Biophysics (IUPAB) and Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.City University of Hong KongKowloonHong Kong

Personalised recommendations