Big Biological Data Management

  • Edvard PedersenEmail author
  • Lars Ailo Bongo
Part of the Computer Communications and Networks book series (CCN)


With the deluge of omics data, the life sciences have become a big data science. The management and analysis of omics data share many of the challenges and technical solutions of other big data fields. However, there are also unique challenges. In particular, there is a need for data management solutions that are backward compatible with unmodified tools, but at the same timescales to large-scale datasets, and in addition manages the intermediate, metadata, and provenance data of analysis pipelines. In this chapter, we present and discuss challenges and approaches for such big biological data management.


  1. 1.
    Abadi, D., Agrawal, R., Ailamaki, A., Balazinska, M., Bernstein, P.A., Carey, M.J., Chaudhuri, S., Chaudhuri, S., Dean, J., Doan, A., Franklin, M.J., Gehrke, J., Haas, L.M., Halevy, A.Y., Hellerstein, J.M., Ioannidis, Y.E., Jagadish, H.V., Kossmann, D., Madden, S., Mehrotra, S., Milo, T., Naughton, J.F., Ramakrishnan, R., Markl, V., Olston, C., Ooi, B.C., Ré, C., Suciu, D., Stonebraker, M., Walter, T., Widom, J.: The beckman report on database research. Commun. ACM 59(2), 92–99 (2016)CrossRefGoogle Scholar
  2. 2.
    Abu-Doleh, A., Atalyrek, V.: Spaler: Spark and graphx based de novo genome assembler. In: 2015 IEEE International Conference on Big Data (Big Data), pp. 1013–1018 (2015)Google Scholar
  3. 3.
    Apache: Apache HBase. Cited 18 April 2016
  4. 4.
    Apache: Avro. Cited 18 April 2016
  5. 5.
    Apache: Cassandra. Cited 18-April-2016
  6. 6.
    Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A., Pasquini, R.: Incoop: MapReduce for Incremental Computations. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 7. ACM Press (2011)Google Scholar
  7. 7.
    Bongo, L.A., Pedersen, E., Ernstsen, M.: Data-intensive computing infrastructure systems for unmodified biological data analysis pipelines. In: Computational Intelligence Methods for Bioinformatics and Biostatistics, LNBI, vol. 8623 (2014)Google Scholar
  8. 8.
    Dean, J., Ghemawat, S.: MapReduce. Commun. ACM 51(1), 107 (2008)CrossRefGoogle Scholar
  9. 9.
    Diao, Y., Roy, A., Bloom, T.: Building highly-optimized, low-latency pipelines for genomic data analysis. In: Proceedings of 7th Biennial Conference on Innovative Data Systems Research (2015)Google Scholar
  10. 10.
    Edgar, R., Domrachev, M., Lash, A.E.: Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30(1), 207–210 (2002)CrossRefGoogle Scholar
  11. 11.
    EMBL-European Bioinformatics Institute: EMBL-EBI Annual Scientific Report 2014. Cited 18 April 2016
  12. 12.
    Fernández-Suárez, X.M., Rigden, D.J., Galperin, M.Y.: The 2014 nucleic acids research database issue and an updated NAR online molecular biology database collection. Nucleic Acids Res. 42(Database issue), D1–6 (2014)Google Scholar
  13. 13.
    Fitzpatrick, B.: Distributed caching with memcached. Linux J. 2004(124), 5 (2004)Google Scholar
  14. 14.
    Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J.Y.H., Zhang, J.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5(10), R80 (2004)CrossRefGoogle Scholar
  15. 15.
    Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles. SOSP ’03, pp. 29–43. ACM, New York, NY, USA (2003)Google Scholar
  16. 16.
    Goecks, J., Nekrutenko, A., Taylor, J.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11(8), R86 (2010)CrossRefGoogle Scholar
  17. 17.
    Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I.: Graphx: Graph processing in a distributed dataflow framework. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pp. 599–613. USENIX Association, Broomfield, CO (2014)Google Scholar
  18. 18.
    Gupta, A., Agarwal, D., Tan, D., Kulesza, J., Pathak, R., Stefani, S., Srinivasan, V.: Amazon redshift and the case for simpler data warehouses. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. SIGMOD ’15, pp. 1917–1923. ACM, New York, NY, USA (2015)Google Scholar
  19. 19.
    Have, C.T., Jensen, L.J.: Are graph databases ready for bioinformatics? Bioinformatics 29(24), 3107–3108 (2013)CrossRefGoogle Scholar
  20. 20.
    Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Pandis, I., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Wanderman-Milne, S., Yoder, M.: Impala: A modern, open-source sql engine for hadoop. In: CIDR. (2015)
  21. 21.
    Kovatch, P., Costa, A., Giles, Z., Fluder, E., Cho, H.M., Mazurkova, S.: Big omics data experience. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’15, pp. 39:1–39:12. ACM, New York, NY, USA (2015)Google Scholar
  22. 22.
    Leinonen, R., Akhtar, R., Birney, E., Bower, L., Cerdeno-Tárraga, A., Cheng, Y., Cleland, I., Faruque, N., Goodgame, N., Gibson, R., Hoad, G., Jang, M., Pakseresht, N., Plaister, S., Radhakrishnan, R., Reddy, K., Sobhany, S., Hoopen, P.T., Vaughan, R., Zalunin, V., Cochrane, G.: The European nucleotide archive. Nucleic Acids Res. 39(SUPPL. 1) (2011)Google Scholar
  23. 23.
    Leipzig, J.: A review of bioinformatic pipeline frameworks. Briefings in Bioinformatics (2016)Google Scholar
  24. 24.
    Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. SIGMOD ’10, pp. 135–146. ACM, New York, NY, USA (2010)Google Scholar
  25. 25.
    Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endowment 3(1–2), 330–339 (2010)CrossRefGoogle Scholar
  26. 26.
    Nothaft, F.A., Massie, M., Danford, T., Zhang, Z., Laserson, U., Yeksigian, C., Kottalam, J., Ahuja, A., Hammerbacher, J., Linderman, M., Franklin, M.J., Joseph, A.D., Patterson, D.A.: Rethinking data-intensive science using scalable analytics systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. SIGMOD ’15, pp. 631–646. ACM, New York, NY, USA (2015)Google Scholar
  27. 27.
    Olston, C., Chopra, S., Srivastava, U.: Generating example data for dataflow programs. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. SIGMOD ’09, pp. 245–256. ACM, New York, NY, USA (2009)Google Scholar
  28. 28.
    Oracle: MySQL. Cited 18 April 2016
  29. 29.
    Pedersen, E., Bongo, L.A.: Large-scale biological meta-database management. In: Future Generation Computer Systems (2016)Google Scholar
  30. 30.
    Pedersen, E., Raknes, I.A., Ernstsen, M., Bongo, L.A.: Integrating data-intensive computing systems with biological data analysis frameworks. In: Proceedings of 23rd Euromicro International Conference on Parallel, Distributed and Network-based Processing, pp. 733–740. IEEE (2015)Google Scholar
  31. 31.
    Robertsen, E.M., Kahlke, T., Raknes, I.A., Pedersen, E., Semb, E.K., Ernstsen, M., Bongo, L.A., Willassen, N.P.: Meta-pipe - pipeline annotation, analysis and visualization of marine metagenomic sequence data. arXiv:1604.04103 (2016)
  32. 32.
    Schildgen, J., Jorg, T., Hoffmann, M., Dessloch, S.: Marimba: A framework for making mapreduce jobs incremental. In: 2014 IEEE International Congress on Big Data, pp. 128–135. IEEE (2014)Google Scholar
  33. 33.
    Schmuck, F., Haskin, R.: Gpfs: A shared-disk file system for large computing clusters. In: Proceedings of the 1st USENIX Conference on File and Storage Technologies, FAST ’02. USENIX Association, Berkeley, CA, USA (2002)Google Scholar
  34. 34.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies 0(5), 1–10 (2010)Google Scholar
  35. 35.
    Stajich, J.E., Block, D., Boulez, K., Brenner, S.E., Chervitz, S.A., Dagdigian, C., Fuellen, G., Gilbert, J.G.R., Korf, I., Lapp, H., Lehväslaiho, H., Matsalla, C., Mungall, C.J., Osborne, B.I., Pocock, M.R., Schattner, P., Senger, M., Stein, L.D., Stupka, E., Wilkinson, M.D., Birney, E.: The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 12(10), 1611–1618 (2002)CrossRefGoogle Scholar
  36. 36.
    Twitter, and Cloudera: Parquet. Cited 18 April 2016
  37. 37.
    UniProt Consortium: UniProt release 201504. Cited 18-April-2016
  38. 38.
    Wang, D.L., Monkewitz, S.M., Lim, K.T., Becla, J.: Qserv: A distributed shared-nothing database for the lsst catalog. In: State of the Practice Reports, SC ’11, pp. 12:1–12:11. ACM, New York, NY, USA (2011)Google Scholar
  39. 39.
    Wetterstrand, K.: DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). Cited 18-April-2016
  40. 40.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster Computing with Working Sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, p. 10 (2010)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.University of Tromsø- The Arctic University of NorwayTromsøNorway

Personalised recommendations