Skip to main content

Bioinformatics from a Big Data Perspective: Meeting the Challenge

  • Conference paper
  • First Online:
  • 1996 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 10209))

Abstract

Recently, the rising of the Big Data paradigm has had a great impact in several fields. Bioformatics is one such field. In fact, Bioinfomatics had to evolve in order to adapt to this phenomenon. The exponential increase of the biological information available, forced the researchers to find new solutions to handle these new challenges.

In this paper we present our point of view on the problems intrinsic to Big Data (volume, velocity, variety and veracity), how they affect the Bioinformatics field, and some solutions that can help Bioinformatics practitioners to deal with the difficulties presented by Big Data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Zikopoulos, P., Eaton, C.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill Osborne Media, IBM, New York (2011)

    Google Scholar 

  2. Greene, C., Tan, J., Ung, M., Moore, J., Cheng, C.: Big data bioinformatics. J. Cell. Physiol. 229(12), 1896–1900 (2014)

    Article  Google Scholar 

  3. Marx, V.: Biology: the big challenges of big data. Nature 498(7453), 255–260 (2013)

    Article  Google Scholar 

  4. Bizer, C., Boncz, P., Brodie, M., Erling, O.: The meaningful use of big data: four perspectives-four challenges. ACM SIGMOD Rec. 40(4), 56–60 (2012)

    Article  Google Scholar 

  5. Labrinidis, A., Jagadish, H.: Challenges and opportunities with big data. Proc. VLDB Endowment 5(12), 2032–2033 (2012)

    Article  Google Scholar 

  6. Cook, C., Bergman, M., Finn, R., Cochrane, G., Birney, E., Apweiler, R.: The European Bioinformatics Institute in 2016: data growth and integration. Nucleic Acids Res. 44(Database Issue), 20–26 (2016)

    Article  Google Scholar 

  7. Kashyap, H., Ahmed, H., Hoque, N., Swarup, R., Dhruba Kumar, B.: Big data analytics in bioinformatics: a machine learning perspective. Cornell Univ. Lib. Comput. Eng. Finan. Sci. 13 (2015)

    Google Scholar 

  8. Gomez-Vela, F., Barranco, C., Diaz-Diaz, N.: Incorporating biological knowledge for construction of fuzzy networks of gene associations. Appl. Soft Comput. 42, 144–155 (2016)

    Article  Google Scholar 

  9. Liu, Y.: Data Mining Methods for Single Nucleotide Polymorphisms Analysis in Computational Biology. Ph.D. thesis AAI3510948 (2011)

    Google Scholar 

  10. Kolesnikov, N., Hastings, E., Keays, M., Melnichuk, O., Tang, Y., Williams, E., Dylag, M., Kurbatova, N., Brandizi, M., Burdett, T., Megy, K., Pilicheva, E., Rustici, G., Tikhonov, A., Parkinson, H., Petryszak, R., Sarkans, U., Brazma, A.: Arrayexpress update-simplifying data submissions. Nucleic Acids Res. 43(Database Issue), 1113–1116 (2015)

    Article  Google Scholar 

  11. Edgar, R., Domrachev, M., Lash, A.E.: Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30(1), 207–210 (2002)

    Article  Google Scholar 

  12. Sherlock, G., Boussard, T., Kasarskis, A., Binkley, G., Matese, J., Dwight, S., Kaloper, M., Weng, S., Jin, H., Ball, C., Eisen, M., Spellman, P.: The Stanford Microarray database. Nucleic Acid Res. 29(1), 152–155 (2001)

    Article  Google Scholar 

  13. Tateno, Y., Imanishi, T., Miyazaki, S., Fukami-Kobayashi, K., Saitou, N., Sugawara, H., Gojobori, T.: DNA Data Bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids Res. 30(1), 27–30 (2002)

    Article  Google Scholar 

  14. Maidak, B., Olsen, G., Larsen, N., Overbeek, R., McCaughey, M., Woese, C.: The RBP (Ribosomal Database Project). Nucleic Acids Res. 25(1), 109–110 (1997)

    Article  Google Scholar 

  15. Warde-Farley, D., Donaldson, S.L., Comes, O., Zuberi, K., Badrawi, R., Chao, P., Franz, M., Grouios, C., Kazi, F., Lopes, C., Maitland, A., Mostafavi, S., Montojo, J., Shao, Q., Wright, G., Bader, G., Morris, Q.: The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res. 38(1), 214–220 (2010)

    Article  Google Scholar 

  16. Stark, C., Breitkreutz, B., Reguly, T., Boucher, L., Breitkreutz, A., Tyers, M.: BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 34(Database Issue), 535–539 (2006)

    Article  Google Scholar 

  17. Szklarczyk, D., Franceschini, A., Kuhn, M., Simonovic, M., Roth, A., Minguez, P., Doerks, T., Stark, M., Muller, J., Bork, P.: The string database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 39(Database Issue), 561–568 (2011)

    Article  Google Scholar 

  18. Kanehisa, M., Goto, S.: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28(1), 27–30 (2000)

    Article  Google Scholar 

  19. Fabregat, A., Sidiropoulos, K., Garapati, P., Gillespie, M., Hausmann, K., Haw, R., Jassal, B., Jupe, S., Korninger, F., McKay, S., Matthews, L., May, B., Milacic, M., Rothfels, K., Shamovsky, V., Webber, M., Weiser, J., Williams, M., Wu, G., Stein, L., Hermjakob, H., D’Eustachio, P.: The Reactome pathway knowledgebase. Nucleic Acids Res. 44(Database Issue), 481–487 (2016)

    Article  Google Scholar 

  20. Cerami, E.G., Gross, B.E., Demir, E., Rodchenkov, I., Babur, O., Anwar, N., Schultz, N., Bader, G.D., Sander, C.: Pathway commons, a web resource for biological pathway data. Nucleic Acids Res. 39(Database Issue), 685–690 (2011)

    Article  Google Scholar 

  21. Ashburner, M., Ball, C.A.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25(1), 25–29 (2000)

    Article  Google Scholar 

  22. Carbon, S., Ireland, A., Mungall, C., Shu, S., Marshall, B., Lewis, S.: AmiGO: online access to ontology and annotation data. Bioinformatics 25(2), 288–289 (2009)

    Article  Google Scholar 

  23. Hadoop, A.: Hadoop (2009)

    Google Scholar 

  24. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  25. Dudley, J.T., Butte, A.: Reproducible in silico research in the era of cloud computing. Nature Biotechnol. 28(11), 1181–1185 (2010)

    Article  Google Scholar 

  26. Reich, M., Liefeld, T., Gould, J., Lerner, J., Tamayo, P., Mesirov, J.: Genepattern 2.0. Nat. Genet. 38(5), 500–501 (2006)

    Article  Google Scholar 

  27. Stein, L.: The case for cloud computing in genome informatics. Genome Biol. 11(5) (2010)

    Google Scholar 

  28. NVIDIA: NVIDIA CUDA Programming Guide 2.0 (2008)

    Google Scholar 

  29. Sumiyoshi, K., Hirata, K., Hiroi, N., Funahashi, A.: Acceleration of discrete stochastic biochemical simulation using GPGPU. Front. Physiol. 6 (2015)

    Google Scholar 

  30. Mane, S.U., Pangu, K.H.: Disease diagnosis using pattern matching algorithm from DNA sequencing: a sequential and GPGPU based approach. In: International Conference on Informatics and Analytics, pp. 1–5 (2016)

    Google Scholar 

  31. Spark, A.: Apache spark-lightning-fast cluster computing (2014)

    Google Scholar 

  32. Triguero, I., Galar, M., Merino, D., Maillo, J., Bustince, H., Herrera, F.: Evolutionary undersampling for extremely imbalanced big data classification under apache spark. In: IEEE Congress on Evolutionary Computation (CEC), pp. 640–647 (2016)

    Google Scholar 

  33. Boubela, R., Kalcher, K., Huf, W., Nasel, C., Moser, E.: Big data approaches for the analysis of large-scale fMRI data using apache spark and GPU processing: a demonstration on resting-state fMRI data from the human connectome project. Front. Neurosci. 9 (2015)

    Google Scholar 

  34. Banker, K.: MongoDB in action. Manning Publications Co., Greenwich (2011)

    Google Scholar 

  35. Taylor, R.C.: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinform. 11(12), S1 (2010)

    Article  Google Scholar 

  36. Dudley, J., Butte, A.: A quick guide for developing effective bioinformatics programming skills. PLoS Comput. Biol. 5(12), e1000589 (2009)

    Article  Google Scholar 

  37. Kepner, J., Anderson, C., Arcand, W., Bestor, D., Bergeron, B., Byun, C., Hubbell, M., Michaleas, P., Mullen, J., O’Gwynn, D., Prout, A., Reuther, A., Rosa, A., Yee, C.: D4m 2.0 schema: a general purpose high performance schema for the accumulo database. In: IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6 (2013)

    Google Scholar 

  38. Garcia-Torres, M., Gomez-Vela, F., Melian-Batista, B., Moreno-Vega, J.: High-dimensional feature selection via feature grouping: a variable neighborhood search approach. Inf. Sci. 326, 102–118 (2016)

    Article  MathSciNet  Google Scholar 

  39. Bagyamathi, M., Inbarani, H.H.: A novel hybridized rough set and improved harmony search based feature selection for protein sequence classification. In: Hassanien, A.E., Azar, A.T., Snasael, V., Kacprzyk, J., Abawajy, J.H. (eds.) Big Data in Complex Systems. SBD, vol. 9, pp. 173–204. Springer, Cham (2015). doi:10.1007/978-3-319-11056-1_6

    Google Scholar 

  40. Zeng, A., Li, T., Liu, D., Zhang, J., Chen, H.: A fuzzy rough set approach for incremental feature selection on hybrid information systems. Fuzzy Sets Syst. 258, 39–60 (2015)

    Article  MATH  MathSciNet  Google Scholar 

  41. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Disc. 2(3), 283–304 (1998)

    Article  MathSciNet  Google Scholar 

  42. Li, X., Fang, Z.: Parallel clustering algorithms. Parallel Comput. 11(3), 275–290 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  43. Zhao, W., Ma, H., He, Q.: Parallel k-means clustering based on MapReduce. In: Proceedings of the 1st International Conference on Cloud Computing, pp. 674–679 (2009)

    Google Scholar 

  44. Chen, N., Chen, A., Zhou, L.: An incremental grid density-based clustering algorithm. J. Soft. 13(1), 1–7 (2002)

    Google Scholar 

  45. Kumar, A., Daume, H.: A co-training approach for multi-view spectral clustering. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 393–400 (2011)

    Google Scholar 

  46. Pontes, B., Giraldez, R., Aguilar-Ruiz, J.: Biclustering on expression data: a review. J. Biomed. Inform. 57, 163–180 (2015)

    Article  Google Scholar 

  47. Madeira, S., Oliveira, A.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinformatics 1(1), 24–45 (2004)

    Article  Google Scholar 

  48. Liu, W., Chen, L., Qu, H., Qin, L.: A parallel biclustering algorithm for gene expressing data. In: 2008 Fourth International Conference on Natural Computation, vol. 1, pp. 25–29 (2008)

    Google Scholar 

  49. Jin, S., Hua, L.: An improved biclustering algorithm for gene expression data. Open Cybern. Systemics J. 8, 1141–1144 (2014)

    Article  Google Scholar 

  50. Orzechowski, P., Boryczko, K.: Effective biclustering on GPU-capabilities and constraints. Prz Elektrotechniczn 1, 131–134 (2015)

    Google Scholar 

  51. Mejia-Roa, E., Garcia, C., Gomez, J., Prieto, M., Tirado, F., Nogales, R., Pascual-Montano, A.: Biclustering and classification analysis in gene expression using nonnegative matrix factorization on multi-GPU systems. In: 11th International Conference on Intelligent Systems Design and Applications, pp. 882–887 (2011)

    Google Scholar 

  52. Arnedo-Fdez, J., Zwir, I., Romero-Zaliz, R.: Biclustering of very large datasets with GPU tecnology using cuda. In: Proceedings of V Latin American Symposium on High Performance Computing (2012)

    Google Scholar 

  53. Liu, B., Yu, C., Wang, D., Cheung, R., Yan, H.: Design exploration of geometric biclustering for microarray data analysis in data mining. IEEE Trans. Parallel Distrib. Syst. 25(10), 2540–2550 (2014)

    Article  Google Scholar 

  54. Papadimitriou, S., Sun, J.: DisCo: Distributed co-clustering with Map-Reduce: a case study towards petabyte-scale end-to-end mining. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 512–521 (2008)

    Google Scholar 

  55. Ruiqi, L., Yifan, Z., Jihong, G., Shuigeng, Z.: CloudNMF: a MapReduce implementation of nonnegative matrix factorization for large-scale biological datasets. Genomics Proteomics Bioinform. 12(1), 48–51 (2014)

    Article  Google Scholar 

  56. Hecker, M., Lambeck, S., Toepfer, S., Van Someren, E., Guthke, R.: Gene regulatory network inference: data integration in dynamic modelsa review. Biosystems 96(1), 86–103 (2009)

    Article  Google Scholar 

  57. Spencer-Angus, T., Yaochu, J.: Reconstructing biological gene regulatory networks: where optimization meets big data. Evol. Intel. 7(1), 29–47 (2014)

    Article  Google Scholar 

  58. Roy, S., Bhattacharyya, D., Kalita, J.: Reconstruction of gene co-expression network from microarray data using local expression patterns. BMC Bioinform. 15, 1–14 (2014)

    Article  Google Scholar 

  59. Rau, A., Jaffrezic, F., Foulley, J., Doerge, R.W.: Reverse engineering gene regulatory networks using approximate Bayesian computation. Stat. Comput. 22(6), 1257–1271 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  60. Xiao, M., Zhang, L., He, B., Xie, J., Zhang, W.: A parallel algorithm of constructing gene regulatory networks. In: Proceedings of the 3rd International Symposium on Optimization and Systems Biology, pp. 184–188 (2009)

    Google Scholar 

Download references

Acknowledgement

This work has been funded by the Spanish Ministry of Science and Innovation under grant TIN2015-64776-C3-2-R.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francisco Gomez-Vela .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Gomez-Vela, F. et al. (2017). Bioinformatics from a Big Data Perspective: Meeting the Challenge. In: Rojas, I., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2017. Lecture Notes in Computer Science(), vol 10209. Springer, Cham. https://doi.org/10.1007/978-3-319-56154-7_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-56154-7_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-56153-0

  • Online ISBN: 978-3-319-56154-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics