The Use of Distributed Data Storage and Processing Systems in Bioinformatic Data Analysis

  • Michał Bochenek
  • Kamil Folkert
  • Roman Jaksik
  • Michał Krzesiak
  • Marcin MichalakEmail author
  • Marek Sikora
  • Tomasz Stȩclik
  • Łukasz Wróbel
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 928)


The cancer and the cancer mortality may seem the sign of the present times. This leads hundreds of scientists to handle the issue of finding significant premises of cancer occurrence. In this paper a set of data mining tasks is defined that joins the observed genes mutation with the specific cancer type observation. Due to the high computational complexity of this kind of data a Hadoop ecosystem cluster was developed to perform the required calculations. The results may be satisfactory in the domains of distributed data storage (processing) and the genes mutation occurrence interpretation.


Hadoop ecosystem Biomedical data Distributed computing TCGA data analysis Gene mutations 



This work was partially supported by Polish National Centre for Research and Development (NCBiR) within the programme Prevention and Treatment of Civilization Diseases—STRATEGMED III.

Grant No. STRATEGMED3/304586/5/NCBR/2017 (PersonALL). The work was carried out in part (especially the participation of the fifth author) within the statutory research project of the Institute of Informatics, BK-213/RAU2/2018.


  1. 1.
    Falco repository. Accessed 11 Dec 2017
  2. 2.
    The Cancer Genome Atlas.
  3. 3.
    Ashburner, M., et al.: Gene ontology: tool for the unification of biology. Nat. Genet. 25(1), 25–29 (2000)CrossRefGoogle Scholar
  4. 4.
    Buchfink, B., Xie, C., Huson, D.: Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015)CrossRefGoogle Scholar
  5. 5.
    Gao, S., Li, L., Li, W., Janowicz, K., Zhang, Y.: Constructing gazetteers from volunteered big geo-data based on Hadoop. Comput. Environ. Urban Syst. 61(Part B), 172–186 (2017)CrossRefGoogle Scholar
  6. 6.
    Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. SIGOPS Oper. Syst. Rev. 37(5), 29–43 (2003)CrossRefGoogle Scholar
  7. 7.
    Hanahan, D., Weinberg, R.: Hallmarks of cancer: the next generation. Cell 144(5), 646–674 (2011)CrossRefGoogle Scholar
  8. 8.
    Knijnenburg, T.A., Bismeijer, T., et al.: A multilevel pan-cancer map links gene mutations to cancer hallmarks. Chin. J. Cancer 34(3), 439–449 (2015)CrossRefGoogle Scholar
  9. 9.
    Li, K.B.: ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 19(12), 1585–1586 (2003)CrossRefGoogle Scholar
  10. 10.
    Mrozek, D., Gosk, P., Małysiak-Mrozek, B.: Scaling ab initio predictions of 3D protein structures in Microsoft Azure Cloud. J. Grid Comput. 13(4), 561–585 (2015)CrossRefGoogle Scholar
  11. 11.
    Mrozek, D., Kłapciński, A., Małysiak-Mrozek, B.: Orchestrating task execution in Cloud4PSi for scalable processing of macromolecular data of 3D protein structures. In: Nguyen, N.T., Tojo, S., Nguyen, L.M., Trawiński, B. (eds.) ACIIDS 2017. LNCS (LNAI), vol. 10192, pp. 723–732. Springer, Cham (2017). Scholar
  12. 12.
    Natesan, P., Rajalaxmi, R.R., Gowrison, G., Balasubramanie, P.: Hadoop based parallel binary bat algorithm for network intrusion detection. Int. J. Parallel Program. 45(5), 1194–1213 (2017)CrossRefGoogle Scholar
  13. 13.
    Sandholm, T., Lai, K.: MapReduce optimization using regulated dynamic prioritization. SIGMETRICS Perform. Eval. Rev. 37(1), 299–310 (2009)Google Scholar
  14. 14.
    Sarnovsky, M., Butka, P., Huzvarova, A.: Twitter data analysis and visualizations using the R language on top of the Hadoop platform. In: IEEE 15th International Symposium on Applied Machine Intelligence and Informatics, pp. 327–331 (2017)Google Scholar
  15. 15.
    Schaefer, C.F., Anthony, K., et al.: PID: the pathway interaction database. Nucleic Acids Res. 37(Suppl. 1), D674–D679 (2009)CrossRefGoogle Scholar
  16. 16.
    Schnase, J.L., Duffy, D.Q., et al.: MERRA analytic services: meeting the big data challenges of climate science through cloud-enabled climate analytics-as-a-service. Comput. Environ. Urban Syst. 61(B), 198–211 (2017)CrossRefGoogle Scholar
  17. 17.
    Shah, S.P., Huang, Y., Xu, T., et al.: Atlas-a data warehouse for integrative bioinformatics. BMC Bioinform. 6(1), 34 (2005)CrossRefGoogle Scholar
  18. 18.
    Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22), 4673–4680 (1994)CrossRefGoogle Scholar
  19. 19.
    Thoralf, T.T., Kormeier, B., Klassen, A., Hofestädt, R.: BioDWH: a data warehouse kit for life science data integration. J. Integr. Bioinform. 5(2), 49–57 (2008)Google Scholar
  20. 20.
    Wan, S., Zou, Q.: HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing. Algorithms Mol. Biol. 12(1), 25 (2017)CrossRefGoogle Scholar
  21. 21.
    White, T.: The Definitive Guide. O’Reilly Media, Newton (2009)Google Scholar
  22. 22.
    Yang, A., Troup, M., Lin, P., Ho, J.: Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud. Bioinformatics 33(5), 767–769 (2017)Google Scholar
  23. 23.
    Yang, M., Mei, H., Huang, D.: An effective detection of satellite images via k-means clustering on Hadoop system. Int. J. Innov. Comput. Inf. Control 13(3), 1037–1046 (2017)Google Scholar
  24. 24.
    Yu, J., Blom, J., Sczyrba, A., Goesmann, A.: Rapid protein alignment in the cloud: HAMOND combines fast DIAMOND alignments with Hadoop parallelism. J. Biotechnol. 257(Suppl. C), 58–60 (2017)CrossRefGoogle Scholar
  25. 25.
    Zou, Q., Hu, Q., et al.: HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 31(15), 2475–2481 (2015)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Michał Bochenek
    • 4
  • Kamil Folkert
    • 4
  • Roman Jaksik
    • 3
  • Michał Krzesiak
    • 4
  • Marcin Michalak
    • 1
    Email author
  • Marek Sikora
    • 2
  • Tomasz Stȩclik
    • 1
  • Łukasz Wróbel
    • 2
  1. 1.Institute of Innovative Technologies EMAGKatowicePoland
  2. 2.Institute of InformaticsSilesian University of TechnologyGliwicePoland
  3. 3.Institute of Automatic ControlSilesian University of TechnologyGliwicePoland
  4. 4.3 Soft S.A.KatowicePoland

Personalised recommendations