Advertisement

Parallel and Distributed Computing Methodologies in Bioinformatics

  • Giuseppe AgapitoEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11874)

Abstract

The significant advantage of using experimental techniques such as microarray, mass spectrometry (MS), and next generation sequencing (NGS), is that they produce an overwhelming amount of experimental omics data. All of these technologies come with the challenges of determining how the raw omics data should be efficiently processed or normalized and, subsequently, how can the data adequately be summarised or integrated, in order to be stored and shared, as well as to enable machine learning and/or statistical analysis. Omics data analysis involves the execution of several steps, each one implemented through different algorithms, that demand for a lot of computation power. The main problem is the automation of the overall analysis process, to increase the throughput and to reduce manual intervention (e.g., users have to manually supervise some steps of the analysis process). In this scenario, parallel and distributed computing technologies (i.e., Message Passing Interface (MPI), GPU computing, and Hadoop Map-Reduce), are essential to speed up and automatize the whole workflow of omics data analysis. Parallel and distributed computing enable the development of bioinformatics pipeline able to achieve scalable, efficient and reliable computing performance on clusters as well as on cloud computing.

Keywords

Bioinformatics High performance computing Cloud computing Distributed computing Parallel computing 

References

  1. 1.
    Marozzo, F., Talia, D., Trunfio, P.: A cloud framework for big data analytics workflows on azure. In: Big Data. IOS Press (2013)Google Scholar
  2. 2.
    Aebersold, R., Mann, M.: Mass spectrometry-based proteomics. Nature 422(6928), 198 (2003) CrossRefGoogle Scholar
  3. 3.
    Agapito, G., Cannataro, M., Guzzi, P.H., Marozzo, F., Talia, D., Trunfio, P.: Cloud4SNP: distributed analysis of SNP microarray data on the cloud. In: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics (2013)Google Scholar
  4. 4.
    Agapito, G., Guzzi, P.H., Cannataro, M.: DMET-miner: efficient discovery of association rules from pharmacogenomic data. J. Biomed. Inf. 56, 273–283 (2015). https://doi.org/10.1016/j.jbi.2015.06.005. http://www.sciencedirect.com/science/article/pii/S153204641500115XCrossRefGoogle Scholar
  5. 5.
    Agapito, G., Guzzi, P.H., Cannataro, M.: Parallel extraction of association rules from genomics data. Appl. Math. Comput. 350, 434–446 (2017)MathSciNetGoogle Scholar
  6. 6.
    Arbitrio, M., et al.: Polymorphic variants in NR 1I3 and UGT 2B7 predict taxane neurotoxicity and have prognostic relevance in patients with breast cancer: a case-control study. Clin. Pharmacol. Ther. 106, 422–431 (2019)CrossRefGoogle Scholar
  7. 7.
    Bairoch, A., Boeckmann, B.: The SWISS-PROT protein sequence data bank. Nucleic Acids Res. 19(Suppl.), 2247 (1991)CrossRefGoogle Scholar
  8. 8.
    Barker, W.C., et al.: The PIR-international protein sequence database. Nucleic Acids Res. 26(1), 27–32 (1998)CrossRefGoogle Scholar
  9. 9.
    Boeckmann, B., et al.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31(1), 365–370 (2003)CrossRefGoogle Scholar
  10. 10.
    Calabrese, B., Cannataro, M.: Cloud computing in bioinformatics: current solutions and challenges. Technical report, PeerJ Preprints (2016)Google Scholar
  11. 11.
    Cannataro, M.: Computational proteomics: management and analysis of proteomics data (2008)Google Scholar
  12. 12.
    Cannataro, M., et al.: A grid environment for high-throughput proteomics. IEEE Trans. Nanobiosci. 6(2), 117–123 (2007)CrossRefGoogle Scholar
  13. 13.
    Cannataro, M., Veltri, P.: MS-analyzer: preprocessing and data mining services for proteomics applications on the grid. Concurrency Comput. Pract. Experience 19(15), 2047–2066 (2007)CrossRefGoogle Scholar
  14. 14.
    Daugelaite, J., O’Driscoll, A., Sleator, R.D.: An overview of multiple sequence alignments and cloud computing in bioinformatics. ISRN Biomathematics 2013 (2013)Google Scholar
  15. 15.
    Desiere, F., et al.: The peptideAtlas project. Nucleic Acids Res. 34(Suppl. 1), D655–D658 (2006)CrossRefGoogle Scholar
  16. 16.
    Deyholos, M.K., Galbraith, D.W.: High-density microarrays for gene expression analysis. Cytometry 43(4), 229–238 (2001)CrossRefGoogle Scholar
  17. 17.
    Di Fatta, G., Blasa, F., Cafiero, S., Fortino, G.: Fault tolerant decentralised k-means clustering for asynchronous large-scale networks. J. Parallel Distrib. Comput. 73(3), 317–329 (2013)CrossRefGoogle Scholar
  18. 18.
    Di Fatta, G., Fortino, G.: A customizable multi-agent system for distributed data mining. In: Proceedings of the 2007 ACM Symposium on Applied Computing, pp. 42–47. ACM (2007)Google Scholar
  19. 19.
    Gardinassi, L.G., Xia, J., Safo, S.E., Li, S.: Bioinformatics tools for the interpretation of metabolomics data. Curr. Pharmacol. Rep. 3(6), 374–383 (2017).  https://doi.org/10.1007/s40495-017-0107-0CrossRefGoogle Scholar
  20. 20.
    Guzzi, P.H., Agapito, G., Cannataro, M.: CoreSNP: parallel processing of microarray data. IEEE Trans. Comput. 63(12), 2961–2974 (2014). https://doi.org/10.1109/TC.2013.176MathSciNetCrossRefGoogle Scholar
  21. 21.
    Guzzi, P.H., Agapito, G., Di Martino, M.T., Arbitrio, M.M., Tagliaferri, P., Cannataro, M.: DMET-analyzer: automatic analysis of affymetrix DMET data. BMC Bioinf. 13(258) (2012)Google Scholar
  22. 22.
    Huda, S., Yearwood, J., Jelinek, H.F., Hassan, M.M., Fortino, G., Buckland, M.: A hybrid feature selection with ensemble classification for imbalanced healthcare data: a case study for brain tumor diagnosis. IEEE Access 4, 9145–9154 (2016)CrossRefGoogle Scholar
  23. 23.
    Kanehisa, M., Goto, S., Sato, Y., Furumichi, M., Tanabe, M.: KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40(D1), D109–D114 (2011)CrossRefGoogle Scholar
  24. 24.
    Marozzo, F., Talia, D., Trunfio, P.: Using clouds for scalable knowledge discovery applications. In: Caragiannis, I., et al. (eds.) Euro-Par 2012. LNCS, vol. 7640, pp. 220–227. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-36949-0_25CrossRefGoogle Scholar
  25. 25.
    Matsunaga, A., Tsugawa, M., Fortes, J.: CloudBLAST: combining MapReduce and virtualization on distributed resources for bioinformatics applications. In: 2008 IEEE Fourth International Conference on eScience, pp. 222–229, December 2008.  https://doi.org/10.1109/eScience.2008.62
  26. 26.
    Schatz, M.C.: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25(11), 1363–1369 (2009). https://doi.org/10.1093/bioinformatics/btp236CrossRefGoogle Scholar
  27. 27.
    Schmidt, A., Forne, I., Imhof, A.: Bioinformatic analysis of proteomics data. BMC Syst. Biol. 8(2), S3 (2014)CrossRefGoogle Scholar
  28. 28.
    Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., Sirotkin, K.: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29(1), 308–311 (2001). https://doi.org/10.1093/nar/29.1.308CrossRefGoogle Scholar
  29. 29.
    Smith, A.D., et al.: Updates to the rmap short-read mapping software. Bioinformatics 25(21), 2841–2842 (2009)CrossRefGoogle Scholar
  30. 30.
    Specht, M., Kuhlgert, S., Fufezan, C., Hippler, M.: Proteomics to go: proteomatic enables the user-friendly creation of versatile MS/MS data evaluation workflows. Bioinformatics 27(8), 1183–1184 (2011). https://doi.org/10.1093/bioinformatics/btr081CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Medical and Surgical SciencesMagna Graecia UniversityCatanzaroItaly

Personalised recommendations