Application of Parallel Vector Space Model for Large-Scale DNA Sequence Analysis

  • Abdul Majid
  • Mukhtaj KhanEmail author
  • Nadeem Iqbal
  • Mian Ahmad Jan
  • Mushtaq Khan
  • Salman


With rapid advancement in the field of bioinformatics and computational biology, the collected DNA dataset is growing exponentially, doubling after every 18 months. Due to large-scale and complex structure of the DNA dataset, the analysis of DNA sequence is becoming computationally a challenging issue in bioinformatics field and computational biology. Fast algorithms, capable of analyzing large-scale DNA sequence, are now required in the field of bioinformatics. This paper presents a novel Parallel Vector Space Model (PVSM) approach that supports the analysis of large-scale DNA sequence by taking advantages of multi-core system. The proposed approach is built on top of modified Vector Space Model (VSM). In order to evaluate the performance of PVSM, the proposed technique is extensively evaluated using varied size of DNA sequences in the context of computational efficiency and accuracy. The performance of PVSM is compared with sequential modified VSM. The sequential VSM is implemented on a single processor whereas, the proposed method is initially parallelized on 4 processors and subsequently on 12 processors. The experimental results show that the PVSM performed better than the sequential VSM. The proposed method achieved approximately 2× speedup compared with sequential approach, without affecting the accuracy level. Moreover, the proposed PVSM is highly scalable with an increase in the number of processing cores and support the analysis of large-scale DNA sequences.


Vector space model Parallel computation Large-scale DNA analysis Multi-core system Bioinformatics 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    A Brief Guide to Genomics, National Human Genome Research Institute. [Online]. Available: [Accessed: 22-Jun-2017] (2015)
  2. 2.
    Memeti, S., Pllana, S.: Analyzing large-scale DNA sequences on multi-core architectures. Proc. - IEEE 18th Int. Conf. Comput. Sci. Eng. CSE 2015, 208–215 (2016)Google Scholar
  3. 3.
    Ogheneovo, E.E., Japheth, R.B.: Application of vector space model to query ranking and information retrieval. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 6(5), 42–47 (2016)Google Scholar
  4. 4.
    Smith, T., Waterman, T.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)CrossRefGoogle Scholar
  5. 5.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)CrossRefGoogle Scholar
  6. 6.
    Abual-rub, M.S., Abdullah, R., Aini, N., Rashid, A.: A modified vector space model for protein retrieval. J. Comput. Sci. 7(9), 85–89 (2007)Google Scholar
  7. 7.
    Patel, S., Panchal, H., Anjaria, K.: DNA Sequence analysis by ORF FINDER & GENOMATIX tool: Bioinformatics analysis of some tree species of Leguminosae family. In: Proceedings - 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops, BIBMW 2012, pp. 922–926 (2012)Google Scholar
  8. 8.
    Vandin, F., Upfal, E., Raphael, B.J.: Algorithms and genome sequencing?: Identifying driver pathways in cancer. IEEE Computer Magazine, no. March, pp. 39–46 (2012)Google Scholar
  9. 9.
    Benson, D.A., et al.: GenBank. Nucleic Acids Res. 41 (Database issue), D36–42 (2013)Google Scholar
  10. 10.
    Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)CrossRefGoogle Scholar
  11. 11.
    Drews, F., Lichtenberg, J., Welch, L.: Scalable parallel word search in multicore/multiprocessor systems. J. Supercomput. 51(1), 58–75 (2010)CrossRefGoogle Scholar
  12. 12.
    Takeuchi, T., Yamada, A., Aoki, T., Nishimura, K.: Cljam: a library for handling DNA sequence alignment/map (SAM) with parallel processing. Source Code Biol. Med. 11, 12 (2016)CrossRefGoogle Scholar
  13. 13.
    Kienzler, R., Bruggmann, R., Ranganathan, A., Tatbul, N.: Large-Scale DNA sequence analysis in the cloud: a Stream-Based approach. In: Euro-Par 2011: Parallel Processing Workshops , france, august 29 – september 2, 2011, pp 467–476. Springer, Berlin (2012)Google Scholar
  14. 14.
    Benenson, Y., Paz-Elizur, T., Adar, R., Keinan, E., Livneh, Z., Shapiro, E.: Programmable and autonomous computing machine made of biomolecules. Nature 414(6862), 430–434 (2001)CrossRefGoogle Scholar
  15. 15.
    Reif, J.H., Sahu, S.: [Online]. Available: [Accessed: 14-May-2018] (2008)
  16. 16.
    Soewito, B., Weng, N.: Methodology for evaluating dna pattern searching algorithms on multiprocessor. In: 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering, pp. 570–577 (2007)Google Scholar
  17. 17.
    Bioinformatics Explained: BLAST versus Smith-Waterman. [Online]. Available: [Accessed: 14-May-2018] (2007)
  18. 18.
    de Almeida, T.J.B.M., Roma, N.F.V.: A Parallel Programming Framework for Multi-core DNA Sequence Alignment, 2010 Int. Conf. Complex, Int.ll. Softw. Intensive Syst., no. February 2010, 907–912 (2010)Google Scholar
  19. 19.
    Herath, D., Lakmali, C., Ragel, R.: Accelerating string matching for bio-computing applications on multi-core CPUs. In: 2012 IEEE 7th Int. Conf. Ind. Inf. Syst. ICIIS 2012 (2012)Google Scholar
  20. 20.
    Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M., Sidow, A., Brudno, M.: SHRIMP: Accurate mapping of short color-space reads. PLos Comput. Biol. 5(5), 1–11 (2009)CrossRefGoogle Scholar
  21. 21.
    Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009)CrossRefGoogle Scholar
  22. 22.
    Ruban, S., Sam, S.B., Serrao, L.V.: A Study and Analysis of Information Retrieval Models. pp. 230–236 (2015)Google Scholar
  23. 23.
    Aitah, W.A., Almakadmeh, K.: An efficient adaptive genetic algorithm for vector space model. J. Theor. Appl. Inf. Technol. 71(2), 281–286 (2015)Google Scholar
  24. 24.
    López-Pujalte, C., Guerrero-Bote, V.P., De Moya-Anegón, F.: Genetic algorithms in relevance feedback: a second test and new contributions. Inf. Process. Manag. 39(5), 669–687 (2003)CrossRefzbMATHGoogle Scholar
  25. 25.
    Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. pp. 1–18 (2009)Google Scholar
  26. 26.
    Raghavan, V.V., Wong, S.K.M.: A critical analysis of vector space model for information retrieval. J. Am. Soc. Inf. Sci. 37(5), 279–287 (1986)CrossRefGoogle Scholar
  27. 27.
    Singhal, A.: Modern information retrieval?: a brief overview. IEEE Data Eng. Bull. 24, 35–43 (2001)Google Scholar
  28. 28.
    Castells, P., Fernandez, M., Vallet, D.: An adaptation of the Vector-Space model for Ontology-Based information retrieval. IEEE Trans. Knowl. Data Eng. 19(2), 261–272 (2007)CrossRefGoogle Scholar
  29. 29.
    Sarkar, I.N.: A vector space model approach to identify genetically related diseases. J Am Med Inf. Assoc 19(2), 249–254 (2012)CrossRefGoogle Scholar
  30. 30.
    Khan, M., Jin, Y., Li, M., Xiang, Y., Jiang, C.: Hadoop Performance modeling for job estimation and resource provisioning. Parallel Distrib. Syst. IEEE Trans. PP(99), 1 (2015)Google Scholar
  31. 31.
    Khan, M., Ashton, P.M., Li, M., Taylor, G.A., Pisica, I., Liu, J.: Parallel detrended fluctuation analysis for fast event detection on massive PMU data. Smart Grid, IEEE Trans. 6(1), 360–368 (2015)CrossRefGoogle Scholar
  32. 32.
    Apache Spark Standalone, Apache Spark. [Online]. Available: [Accessed: 15-Mar-2017]
  33. 33.
    Danford, T.: Next-generation genomics analysis with apache spark. In: Strata + Hadoop World (2015)Google Scholar

Copyright information

© Springer Nature B.V. 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceAbdul Wali Khan University MardanMardanPakistan
  2. 2.COMSATS Institute of Information TechnologyWah CanttPakistan

Personalised recommendations