Skip to main content
Log in

A performance analysis of genome search by matching whole targeted reads on different environments

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

An increase in the size of next-generation sequencing (NGS) data owing to the development of the novel computation power has made an automated analysis system increasingly desirable. To automatically predict genes for the unknown sequences, several pipeline steps are required. The first step involves the acquisition of various NGS fragment reads, followed by assembler of the fragment reads of 100 bp to 10 Kbp. Upon accurate assembler of NGS fragment reads of a sufficient size, a de novo assembler is used to construct the whole genome. However, reads are assembled on the basis of overlaps in the reference sequences instead of using the de novo assembler, owing to inaccuracy and short length. The next step is the prediction of genes in whole assembled contigs. Upon matching candidate sequences with references sequences, genes can be annotated. In each processing step, different formatted inputs and outputs are required; hence, data files of different formats must be managed. To reduce these redundant processes, we herein propose an approach referred to as the genome search system. This system automatically identifies genes from assembled sequences and reference amino acids sequences. However, challenge associated with this is that BLAST and analysis of results for each gene are computationally intensive processes; hence, reduces the use of hardware resources to process whole assembled reads. This helps improve performance and shorten the execution time to identify genes. Based on this result, this study reviews this approach of identifying genes and compare the performance of different system environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19:455–477, 05

    Article  MathSciNet  Google Scholar 

  • Besemer J, Lomsadze A, Borodovsky M (2001) GeneMarkS: a self-training method for prediction of gene starts in microbial genomes implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29:2607–2618, 06

    Article  Google Scholar 

  • Chevreux B, Pfisterer T, Drescher B, Driesel AJ, Müller WE, Wetter T, Suhai S (2004) Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res 14:1147–1159, 06

    Article  Google Scholar 

  • Darling AE, Carey L, Feng WC (2003) The design, implementation, and evaluation of mpiBLAST, San Jose, CA, p 6

  • Jung J, Kim JI, Jeong Y-S, Yi G (2017) A robust method for finding the automated best matched genes based on grouping similar fragments of large-scale references for genome assembly. Symmetry 9(9):192. https://www.mdpi.com/2073-8994/9/9/192

  • Kim JI, Moore CE, Archibald JM, Bhattacharya D, Yi G, Yoon HS, Shin W (2017) Evolutionary dynamics of cryptophyte plastid genomes. Genome Biol Evol 9(7):1859–1872

    Article  Google Scholar 

  • Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL (2003) Versatile and open software for comparing large genomes. Genome Biol 5:R12–R12

    Article  Google Scholar 

  • Langmead B (2010) Aligning short sequencing reads with bowtie. In: Baxevanis AD et al (ed) Current protocols in bioinformatics/editorial board, vol CHAPTER, pp. Unit–11.7, 12

  • Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25:1754–1760, 07

    Article  Google Scholar 

  • Liu L, Wang Y, He P, Li P, Lee J, Soltis DE, Fu C (2018) Chloroplast genome analyses and genomic resource development for epilithic sister genera oresitrophe and mukdenia (saxifragaceae), using genome skimming data. BMC Genomics 19:235

    Article  Google Scholar 

  • Lohse M, Drechsel O, Bock R (2007) OrganellarGenomeDRAW (OGDRAW): a tool for the easy generation of high-quality custom graphical maps of plastid and mitochondrial genomes. Curr Genet 52:267–274

    Article  Google Scholar 

  • Lohse M, Drechsel O, Kahlau S, Bock R (2013) OrganellarGenomeDRAW—a suite of tools for generating physical maps of plastid and mitochondrial genomes and visualizing expression data sets. Nucleic Acids Res 41(W1):W575–W581

    Article  Google Scholar 

  • Lowe TM, Chan PP (2016) trnascan-se on-line: integrating search and context for analysis of transfer RNA genes. Nucleic Acids Res 44:W54–W57

    Article  Google Scholar 

  • Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, Tang J, Wu G, Zhang H, Shi Y, Liu Y, Yu C, Wang B, Lu Y, Han C, Cheung DW, Yiu S-M, Peng S, Xiaoqian Z, Liu G, Liao X, Li Y, Yang H, Wang J, Lam T-W, Wang J (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1:18–18

    Article  Google Scholar 

  • Mathog DR (2003) Parallel BLAST on split databases. Bioinformatics 19(14):1865–1866

    Article  Google Scholar 

  • Oehmen C, Nieplocha J (2006) Scalablast: a scalable implementation of blast for high-performance data-intensive bioinformatics analysis. IEEE Trans Parallel Distrib Syst 17:740–749

    Article  Google Scholar 

  • Paszkiewicz K, Studholme DJ (2010) De novo assembly of short sequence reads. Brief Bioinform 11(5):457–472

    Article  Google Scholar 

  • Sawyer SE, Rekepalli B, Horton MD, Brook RG (2015) HPC-BLAST: distributed BLAST for Xeon Phi clusters. In: BCB ’15. ACM, New York

  • Schmidt B, Hildebrandt A (2017) Next-generation sequencing: big data meets high performance computing. Drug Discov Today 22:712–717

    Article  Google Scholar 

  • Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19:1117–1123, 06

    Article  Google Scholar 

  • Song HJ, Lee J, Graf L, Rho M, Qiu H, Bhattacharya D, Yoon HS (2016) A novice’s guide to analyzing NGS-derived organelle and metagenome data. ALGAE 31(2):137–154

    Article  Google Scholar 

  • Wang X, Cheng F, Rohlsen D, Bi C, Wang C, Xu Y, Wei S, Ye Q, Yin T, Ye N (2018) Organellar genome assembly methods and comparative analysis of horticultural plants. Hortic Res 5:3

    Article  Google Scholar 

  • Yang Y, Xie B, Yan J (2014) Application of next-generation sequencing technology in forensic science. Genomics Proteomics Bioinform 12:190–197

    Article  Google Scholar 

  • Yim WC, Cushman JC, Papaleo E (2017) Divide and Conquer (DC) BLAST: fast and easy BLAST execution within HPC environments. PeerJ 5:e3486

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (MSIP) NRF - 2016R1C1B1007929, NRF - 2016R1D1A1A09919318, Hongik University Research Fund of 2018 and Dongguk University Research Fund of 2016.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gangman Yi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by X. Wang, A.K. Sangaiah, M. Pelillo.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jung, J., Yi, G. A performance analysis of genome search by matching whole targeted reads on different environments. Soft Comput 23, 9153–9160 (2019). https://doi.org/10.1007/s00500-018-3573-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-018-3573-3

Keywords

Navigation