Nucleotide Sequence Alignment and Compression via Shortest Unique Substring

Adaş, Boran; Bayraktar, Ersin; Faro, Simone; Moustafa, Ibraheem Elsayed; Külekci, M. Oguzhan

doi:10.1007/978-3-319-16480-9_36

Nucleotide Sequence Alignment and Compression via Shortest Unique Substring

Boran Adaş²⁰,
Ersin Bayraktar²⁰,
Simone Faro²¹,
Ibraheem Elsayed Moustafa²² &
…
M. Oguzhan Külekci²²

Conference paper

3064 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9044))

Abstract

Aligning short reads produced by high throughput sequencing equipments onto a reference genome is the fundamental step of sequence analysis. Since the sequencing machinery generates massive volumes of data, it is becoming more and more vital to keep those data compressed also. In this study we present the initial results of an on-going research project, which aims to combine the alignment and compression of short reads with a novel preprocessing technique based on shortest unique substring identifiers. We observe that clustering the short reads according to the set of unique identifiers they include provide us an opportunity to combine compression and alignment. Thus, we propose an alternative path in high-throughput sequence analysis pipeline, where instead of applying an immediate whole alignment, a preprocessing that clusters the reads according to the set of shortest unique substring identifiers extracted from the reference genome is to be performed first. We also present an analysis of the short unique substrings identifiers on the human reference genome and examine how labeling each short read with those identifiers helps in alignment and compression.

This work has been supported by the Scientific & Technological Research Council of Turkey (TÜBİTAK), BİDEB–2221 Fellowship Program, and also with the TÜBİTAK-ARDEB-1005 grant number 114E293.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alkan, C., Kidd, J.M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., Kitzman, J.O., Baker, C., Malig, M., Mutlu, O., et al.: Personalized copy number and segmental duplication maps using next-generation sequencing. Nature Genetics 41(10), 1061–1067 (2009)
Article Google Scholar
Berger, B., Peng, J., Singh, M.: Computational solutions for omics data. Nature Reviews Genetics 14(5), 333–346 (2013)
Article Google Scholar
Bonfield, J.K., Mahoney, M.V.: Compression of fastq and sam format sequencing data. PloS One 8(3), e59190 (2013)
Google Scholar
Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the burrows–wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)
Article Google Scholar
Deorowicz, S., Grabowski, S.: Compression of dna sequence reads in fastq format. Bioinformatics 27(6), 860–862 (2011)
Article Google Scholar
Deorowicz, S., Grabowski, S.: Data compression for sequencing data. Algorithms for Molecular Biology 8(1), 25 (2013)
Article Google Scholar
Fonseca, N.A., Rung, J., Brazma, A., Marioni, J.C.: Tools for mapping high-throughput sequencing data. Bioinformatics 28(24), 3169–3177 (2012)
Article Google Scholar
Hsi-Yang, F.M., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput dna sequencing data using reference-based compression. Genome Research 21(5), 734–740 (2011)
Article Google Scholar
Giancarlo, R., Rombo, S.E., Utro, F.: Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Briefings in Bioinformatics, bbt088 (2013)
Google Scholar
Hach, F., Hormozdiari, F., Alkan, C., Hormozdiari, F., Birol, I., Eichler, E.E., Sahinalp, S.C.: mrsfast: A cache-oblivious algorithm for short-read mapping. Nature Methods 7(8), 576–577 (2010)
Article Google Scholar
Hach, F., Numanagić, I., Alkan, C., Sahinalp, S.C.: Scalce: Boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28(23), 3051–3057 (2012)
Article Google Scholar
Hach, F., Sarrafi, I., Hormozdiari, F., Alkan, C., Eichler, E.E., Sahinalp, S.C.: mrsfast-ultra: a compact, snp-aware mapper for high performance sequencing applications. Nucleic Acids Research, gku370 (2014)
Google Scholar
İleri, A.M., Külekci, M.O., Xu, B.: Shortest unique substring query revisited. In: Kulikov, A.S., Kuznetsov, S.O., Pevzner, P. (eds.) CPM 2014. LNCS, vol. 8486, pp. 172–181. Springer, Heidelberg (2014)
Chapter Google Scholar
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nature Methods 9(4), 357–359 (2012)
Article Google Scholar
Li, H., Durbin, R.: Fast and accurate long-read alignment with burrows–wheeler transform. Bioinformatics 26(5), 589–595 (2010)
Article Google Scholar
Loh, P.-R., Baym, M., Berger, B.: Compressive genomics. Nature Biotechnology 30(7), 627–630 (2012)
Article Google Scholar
Pei, J., Wu, W.C.-H., Yeh, M.-Y.: On shortest unique substring queries. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 937–948. IEEE (2013)
Google Scholar
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147(1), 195–197 (1981)
Article Google Scholar
Tsuruta, K., Inenaga, S., Bannai, H., Takeda, M.: Shortest Unique Substrings Queries in Optimal Time. In: Geffert, V., Preneel, B., Rovan, B., Štuller, J., Tjoa, A.M. (eds.) SOFSEM 2014. LNCS, vol. 8327, pp. 503–513. Springer, Heidelberg (2014)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Enginering, İstanbul Technical University, Turkey
Boran Adaş & Ersin Bayraktar
Department of Mathematics and Computer Science, University of Catania, Italy
Simone Faro
Department of Biomedical Enginering, İstanbul Medipol University, Turkey
Ibraheem Elsayed Moustafa & M. Oguzhan Külekci

Authors

Boran Adaş
View author publications
You can also search for this author in PubMed Google Scholar
Ersin Bayraktar
View author publications
You can also search for this author in PubMed Google Scholar
Simone Faro
View author publications
You can also search for this author in PubMed Google Scholar
Ibraheem Elsayed Moustafa
View author publications
You can also search for this author in PubMed Google Scholar
M. Oguzhan Külekci
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universidad de Granada. , Dpto. de Arquitectura y Tecnología de Computadores (ATC)., E.T.S. de Ingenierías en Informática y Telecomunicación. CITIC-UGR.,, , c/ Periodista Daniel Saucedo Aranda s/n, , 18071, Granada, , , Spain
Francisco Ortuño
Universidad de Granada, E.T.S. Ingenierías Informática y de Telecomunicación , , Dpto. Arquitectura y Tecnología de Computadores, CITIC-UGR, , , C Periodista Rafael Gómez Montero , , 18071, Granada,, Spain
Ignacio Rojas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Adaş, B., Bayraktar, E., Faro, S., Moustafa, I.E., Külekci, M.O. (2015). Nucleotide Sequence Alignment and Compression via Shortest Unique Substring. In: Ortuño, F., Rojas, I. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2015. Lecture Notes in Computer Science(), vol 9044. Springer, Cham. https://doi.org/10.1007/978-3-319-16480-9_36

Download citation

DOI: https://doi.org/10.1007/978-3-319-16480-9_36
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16479-3
Online ISBN: 978-3-319-16480-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics