Overview of Sequence Data Formats

Zhang, Hongen

doi:10.1007/978-1-4939-3578-9_1

Overview of Sequence Data Formats

Hongen Zhang⁴

Protocol
First Online: 24 March 2016

9462 Accesses
9 Citations

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1418))

Abstract

Next-generation sequencing experiment can generate billions of short reads for each sample and processing of the raw reads will add more information. Various file formats have been introduced/developed in order to store and manipulate this information. This chapter presents an overview of the file formats including FASTQ, FASTA, SAM/BAM, GFF/GTF, BED, and VCF that are commonly used in analysis of next-generation sequencing data.

This is a preview of subscription content, log in via an institution.

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26:1135–1145
Article CAS PubMed Google Scholar
Metzker ML (2010) Sequencing technologies—the next generation. Nat Rev Genet 11:31–46
Article CAS PubMed Google Scholar
Quail MA, Smith M, Cooupland P et al (2012) A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13:341
Article CAS PubMed PubMed Central Google Scholar
Mardis ER (2008) Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9:387–402
Article CAS PubMed Google Scholar
Mardis ER (2013) Next-generation sequencing platforms. Annu Rev Anal Chem 6:287–303
Article CAS Google Scholar
Flicek P, Birney E (2009) Sense from sequence reads: methods for alignment and assembly. Nat Methods 6(Suppl 11):S6–S12
Article CAS PubMed Google Scholar
Medvedev P, Stanciu M, Brudno M (2009) Computational methods for discovering structural variation with next-generation sequencing. Nat Methods 6(Suppl 11):S13–S20
Article CAS PubMed Google Scholar
Pepke S, Wold B, Mortazavi A (2009) Computation for ChIP-seq and RNA-seq studies. Nat Methods 6(Suppl 11):S22–S32
Article CAS PubMed PubMed Central Google Scholar
van Dijk EL, Auger H, Jaszczyszyn Y et al (2014) Ten years of next-generation sequencing technology. Trends Genet 30:418–426
Article PubMed Google Scholar
Voelkerding KV, Dames SA, Durtschi JD (2009) Next-generation sequencing: from basic research to diagnostics. Clin Chem 55:641–658
Article CAS PubMed Google Scholar
Pavlopoulos GA, Oulas A, Lacucci E et al (2013) Unraveling genomic variation from next generation sequencing data. BioData Min 6:13
Article CAS PubMed PubMed Central Google Scholar
Allcock RJN (2014) Production and analytic bioinformatics for next-generation DNA sequencing. In: Trent R (ed) Clinical bioinformatics, 2nd edn. Humana, New York, pp 17–30
Google Scholar
Cock PJ, Fields CJ, Goto N et al (2010) The Sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants. Nucleic Acids Res 38:1767–1771
Article CAS PubMed PubMed Central Google Scholar
Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/Map format and SAMtools. Bioinformatics 25:2078–2079
Article PubMed PubMed Central Google Scholar
The SAM/BAM Format Specification Working Group (2014) Sequence alignment/map format specification. http://samtools.github.io/hts-specs/SAMv1.pdf
Danecek P, Auton A, Abecasis G et al (2011) The variant call format and VCFtools. Bioinformatics 27:2156–2158
Article CAS PubMed PubMed Central Google Scholar
Ewing B, Hillier L, Wendl MC et al (1998) Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8:175–185
Article CAS PubMed Google Scholar
Ewing B, Green P (1998) Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res 8:186–194
Article CAS PubMed Google Scholar
Andrews S (2010) FastQC: a quality control tool for high throughput sequence data., Available online at http://www.bioinformatics.babraham.ac.uk/projects/fastqc
Lipman D, Pearson W (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441
Article CAS PubMed Google Scholar
Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci 85:2444–2448
Article CAS PubMed PubMed Central Google Scholar
Wu TD, Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21:1859–1875
Article CAS PubMed Google Scholar
Langmead B, Trapnell C, Pop M et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25
Article PubMed PubMed Central Google Scholar
Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26:589–595
Article PubMed PubMed Central Google Scholar
Dobin A, Davis CA, Schlesinger F et al (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21
Article CAS PubMed PubMed Central Google Scholar
Robinson JT, Thorvaldsdóttir H, Winckler W et al (2011) Integrative Genomics Viewer. Nat Biotechnol 29:24–26
Article CAS PubMed PubMed Central Google Scholar
Thorvaldsdóttir H, Robinson JT, Mesirov JP (2013) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform 14:178–192
Article PubMed PubMed Central Google Scholar
Generic Feature Format (GFF). http://www.sanger.ac.uk/resources/software/gff/spec.html
GFF/GTF File Format—Definition and supported options. http://www.ensembl.org/info/website/upload/gff.html
BED File Format. Definition and supported options. http://useast.ensembl.org/info/website/upload/bed.html
BED format. http://genome.ucsc.edu/FAQ/FAQformat.html#format1
The 1000 Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073
Article PubMed Central Google Scholar
McVean GA, Abecasis DM, Auton R et al (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491:56–65
Article CAS PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Center for Cancer Research, National Cancer Institute, National Institutes of Health, 37 Convent Drive, Room 6138, Bethesda, MD, 20892, USA
Hongen Zhang

Authors

Hongen Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongen Zhang .

Editor information

Editors and Affiliations

Ohio State University, Biomed Informatics, College of Medicine, Columbus, Ohio, USA
Ewy Mathé
National Cancer Institute, National Institutes of Health, Columbia, Maryland, USA
Sean Davis

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Zhang, H. (2016). Overview of Sequence Data Formats. In: Mathé, E., Davis, S. (eds) Statistical Genomics. Methods in Molecular Biology, vol 1418. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3578-9_1

Download citation

DOI: https://doi.org/10.1007/978-1-4939-3578-9_1
Published: 24 March 2016
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-3576-5
Online ISBN: 978-1-4939-3578-9
eBook Packages: Springer Protocols

Publish with us

Policies and ethics