Skip to main content

Overview of Sequence Data Formats

  • Protocol
  • First Online:

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1418))

Abstract

Next-generation sequencing experiment can generate billions of short reads for each sample and processing of the raw reads will add more information. Various file formats have been introduced/developed in order to store and manipulate this information. This chapter presents an overview of the file formats including FASTQ, FASTA, SAM/BAM, GFF/GTF, BED, and VCF that are commonly used in analysis of next-generation sequencing data.

This is a preview of subscription content, log in via an institution.

Buying options

Protocol
USD   49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Springer Nature is developing a new tool to find and evaluate Protocols. Learn more

References

  1. Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26:1135–1145

    Article  CAS  PubMed  Google Scholar 

  2. Metzker ML (2010) Sequencing technologies—the next generation. Nat Rev Genet 11:31–46

    Article  CAS  PubMed  Google Scholar 

  3. Quail MA, Smith M, Cooupland P et al (2012) A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13:341

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Mardis ER (2008) Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9:387–402

    Article  CAS  PubMed  Google Scholar 

  5. Mardis ER (2013) Next-generation sequencing platforms. Annu Rev Anal Chem 6:287–303

    Article  CAS  Google Scholar 

  6. Flicek P, Birney E (2009) Sense from sequence reads: methods for alignment and assembly. Nat Methods 6(Suppl 11):S6–S12

    Article  CAS  PubMed  Google Scholar 

  7. Medvedev P, Stanciu M, Brudno M (2009) Computational methods for discovering structural variation with next-generation sequencing. Nat Methods 6(Suppl 11):S13–S20

    Article  CAS  PubMed  Google Scholar 

  8. Pepke S, Wold B, Mortazavi A (2009) Computation for ChIP-seq and RNA-seq studies. Nat Methods 6(Suppl 11):S22–S32

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. van Dijk EL, Auger H, Jaszczyszyn Y et al (2014) Ten years of next-generation sequencing technology. Trends Genet 30:418–426

    Article  PubMed  Google Scholar 

  10. Voelkerding KV, Dames SA, Durtschi JD (2009) Next-generation sequencing: from basic research to diagnostics. Clin Chem 55:641–658

    Article  CAS  PubMed  Google Scholar 

  11. Pavlopoulos GA, Oulas A, Lacucci E et al (2013) Unraveling genomic variation from next generation sequencing data. BioData Min 6:13

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Allcock RJN (2014) Production and analytic bioinformatics for next-generation DNA sequencing. In: Trent R (ed) Clinical bioinformatics, 2nd edn. Humana, New York, pp 17–30

    Google Scholar 

  13. Cock PJ, Fields CJ, Goto N et al (2010) The Sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants. Nucleic Acids Res 38:1767–1771

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/Map format and SAMtools. Bioinformatics 25:2078–2079

    Article  PubMed  PubMed Central  Google Scholar 

  15. The SAM/BAM Format Specification Working Group (2014) Sequence alignment/map format specification. http://samtools.github.io/hts-specs/SAMv1.pdf

  16. Danecek P, Auton A, Abecasis G et al (2011) The variant call format and VCFtools. Bioinformatics 27:2156–2158

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Ewing B, Hillier L, Wendl MC et al (1998) Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8:175–185

    Article  CAS  PubMed  Google Scholar 

  18. Ewing B, Green P (1998) Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res 8:186–194

    Article  CAS  PubMed  Google Scholar 

  19. Andrews S (2010) FastQC: a quality control tool for high throughput sequence data., Available online at http://www.bioinformatics.babraham.ac.uk/projects/fastqc

  20. Lipman D, Pearson W (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441

    Article  CAS  PubMed  Google Scholar 

  21. Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci 85:2444–2448

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Wu TD, Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21:1859–1875

    Article  CAS  PubMed  Google Scholar 

  23. Langmead B, Trapnell C, Pop M et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25

    Article  PubMed  PubMed Central  Google Scholar 

  24. Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26:589–595

    Article  PubMed  PubMed Central  Google Scholar 

  25. Dobin A, Davis CA, Schlesinger F et al (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Robinson JT, Thorvaldsdóttir H, Winckler W et al (2011) Integrative Genomics Viewer. Nat Biotechnol 29:24–26

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Thorvaldsdóttir H, Robinson JT, Mesirov JP (2013) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform 14:178–192

    Article  PubMed  PubMed Central  Google Scholar 

  28. Generic Feature Format (GFF). http://www.sanger.ac.uk/resources/software/gff/spec.html

  29. GFF/GTF File Format—Definition and supported options. http://www.ensembl.org/info/website/upload/gff.html

  30. BED File Format. Definition and supported options. http://useast.ensembl.org/info/website/upload/bed.html

  31. BED format. http://genome.ucsc.edu/FAQ/FAQformat.html#format1

  32. The 1000 Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073

    Article  PubMed Central  Google Scholar 

  33. McVean GA, Abecasis DM, Auton R et al (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491:56–65

    Article  CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongen Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media New York

About this protocol

Cite this protocol

Zhang, H. (2016). Overview of Sequence Data Formats. In: Mathé, E., Davis, S. (eds) Statistical Genomics. Methods in Molecular Biology, vol 1418. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3578-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-3578-9_1

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-3576-5

  • Online ISBN: 978-1-4939-3578-9

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics