Alignment-Free Sequence Comparison Based on Next Generation Sequencing Reads: Extended Abstract

  • Kai Song
  • Jie Ren
  • Zhiyuan Zhai
  • Xuemei Liu
  • Minghua Deng
  • Fengzhu Sun
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7262)


Next generation sequencing (NGS) technologies have generated enormous amount of shotgun read data and assembly of the reads can be challenging, especially for organisms without template sequences. We study the power of genome comparison based on shotgun read data without assembly using three alignment-free sequence comparison statistics, \(D_2, D_2^*\), and \(D_2^S\), both theoretically and by simulations. Theoretical formulas for the power of detecting the relationship between two sequences related through a common motif model are derived. It is shown that both \(D_2^*\) and \(D_2^S\) outperform D 2 for detecting the relationship between two sequences based on NGS data. We then study the effects of length of the tuple, read length, coverage, and sequencing error on the power of \(D_2^*\) and \(D_2^S\). Finally, variations of these statistics, \(d_2, d_2^*\) and \(d_2^S\), respectively, are used to first cluster 5 mammalian species with known phylogenetic relationships and then cluster 13 tree species whose complete genome sequences are not available using NGS shotgun reads. The clustering results using \(d_2^S\) are consistent with biological knowledge for the 5 mammalian and 13 tree species, respectively. Thus, the statistic \(d_2^S\) provides a powerful alignment-free comparison tool to study the relationships among different organisms based on NGS read data without assembly.


NGS HMM statistical power normal approximation word count statistics 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blaisdell, B.E.: A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences of the United States of America 83(14), 5155–5159 (1986)zbMATHCrossRefGoogle Scholar
  2. 2.
    Domazet-Lošo, M., Haubold, B.: Alignment-free detection of local similarity among viral and bacterial genomes. Bioinformatics 27(11), 1466–1472 (2011)CrossRefGoogle Scholar
  3. 3.
    Ivan, A., Halfon, M., Sinha, S.: Computational discovery of cis-regulatory modules in Drosophila without prior knowledge of motifs. Genome Biology 9(1), R22 (2008)CrossRefGoogle Scholar
  4. 4.
    Jun, S.R., Sims, G.E., Wu, G.A., Kim, S.H.: Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution. Proceedings of the National Academy of Sciences of the United States of America 107(1), 133–138 (2010)CrossRefGoogle Scholar
  5. 5.
    Leung, G., Eisen, M.B.: Identifying cis-regulatory sequences by word profile similarity. PLoS One 4, e6901 (2009)CrossRefGoogle Scholar
  6. 6.
    Lippert, R.A., Huang, H.Y., Waterman, M.S.: Distributional regimes for the number of k-word matches between two random sequences. Proceedings of the National Academy of Sciences of the United States of America 100(13), 13980–13989 (2002)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Liu, X., Wan, L., Li, J., Reinert, G., Waterman, M.S., Sun, F.: New powerful statistics for alignment-free sequence comparison under a pattern transfer model. Journal of Theoretical Biology 284(1), 106–116 (2011)CrossRefGoogle Scholar
  8. 8.
    Reinert, G., Chew, D., Sun, F.Z., Waterman, M.S.: Alignment-free sequence comparison (I): Statistics and power. Journal of Computational Biology 16(12), 1615–1634 (2009)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Sims, G.E., Jun, S.R., Wu, G.A., Kim, S.H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences of the United States of America 106(8), 2677–2682 (2009)CrossRefGoogle Scholar
  10. 10.
    Vinga, S., Almeida, J.: Alignment-free sequence comparison–a review. Bioinformatics 19(4), 513–523 (2003)CrossRefGoogle Scholar
  11. 11.
    Wan, L., Reinert, G., Sun, F., Waterman, M.S.: Alignment-free sequence comparison (II): Theoretical power of comparison statistics. Journal of Computational Biology 17(11), 1467–1490 (2010)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Zhai, Z.Y., Ku, S.Y., Luan, Y.H., Reinert, G., Waterman, M.S., Sun, F.Z.: The power of detecting enriched patterns: An HMM approach. Journal of Computational Biology 17(4), 581–592 (2010)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Zhang, Z.D., Rozowsky, J., Snyder, M., Chang, J., Gerstein, M.: Modeling ChIP sequencing in silico with applications. PLoS Computational Biology 4(8), e1000158 (2008)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Hansen, K.D., Brenner, S.E., Dudoit, S.: Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research 38(12), e131 (2010)CrossRefGoogle Scholar
  15. 15.
    Li, J., Jiang, H., Wong, W.H.: Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biology 11, R50 (2010)CrossRefGoogle Scholar
  16. 16.
    Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: MetaSim: a sequencing simulator for genomics and metagenomics. PLoS One 3(10), e3373 (2008)CrossRefGoogle Scholar
  17. 17.
    Cannon, C.H., Kua, C.S., Zhang, D., Harting, J.R.: Assembly free comparative genomics of short-read sequence data discovers the needles in the haystack. Molecular Ecology 19(suppl. 1), 146–160 (2010)Google Scholar
  18. 18.
    Miller, W., Rosenbloom, K., Hardison, R.C., Hou, M., Taylor, J., Raney, B., Burhans, R., King, D.C., Baertsch, R., Blankenberg, D., et al.: 28-way vertebrate alignment and conservation track in the UCSC genome browser. Genome Research 17(12), 1797–1808 (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Kai Song
    • 1
  • Jie Ren
    • 1
  • Zhiyuan Zhai
    • 2
  • Xuemei Liu
    • 3
  • Minghua Deng
    • 1
  • Fengzhu Sun
    • 4
    • 5
  1. 1.School of MathematicsPeking UniversityBeijingP.R. China
  2. 2.School of MathematicsShandong UniversityP.R. China
  3. 3.School of PhysicsSouth China University of TechnologyGuangzhouP.R. China
  4. 4.TNLIST/Department of AutomationTsinghua UniversityBeijingP.R. China
  5. 5.Molecular and Computational Biology ProgramUniversity of Southern CaliforniaLos AngelesUSA

Personalised recommendations