Advertisement

Statistical Significance for NGS Reads Similarities

  • Antonio Muñoz-Mérida
  • Javier Ríos
  • Hicham Benzekri
  • Oswaldo Trelles
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6620)

Abstract

In this work we present a significance curve to segregate random alignments from true matches in by identity sequence comparison, especially suitable for sequencing data produced by NGS-technologies. The experimental approach reproduces the random local ungapped similarities distribution by score and length from which it is possible to asses the statistical significance of any particular ungapped similarity. This work includes the study of the distribution behaviour as a function of the experimental technology used to produce the raw sequences, as well as the scoring system used in the comparison. Our approach reproduces the expected behaviour and completes the proposal of Rost and Sander for homology based sequence comparisons. Results can be exploited by computational applications to reduce the computational cost and memory usage.

Keywords

assembly reads similarity NGS 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Swindell, S.R., Plasterer, T.N.: SEQMAN. Contig assembly. Methods Mol. Biol. 70, 75–89 (1997)Google Scholar
  2. 2.
    Miller, J.R., et al.: Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24, 2818–2824 (2008)CrossRefGoogle Scholar
  3. 3.
    Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 98, 9748–9753 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Chevreux, B., Wetter, T., Suhai, S.: Genome sequence assembly using trace signals and additional sequence information. In: Comput. Sci. Biol.: Proc. German Conference on Bioinformatics GCB 1999 GCB, pp. 45–56 (1999)Google Scholar
  5. 5.
  6. 6.
  7. 7.
  8. 8.
    Rost, B.: Twilight zone of protein sequence alignments. Protein Eng. 12(2), 85–94 (1999)CrossRefGoogle Scholar
  9. 9.
    Karlin, S., Altschul, S.F.: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268 (1990)CrossRefzbMATHGoogle Scholar
  10. 10.
    Altschul, S.F., Gish, W.: Local alignment statistics. Methods Enzymol. 266, 460–480 (1996)CrossRefGoogle Scholar
  11. 11.
    Collins, J.F., Coulson, A.: Significance of protein sequence similarities. Methods Enzymol. 183, 474–487 (1990)CrossRefGoogle Scholar
  12. 12.
    Sander, C., Schneider, R.: Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9(1), 56–68 (1991)CrossRefGoogle Scholar
  13. 13.
    Altschul, S.F., Bundschuh, R., Olsen, R., Hwa, T.: The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res. 29(2), 351–361 (2001)CrossRefGoogle Scholar
  14. 14.
    Trelles, O., Andrade, M.A., Valencia, A., Zapata, E.L., Carazo, J.M.: Computational Space Reduction and Parallelization of a new Clustering Approach for Large Groups of Sequences. BioInformatics 14(5), 439–451 (1998)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Antonio Muñoz-Mérida
    • 1
  • Javier Ríos
    • 1
  • Hicham Benzekri
    • 1
  • Oswaldo Trelles
    • 1
  1. 1.Computer Architecture DepartmentUniversity of MalagaSpain

Personalised recommendations