# Statistical Identification of Uniformly Mutated Segments within Repeats

## Abstract

Given a long string of characters from a constant size (w.l.o.g. binary) alphabet we present an algorithm to determine whether its characters have been generated by a single i.i.d. random source. More specifically, consider all possible *k*-coin models for generating a binary string *S*, where each bit of *S* is generated via an independent toss of one of the *k* coins in the model. The choice of which coin to toss is decided by a random walk on the set of coins where the probability of a coin change is much lower than the probability of using the same coin repeatedly. We present a statistical test procedure which, for any given *S*, determines whether the *a posteriori* probability for *k* = 1 is higher than for any other *k* > 1. Our algorithm runs in time *O*(*l* ^{4} log *l*), where *l* is the length of *S*, through a dynamic programming approach which exploits the convexity of the *a posteriori* probability for *k*.

The problem we consider arises from two critical applications in analyzing long alignments between pairs of genomic sequences. A high alignment score between two DNA sequences usually indicates an evolutionary relationship, i.e. that the sequences have been generated as a result of one or more copy events followed by random point mutations. Such sequences may include functional regions (e.g. exons) as well as nonfunctional ones (e.g. introns). Functional regions with critical importance exhibit much lower mutation rates than non-functional DNA (or DNA

## Keywords

Genome Segment Posteriori Probability Locality Sensitive Hash Random Source High Similarity Score## Preview

Unable to display preview. Download preview PDF.

## References

- 1.E. F. Adebiyi, T. Jiang, M. Kaufmann, An Efficient Algorithm for Finding Short Approximate Non-Tandem Repeats,
*In Proceedings of ISMB 2001*.Google Scholar - 2.A. N. Arslan, O. Egecioglu, P. A. Pevzner A new approach to sequence comparison: normalized sequence alignment,
*Proceedings of RECOMB 2001*.Google Scholar - 3.Bailey J. A., Yavor A. M., Massa H. F., Trask B. J., Eichler E. E., Segmental duplications: organization and impact within the current human genome project assembly,
*Genome Research*11(6), Jun 2001.Google Scholar - 4.T. Bailey, C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers,
*Proceedings of ISMB*1994, AAAI Press.Google Scholar - 5.J. Buhler and M. Tompa Finding Motifs Using Random Projections,
*In Proc. of RECOMB 2001*.Google Scholar - 6.J. Buhler Efficient Large Scale Sequence Comparison by Locality Sensitive Hashing,
*Bioinformatics*17(5), 2001.Google Scholar - 7.Richard Cole and Ramesh Hariharan, Approximate String Matching: A Simpler Faster Algorithm,
*Proc. ACM-SIAM Symposium on Discrete Algorithms*, pp. 463–472, 25–27 January 1998.Google Scholar - 8.Churchill, G. A. Stochastic models for heterogeneous DNA sequences,
*Bulletin of Mathemathical Biology*51, 79–94 (1989).zbMATHMathSciNetGoogle Scholar - 9.W. Chang and E. Lawler, Approximate String Matching in Sublinear Expected Time,
*Proc. IEEE Symposium on Foundations of Computer Science*, 1990.Google Scholar - 10.Fu, Y.-X and R. N. Curnow. Maximum likelihood estimation of multiple change points,
*Biometrika*77, 563–573 (1990).zbMATHCrossRefMathSciNetGoogle Scholar - 11.Green, P. J. Reversible Jump Markov chain Monte Carlo Computation and Bayesian Model Determination
*Biometrika*82, 711–732 (1995)zbMATHCrossRefMathSciNetGoogle Scholar - 12.A. L. Halpern Minimally Selected
*p*and Other Tests for a Single Abrupt Change-point in a Binary Sequence*Biometrics*55, Dec 1999.Google Scholar - 13.A. L. Halpern Multiple Changepoint Testing for an Alternating Segments Model of a Binary Sequence
*Biometrics*56, Sep 2000.Google Scholar - 14.J. E. Horvath, L. Viggiano, B. J. Loftus, M. D. Adams, N. Archidiacono, M. Rocchi, E. E. Eichler Molecular structure and evolution of an alpha satellite/non-satellite junction at 16p11.
*Human Molecular Genetics*, 2000, Vol 9, No 1.Google Scholar - 15.
- 16.E. S. Lander et al., Initial sequencing and analysis of the human genome,
*Nature*, 15:409, Feb 2001.Google Scholar - 17.V. I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals,
*Cybernetics and Control Theory*, 10(8):707–710, 1966.MathSciNetGoogle Scholar - 18.T. Mashkova, N. Oparina, I. Alexandrov, O. Zinovieva, A. Marusina, Y. Yurov, M. Lacroix, L. Kisselev, Unequal crossover is involved in human alpha satellite DNA rearrangements on a border of the satellite domain,
*FEBS Letters*, 441 (1998).Google Scholar - 19.A. Marzal and E. Vidal, Computation of normalized edit distances and applications,
*IEEE Trans. on PAMI*, 15(9):926–932, 1993.Google Scholar - 20.L. Parida, I. Rigoutsos, A. Floratsas, D. Platt, Y. Gao, Pattern discovery on character sets and real valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm,
*Proceedings of ACM-SIAM SODA, 2000*.Google Scholar - 21.S. C. Sahinalp and U. Vishkin, Approximate and Dynamic Matching of Patterns Using a Labeling Paradigm,
*Proc. IEEE Symposium on Foundations of Computer Science*, 1996.Google Scholar - 22.George P. Smith Evolution of Repeated DNA Sequences by Unequal Crossover,
*Science*, vol 191, pp 528–535.Google Scholar - 23.J. D. Thompson, D. G. Higgins, T. J. Gibson, Clustal-W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice,
*Nucleic Acid Research*1994, Vol. 22, No. 22.Google Scholar - 24.E. Ukkonen, On Approximate String Matching,
*Proc. Conference on Foundations of Computation Theory*, 1983.Google Scholar - 25.Venter, J. and Steel, S. Finding multiple abrupt change points.
*Computational Statistics and Data Analysis*22, 481–501. (1996).zbMATHCrossRefMathSciNetGoogle Scholar - 26.C. Venter et. al., The sequence of the human genome,
*Science*, 16:291, Feb 2001.Google Scholar