Abstract
Whole-genome comparisons among mammalian and other eukaryotic organisms have revealed that they contain large quantities of conserved non—protein-coding sequence. Although some of the functions of this non-coding DNA have been identified, there remains a large quantity of conserved genomic sequence that is of no known function. Moreover, the task of delineating the conserved sequences is non-trivial, particularly when some sequences are conserved in only a small number of lineages. Sequence segmentation is a statistical technique for identifying putative functional elements in genomes based on atypical sequence characteristics, such as conservation levels relative to other genomes, GC content, SNP frequency, and potentially many others. The publicly available program changept and associated programs use Bayesian multiple change-point analysis to delineate classes of genomic segments with similar characteristics, potentially representing new classes of non-coding RNAs (contact web site: http://silmaril.math.sci.qut.edu.au/~keith/).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Lander, E. S., Linton, L. M., Birren, B., et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860–921.
Venter, J. C., Adams, M. D., Myers, E. W., et al. (2001) The sequence of the human genome. Science 291, 1304–1351.
Waterston, R. H., Lindblad-Toh, K., Bir-ney, E., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562.
Mikkelsen, T. S., Hillier, L. W., Eichler, E. E., et al. (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87.
Sandelin, A., Wasserman, W. W., Lenhard, B. (2004) ConSite: web-based prediction of regulatory elements using cross-species comparison. Nucleic Acids Res 32, W249–W52.
Loots, G. G., Ovcharenko, I., Pachter, L., et al. (2002) rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res 12, 832–839.
Cooper, G. M., Stone, E. A., Asimenos, G., et al. (2005) Distribution and intensity of constraint in mammalian genomic sequence. Genome Res 15, 901–913.
Gibbs, R. A., Weinstock, G. M., Metzker, M. L., et al. (2004) Genome sequence of the Brown Norway Rat yields insights into mammalian evolution. Nature 428, 493–521.
Siepel, A. C., Bejerano, G., Pedersen, J. S., et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15, 1034–1050.
Siepel, A. C., Haussler, D. (2004) Combining phylogenetic and hidden Markov models in biosequence analysis. J Com Biol 11, 413–428.
Bernaola-Galvan, P., Grosse, I., Carpena, P., et al. (2000) Finding borders between coding and non-coding regions by an entropic segmentation method. Phys Rev Letts 85, 1342–1345.
Bernaola-Galvan, P., Roman-Roldan, R., Oliver, J. (1996) Compositional segmentation and long-range fractal correlations in DNA sequences. Phys Rev E 53, 5181–5189.
Braun, J. V., Braun, R. K., Muller, H.-G. (2000) Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation. Biometrika 87, 301–314.
Braun, J. V., Muller, H.-G. (1998) Statistical methods for DNA sequence segmentation. Stat Sci 13, 142–162.
Gionis, A., Mannila, H. (2003) Finding recurrent sources in sequences. In Proceedings of the Seventh Annual International Conference on Research in Computational Molecular Biology, 123–130.
Li, W. (2001) DNA segmentation as a model selection process. In Proceedings of the Fifth Annual International Conference on Research in Computational Molecular Biology, 204–210.
Li, W., Bernaola-Galvan, P., Haghighi, F., et al. (2002) Applications of recursive segmentation to the analysis of DNA sequences. Comput Chem 26, 491–510.
Oliver, J. L., Bernaola-Galvan, P., Carpena, P., et al. (2001) Isochore chromosome maps of eukaryotic genomes. Gene 276, 47–56.
Oliver, J. L., Carpena, P., Roman-Roldan, R., et al. (2002) Isochore chromosome maps of the human genome. Gene 300, 117–127.
Oliver, J. L., Roman-Roldan, R., Perez, J., et al. (1999) SEGMENT: identifying compositional domains in DNA sequences. Bio-informatics 15, 974–979.
Szpankowski, W., Ren, W., Szpankowski, L. (2005) An optimal DNA segmentation based on the MDL principle. Int J Bioin-format Res Appl 1, 3–17.
Boys, R. J., Henderson, D. A. (2002) On determining the order of Markov dependence of an observed process governed by a hidden Markov model. Sci Prog 10, 241–251.
Boys, R. J., Henderson, D. A. (2004) A Bayesian approach to DNA sequence segmentation. Biometrics 60, 573–588.
Boys, R. J., Henderson, D. A., Wilkinson, D. J. (2000) Depicting homogenous segments in DNA sequences by using hidden Markov models. Appl Stat 49, 269–285.
Keith, J. M. (2006) Segmenting eukaryotic genomes with the generalized Gibbs sampler. J Comput Biol 13, 1369–1383.
Keith, J. M., Kroese, D. P., Bryant, D. (2004) A Generalized Markov Sampler. Methodol Comput Appl Prob 6, 29–53.
Minin, V. N., Dorman, K. S., Fang, F., et al. (2005) Dual multiple change-point model leads to more accurate recombination detection. Bioinformatics 21, 3034–3042.
Husmeier, D., Wright, F. (2002) A Baye-sian approach to discriminate between alternative DNA sequence segmentations. Bioinformatics 18, 226–234.
Liu, J. S., Lawrence, C. E. (1999) Bayesian inference on biopolymer models. Bioinformatics 15, 38–52.
Ramensky, V. E., Makeev, V. J., Toytberg, M. A., et al. (2000) DNA segmentation through the Bayesian approach. J Comput Biol 7, 215–231.
Salmenkivi, M., Kere, J., Mannila, H. (2002) Genome segmentation using piecewise constant intensity models and reversible jump MCMC. Bioinformatics 18, S211–S218.
Keith, J. M., Adams, P., Stephen, S., et al. Delineating slowly and rapidly evolving fractions of the Drosophila genome, submitted.
Russo, C. A. M., Takezaki, N., Nei, M. (1995) Molecular phylogeny and divergence times of Drosopholid species. Mol Biol Evol 12, 391–404.
Tamura, K., Subramanian, S., Kumar, S. (2004) Temporal patterns of fruit fly (Drosophila) evolution revealed by mutation clocks. Mol Biol Evol 21, 36–44.
Geyer, C. J. (1991) Markov chain Monte Carlo maximum likelihood, in (Keramidas, E. M., ed.), Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, pp. 156–163. Interface Foundation, Fairfax Station, VA.
Acknowledgments
The author thanks Peter Adams for assistance in running simulations; Stuart Stephen for assisting in the development of much of the code; Benjamin Goursaud and Rachel Crehange for assisting in the generalization of the code for multiple data types; and John Mattick, Kerrie Mengersen, Chris Ponting, and Mark Borodovski for helpful discussions. This work was partially funded by Australian Research Council (ARC) Discovery Grants DP0452412 and DP0556631 and a National Health and Medical Research Council (NHMRC) grant entitled “Statistical methods and algorithms for analysis of high-throughput genetics and genomics platforms” (389892).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Humana Press, a part of Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Keith, J.M. (2008). Sequence Segmentation. In: Keith, J.M. (eds) Bioinformatics. Methods in Molecular Biology™, vol 452. Humana Press. https://doi.org/10.1007/978-1-60327-159-2_11
Download citation
DOI: https://doi.org/10.1007/978-1-60327-159-2_11
Publisher Name: Humana Press
Print ISBN: 978-1-58829-707-5
Online ISBN: 978-1-60327-159-2
eBook Packages: Springer Protocols