Sequence Segmentation

Keith, Jonathan M.

doi:10.1007/978-1-60327-159-2_11

Jonathan M. Keith³

Part of the book series: Methods in Molecular Biology™ ((MIMB,volume 452))

6072 Accesses
6 Citations

Abstract

Whole-genome comparisons among mammalian and other eukaryotic organisms have revealed that they contain large quantities of conserved non—protein-coding sequence. Although some of the functions of this non-coding DNA have been identified, there remains a large quantity of conserved genomic sequence that is of no known function. Moreover, the task of delineating the conserved sequences is non-trivial, particularly when some sequences are conserved in only a small number of lineages. Sequence segmentation is a statistical technique for identifying putative functional elements in genomes based on atypical sequence characteristics, such as conservation levels relative to other genomes, GC content, SNP frequency, and potentially many others. The publicly available program changept and associated programs use Bayesian multiple change-point analysis to delineate classes of genomic segments with similar characteristics, potentially representing new classes of non-coding RNAs (contact web site: http://silmaril.math.sci.qut.edu.au/~keith/).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Lander, E. S., Linton, L. M., Birren, B., et al. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860–921.
Article PubMed CAS Google Scholar
Venter, J. C., Adams, M. D., Myers, E. W., et al. (2001) The sequence of the human genome. Science 291, 1304–1351.
Article PubMed CAS Google Scholar
Waterston, R. H., Lindblad-Toh, K., Bir-ney, E., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562.
Article PubMed CAS Google Scholar
Mikkelsen, T. S., Hillier, L. W., Eichler, E. E., et al. (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87.
Article CAS Google Scholar
Sandelin, A., Wasserman, W. W., Lenhard, B. (2004) ConSite: web-based prediction of regulatory elements using cross-species comparison. Nucleic Acids Res 32, W249–W52.
Article PubMed CAS Google Scholar
Loots, G. G., Ovcharenko, I., Pachter, L., et al. (2002) rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res 12, 832–839.
PubMed Google Scholar
Cooper, G. M., Stone, E. A., Asimenos, G., et al. (2005) Distribution and intensity of constraint in mammalian genomic sequence. Genome Res 15, 901–913.
Article PubMed CAS Google Scholar
Gibbs, R. A., Weinstock, G. M., Metzker, M. L., et al. (2004) Genome sequence of the Brown Norway Rat yields insights into mammalian evolution. Nature 428, 493–521.
Article PubMed CAS Google Scholar
Siepel, A. C., Bejerano, G., Pedersen, J. S., et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15, 1034–1050.
Article PubMed CAS Google Scholar
Siepel, A. C., Haussler, D. (2004) Combining phylogenetic and hidden Markov models in biosequence analysis. J Com Biol 11, 413–428.
Article CAS Google Scholar
Bernaola-Galvan, P., Grosse, I., Carpena, P., et al. (2000) Finding borders between coding and non-coding regions by an entropic segmentation method. Phys Rev Letts 85, 1342–1345.
Article CAS Google Scholar
Bernaola-Galvan, P., Roman-Roldan, R., Oliver, J. (1996) Compositional segmentation and long-range fractal correlations in DNA sequences. Phys Rev E 53, 5181–5189.
Article Google Scholar
Braun, J. V., Braun, R. K., Muller, H.-G. (2000) Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation. Biometrika 87, 301–314.
Article Google Scholar
Braun, J. V., Muller, H.-G. (1998) Statistical methods for DNA sequence segmentation. Stat Sci 13, 142–162.
Article Google Scholar
Gionis, A., Mannila, H. (2003) Finding recurrent sources in sequences. In Proceedings of the Seventh Annual International Conference on Research in Computational Molecular Biology, 123–130.
Google Scholar
Li, W. (2001) DNA segmentation as a model selection process. In Proceedings of the Fifth Annual International Conference on Research in Computational Molecular Biology, 204–210.
Google Scholar
Li, W., Bernaola-Galvan, P., Haghighi, F., et al. (2002) Applications of recursive segmentation to the analysis of DNA sequences. Comput Chem 26, 491–510.
Article PubMed CAS Google Scholar
Oliver, J. L., Bernaola-Galvan, P., Carpena, P., et al. (2001) Isochore chromosome maps of eukaryotic genomes. Gene 276, 47–56.
Article PubMed CAS Google Scholar
Oliver, J. L., Carpena, P., Roman-Roldan, R., et al. (2002) Isochore chromosome maps of the human genome. Gene 300, 117–127.
Article PubMed CAS Google Scholar
Oliver, J. L., Roman-Roldan, R., Perez, J., et al. (1999) SEGMENT: identifying compositional domains in DNA sequences. Bio-informatics 15, 974–979.
CAS Google Scholar
Szpankowski, W., Ren, W., Szpankowski, L. (2005) An optimal DNA segmentation based on the MDL principle. Int J Bioin-format Res Appl 1, 3–17.
Article CAS Google Scholar
Boys, R. J., Henderson, D. A. (2002) On determining the order of Markov dependence of an observed process governed by a hidden Markov model. Sci Prog 10, 241–251.
Google Scholar
Boys, R. J., Henderson, D. A. (2004) A Bayesian approach to DNA sequence segmentation. Biometrics 60, 573–588.
Article PubMed Google Scholar
Boys, R. J., Henderson, D. A., Wilkinson, D. J. (2000) Depicting homogenous segments in DNA sequences by using hidden Markov models. Appl Stat 49, 269–285.
Google Scholar
Keith, J. M. (2006) Segmenting eukaryotic genomes with the generalized Gibbs sampler. J Comput Biol 13, 1369–1383.
Article PubMed CAS Google Scholar
Keith, J. M., Kroese, D. P., Bryant, D. (2004) A Generalized Markov Sampler. Methodol Comput Appl Prob 6, 29–53.
Article Google Scholar
Minin, V. N., Dorman, K. S., Fang, F., et al. (2005) Dual multiple change-point model leads to more accurate recombination detection. Bioinformatics 21, 3034–3042.
Article PubMed CAS Google Scholar
Husmeier, D., Wright, F. (2002) A Baye-sian approach to discriminate between alternative DNA sequence segmentations. Bioinformatics 18, 226–234.
Article PubMed CAS Google Scholar
Liu, J. S., Lawrence, C. E. (1999) Bayesian inference on biopolymer models. Bioinformatics 15, 38–52.
Article PubMed CAS Google Scholar
Ramensky, V. E., Makeev, V. J., Toytberg, M. A., et al. (2000) DNA segmentation through the Bayesian approach. J Comput Biol 7, 215–231.
Article PubMed CAS Google Scholar
Salmenkivi, M., Kere, J., Mannila, H. (2002) Genome segmentation using piecewise constant intensity models and reversible jump MCMC. Bioinformatics 18, S211–S218.
Article PubMed Google Scholar
Keith, J. M., Adams, P., Stephen, S., et al. Delineating slowly and rapidly evolving fractions of the Drosophila genome, submitted.
Google Scholar
Russo, C. A. M., Takezaki, N., Nei, M. (1995) Molecular phylogeny and divergence times of Drosopholid species. Mol Biol Evol 12, 391–404.
PubMed CAS Google Scholar
Tamura, K., Subramanian, S., Kumar, S. (2004) Temporal patterns of fruit fly (Drosophila) evolution revealed by mutation clocks. Mol Biol Evol 21, 36–44.
Article PubMed CAS Google Scholar
Geyer, C. J. (1991) Markov chain Monte Carlo maximum likelihood, in (Keramidas, E. M., ed.), Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, pp. 156–163. Interface Foundation, Fairfax Station, VA.
Google Scholar

Download references

Acknowledgments

The author thanks Peter Adams for assistance in running simulations; Stuart Stephen for assisting in the development of much of the code; Benjamin Goursaud and Rachel Crehange for assisting in the generalization of the code for multiple data types; and John Mattick, Kerrie Mengersen, Chris Ponting, and Mark Borodovski for helpful discussions. This work was partially funded by Australian Research Council (ARC) Discovery Grants DP0452412 and DP0556631 and a National Health and Medical Research Council (NHMRC) grant entitled “Statistical methods and algorithms for analysis of high-throughput genetics and genomics platforms” (389892).

Author information

Authors and Affiliations

School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
Jonathan M. Keith

Authors

Jonathan M. Keith
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
Jonathan M. Keith PhD

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Keith, J.M. (2008). Sequence Segmentation. In: Keith, J.M. (eds) Bioinformatics. Methods in Molecular Biology™, vol 452. Humana Press. https://doi.org/10.1007/978-1-60327-159-2_11

Download citation

DOI: https://doi.org/10.1007/978-1-60327-159-2_11
Publisher Name: Humana Press
Print ISBN: 978-1-58829-707-5
Online ISBN: 978-1-60327-159-2
eBook Packages: Springer Protocols

Publish with us

Policies and ethics