Species Sampling Priors for Modeling Dependence: An Application to the Detection of Chromosomal Aberrations

Bassetti, Federico; Leisen, Fabrizio; Airoldi, Edoardo; Guindani, Michele

doi:10.1007/978-3-319-19518-6_5

Federico Bassetti⁸,
Fabrizio Leisen⁹,
Edoardo Airoldi¹⁰ &
…
Michele Guindani¹¹

Part of the book series: Frontiers in Probability and the Statistical Sciences ((FROPROSTAS))

3977 Accesses
1 Citations

Abstract

We discuss a class of Bayesian nonparametric priors that can be used to model local dependence in a sequence of observations. Many popular Bayesian nonparametric priors can be characterized in terms of exchangeable species sampling sequences. However, in some applications, common exchangeability assumptions may not be appropriate. We discuss a generalization of species sampling sequences, where the weights in the predictive probability functions are allowed to depend on a sequence of independent (not necessarily identically distributed) latent random variables. More specifically, we consider conditionally identically distributed (CID) Pitman-Yor sequences and the Beta-GOS sequences recently introduced by Airoldi et al. (Journal of the American Statistical Association, 109, 1466–1480, 2014). We show how those processes can be used as a prior distribution in a hierarchical Bayes modeling framework, and, in particular, how the Beta-GOS can provide a reasonable alternative to the use of non-homogenous Hidden Markov models, further allowing unsupervised clustering of the observations in an unknown number of states. The usefulness of the approach in biostatistical applications is discussed and explicitly shown for the detection of chromosomal aberrations in breast cancer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Airoldi, E., Costa, T., Bassetti, F., Leisen, F., and Guindani, M. (2014). Generalized Species Sampling Priors With Latent Beta Reinforcements. Journal of the American Statistical Association, 109, 1466–1480.
Article MathSciNet Google Scholar
Airoldi, E. M., Anderson, A., Fienberg, S., and Skinner, K. (2006). Who wrote Ronald Reagan’s radio addresses? Bayesian Anal., 1, 289–320.
Article MathSciNet Google Scholar
Baladandayuthapani, V., Ji, Y., Talluri, R., Nieto-Barajas, L. E., and Morris, J. S. (2010). Bayesian random segmentation models to identify shared copy number aberrations for array cgh data. Journal of the American Statistical Association, 105(492), 1358–1375.
Article MathSciNet MATH Google Scholar
Bassetti, F., Crimaldi, I., and Leisen, F. (2010). Conditionally identically distributed species sampling sequences. Adv. in Appl. Probab, 42, 433–459.
Article MathSciNet MATH Google Scholar
Berti, P., Pratelli, L., and P., R. (2004). Limit Theorems for a Class of Identically Distributed Random Variables. Ann. Probab., 32(3), 2029–2052.
Google Scholar
Blackwell, D. and MacQueen, J. (1973). Ferguson distributions via Pólya urn schemes. Ann. Statist., 1(353–355).
Google Scholar
Blei, D. and Frazier, P. (2011). Distance dependent Chinese restaurant processes. Journal of Machine Learning Reseach, 12, 2461–2488.
MathSciNet Google Scholar
Cardin, N., Holmes, C., Consortium, T. W. T. C. C., Donnelly, P., and Marchini, J. (2011). Bayesian hierarchical mixture modeling to assign copy number from a targeted cnv array. Genetic Epidemiology, 35(6), 536–548.
Google Scholar
Chin, K., DeVries, S., Fridlyand, J., Spellman, P. T., Roydasgupta, R., Kuo, W.-L., Lapuk, A., Neve, R. M., Qian, Z., Ryder, T., Chen, F., Feiler, H., Tokuyasu, T., Kingsley, C., Dairkee, S., Meng, Z., Chew, K., Pinkel, D., Jain, A., Ljung, B. M., Esserman, L., Albertson, D. G., Waldman, F. M., and Gray, J. W. (2006). Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell, 10(6), 529–541.
Article MATH Google Scholar
DeSantis, S. M., Houseman, E. A., Coull, B. A., Louis, D. N., Mohapatra, G., and Betensky, R. A. (2009). A latent class model with hidden markov dependence for array cgh data. Biometrics, 65(4), 1296–1305.
Article MathSciNet MATH Google Scholar
Dewar, M., Wiggins, C., and Wood, F. (2012). Inference in Hidden Markov Models with Explicit State Duration Distributions. Signal Processing Letters, IEEE, 19(4), 235–238.
Article Google Scholar
Du, L., Chen, M., Lucas, J., and Carlin, L. (2010). Sticky hidden Markov modelling of comparative genomic hybridization. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 58(10), 5353–5368.
Article MathSciNet Google Scholar
Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90, 577–588.
Article MathSciNet MATH Google Scholar
Ferguson, J. D. (1980). Variable duration models for speech. In Proceedings of the Symposium on the Applications of Hidden Markov Models to Text and Speech, pages 143–179.
Google Scholar
Fortini, S., Ladelli, L., and Regazzini, E. (2000). Exchangeability, predictive distributions and parametric models. Sankhya, 62(1), 86–109.
MathSciNet MATH Google Scholar
Fox, E., Sudderth, E., Jordan, M., and Willsky, A. (2011). A sticky HDP-HMM with application to speaker diarization. Annals of Applied Statistics, 5(2A), 1020–1056.
Article MathSciNet MATH Google Scholar
Guha, S., Li, Y., and Neuberg, D. (2008). Bayesian hidden Markov modelling of array cgh data. JASA, 103, 485–497.
Article MathSciNet Google Scholar
Guindani, M., Müller, P., and Zhang, S. (2009). A Bayesian discovery procedure. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(5), 905–925.
Article MathSciNet Google Scholar
Hansen, B. and Pitman, J. (2000). Prediction rules for exchangeable sequences related to species sampling. Statist. Probab. Lett., 46(251–256).
Google Scholar
Heller, R., Stanley, D., Yekutieli, D., Rubin, N., and Benjamini, Y. (2006). Cluster-based analysis of fMRI data. Neuroimage, 33, 599–608.
Google Scholar
Hilbe, J. M. (2011). Negative Binomial Regression. Cambridge University Press.
Google Scholar
Ishwaran, H. and Zarepour, M. (2003). Random probability measures via Pólya sequences: revisiting the Blackwell-MacQueen urn scheme. Technical report, Arxiv.org.
Google Scholar
Ji, Y., Lu, Y., and Mills, G. (2008). Bayesian models based on test statistics for multiple hypothesis testing problems. Bioinformatics, 24, 943–949.
Article Google Scholar
Johnson, M. J. and Willsky, A. S. (2013). Bayesian nonparametric hidden semi-Markov models. J. Mach. Learn. Res., 14(1), 673–701.
MathSciNet MATH Google Scholar
Kim, S., Tadesse, M. G., and Vannucci, M. (2006). Variable selection in clustering via dirichlet process mixture models. Biometrika, 93(4), 877–893.
Article MathSciNet Google Scholar
Lee, J., Quintana, F., Müller, P., and Trippa, L. (2008). Defining Predictive Probability Functions for Species Sampling Models.. Statist.Sci., 2(209–222).
Google Scholar
Lee, J., Müller, P., Zhu, Y., and Ji, Y. (2013). A nonparametric Bayesian model for local clustering with application to proteomics. Journal of the American Statistical Association, 108(503), 775–788.
Article MathSciNet Google Scholar
Lo, A. (1984). On a class of Bayesian nonparametric estimates: I density estimates. Ann. Statist., 12 (1), 351–357.
Article MathSciNet MATH Google Scholar
MacEachern, S. N. (1999). Dependent nonparametric processes. In Proceedings of the Section on Bayesian Statistical Science.
Google Scholar
MacEachern, S. N. and Müller, P. (1998). Estimating mixtures of Dirichlet process models. Journal of Computational and Graphical Statistics, 7, 223–238.
Google Scholar
Marioni, J. C., Thorne, N. P., and Tavaré, S. (2006). Biohmm: a heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics, 22(9), 1144–1146.
Article Google Scholar
Mitchell, C., Harper, M., and Jamieson, L. (1995). On the complexity of explicit duration hmm’s. Speech and Audio Processing, IEEE Transactions on, 3(3), 213–217.
Article Google Scholar
Müller, P. and Quintana, F. (2010). Random partition models with regression on covariates. Journal of Statistical Planning and Inference, 140(10), 2801–2808.
Article MathSciNet Google Scholar
Müller, P., Parmigiani, G., and Rice, K. (2007). FDR and Bayesian multiple comparisons rules. In J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, and M. West, editors, Bayesian Statistics 8. Oxford, UK: Oxford University Press.
Google Scholar
Neal, R. M. (2000). Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of Computational and Graphical Statistics, 9, 249–265.
MathSciNet Google Scholar
Newton, M. A., Noueiry, A., Sarkar, D., and Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, 5, 155—176.
Article MATH Google Scholar
Park, J. and Dunson, D. (2010). Bayesian generalized product partition model. Statistica Sinica, 20(1203–1226).
Google Scholar
Pitman, J. (1996). Some developments of the Blackwell-MacQueen urn scheme, volume 30, pages 245–267. Lecture Notes-Monograph Series, Institute of Mathematical Statistics, Hayward, California.
Google Scholar
Pitman, J. (2006). Combinatorial Stochastic Processes. Lecture Notes in Mathematics. Springer:Berlin / Heidelberg.
Google Scholar
Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Google Scholar
Redon, R., Fitzgerald, T., and Carter, N. (2009). Comparative genomic hybridization: DNA labeling, hybridization and detection. In M. Dufva, editor, DNA Microarrays for Biomedical Research, volume 529 of Methods in Molecular Biology, pages 267–278. Humana Press.
Google Scholar
Storey, J. D. (2003). The positive false discovery rate: a Bayesian interpretation and the q-value. The Annals of Statistics, 31, 2013–2035.
Article MathSciNet MATH Google Scholar
Storey, J. D. (2007). The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments. Biostatistics, 8, 414–432.
Article Google Scholar
Sun, W., Reich, B. J., Tony Cai, T., Guindani, M., and Schwartzman, A. (2015). False discovery control in large-scale spatial multiple testing. Journal of the Royal Statistical Society Series B, 77, 59–83.
Article Google Scholar
Taramasco, O. and Bauer, S. (2012). RHMM: Hidden Markov models simulations and estimations. Technical report, CRAN.
Google Scholar
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581.
Article MathSciNet MATH Google Scholar
Yau, C., Papaspiliopoulos, O., Roberts, G. O., and Holmes, C. (2011). Bayesian non-parametric hidden Markov models with applications in genomics. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(1), 37–57.
Article MathSciNet Google Scholar
Yu, S.-Z. (2010). Hidden semi-markov models. Artificial Intelligence, 174(2), 215–243. Special Review Issue.
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Matematica, Universitá di Pavia, via Ferrata, 1, 27100, Pavia, Italy
Federico Bassetti
School of Mathematics, Statistics and Actuarial Sciences, University of Kent, Cornwallis Building, CT2 7NF, Canterbury, Kent, UK
Fabrizio Leisen
Department of Statistics, Harvard University, 1 Oxford Street, Cambridge, MA, 02138, USA
Edoardo Airoldi
Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, TX, USA
Michele Guindani

Authors

Federico Bassetti
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Leisen
View author publications
You can also search for this author in PubMed Google Scholar
Edoardo Airoldi
View author publications
You can also search for this author in PubMed Google Scholar
Michele Guindani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michele Guindani .

Editor information

Editors and Affiliations

Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, Kentucky, USA
Riten Mitra
Department of Mathematics, University of Texas, Austin, Texas, USA
Peter Müller

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bassetti, F., Leisen, F., Airoldi, E., Guindani, M. (2015). Species Sampling Priors for Modeling Dependence: An Application to the Detection of Chromosomal Aberrations. In: Mitra, R., Müller, P. (eds) Nonparametric Bayesian Inference in Biostatistics. Frontiers in Probability and the Statistical Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-19518-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-19518-6_5
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19517-9
Online ISBN: 978-3-319-19518-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics