Abstract
We discuss a class of Bayesian nonparametric priors that can be used to model local dependence in a sequence of observations. Many popular Bayesian nonparametric priors can be characterized in terms of exchangeable species sampling sequences. However, in some applications, common exchangeability assumptions may not be appropriate. We discuss a generalization of species sampling sequences, where the weights in the predictive probability functions are allowed to depend on a sequence of independent (not necessarily identically distributed) latent random variables. More specifically, we consider conditionally identically distributed (CID) Pitman-Yor sequences and the Beta-GOS sequences recently introduced by Airoldi et al. (Journal of the American Statistical Association, 109, 1466–1480, 2014). We show how those processes can be used as a prior distribution in a hierarchical Bayes modeling framework, and, in particular, how the Beta-GOS can provide a reasonable alternative to the use of non-homogenous Hidden Markov models, further allowing unsupervised clustering of the observations in an unknown number of states. The usefulness of the approach in biostatistical applications is discussed and explicitly shown for the detection of chromosomal aberrations in breast cancer.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Airoldi, E., Costa, T., Bassetti, F., Leisen, F., and Guindani, M. (2014). Generalized Species Sampling Priors With Latent Beta Reinforcements. Journal of the American Statistical Association, 109, 1466–1480.
Airoldi, E. M., Anderson, A., Fienberg, S., and Skinner, K. (2006). Who wrote Ronald Reagan’s radio addresses? Bayesian Anal., 1, 289–320.
Baladandayuthapani, V., Ji, Y., Talluri, R., Nieto-Barajas, L. E., and Morris, J. S. (2010). Bayesian random segmentation models to identify shared copy number aberrations for array cgh data. Journal of the American Statistical Association, 105(492), 1358–1375.
Bassetti, F., Crimaldi, I., and Leisen, F. (2010). Conditionally identically distributed species sampling sequences. Adv. in Appl. Probab, 42, 433–459.
Berti, P., Pratelli, L., and P., R. (2004). Limit Theorems for a Class of Identically Distributed Random Variables. Ann. Probab., 32(3), 2029–2052.
Blackwell, D. and MacQueen, J. (1973). Ferguson distributions via Pólya urn schemes. Ann. Statist., 1(353–355).
Blei, D. and Frazier, P. (2011). Distance dependent Chinese restaurant processes. Journal of Machine Learning Reseach, 12, 2461–2488.
Cardin, N., Holmes, C., Consortium, T. W. T. C. C., Donnelly, P., and Marchini, J. (2011). Bayesian hierarchical mixture modeling to assign copy number from a targeted cnv array. Genetic Epidemiology, 35(6), 536–548.
Chin, K., DeVries, S., Fridlyand, J., Spellman, P. T., Roydasgupta, R., Kuo, W.-L., Lapuk, A., Neve, R. M., Qian, Z., Ryder, T., Chen, F., Feiler, H., Tokuyasu, T., Kingsley, C., Dairkee, S., Meng, Z., Chew, K., Pinkel, D., Jain, A., Ljung, B. M., Esserman, L., Albertson, D. G., Waldman, F. M., and Gray, J. W. (2006). Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell, 10(6), 529–541.
DeSantis, S. M., Houseman, E. A., Coull, B. A., Louis, D. N., Mohapatra, G., and Betensky, R. A. (2009). A latent class model with hidden markov dependence for array cgh data. Biometrics, 65(4), 1296–1305.
Dewar, M., Wiggins, C., and Wood, F. (2012). Inference in Hidden Markov Models with Explicit State Duration Distributions. Signal Processing Letters, IEEE, 19(4), 235–238.
Du, L., Chen, M., Lucas, J., and Carlin, L. (2010). Sticky hidden Markov modelling of comparative genomic hybridization. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 58(10), 5353–5368.
Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90, 577–588.
Ferguson, J. D. (1980). Variable duration models for speech. In Proceedings of the Symposium on the Applications of Hidden Markov Models to Text and Speech, pages 143–179.
Fortini, S., Ladelli, L., and Regazzini, E. (2000). Exchangeability, predictive distributions and parametric models. Sankhya, 62(1), 86–109.
Fox, E., Sudderth, E., Jordan, M., and Willsky, A. (2011). A sticky HDP-HMM with application to speaker diarization. Annals of Applied Statistics, 5(2A), 1020–1056.
Guha, S., Li, Y., and Neuberg, D. (2008). Bayesian hidden Markov modelling of array cgh data. JASA, 103, 485–497.
Guindani, M., Müller, P., and Zhang, S. (2009). A Bayesian discovery procedure. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(5), 905–925.
Hansen, B. and Pitman, J. (2000). Prediction rules for exchangeable sequences related to species sampling. Statist. Probab. Lett., 46(251–256).
Heller, R., Stanley, D., Yekutieli, D., Rubin, N., and Benjamini, Y. (2006). Cluster-based analysis of fMRI data. Neuroimage, 33, 599–608.
Hilbe, J. M. (2011). Negative Binomial Regression. Cambridge University Press.
Ishwaran, H. and Zarepour, M. (2003). Random probability measures via Pólya sequences: revisiting the Blackwell-MacQueen urn scheme. Technical report, Arxiv.org.
Ji, Y., Lu, Y., and Mills, G. (2008). Bayesian models based on test statistics for multiple hypothesis testing problems. Bioinformatics, 24, 943–949.
Johnson, M. J. and Willsky, A. S. (2013). Bayesian nonparametric hidden semi-Markov models. J. Mach. Learn. Res., 14(1), 673–701.
Kim, S., Tadesse, M. G., and Vannucci, M. (2006). Variable selection in clustering via dirichlet process mixture models. Biometrika, 93(4), 877–893.
Lee, J., Quintana, F., Müller, P., and Trippa, L. (2008). Defining Predictive Probability Functions for Species Sampling Models.. Statist.Sci., 2(209–222).
Lee, J., Müller, P., Zhu, Y., and Ji, Y. (2013). A nonparametric Bayesian model for local clustering with application to proteomics. Journal of the American Statistical Association, 108(503), 775–788.
Lo, A. (1984). On a class of Bayesian nonparametric estimates: I density estimates. Ann. Statist., 12 (1), 351–357.
MacEachern, S. N. (1999). Dependent nonparametric processes. In Proceedings of the Section on Bayesian Statistical Science.
MacEachern, S. N. and Müller, P. (1998). Estimating mixtures of Dirichlet process models. Journal of Computational and Graphical Statistics, 7, 223–238.
Marioni, J. C., Thorne, N. P., and Tavaré, S. (2006). Biohmm: a heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics, 22(9), 1144–1146.
Mitchell, C., Harper, M., and Jamieson, L. (1995). On the complexity of explicit duration hmm’s. Speech and Audio Processing, IEEE Transactions on, 3(3), 213–217.
Müller, P. and Quintana, F. (2010). Random partition models with regression on covariates. Journal of Statistical Planning and Inference, 140(10), 2801–2808.
Müller, P., Parmigiani, G., and Rice, K. (2007). FDR and Bayesian multiple comparisons rules. In J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, and M. West, editors, Bayesian Statistics 8. Oxford, UK: Oxford University Press.
Neal, R. M. (2000). Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of Computational and Graphical Statistics, 9, 249–265.
Newton, M. A., Noueiry, A., Sarkar, D., and Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, 5, 155—176.
Park, J. and Dunson, D. (2010). Bayesian generalized product partition model. Statistica Sinica, 20(1203–1226).
Pitman, J. (1996). Some developments of the Blackwell-MacQueen urn scheme, volume 30, pages 245–267. Lecture Notes-Monograph Series, Institute of Mathematical Statistics, Hayward, California.
Pitman, J. (2006). Combinatorial Stochastic Processes. Lecture Notes in Mathematics. Springer:Berlin / Heidelberg.
Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Redon, R., Fitzgerald, T., and Carter, N. (2009). Comparative genomic hybridization: DNA labeling, hybridization and detection. In M. Dufva, editor, DNA Microarrays for Biomedical Research, volume 529 of Methods in Molecular Biology, pages 267–278. Humana Press.
Storey, J. D. (2003). The positive false discovery rate: a Bayesian interpretation and the q-value. The Annals of Statistics, 31, 2013–2035.
Storey, J. D. (2007). The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments. Biostatistics, 8, 414–432.
Sun, W., Reich, B. J., Tony Cai, T., Guindani, M., and Schwartzman, A. (2015). False discovery control in large-scale spatial multiple testing. Journal of the Royal Statistical Society Series B, 77, 59–83.
Taramasco, O. and Bauer, S. (2012). RHMM: Hidden Markov models simulations and estimations. Technical report, CRAN.
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476), 1566–1581.
Yau, C., Papaspiliopoulos, O., Roberts, G. O., and Holmes, C. (2011). Bayesian non-parametric hidden Markov models with applications in genomics. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(1), 37–57.
Yu, S.-Z. (2010). Hidden semi-markov models. Artificial Intelligence, 174(2), 215–243. Special Review Issue.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Bassetti, F., Leisen, F., Airoldi, E., Guindani, M. (2015). Species Sampling Priors for Modeling Dependence: An Application to the Detection of Chromosomal Aberrations. In: Mitra, R., Müller, P. (eds) Nonparametric Bayesian Inference in Biostatistics. Frontiers in Probability and the Statistical Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-19518-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-19518-6_5
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19517-9
Online ISBN: 978-3-319-19518-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)