Science China Mathematics

, Volume 62, Issue 5, pp 979–998 | Cite as

A theoretic study of a distance-based regression model

  • Jialu Li
  • Wei Zhang
  • Sanguo Zhang
  • Qizhai LiEmail author


The distance-based regression model has many applications in analysis of multivariate response regression in various fields, such as ecology, genomics, genetics, human microbiomics, and neuroimaging. It yields a pseudo F test statistic that assesses the relation between the distance (dissimilarity) of the subjects and the predictors of interest. Despite its popularity in recent decades, the statistical properties of the pseudo F test statistic have not been revealed to our knowledge. This study derives the asymptotic properties of the pseudo F test statistic using spectral decomposition under the matrix normal assumption, when the utilized dissimilarity measure is the Euclidean or Mahalanobis distance. The pseudo F test statistic with the Euclidean distance has the same distribution as the quotient of two Chi-squared-type mixtures. The denominator and numerator of the quotient are approximated using a random variable of the form \(\xi\chi_d^2+\eta\) and the approximate error bound is given. The pseudo F test statistic with the Mahalanobis distance follows an F distribution. In simulation studies, the approximated distribution well matched the “exact” distribution obtained by the permutation procedure. The obtained distribution was further validated on H1N1 influenza data, aging human brain data, and embryonic imprint data.


distance-based regression Euclidean pseudo F test statistic Mahalanobis 


62H15 62J99 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.



This work was supported by National Natural Science Foundation of China (Grant No. 11722113). The authors thank the anonymous reviewers for their insightful comments, which improve the manuscript substantially.


  1. 1.
    Chen J, Bittinger K, Charlson E S, et al. Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics, 2012, 28: 2106–2113CrossRefGoogle Scholar
  2. 2.
    Du S, Lv J. Minimal Euclidean distance chart based on support vector regression for monitoring mean shifts of auto-correlated processes. Internat J Product Econom, 2013, 141: 377–387CrossRefGoogle Scholar
  3. 3.
    Gower J C. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 1966, 53: 325–338MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Kruskal J B. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Phychometrika, 1964, 29: 1–27MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Li Q, Wacholder S, Hunter D J, et al. Genetic background comparison using distance-based regression, with applications in population stratification evaluation and adjustment. Genet Epidemiol, 2009, 33: 432–441CrossRefGoogle Scholar
  6. 6.
    Lu T, Pan Y, Kao S, et al. Gene regulation and DNA damage in the aging human brain. Nature, 2004, 429: 883–891CrossRefGoogle Scholar
  7. 7.
    McArdle B H, Anderson M J. Fitting multivariate models to community data: A comment on distance-based redun-dancy analysis. Ecology, 2001, 82: 290–297CrossRefGoogle Scholar
  8. 8.
    Nievergelt C M, Libiger O, Schork N J. Generalized analysis of molecular variance. PLoS Genet, 2007, 3: 467–478CrossRefGoogle Scholar
  9. 9.
    Pan W. Relationship between genomic distance-based regression and kernel machine regression for multi-marker association testing. Genet Epidemiol, 2011, 35: 211–216CrossRefGoogle Scholar
  10. 10.
    Shapira S D, Irit G V, Shum B O V, et al. A physical and regulatory map of host-in uenza interactions reveals pathways in H1N1 infection. Cell, 2009, 139: 1255–1267CrossRefGoogle Scholar
  11. 11.
    Shapiro S S, Wilk M B. An analysis of variance test for normality (complete samples). Biometrika, 1965, 52: 591–611MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Shehzad Z, Kelly C, Reiss P T, et al. A multivariate distance-based analytic framework for connectome-wide association studies. Neuroimage, 2014, 93: 74–94CrossRefGoogle Scholar
  13. 13.
    Wessel J, Schork N J. Generalized genomic distance-based regression methodology for multilocus association analysis. Amer J Hum Genet, 2006, 79: 792–806CrossRefGoogle Scholar
  14. 14.
    Xu Y, Guo X, Sun J, et al. Snowball: Resampling combined with distance-based regression to discover transcriptional consequences of a driver mutation. Bioinformatics, 2015, 31: 84–93CrossRefGoogle Scholar
  15. 15.
    Zapala M A, Hovatta I, Ellison J A, et al. Adult mouse brain gene expression patterns bear an embryologic imprint. Proc Natl Acad Sci USA, 2005, 102: 10357–10362CrossRefGoogle Scholar
  16. 16.
    Zapala M A, Schork N J. Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables. Proc Natl Acad Sci USA, 2006, 103: 19430–19435CrossRefGoogle Scholar
  17. 17.
    Zhang J. Approximate and asymptituc distributions of Chi-squared-type mixtures with applications. J Amer Statist Assoc, 2005, 100: 273–285MathSciNetCrossRefGoogle Scholar

Copyright information

© Science China Press and Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Mathematics and StatisticsBeijing Institute of TechnologyBeijingChina
  2. 2.Biostatistics and Bioinformatics BranchNational Institute of Child Health and Human DevelopmentBethesdaUSA
  3. 3.School of Mathematical SciencesUniversity of Chinese Academy of SciencesBeijingChina
  4. 4.Key Laboratory of Big Data Mining and Knowledge ManagementChinese Academy of SciencesBeijingChina
  5. 5.LSC, NCMIS, Academy of Mathematics and Systems ScienceChinese Academy of SciencesBeijingChina

Personalised recommendations