Consistency of test-based method for selection of variables in high-dimensional two-group discriminant analysis

Fujikoshi, Yasunori; Sakurai, Tetsuro

doi:10.1007/s42081-019-00032-4

Consistency of test-based method for selection of variables in high-dimensional two-group discriminant analysis

Original Paper
Published: 08 February 2019

Volume 2, pages 155–171, (2019)
Cite this article

Japanese Journal of Statistics and Data Science Aims and scope Submit manuscript

Yasunori Fujikoshi¹ &
Tetsuro Sakurai²

659 Accesses
5 Citations
Explore all metrics

Abstract

This paper is concerned with selection of variables in two-group discriminant analysis with the same covariance matrix. We propose a test-based method (TM) drawing on the significance of each variable. Sufficient conditions for the test-based method to be consistent are provided when the dimension and the sample size are large. For the case that the dimension is larger than the sample size, a ridge-type method is proposed. Our results and tendencies therein are explored numerically through a Monte Carlo simulation. It is pointed that our selection method can be applied for high-dimensional data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new criterion for assessing discriminant validity in variance-based structural equation modeling

Article Open access 22 August 2014

Jörg Henseler, Christian M. Ringle & Marko Sarstedt

Univariate and multivariate skewness and kurtosis for measuring nonnormality: Prevalence, influence and estimation

Article 17 October 2016

Meghan K. Cain, Zhiyong Zhang & Ke-Hai Yuan

Data clustering: application and trends

Article 27 November 2022

Gbeminiyi John Oyewole & George Alex Thopil

References

Akaike, H. (1973). Information theory and an extension of themaximum likelihood principle. In B. N. Petrov & F. Csáki (Eds.), 2nd International Symposium on Information Theory (pp. 267–281). Budapest: Akadémiai Kiadó.
Google Scholar
Clemmensen, L., Hastie, T., Witten, D. M., & Ersbell, B. (2011). Sparse discriminant analysis. Technometrics, 53, 406–413.
Article MathSciNet Google Scholar
Fujikoshi, Y. (1985). Selection of variables in two-group discriminant analysis by error rate and Akaike’s information criteria. Journal of Multivariate Analysis, 17, 27–37.
Article MathSciNet Google Scholar
Fujikoshi, Y. (2000). Error bounds for asymptotic approximations of the linear discriminant function when the sample size and dimensionality are large. Journal of Multivariate Analysis, 73, 1–17.
Article MathSciNet Google Scholar
Fujikoshi, Y., & Sakurai, T. (2016). High-dimensional consistency of rank estimation criteria in multivariate linear model. Journal of Multivariate Analysis, 149, 199–212.
Article MathSciNet Google Scholar
Fujikoshi, Y., Ulyanov, V. V., & Shimizu, R. (2010). Multivariate statistics: high-dimensional and large-sample approximations. Hobeken, NJ: Wiley.
Book Google Scholar
Fujikoshi, Y., Sakurai, T., & Yanagihara, H. (2014). Consistency of high-dimensional AIC-type and $\text{ C }_p$-type criteria in multivariate linear regression. Journal of Multivariate Analysis, 144, 184–200.
Article Google Scholar
Hao, N., Dong, B. & Fan, J. (2015). Sparsifying the Fisher linear discriminant by rotation. Journal of the Royal Statistical Society: Series B, 77, 827–851.
Article MathSciNet Google Scholar
Hyodo, M., & Kubokawa, T. (2014). A variable selection criterion for linear discriminant rule and its optimality in high dimensional and large sample data. Journal of Multivariate Analysis, 123, 364–379.
Article MathSciNet Google Scholar
Ito, T. & Kubokawa, T. (2015). Linear ridge estimator of high-dimensional precision matrix using random matrix theory. Discussion Paper Series, CIRJE-F-995.
Kubokawa, T., & Srivastava, M. S. (2012). Selection of variables in multivariate regression models for large dimensions. Communication in Statistics-Theory and Methods, 41, 2465–2489.
Article MathSciNet Google Scholar
McLachlan, G. J. (1976). A criterion for selecting variables for the linear discriminant function. Biometrics, 32, 529–534.
Article MathSciNet Google Scholar
Nishii, R., Bai, Z. D., & Krishnaia, P. R. (1988). Strong consistency of the information criterion for model selection in multivariate analysis. Hiroshima Mathematical Journal, 18, 451–462.
Article MathSciNet Google Scholar
Rao, C. R. (1973). Linear statistical inference and its applications (2nd ed.). New York: Wiley.
Book Google Scholar
Sakurai, T., Nakada, T., & Fujikoshi, Y. (2013). High-dimensional AICs for selection of variables in discriminant analysis. Sankhya, Series A, 75, 1–25.
Article MathSciNet Google Scholar
Schwarz, G. (1978). Estimating the dimension od a model. Annals of Statistics, 6, 461–464.
Article MathSciNet Google Scholar
Tiku, M. (1985). Noncentral chi-square distribution. In S. Kotz & N. L. Johnson (Eds.), Encyclopedia of Statistical Sciences, vol. 6 (pp. 276–280). New York: Wiely.
Google Scholar
Van Wieringen, W. N., & Peeters, C. F. (2016). Ridge estimation of inverse covariance matrices from high-dimensional data. Computational Statistics & Data Analysis, 103, 284–303.
Article MathSciNet Google Scholar
Witten, D. W., & Tibshirani, R. (2011). Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B, 73, 753–772.
Article MathSciNet Google Scholar
Yamada, T., Sakurai, T. & Fujikoshi, Y. (2017). High-dimensional asymptotic results for EPMCs of W- and Z- rules. Hiroshima Statistical Research Group, 17–12.
Yanagihara, H., Wakaki, H., & Fujikoshi, Y. (2015). A consistency property of the AIC for multivariate linear models when the dimension and the sample size are large. Electronic Journal of Statistics, 9, 869–897.
Article MathSciNet Google Scholar
Zhao, L. C., Krishnaiah, P. R., & Bai, Z. D. (1986). On determination of the number of signals in presence of white noise. Journal of Multivariate Analysis, 20, 1–25.
Article MathSciNet Google Scholar

Download references

Acknowledgements

We thank two referees for careful reading of our manuscript and many helpful comments which improved the presentation of this paper. The first author’s research is partially supported by the Ministry of Education, Science, Sports, and Culture, a Grant-in-Aid for Scientific Research (C), 16K00047, 2016–2018.

Author information

Authors and Affiliations

Department of Mathematics, Graduate School of Science, Hiroshima University, 1-3-1 Kagamiyama, Higashi Hiroshima, Hiroshima, 739-8626, Japan
Yasunori Fujikoshi
School of General and Management Studies, Suwa University of Science, 5000-1 Toyohira, Chino, Nagano, 391-0292, Japan
Tetsuro Sakurai

Authors

Yasunori Fujikoshi
View author publications
You can also search for this author in PubMed Google Scholar
Tetsuro Sakurai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yasunori Fujikoshi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proofs of Theorems 1, 2 and 3

1.1 Preliminary lemmas

First, we study distributional results related to the test statistics $\mathrm{T}_{d,i}$ in (5). For a notational simplicity, consider a decomposition of ${\varvec{y}}=({\varvec{y}}_1', {\varvec{y}}_2')', \ {\varvec{y}}_1; \ p_1\times 1, \ {\varvec{y}}_2; \ p_2 \times 1$. Similarly, decompose $\varvec{\beta }=(\varvec{\beta }_1', \varvec{\beta }_2')'$, and

$$\begin{aligned} {\mathsf {S}}= \left( \begin{array}{cc} {\mathsf {S}}_{11} &{} {\mathsf {S}}_{12} \\ {\mathsf {S}}_{21} &{} {\mathsf {S}}_{22} \end{array} \right) , \quad {\mathsf {S}}_{12}; \ p_1 \times p_2. \end{aligned}$$

Let $\lambda$ be the likelihood ratio criterion for testing a hypothesis $\varvec{\beta }_2={\varvec{0}}$, then

$$\begin{aligned} -2 \log \lambda = n \log \left\{ 1 + \frac{g^2 (D^2 - D_1^2)}{n-2 + g^2 D_1^2}, \right\} \end{aligned}$$

(12)

where $g=\left\{ (n_1n_2)/n\right\} ^{1/2}$. The following lemma (see, e.g., Fujikoshi et al. 2010) is used.

Lemma 1

Let $D_1$ and D be the sample Mahalanobis distances based on ${\varvec{y}}_1$ and ${\varvec{y}}$, respectively. Let $D_{2\cdot 1}^2=D^2-D_1^2$. Similarly, the corresponding population quantities are expressed as $\varDelta _1$, $\varDelta$ and $\varDelta _{2\cdot 1}^2$. Then, it holds that

$$\begin{aligned}&\mathrm{(1)} \ D_1^2=(n-2)g^{-2}R, \quad R=\chi _{p_1}^2(g^2\varDelta _1^2)\left\{ \chi _{n-p_1-1}^2\right\} ^{-1}. \\&\mathrm{(2)} \ D_{2\cdot 1}^2 = (n-2) g^{-2} \chi _{p_2}^2 \left( g^2 \varDelta _{2\cdot 1}^2 \cdot \frac{1}{1+R} \right) \left\{ \chi _{n-p-1}^2\right\} ^{-1} (1+R).\\&\mathrm{(3)} \ \frac{g^2 (D^2 - D_1^2)}{n-2 + g^2 D_1^2} = \chi _{p_2}^2 ( g^2 \varDelta _{2\cdot 1}^2 (1+R)^{-1})\{\chi _{n-p-1}^2\}^{-1} \end{aligned}$$

Here, $\chi _{p_1}^2(\cdot )$, $\chi _{n-p_1-1}^2$, $\chi _{p_2}^2(\cdot )$, and $\chi _{n-p-1}^2$ are independent Chi-square variates.

Related to the conditional distribution of the right-hand side of (3) with $p_2=1$ and $m=n-p-1$ in Lemma 1, consider the random variable defined by

$$\begin{aligned} V=\frac{\chi _1^2(\lambda ^2)}{\chi _m^2}-\frac{1+\lambda ^2}{m-2}, \end{aligned}$$

(13)

where $\chi _1^2(\lambda ^2)$ and $\chi _m^2$ are independent. We can express V as

$$\begin{aligned} V=U_1U_2+(m-2)^{-1}U_1+(1+\lambda ^2)U_2, \end{aligned}$$

(14)

in terms of the centralized variables $U_1$ and $U_2$ defined by

$$\begin{aligned} U_1=\chi _1^2(1+\lambda ^2) -(1+\lambda ^2), \quad U_2=\frac{1}{\chi _m^2}-\frac{1}{m-2}. \end{aligned}$$

(15)

It is well known (see, e.g., Tiku 1985) that

$$\begin{aligned}&\mathrm{E}(U_1)=0,\\&\mathrm{E}(U_1^2)=2(1+2\lambda ^2), \\&\mathrm{E}(U_1^3)=8(1+3\lambda ^2), \\&\mathrm{E}(U_1^4)=48(1+4\lambda ^2)+4(1+3\lambda ^2)^2. \end{aligned}$$

Furthermore

$$\begin{aligned} \mathrm{E}\left( U_2^k\right) =&\sum _{i=0}^k {}_k C_i \mathrm{E}\left\{ \left( \frac{1}{\chi _m^2}\right) ^i\right\} \left( -\frac{1}{m-2}\right) ^{k-i}\\ =&\sum _{i=1}^k {}_kC_i\frac{1}{(m-2) \cdots (m-2i)}\left( -\frac{1}{m-2}\right) ^{k-i} +\left( -\frac{1}{m-2}\right) ^k. \end{aligned}$$

These give the first four moments of V. In particular, we use the following results.

Lemma 2

Let V be the random variable defined by (14). Suppose that $\lambda ^2=\mathrm{O}(m)$. Then

$$\begin{aligned}&\mathrm{E}(V)=0,\quad \mathrm{E}(V^2)=\frac{2(m-3-2\lambda ^2+\lambda ^4)}{(m-2)^2(m-4)}=\mathrm{O}(m^{-1}), \\&\mathrm{E}(V^3)=\mathrm{O}(m^{-2}), \quad \mathrm{E}(V^4)=\mathrm{O}(m^{-2}). \end{aligned}$$

1.2 Proof of Theorem 1

First, we show “$\mathrm{[F1]} \rightarrow 0$”. Let $i \in j_*$. Then, $(-i) \notin {{{\mathcal {F}}}}_+$, and hence

$$\begin{aligned} \varDelta _{(-i)}^2 < \varDelta ^2, \quad \varDelta _{ \{i\} \cdot (-i)}^2 > 0. \end{aligned}$$

Using (12) and Lemma 1(3)

$$\begin{aligned} \mathrm{T}_{d, i}= n \log \left\{ 1 + \frac{\chi _1^2 ( g^2 \varDelta _{\{i\} \cdot (-i)}^2 (1+R_i)^{-1})}{\chi _{n-p-1}^2} \right\} - d, \end{aligned}$$

where $R_i = \chi _{p-1}^2 (g^2 \varDelta _{(-i)}^2)\left\{ \chi _{n-p}^2\right\} ^{-1}$. Here, since $j_*$ is finite, by showing

$$\begin{aligned} \mathrm{T}_{d, i} \overset{p}{\rightarrow } t_i > 0 \quad \text {or} \quad \mathrm{T}_{d, i} \overset{p}{\rightarrow } \infty , \end{aligned}$$

we obtain $P (\mathrm{T}_{d, i} \le 0) \rightarrow 0$, and hence, “$\mathrm{[F1]} \rightarrow 0$”. It is easily seen that

$$\begin{aligned} R_i \sim \frac{p+g^2 \varDelta _{(-i)}^2}{n-p}, \end{aligned}$$

where $g=\{(n_1n_2)/n\}^{1/2}$ and “ $\sim$” means asymptotically equivalent, and hence

$$\begin{aligned} (1+R_i)^{-1} \sim \frac{n-p}{n+g^2\varDelta _{(-i)}^2}. \end{aligned}$$

Therefore, we obtain

$$\begin{aligned} \frac{1}{n} \mathrm{T}_{d, i} \rightarrow \lim \log \left( 1 + \frac{g^2 \varDelta _{\{i\} \cdot (-i)}^2}{n + g^2 \varDelta _{(-i)}^2} \right) > 0, \end{aligned}$$

which implies our assertion.

Next, consider to show “$\mathrm{[F2]} \rightarrow 0$”. For any $i \notin j_*$, $\varDelta ^2=\varDelta _{(-i)}^2$. Therefore, using Lemma 1(3), we have

$$\begin{aligned} \mathrm{T}_{d, i}= n \log \left( 1 + \frac{\chi _1^2}{\chi _{n-p-1}^2} \right) - d, \end{aligned}$$

(16)

whose distribution does not depend on i. Here, $\chi _1^2$ and $\chi _{n-p-1}^2$ are independent Chi-square variates with 1 and $n-p-1$ degrees of freedom. This implies that

$$\begin{aligned} \mathrm{T}_{d,i}> 0 \Leftrightarrow \frac{\chi _1^2}{\chi _{n-p-1}^2} > e^{d/n} - 1. \end{aligned}$$

Noting that $\mathrm{E}[ \chi _1^2/ \chi _{n-p-1}^2 ] = (n-p-3)^{-1}$, let

$$\begin{aligned} U = \frac{\chi _1^2}{\chi _{n-p-1}^2} - \frac{1}{n-p-3}. \end{aligned}$$

Then, since $e^{d/n} - 1 - \frac{1}{n-p-3}>h$

$$\begin{aligned} P ( \mathrm{T}_{d,i}> 0 )&= P \left( U> e^{d/n} - 1 - \frac{1}{n-p-3} \right) \\&\le P \left( U > h \right) . \end{aligned}$$

Furthermore, using Markov inequality, we have

$$\begin{aligned} P( \mathrm{T}_{d,i}> 0 )&\le P(|U| > h)\\&\le h^{-2\ell } \mathrm{E}(U^{2\ell }), \quad \ell = 1, 2, \ldots \end{aligned}$$

Furthermore, it is easily seen that

$$\begin{aligned} \mathrm{E}( U^{2\ell } ) = \mathrm{O}(n^{-2\ell }), \end{aligned}$$

using, e.g., Theorem 16.2.2 in Fujikoshi et al. (2010). When $h = O(n^{-a})$

$$\begin{aligned} h^{-2\ell } \mathrm{E}( U^{2\ell } ) = \mathrm{O}(n^{-2(1-a)\ell }). \end{aligned}$$

Choosing $\ell$ such that $\ell > (1-a)^{-1}$, we have “$\mathrm{[F2]} \rightarrow 0$”.

1.3 Proof of Theorem 2

First, note that in the proof of “$\mathrm{[F2]} \rightarrow 0$” in Theorem 1, Assumption A3 is not used. This implies the assertion “$\mathrm{[F2]} \rightarrow 0$” in Theorem 2.

Now, we consider to show “$\mathrm{[F1]} \rightarrow 0$” when $p_*=\mathrm{O}(p)$ and $\varDelta ^2=\mathrm{O}(p)$. In this case, $p_*$ tends to $\infty$. Based on the proof in Theorem 1, we can express $\mathrm{T}_{d, i}$ for $i \in j_*$ as

$$\begin{aligned} \mathrm{T}_{d, i}= n \log \left\{ 1 + \frac{\chi _1^2 ({\widehat{\lambda }}_i^2)}{\chi _{n-p-1}^2} \right\} - d, \end{aligned}$$

where ${\widehat{\lambda }}_i^2=g^2\varDelta _{\{i\} \cdot (-i)}^2 (1+R_i)^{-1}$ and $R_i = \chi _{p-1}^2 (g^2 \varDelta _{(-i)}^2)\left\{ \chi _{n-p}^2\right\} ^{-1}$. Note that $\chi _1^2$ and $\chi _{n-p-1}^2$ are independent of $R_i$, and hence of ${\widehat{\lambda }}_i^2$. Then, we have

$$\begin{aligned} P(T_{d, i} \le 0)=P({\widehat{V}} \le {\widehat{h}}), \end{aligned}$$

(17)

where

$$\begin{aligned} {\widehat{V}}&= \frac{\chi _1^2 ({\widehat{\lambda }}_i^2)}{\chi _{n-p-1}^2}- \frac{1+{\widehat{\lambda }}_i^2}{n-p-3}, \\ {\widehat{h}}&=e^{d/n}-1-(1+{\widehat{\lambda }}_i^2)/(n-p-3). \end{aligned}$$

Considering the conditional distribution of the right-hand side in (17), we have

$$\begin{aligned} P({\widehat{V}} \le {\widehat{h}})= \mathrm{E}_{{\widehat{\lambda }}^2_i}\left\{ Q({\widehat{\lambda }}^2_i)\right\} , \end{aligned}$$

(18)

where

$$\begin{aligned} Q(\lambda ^2_i)&=P({\widehat{V}} \le {\widehat{h}} \ | \ {\widehat{\lambda }}_i^2=\lambda _i^2) \\&=P({\widetilde{V}} \le {\widetilde{h}}). \end{aligned}$$

Here

$$\begin{aligned} {\widetilde{V}}&= \frac{\chi _1^2 (\lambda _i^2)}{\chi _{n-p-1}^2}- \frac{1+\lambda _i^2}{n-p-3}, \\ {\widetilde{h}}&=e^{d/n}-1-(1+\lambda _i^2)/(n-p-3). \end{aligned}$$

Using Assumption A6, it can be seen that

$$\begin{aligned} {\widehat{\lambda }}_i^2 \sim (1-c)c^{-1}\theta _i^2p \equiv \lambda _{i0}^2,\quad \mathrm{and} \quad {\widehat{\lambda }}_i^2 = \mathrm{O}(p^b). \end{aligned}$$

Now, we consider the probability $P({\widetilde{V}} \le {\widetilde{h}})$ when $\lambda _i^2=\lambda _{i0}^2$. From assumption $r < b$, for large n, ${\widetilde{h}} < 0$. Therefor, for large n, we have

$$\begin{aligned} P({\widetilde{V}} \le {\widetilde{h}})&\le P(|{\widetilde{V}}| \ge |{\widetilde{h}}|) \\&\le |{\widetilde{h}}|^{-4} \mathrm{E}({\widetilde{V}}^4). \end{aligned}$$

From Lemma 2, $\mathrm{E}({\widetilde{V}}^4)=\mathrm{O}(n^{-2})$. Noting that ${\widetilde{h}}=\mathrm{O}(n^{-(1-b)})$, we have

$$\begin{aligned} |{\widetilde{h}}|^{-4} \mathrm{E}({\widetilde{V}}^4)=\mathrm{O}(n^{4(1-b)-2}), \end{aligned}$$

whose order is $\mathrm{O}(n^{-(1+3\delta )})$ if we choose b as $b > (3/4)(1+\delta )$. Therefore, we have $P(\mathrm{T}_{d,i} \le 0)=\mathrm{O}(n^{-(1+3\delta )})$, which implies “$\mathrm{[F1]} \rightarrow 0$”.

1.4 Proof of Theorem 3

The assertion “$\mathrm{[F1]} \rightarrow 0$” follows from the proof of “$\mathrm{[F1]} \rightarrow 0$” in Theorem 1. For a proof of “$\mathrm{[F2]} \rightarrow 0$”, it is enough to show that

$$\begin{aligned} \mathrm{for} \ i \notin j_*, \quad \mathrm{T}_{d,i} \rightarrow \ -\infty . \end{aligned}$$

since p has been fixed. From (16), the limiting distribution of $\mathrm{T}_{d,i}$ is “$\chi _1^2-d$”. This implies “$\mathrm{[F2]} \rightarrow 0$”.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fujikoshi, Y., Sakurai, T. Consistency of test-based method for selection of variables in high-dimensional two-group discriminant analysis. Jpn J Stat Data Sci 2, 155–171 (2019). https://doi.org/10.1007/s42081-019-00032-4

Download citation

Received: 19 July 2018
Accepted: 16 January 2019
Published: 08 February 2019
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s42081-019-00032-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Consistency of test-based method for selection of variables in high-dimensional two-group discriminant analysis

Abstract

Access this article

Similar content being viewed by others

A new criterion for assessing discriminant validity in variance-based structural equation modeling

Univariate and multivariate skewness and kurtosis for measuring nonnormality: Prevalence, influence and estimation

Data clustering: application and trends

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proofs of Theorems 1, 2 and 3

1.1 Preliminary lemmas

Lemma 1

Lemma 2

1.2 Proof of Theorem 1

1.3 Proof of Theorem 2

1.4 Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Consistency of test-based method for selection of variables in high-dimensional two-group discriminant analysis

Abstract

Access this article

Similar content being viewed by others

A new criterion for assessing discriminant validity in variance-based structural equation modeling

Univariate and multivariate skewness and kurtosis for measuring nonnormality: Prevalence, influence and estimation

Data clustering: application and trends

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proofs of Theorems 1, 2 and 3

Appendix: Proofs of Theorems 1, 2 and 3

1.1 Preliminary lemmas

Lemma 1

Lemma 2

1.2 Proof of Theorem 1

1.3 Proof of Theorem 2

1.4 Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation