1 Introduction

In the field of machine learning, multi-modal learning, which models relationships among multiple modalities, has been studied actively. One challenge of multi-modal learning is to construct a predictive model from a set of multiple modalities to a certain modality. We call the modality to be predicted target modality and the modality to be used to predict a target modality source modality. As a model for estimating the relationship between different modalities, canonical correlation analysis (CCA) [9] is representative, and there are prior studies that actually use CCA for prediction [7]. Also, note that we will call the set of modalities collected from an object an element, and we will refer to a group of elements as a sample respectively.

Fig. 1.
figure 1

An example of a real sample missing modalities. means that the modality is provided and \(\times \) means that it is not provided. In this figure, only Element 3 provides all modalities and the others have missing modalities in various patterns. GBCCA-M2 can utilize all elements in learning and predict a target modality from source modalities.

There are two challenges in building such a model: some modalities are missing for any reason. For example, in purchaser behavior prediction, some people often refuse to provide some modalities because of their privacy. As Fig. 1 shows, there are various patterns of missing modalities, which makes the problem more difficult. Further, the features of modalities likely to be high dimensional and noisy. The situation occurs when we collect a large amount of information. To deal with these challenges, we propose a method called Generalized Bayesian Canonical Correlation Analysis with Missing Modalities (GBCCA-M2). This method can learn relationships among different modalities utilizing the incomplete sets of modalities by including them in the likelihood function. This study is motivated by the previous works [13, 26] which utilized incomplete sets of modalities. These previous works were proposed to learn the relationships between two different modalities, whereas our method can deal with more than two different modalities. In addition, this method works well on high dimensional and noisy modalities thanks to the prior knowledge incorporated on the parameters of a model. The prior knowledge is introduced to control the sparsity of the weight parameters linking each latent variable to modalities, which makes the model robust to high dimensional and noisy features of modalities. The main contributions of this paper are as follows:

  • We propose Generalized Bayesian Canonical Correlation Analysis with Missing Modalities (GBCCA-M2) which is a learning model that can account for elements with missing modalities in the likelihood function.

  • Through an experiment using artificial data, we demonstrate that GBCCA-M2 improves prediction performance when using elements with missing modalities, and it is effective for high dimensional and noisy modalities.

  • Through an experiment using real data, we demonstrate that GBCCA-M2 is more effective for predicting purchaser behavior and retrieving images from English and Japanese sentences than existing methods.

2 Related Work

Through much research on multi-modal learning, it has been determined in various tasks that performance can be improved by using multiple source modalities rather than using only one [10, 25]. CCA [9] is a method that learns relationships between two modalities. Given pairs of modalities, the model learns to project them into the latent space where they are maximally correlated. CCA has many variants, such as Generalized CCA [5], kernel extensions [1, 15], probabilistic CCA (PCCA) [3] and Bayesian CCA (BCCA) [14, 23]. Generalized CCA extends to CCA in order to capture relationships among more than two modalities. The probabilistic models such as PCCA or BCCA incorporate prior knowledge into their parameters and can learn relationships between high dimensional and noisy features of modalities. We explain the details of BCCA in Sect. 3. The difficulty in learning relationships between paired modalities is that it is often expensive to collect a large number of paired modalities. In reality, there are a limited number of paired modalities; however, unpaired modalities may be accessible. To overcome this problem, extensions of CCA for semi-supervised learning have been proposed (e.g., Semi CCA [13], Semi PCCA [11, 26]). Semi PCCA can deal with elements that are missing modalities by describing likelihood for use with them. However, this method can only deal with the case where only one of the two modalities is missing. Therefore, we will introduce the methods that are used for general missing data analysis below.

Statistical analysis of missing data is roughly classified as one of the following three types [17]: (1) complete case analysis (CC) [12, 20], (2) imputation of missing values, and (3) describing likelihood for use with missing data. CC is simple, but elements with missing values are not utilized. As for imputation of missing values, this includes mean imputation, regression imputation, or multiple imputation. Methods for complementing missing values by autoencoder [8] have also been developed, and the extensions, such as Cascaded Residual Autoencoder [22] attacks the cases where the modalities are missing. Since imputations of missing values mainly assume that the missing values occur randomly, they are not suitable for the case of missing modalities. As for the studies on describing likelihood for use with missing data [6, 18, 26], it is known that these methods hold looser assumptions than CC and imputation of missing values. However, these methods merely estimate the distribution of data or regress missing values. Although a regression can be performed using parameters learned by Semi PCCA, this is not suitable for the case of multi-modal learning with missing modalities. In our experiments, we use CC and mean imputation to make spuriously complete data for comparison of methods.

3 Generalized Bayesian Canonical Correlation Analysis with Missing Modalities

Since our proposed method is motivated by Bayesian CCA (BCCA) [14, 23] and Semi CCA [13], we will first review these two methods separately.

Fig. 2.
figure 2

Graphical illustration of the BCCA model as a plate diagram. The shaded nodes indicate the two observed variables, and the other nodes indicate the model parameters to be estimated. The latent variable \(\varvec{z}\) captures the correlation between \(\varvec{x}_1\) and \(\varvec{x}_2\).

3.1 Bayesian Canonical Correlation Analysis

BCCA [14, 23] is a method that adapts the hierarchical Bayesian model to CCA [9]. Fujiwara et al. [7] proposed a new BCCA model to reconstruct images from human brain information. As shown in Fig. 2, the new model captures the relationships between the two modalities. In the model, modalities \({\varvec{x}}_i\in \mathbb {R}^{d_i}\,(i=1,2)\) are generated by common latent variables \({\varvec{z}}\in \mathbb {R}^{d_z}, d_z\le \mathrm{{min}}(d_i)\) and weight matrices \({\varvec{W}}_i\in \mathbb {R}^{d_i\times d_z}\), where \(d_i\) and \(d_z\) represent the dimension of modalities and latent variables, respectively. In addition, weight matrices are controlled by parameters \(\varvec{\alpha }_i\in \mathbb {R}^{d_i\times d_z}\). The likelihood of the modalities is

$$\begin{aligned} P(\varvec{x}_i|\varvec{W}_i,\varvec{z})\propto \exp \Biggl (-\frac{1}{2}\sum _{n=1}^{N}\left( \varvec{x}_i(n)-\varvec{W}_i\varvec{z}(n)\right) ^{\mathrm {T}}\beta _{i}\left( \varvec{x}_i(n)-\varvec{W}_i\varvec{z}(n)\right) \Biggr ), \end{aligned}$$
(1)

where \(\beta _i\mathrm{\varvec{I}}_{d_z}(\beta _i\in \mathbb {R}^1)\) represents covariance of the Gaussian distribution, \(\mathrm{\varvec{I}}_d\) represents a \(d \times d\) identity matrix, and, N represents the sample size. The prior distribution of latent variables is

$$\begin{aligned} P_0(\varvec{z})\propto \exp \left( -\frac{1}{2}\sum _{n=1}^{N}\Vert \varvec{z}(n)\Vert ^{2}\right) . \end{aligned}$$
(2)

Latent variables are generated from the Gaussian distribution whose mean is \(\varvec{0}\) and whose covariance is \(\mathrm{\varvec{I}}\). The prior distribution of weight matrices is

$$\begin{aligned} P_0(\varvec{W}_i|\varvec{\alpha }_i)\propto \exp \left( -\frac{1}{2}\sum _{s=1}^{d_i}\sum _{t=1}^{d_z}\alpha _{i_{(s,t)}}W_{i_{(s,t)}}^2\right) . \end{aligned}$$
(3)

The (st) element of weight matrices is generated from the Gaussian distribution whose mean is \(W_{i_{(s,t)}}\) and whose covariance is \(\alpha _{i(s,t)}\). Weight matrices are controlled by hyper-parameters \(\varvec{\alpha }_i\), whose hyper-prior distribution is

$$\begin{aligned} P_0(\varvec{\alpha }_i)=\prod _{s=1}^{d_i}\prod _{t=1}^{d_z}\mathcal {G}(\alpha _{i_{(s,t)}}|\overline{\alpha }_{i_{(s,t)}},\gamma _{i_{(s,t)}}), \end{aligned}$$
(4)

where \(\mathcal {G}(\alpha |\overline{\alpha },\gamma )\) is the Gamma distribution whose mean is \(\overline{\alpha }\) and whose confidence parameter is \(\gamma \). This probability model (Eqs. (1) and (4)) is known as automatic relevance determination (ARD) [19], which drives unnecessary components to zero. The prior distribution of observation noise \(\beta _i\) is

$$\begin{aligned} P_0(\beta _i)=\frac{1}{\beta _i}, \end{aligned}$$
(5)

which is called non-informative priors. Parameters are estimated by variational Bayesian inference [2], and the predictive distribution of the target modality is driven using these estimated parameters (Fig. 3).

Fig. 3.
figure 3

An example of spatial estimation by Semi CCA. By using unpaired elements, we can estimate a direction closer to that estimated in the case that all the unpaired elements would have been paired, than by using only paired elements.

3.2 Semi Canonical Correlation Analysis

Semi CCA [13] is a method that extends CCA to a semi-supervised one by combining CCA and principal component analysis (PCA). We denote the group of elements whose modalities are paired as P, the ones whose are not paired as U, and the sample covariance matrices as \(\varvec{\varSigma }\)s. The solution of Semi CCA can be obtained by solving the following general eigenvalue problem.

$$\begin{aligned} B \begin{pmatrix} \varvec{w}_1\\ \varvec{w}_2 \end{pmatrix}&=\lambda C \begin{pmatrix} \varvec{w}_1\\ \varvec{w}_2 \end{pmatrix},\end{aligned}$$
(6)
$$\begin{aligned} B = \beta \begin{pmatrix} \varvec{0}&{}\varvec{\varSigma }_{12}^{(P)}\\ \varvec{\varSigma }_{21}^{(P)}&{}\varvec{0} \end{pmatrix}&+\left( 1 -\beta \right) \begin{pmatrix} \varvec{\varSigma }_{11}^{(P+U)}&{}\varvec{0}\\ \varvec{0}&{}\varvec{\varSigma }_{22}^{(P+U)} \end{pmatrix},\end{aligned}$$
(7)
$$\begin{aligned} C =\beta \begin{pmatrix} \varvec{\varSigma }_{12}^{(P)} &{} \varvec{0}\\ \varvec{0} &{} \varvec{\varSigma }_{21}^{(P)} \end{pmatrix}&+\left( 1 -\beta \right) \begin{pmatrix} \mathrm{\varvec{I}}_{D_1}&{}\varvec{0}\\ \varvec{0}&{}\mathrm{\varvec{I}}_{D_2} \end{pmatrix}. \end{aligned}$$
(8)

\(\beta \) represents the contribution ratio of CCA to PCA. Similar to this, we introduce contribution rates of elements missing modalities to GBCCA-M2. Semi CCA has an application in probabilistic models, such as Probabilistic Semi CCA [11, 26]. However, they are not suitable for high dimensional and noisy features of modalities because the premise that weight matrices become sparse in learning is not assumed, which causes overfitting.

3.3 Generalized Bayesian Canonical Correlation Analysis with Missing Modalities

As mentioned in Sect. 1, some of the modalities are missing and the features of them are high dimensional and noisy. Considering these characteristics, the following functions are necessary for our method: (F1) dealing with various patterns of missing modalities, (F2) dealing with more than two different modalities, (F3) highly accurate prediction for high dimensional and noisy features of modalities. BCCA meets F3, so we extend it so as to meet F1 and F2 through the proposed GBCCA-M2. We construct the model of GBCCA-M2 while considering the following: (1) the number of modalities should be increased more than two, and all modalities are generated from common latent variables and (2) the contribution rates to the likelihood are changed according to how many modalities are missing. The graphical model of GBCCA-M2 is shown in Fig. 4. Now, we introduce the likelihood and prior distribution of GBCCA-M2, parameter estimation by a variational Bayesian inference, and the prediction of target modality using source modalities and estimated parameters.

Fig. 4.
figure 4

Graphical illustration of the GBCCA-M2 model as a plate diagram. Each element used in learning has two or more modalities and various missing patterns. The shaded nodes \(\varvec{x}_i\) indicate the observed variables. The latent variable z captures the correlations among the \(\varvec{x}_i\)s.

The Likelihood and Prior Distribution: The likelihood of modalities is

$$\begin{aligned} P\left( \varvec{x}_i|\varvec{W}_i, \varvec{z}\right)&=\prod _{m=1}^{M}P\left( \!\varvec{x}^{(m)}_i|\varvec{W}_i,z^{(m)}\right) ^{\eta _m}\end{aligned}$$
(9)
$$\begin{aligned} P\!\left( \!\varvec{x}_i^{(m)}|\varvec{W}_i,\varvec{z}^{(m)}\!\right) \!&\propto \!\exp \!\Biggl (\! -\frac{1}{2} \!\sum _{n=1}^{N^{(m)}_i}\!\Bigl (\!\varvec{x}_i^{(m)}(n) -\varvec{W}_i\varvec{z}^{(m)}(n)\!\Bigr )^{\!\mathrm {T}}\!\beta _i\!\left( \!\varvec{x}_i^{(m)}(n) -\varvec{W}_i\varvec{z}^{(m)}(n)\!\right) \!\Biggr ), \end{aligned}$$
(10)

where \(\varvec{x}_i^{(m)}\) represents the i-th modality of an element that has m sets of modalities, M represents the number of modalities, and \(N_i^{(m)}\) represents the number of elements which have m sets of modalities and the i-th modality of them is not missing. Moreover, we introduce contribution rates \(\eta _m\) of elements missing modalities to the likelihood function and change them according to the degree of missing modalities. Especially, the more modalities are missing, the smaller the contribution rates should be (\(\eta _1<\eta _2<\eta _3<\cdots \)), and the more elements missing modalities are, the smaller contribution rates should be, which is reflected in Fig. 5. Owing to them, we can properly utilize elements missing modalities. As with BCCA, prior distributions and the hyper-prior distribution of each parameter are as follows:

$$\begin{aligned} P_0\left( \varvec{z}^{(m)}\right)&\propto \exp \left( -\frac{1}{2}\sum _{n=1}^{N^{(m)}}\Vert \varvec{z}^{(m)}(n)\Vert ^{2}\right) ,\end{aligned}$$
(11)
$$\begin{aligned} P_0\left( \varvec{W}_i|\varvec{\alpha }_i\right)&\propto \exp \left( -\frac{1}{2}\sum _{s=1}^{d_i}\sum _{t=1}^{d_z}\alpha _{i_{(s,t)}}W_{i_{(s,t)}}^2\right) ,\end{aligned}$$
(12)
$$\begin{aligned} P_0(\varvec{\alpha }_i)&=\prod _{s=1}^{d_i}\prod _{t=1}^{d_z}\mathcal {G}\left( \alpha _{i_{(s,t)}}|\overline{\alpha }_{i_{(s,t)}},\gamma _{i_{(s,t)}}\right) ,\end{aligned}$$
(13)
$$\begin{aligned} P_0\left( \beta _i\right)&=\frac{1}{\beta _i}. \end{aligned}$$
(14)

Parameter Estimation by Variational Bayesian Inference: Given the likelihood (Eqs. (9) and (10)); the prior distribution (Eqs. (11), (12) and (14)); and the hyper-prior distribution (Eq. (13)), weight matrices are estimated as the posterior distribution \(P(\varvec{W}_1,\cdots ,\varvec{W}_M|\varvec{x}_1,\cdots ,\varvec{x}_M)\). This posterior distribution is obtained by marginalizing the joint posterior distributions with respect to latent variables and variance parameters \(\varvec{\alpha }_i,\beta _i\) as follows:

$$\begin{aligned} \begin{aligned} P\left( \varvec{W}_1,\cdots ,\varvec{W}_M|\varvec{x}_1,\cdots ,\varvec{x}_M\right)&=\int \mathrm{{d}}\varvec{z}\mathrm{{d}}\varvec{\alpha }_1\cdots \mathrm{{d}}\varvec{\alpha }_M\mathrm{{d}}\beta _1\cdots \mathrm{{d}} \beta _M\\ P(\varvec{W}_1,\cdots ,\varvec{W}_M,&\varvec{z},\varvec{\alpha }_1,\cdots ,\varvec{\alpha }_M,\beta _1,\cdots ,\beta _M|\varvec{x}_1,\cdots ,\varvec{x}_M). \end{aligned} \end{aligned}$$
(15)

This joint posterior distribution cannot be calculated analytically, so it is approximated by using a trial distribution with the following factorization based on variational Bayes inference.

$$\begin{aligned} \begin{aligned}&Q\left( \varvec{W}_1,\cdots ,\varvec{W}_M,\varvec{z},\varvec{\alpha }_1,\cdots ,\varvec{\alpha }_M,\beta _1,\cdots ,\beta _M\right) \\ =&\; Q_W(\varvec{W}_1)\cdots Q_W(\varvec{W}_M)Q_z(\varvec{z})Q_{\alpha }(\varvec{\alpha }_1,\cdots ,\varvec{\alpha }_M,\beta _1,\cdots ,\beta _M). \end{aligned} \end{aligned}$$
(16)

The trial distribution of weight matrices \(Q_W\left( \varvec{W}_i\right) \) is

$$\begin{aligned} Q_W\left( \varvec{W}_i\right)&=\prod _{s=1}^{d_i}\prod _{t=1}^{d_z}\mathcal {N}\left( W_{i_{(s,t)}}|\overline{W}_{i_{(s,t)}},\sigma _{i_{(s,t)}}^{-1}\right) ,\end{aligned}$$
(17)
$$\begin{aligned} \overline{W}_{i_{(s,t)}}&=\overline{\beta }_i\sigma _{i_{(s,t)}}^{-1}\sum _{m=1}^{M}\left( \eta _m\cdot \sum _{n=1}^{N_i^{(m)}} x_{i_s}^{(m)}(n)z_t^{(m)}(n)\right) ,\end{aligned}$$
(18)
$$\begin{aligned} \sigma _{i_{(s,t)}}^{-1}&=\overline{\beta }_i\sum _{m=1}^M\left( \eta _m\cdot \sum _{n=1}^{N_i^{(m)}} {z_t^{(m)}}^2(n)+N_i^{(m)}\varSigma _{z^{(m)}(t,t)}^{-1}\right) +\overline{\alpha }_{i_{(s,t)}}. \end{aligned}$$
(19)

The trial distribution of latent variable \(Q_z\left( \varvec{z}^{(m)}\right) \) is

$$\begin{aligned} Q_z\left( \varvec{z}^{(m)}\right)&=\prod _{n=1}^{N^{(m)}}\mathcal {N}\left( \varvec{z}^{(m)}(n)|\overline{\varvec{z}}^{(m)}(n),\varvec{\varSigma }_{z^{(m)}}^{-1}\right) ,\end{aligned}$$
(20)
$$\begin{aligned} \overline{\varvec{z}}^{(m)}(n)&=\varvec{\varSigma }_{z^{(m)}}^{-1}\sum _{i=1}^M{\eta _m}\overline{\beta }_i\overline{\varvec{W}}_i^{\mathrm {T}}\varvec{x}_i^{(m)}(n),\end{aligned}$$
(21)
$$\begin{aligned} \varvec{\varSigma }_{z^{(m)}}&=\sum _{i=1}^M\left[ {\eta _m}\overline{\beta }_i\left( \overline{\varvec{W}}_i^{\mathrm {T}}\overline{\varvec{W}}_i+\varvec{\varSigma }_{W_i}^{-1}\right) \right] +\mathrm{\varvec{I}},\end{aligned}$$
(22)
$$\begin{aligned} \varvec{\varSigma }_{W_i}&=\mathrm{{diag}}\left( \left[ \sum _{s=1}^{d_i}{\sigma }_{i_{(s,1)}},\cdots ,\sum _{s=1}^{d_i}{\sigma }_{i_{(s,d_z)}}\right] \right) . \end{aligned}$$
(23)

Finally, the trial distribution of the inverse variances \(Q_\alpha (\varvec{\alpha }_1,\cdots ,\varvec{\alpha }_M,\beta _1,\cdots ,\beta _M)\) is further factorized to \(Q_{\alpha }(\varvec{\alpha }_1)\cdots Q_\alpha (\varvec{\alpha }_M)Q_\alpha (\beta _1)\cdots Q_\alpha (\beta _M)\). The expected values of \(\varvec{\alpha }_i\) and \(\beta _i\) are

$$\begin{aligned} \overline{\alpha }_{i_{(s,t)}}&=\left( \frac{1}{2}+\gamma _{i0_{(s,t)}}\right) \left( \frac{1}{2}\overline{W}_{i_{(s,t)}}^2+\frac{1}{2}\sigma _{i_{(s,t)}}^{-1}+\gamma _{i0_{(s,t)}}\alpha _{i0_{(s,t)}}^{-1}\right) ^{-1}, \end{aligned}$$
(24)
$$\begin{aligned} \overline{\beta }_i&=d_i N_i^{(M)}\Biggl \{\sum _{n=1}^{N_i^{(M)}}\Vert \varvec{x}_i(n)-\overline{\varvec{W}}_i\overline{\varvec{z}}(n)\Vert ^2 \\&+\mathrm{{Tr}}\Biggl [\!\varvec{\varSigma }_{W_i}^{-1}\Biggl (\sum _{n=1}^{N_i^{(M)}}\varvec{z}(n)\varvec{z}^{\mathrm {T}}(n) + N_i^{(M)}\varvec{\varSigma }_z^{-1}\Biggr ) + N_i^{(M)}\varvec{\varSigma }_z^{-1}\overline{\varvec{W}}_i^{\mathrm {T}}\overline{\varvec{W}}_i\Biggr ]\!\Biggr \}^{-1},\nonumber \end{aligned}$$
(25)

where \(\gamma _{i0_{(s,t)}},\alpha _{i0_{(s,t)}}\) are constant values (zero in our study). For estimating \(\beta _i\), only elements having all modalities are used. By calculating \(Q_W(\varvec{W}_i)\), \(Q_z(\varvec{z})\), and \(Q_\alpha (\varvec{\alpha }_1,\cdots ,\varvec{\alpha }_M,\beta _1,\cdots ,\beta _M)\) successively, the parameter are estimated.

Predictive Distribution: When the new set of source modalities \(\varvec{X}_\mathrm{{new}}\in \mathfrak {P}(\{\varvec{x}_1,\cdots ,\varvec{x}_{M-1}\})\), where \(\mathfrak {P}\) represents a power set (a set of all subsets), is obtained, the predictive distribution of the target modality \(\varvec{x}_{M_\mathrm{{new}}}\) is

$$\begin{aligned} P(\varvec{x}_{M\mathrm{{new}}}|\varvec{X}_\mathrm{{new}})=\!\int \!\mathrm{{d}}\varvec{W}_M\mathrm{{d}}\varvec{z}_\mathrm{{new}}P(\varvec{x}_{M\mathrm{{new}}}|\varvec{W}_M,\varvec{z}_\mathrm{{new}})Q(\varvec{W}_M)P(\varvec{z}_\mathrm{{new}}|\varvec{X}_\mathrm{{new}}). \end{aligned}$$
(26)

When the random variable \(\varvec{W}_M\) is replaced with the estimated \(\overline{\varvec{W}}_M\), the predictive distribution is

$$\begin{aligned} P(\varvec{x}_{M\mathrm{{new}}}|\varvec{X}_\mathrm{{new}})&\simeq \int \mathrm{{d}}\varvec{z}_{\mathrm{{new}}}P(\varvec{x}_{M\mathrm{{new}}}|\varvec{z}_{\mathrm{{new}}})P(\varvec{z}_\mathrm{{{new}}}|\varvec{X}_{\mathrm{{new}}}), \end{aligned}$$
(27)
$$\begin{aligned} P(\varvec{x}_{M\mathrm{{new}}}|\varvec{z}_\mathrm{{new}})&\propto \exp \left[ -\frac{1}{2}\overline{\beta }_M\Vert \varvec{x}_{M\mathrm{{new}}}-\overline{\varvec{W}}_M\varvec{z}_\mathrm{{new}}\Vert ^2\right] . \end{aligned}$$
(28)

Since the distribution \(P(\varvec{z}_\mathrm{{new}}|\varvec{X}_\mathrm{{new}})\) is an unknown distribution, it is approximated based on the test distribution \(Q_z(\varvec{z})\) (Eq. (20)). The approximate distribution is obtained by using only the term related to \(\varvec{x}_{i\mathrm{{new}}}\) included in \(\varvec{X}_\mathrm{{new}}\).

$$\begin{aligned} \tilde{Q_z}(\varvec{z}_\mathrm{{new}})&=\mathcal {N}\left( \varvec{z}|\overline{\varvec{z}}_\mathrm{{new}},\varvec{\varSigma }_{z\mathrm{{new}}}^{-1}\right) ,\end{aligned}$$
(29)
$$\begin{aligned} \overline{\varvec{z}}_\mathrm{{new}}&=\sum _{i=1}^{M-1}\overline{\beta }_i\varvec{\varSigma }_{z\mathrm{{new}}}^{-1}\overline{\varvec{W}}_i^{\mathrm {T}}\varvec{x}_{i\mathrm{{new}}},\end{aligned}$$
(30)
$$\begin{aligned} \varvec{\varSigma }_{z\mathrm{{new}}}&=\sum _{i=1}^{M-1}\left( \overline{\beta }_i\left( \overline{\varvec{W}}_i^{\mathrm {T}}\overline{\varvec{W}}_i+\varvec{\varSigma }_{W_i}^{-1}\right) \right) +\mathrm{\varvec{I}}. \end{aligned}$$
(31)

Finally, the prediction distribution \(P(\varvec{x}_{M\mathrm{{new}}}|\varvec{X}_\mathrm{{new}})\) is

$$\begin{aligned} P(\varvec{x}_{M\mathrm{{new}}}|\varvec{X}_\mathrm{{new}})&\simeq \int \mathrm{{d}}\varvec{z}_\mathrm{{new}}P\left( \varvec{x}_{M\mathrm{{new}}}|\varvec{z}_\mathrm{{new}}\right) \tilde{Q}_z\left( \varvec{z}_\mathrm{{new}}\right) \nonumber \\&=\mathcal {N}\left( \varvec{x}_{M\mathrm{{new}}}|\overline{\varvec{x}}_{M\mathrm{{new}}},\varvec{\varSigma }_{M\mathrm{{new}}}^{-1}\right) ,\end{aligned}$$
(32)
$$\begin{aligned} \overline{\varvec{x}}_{M\mathrm{{new}}}&=\overline{\varvec{W}}_M\varvec{\varSigma }_{z\mathrm{{new}}}^{-1}\sum _{i=1}^{M-1}\overline{\beta }_i\overline{\varvec{W}}_i^{\mathrm {T}}\varvec{x}_{i\mathrm {new}},\end{aligned}$$
(33)
$$\begin{aligned} \varvec{\varSigma }_{M\mathrm{{new}}}&=\overline{\varvec{W}}_M\varvec{\varSigma }_{z\mathrm{{new}}}^{-1}\overline{\varvec{W}}_M^{\mathrm {T}}+\overline{\beta }_M^{-1}\mathrm{\varvec{I}}. \end{aligned}$$
(34)

4 Preliminary Investigation

We conducted three experiments to investigate the basic characteristics of GBCCA-M2 using artificially generated data. In this section, we firstly describe the common experimental setup and then explain each experiment.

4.1 Common Experimental Setup

As a method for generating artificial data, we used a simple Gaussian latent model. The latent variables are denoted by \(\varvec{Z}_\mathrm{{gen}}=\bigl \{\varvec{z}_\mathrm{{gen}}(n)\bigr \}_{n=1}^{N}\in \mathbb {R}^{d_{z_\mathrm{{gen}}}}\) and observed modalities are denoted by \(\varvec{X}_i=\bigl \{\varvec{x}_i(n)\bigr \}_{n=1}^{N}\in \mathbb {R}^{d_i}\). In this section, we considered the case of three observed modalities. \(d_{z_\mathrm{{gen}}}\) and \(d_i\) represent the dimension of the latent variables and modalities respectively, and N represents the sample size. Latent variables were extracted independently from \(\mathcal {N}\left( \varvec{0},\mathrm{\varvec{I}}_{d_z}\right) \). \(\varvec{x}_i(n)\) were generated as follows: \(\varvec{x}_i(n)=\varvec{W}_i\varvec{z}_\mathrm{{gen}}(n)+\varvec{\mu }_i+\varvec{\delta }_i(n)\), where each row of \(\varvec{W}_i\) was extracted from \(\mathcal {N}(\varvec{0},\mathrm{\varvec{I}}_{d_{z_\mathrm{{gen}}}})\), mean \(\varvec{\mu }_i\) was extracted from \(\mathcal {N}(\varvec{0},\mathrm{\varvec{I}}_{d_i})\), and covariance of noise \(\varvec{\delta }_i(n)\) was determined as follows:

$$\begin{aligned} \varvec{\delta }_i(n)=\alpha \left( \mathrm{\varvec{I}}_{d_i}+\sum _{j=1}^{\frac{d_{z_\mathrm{{gen}}}}{2}}\varvec{u}_j(n)\varvec{u}_j(n)^{\mathrm {T}}\right) . \end{aligned}$$
(35)

\(\varvec{u}_j(n)\) were extracted independently from \(\mathcal {N}\left( \varvec{0}, \mathrm{\varvec{I}}_{d_i}\right) \). The magnitude of the noise is controlled by \(\alpha \), which was changed in the experiment evaluating robustness against noise, and fixed in the other experiments. The number of elements in the test data was set to 500. \(\varvec{X}_1\) and \(\varvec{X}_2\) were set to the source modalities, and \(\varvec{X}_3\) the target modality. The evaluation was performed by calculating the cosine similarities between the predicted modality and that of the test data.

4.2 Contribution Rates of Elements Missing Modalities

In GBCCA-M2, in order to utilize elements with various missing patterns efficiently, we introduced the contribution rates of elements missing modalities to the likelihood as shown in Eq. (9). In this experiment, we investigated the change in prediction performance when contribution rates were changed.

The dimension of each modality was set as \([d_1,d_2,d_3,d_{z_\mathrm{{gen}}}] = [250,250,250,50]\). When the number of modalities was three, the patterns of missing modalities were divided into three categories of elements with one, two, and three modalities, respectively. We defined the number of elements with m sets of modalities \(N^{(j)}\) and set them as \([N^{(1)},N^{(2)},N^{(3)}]=[120,120, 120]\) and \([N^{(1)},N^{(2)},N^{(3)}]=[1440,720,120]\) (refer to Fig. 5). Moreover, the modality an element was missing was made uniform in each pattern. This was the same in all experiments. In Eq. (9), we fixed \(\eta _3\) at 1.0 and varied \(\eta _1\) and \(\eta _2\) by increments of 0.1 in the range 0 to 1.0. Also, the dimension of latent variable \(\varvec{z}\) used in the proposed method was set to 150. Experiments were repeated ten times for each set of (\(\eta _1\), \(\eta _2\)), and the average of cosine similarity was calculated.

Fig. 5.
figure 5

Prediction performance when the contribution rates were changed.

The experimental results are shown in Fig. 5. Since the cosine similarity became maximal when \(\eta _2\) was in the range 0.9 to 1.0, \(\eta _2\) should be set to a value close to 1.0. This is because even if one modality is missing, it is possible to estimate parameters with the remaining two modalities. On the other hand, since the cosine similarity became maximal when \(\eta _1\) was in the range 0.4 to 0.6, \(\eta _1\) should be set to be smaller than \(\eta _2\). This is because the element with one modality seems to be useful for estimating the distribution in the feature space of each modality, but it seems to deteriorate the estimation of the relationships between modalities. Moreover, since \(\eta _1\) and \(\eta _2\), which maximized cosine similarity when \([N^{(1)},N^{(2)},N^{(3)}]=[1440,720,120]\), were lower than when \([N^{(1)},N^{(2)},N^{(3)}]=[120,120,120]\), the contribution rates should be decreased as the number of elements missing modalities increases.

4.3 The Number of Elements in Training

GBCCA-M2 utilizes elements missing modalities by including them in the likelihood function. In this experiment, we changed the number of elements in training according to the degree of missing modalities and investigated whether GBCCA-M2 can utilize elements missing modalities effectively.

Among the three kinds of missing patterns, the number of elements of any two patterns was fixed, and the number of elements of the remaining one pattern was changed. The number of elements to be fixed was set to 60 and the number of elements to be changed was set to \(60,120, \cdots ,1200\). The dimension of each modality and the contribution rates were set as follows: \([d_1,\,d_2,\,d_3,\,d_{z_\mathrm{{gen}}}]=[250,\,250,\,250,\,50]\), \([\eta _1,\eta _2,\eta _3]=[0.4,0.9,1.0]\). Also, the dimension of latent variable \(\varvec{z}\) used in the GBCCA-M2 was set to 150. We used the following two methods for comparison: (1) CC and ridge regression (CC-Ridge) and (2) mean imputation and ridge regression (Mean-Ridge). CC-Ridge removes elements with missing modalities and performs ridge regression using the remaining elements. Ridge regression is a learning method that adds a square of the weight to the loss function in the linear least squares method and obtains a weight that minimizes it. Mean-Ridge substitutes the mean value of elements in the missing modalities and performs ridge regression.

Fig. 6.
figure 6

Predict performance when the number of elements was changed.

Figure 6 shows the experimental results. When the number of elements with two modalities was increased in GBCCA-M2, the prediction performance approximately monotonically increased. This may be because elements with two modalities have a positive effect on the relationship estimation between the non-missing modalities and the estimation of the feature amount space in the non-missing modality. On the other hand, when the number of elements with one modality was increased, the prediction performance improved only in the range where the number of elements was small. This may be because when contribution rates are fixed, as the number of elements with one modality is increased, the negative effect on relationship estimation between the non-missing modality and the missing modality increases. Therefore, if \(\eta _1\) is set appropriately, it should be possible to use elements with one modality effectively for learning.

4.4 Evaluating Robustness Against Dimension and Noise

We described that the features of modalities are likely to be high dimensional and noisy. In order to show the effectiveness of GBCCA-M2 for such modalities, we conducted experiments to evaluate robustness against dimension and noise.

In the experiments evaluating robustness against dimension, we changed the parameter \(\beta \), which represents the size of the dimension. The dimension of each modality was set to \([d_1, d_2, d_3, d_{z_\mathrm{{gen}}}]=[50\beta , 50\beta , 50\beta , 10\beta ]\), and the dimension of the latent variable z used in GBCCA-M2 was set to \(30\,\beta \). We set \(\beta \) to 1, 2, 4, 8, 16, or 32. In the experiment evaluating robustness against noise, we changed \(\alpha \), which controlled the magnitude of the noise (Eq. (35)), by increments of 0.1 in the range 0.1 to 3.0. The dimension of each modality was set to \([d_1,d_2,d_3,d_{z_\mathrm{{gen}}}]= [250,250,250,50]\) and the dimension of latent variable \(\varvec{z}\) used in GBCCA-M2 was set to 150. In both experiments, the numbers of elements in training were set to 120 for all missing patterns. Also, the contribution rates were set as follows: \([\eta _1,\eta _2,\eta _3]=[0.4,0.9,1.0]\). As the comparison method, we used the same two methods as in the Experiment in Sect. 4.3.

Fig. 7.
figure 7

Prediction performances when the dimension of modalities was changed (left) and when the noise of modalities was changed (right).

Figure 7 shows the experimental results. When the dimension or noise of modality increased, GBCCA-M2 achieved higher prediction performance than the comparison methods. This may be because GBCCA-M2 is based on BCCA, which is effective for high dimensional and noisy features of modalities. Experimental results show that GBCCA-M2 is also effective for such cases.

5 Experiment with Real Data

5.1 Purchaser Behavior Prediction

We conducted an experiment to show the effectiveness of GBCCA-M2 using real purchaser dataset, in which modalities are actually missing. For the purchaser dataset, we used the INTAGE Single Source Panel (i-SSP) dataset from INTAGE Inc. This dataset includes attributes, purchase histories, and television program viewing information for the half year from January 1st, 2016 to June 30th, 2016. In the attribute data, we converted the nominal scales such as occupation and residence to one-hot expression and used the proportional scales as they were. Purchasing information includes purchase data of beer, chocolate, and shampoo. We used the total number of purchases for each manufacturer as one modality. For the television program viewing information, we used the average television viewing time for each television program only if it was 20 hours or more. As a result of the above operation, the dimension of attribute information was 89, that of purchase situation was 67, and that of TV program viewing information was 226. Table 1 indicates the number of elements for each missing pattern. We extracted 100 elements with three modalities randomly as test data and used the remaining elements as learning data. We set the contribution rates and the dimension of the latent variable in GBCCA-M2 as follows: \([\eta _1,\eta _2,\eta _3,d_z]=[0.3,0.8,1.0,30]\). In addition to CC-Ridge and Mean-Ridge, we used CC and BCCA (CC-BCCA), mean imputation and BCCA (Mean-BCCA), and Semi CCA for comparison. Television program viewing information was predicted from source modalities (i.e., attribute and purchase history). As the evaluation index, we calculated the following indexes using the predicted vector and the actual vector: (1) cosine similarity, (2) mean absolute error (MAE), and (3) root mean square error (RMSE). We did this 30 times and calculated the average.

Table 1. The number of elements by missing patterns in purchaser’s data.
Table 2. Comparison of each method in the actual purchaser’s data.

Table 2 shows the experimental results. As for all evaluation index, GBCCA-M2 achieved best. This may be because GBCCA-M2 is effective for purchaser data in which features of modalities are high dimensional and noisy and there are many elements missing modalities. From the above findings, the effectiveness of GBCCA-M2 for a real purchaser dataset can be seen clearly.

Table 3. Comparison of each method in the sentence-to-image retrieval.

5.2 Image Retrieval from English and Japanese Sentences

In this section, we report results on image retrieval from English and Japanese sentences learned with the dataset in which we made some modalities missing intentionally. In addition to MSCOCO [16] dataset, we used STAIR Captions [24], which is a Japanese image caption dataset based on images from MSCOCO. As the feature of images, we extracted the 4096-dimensional activations from 19-layer VGG model [21], and as the feature of sentences, we used tf-idf-weighted bag-of-words vectors. For English, we pre-processed all the sentences with WordNet’s lemmatizer [4] and removed stop words. For Japanese, we removed stop words and all parts of speech other than nouns, verbs, adjectives, and adjectival verbs. The final dimensions of English and Japanese sentences were 6245 and 7278, respectively. In training, we used 9000 elements (i.e, images and their corresponding English and Japanese sentences), made 50 % modalities missing randomly, and reduced the dimension of each modality to 1000 by PCA. For the evaluation, we used 1000 elements. We retrieved images form English and Japanese sentences and calculated Recall@K (\(K=1,5,10\)). We set the contribution rates and the dimension of the latent variable as follows: \([\eta _1,\eta _2,\eta _3,d_z] = [0.3, 0.8, 1.0, 750]\) and used same methods in Sect. 5.1 as comparison methods. Table 3 shows the experimental results. We can see that GBCCA-M2 gives best results in all methods. By using GBCCA-M2, we can retrieve images more accurately by utilizing elements missing modalities.

6 Conclusion

In this study, we considered the two challenges associated with multi-modal learning and proposed GBCCA-M2, which utilizes elements missing modalities and can work well on high dimensional and noisy features of modalities. Moreover, we conducted experiments using artificially generated data as well as real data. The findings obtained in this study are as follows: (1) in order to utilize the elements missing modalities, it is effective to change the contribution rates to likelihood according to the degree of missing modalities, (2) GBCCA-M2, which uses a hierarchical Bayesian model, is effective for high dimensional and noisy features of modalities, and (3) because GBCCA-M2 is suited to the case that there are many elements missing modalities, and the features of modalities are high dimensional and noisy, it is effectively used for such multi-modal applications.