1 Introduction

The success of machine learning methods stems from appropriate data representation. Clearly this requires applying feature engineering, i.e., handcrafted proposition of a set of features potentially useful in the considered problem. However, it would be beneficial to propose an automatic features extraction to avoid any awkward preprocessing pipelines for hand-tuning of the data representation [2]. Deep learning turns out to be a suitable fashion of automatic representation learning in many domains such as object recognition [23], speech recognition [20], natural language processing [7], neuroimaging [13] multimodal learning from images and text annotations [27], pose recovery [29], or domain adaptation [9].

Fairly simple but still one of the most popular models for unsupervised feature learning is the restricted Boltzmann machine (RBM). Except automatic feature learning, RBMs can be stacked in a hierarchy to form a deep network [1]. The bipartie structure of the RBM enables block Gibbs sampling which allows formulating efficient learning algorithms such as contrastive divergence [10]. However, lately it has been argued that the RBM fails to properly reflect statistical dependencies [22]. One possible solution is to apply higher-order Boltzmann machine [17, 24] to model sophisticated patterns in data.

In this work we follow this line of thinking and develop a more refined model than the RBM to learn features from data. Our model introduces two kinds of hidden units, i.e., subspace units and gate units (see Fig. 1). The subspace units are hidden variables which reflect variations of a feature and thus they are more robust to invariances. The gate units are responsible for activating the subspace units and they can be seen as pooling features composed of the subspace features. The proposed model is based on an energy function with third-order interactions and maintains the conditional independence structure that can be readily used in simple and efficient learning.

The paper is organized as follows. In Sect. 2 the proposed new model is presented. In Sect. 3, the learning procedure of subspaceRBM is outlined. In Sect. 4, we relate our approach to other deep models. Next, in Sect. 5, the proposed model is evaluated empirically on two image corpora and the results are discussed. Finally, in Sect. 6, the conclusions are drawn and future research are indicated.

Fig. 1
figure 1

A graphical representation of the subspaceRBM. The triangular symbol represents a third-order multiplicative interaction

2 The Model

The RBM is a second-order Boltzmann machine with restriction on within-layer connections. This model can be extended in a straightforward way to third-order multiplicative interactions of one visible \(x_{i}\) and two types of hidden binary units, a gate unit \(h_{j}\) and a subspace unit \(s_{jk}\). Each gate unit is associated with a group of subspace hidden units. The energy function of a joint configuration is defined as follows:

$$\begin{aligned} E(\mathbf {x}, \mathbf {h}, \mathbf {S}|\varvec{\theta }) = - \sum _{i=1}^{D} \sum _{j=1}^{M} \sum _{k=1}^{K} W_{ijk} x_{i} h_{j} S_{jk} - \sum _{i=1}^{D} b_{i} x_{i} - \sum _{j=1}^{M} c_{j} h_{j} - \sum _{j=1}^{M} h_{j} \sum _{k=1}^{K} D_{jk} S_{jk} .\nonumber \\ \end{aligned}$$
(1)

where \(\mathbf {x} \in \{0,1\}^{D}\) denotes a vector of visible variables, \(\mathbf {h} \in \{0,1\}^{M}\) is a vector of gate units, \(\mathbf {S} \in \{0,1\}^{M\times K}\) is a matrix of subspace units, the parameters are \(\varvec{\theta }= \{\mathbf {W}, \mathbf {b}, \mathbf {c}, \mathbf {D}\}\), \(\mathbf {W} \in \mathbb {R}^{D\times M\times K}\) is a weight tensor, \(\mathbf {b}\in \mathbb {R}^{D}\) is a vector of visible biases, \(\mathbf {c} \in \mathbb {R}^{M}\) is a vector of gate biases, and \(\mathbf {D} \in \mathbb {R}^{M\times K}\) is a matrix of subspace biases.

The model defined by the Gibbs distribution with the energy function as in Eq. 1, that is:

$$\begin{aligned} p( \mathbf {x}, \mathbf {h}, \mathbf {S} | \varvec{\theta }) = \frac{1}{Z(\varvec{\theta })} \mathrm {exp}\{-E(\mathbf {x}, \mathbf {h}, \mathbf {S}|\varvec{\theta })\} , \end{aligned}$$
(2)

where

$$\begin{aligned} Z(\varvec{\theta }) = \sum _{\mathbf {x}, \mathbf {h}, \mathbf {S}} \mathrm {exp}\{ -E(\mathbf {x}, \mathbf {h}, \mathbf {S} | \varvec{\theta }) \} \end{aligned}$$
(3)

is a partition function, is further called subspace restricted Boltzmann machine (subspaceRBM).

For the subspaceRBM the following conditional dependencies hold true:Footnote 1 \(^{,}\) Footnote 2

$$\begin{aligned} p(x_i = 1 | \mathbf {h}, \mathbf {S})&= \mathrm {sigm} \left( \sum _{j} \sum _{k} W_{ijk} h_{j} S_{jk} + b_i \right) , \end{aligned}$$
(4)
$$\begin{aligned} p(s_{jk} = 1 | \mathbf {x}, h_j)&= \mathrm {sigm} \left( \sum _i W_{ijk} x_i h_j + h_j D_{jk} \right) , \end{aligned}$$
(5)
$$\begin{aligned} p(h_j = 1| \mathbf {x})&= \mathrm {sigm}\left( -K\mathrm {log} 2 + c_j + \sum _{k=1}^{K} \mathrm {softplus} \left( \sum _i W_{ijk} x_i + D_{jk} \right) \right) , \end{aligned}$$
(6)

which can be straightforwardly used in formulating a contrastive divergence learning algorithm. Notice that in Eq. 6 a term \(-K\mathrm {log} 2\) influences the hidden unit activation which is linear to the number of subspace hidden variables. Moreover, the probability of an example \(\mathbf {x}\) is as follows:

$$\begin{aligned} p(\mathbf {x}) \propto \mathrm {exp} \left( \sum _{i} b_{i} x_{i} + \sum _{j} \log \left[ 2^{K} + \mathrm {exp}(c_{j}) \prod _{k} \left( 1 + \mathrm {exp}\left( \sum _{i} W_{ijk} x_{i} + D_{jk} \right) \right) \right] \right) . \end{aligned}$$
(7)

3 Learning

In training, we take advantage of the Eqs. 4, 5, and 6 to formulate an efficient three-phase block-Gibbs sampling from the subspaceRBM (see Algorithm 1). First, for given data, we sample gate units from \(p(\mathbf {h}|\mathbf {x})\) with \(\mathbf {S}\) marginalized out. Then, given both \(\mathbf {x}\) and \(\mathbf {h}\), we can sample subspace variables from \(p(\mathbf {S}|\mathbf {x}, \mathbf {h})\). Eventually, the data can be sampled from \(p(\mathbf {x}| \mathbf {h}, \mathbf {S})\).

We update the parameters of the subspaceRBM using contrastive divergence learning procedure [8, 10]. For this purpose, we need to calculate the gradient of the log-likelihood function. The log-likelihood gradient takes the form of a difference between two expectations, namely, over the probability distribution with clamped data, and over the joint probability distribution of visible and hidden variables. Analogously to the standard RBM, in the subspaceRBM these two expectations are approximated by samples drawn from the three-phase block-Gibbs sampling procedure.

figure a

4 Related Works

The standard RBM can reflect only the second-order multiplicative interactions. However, in many real-life situations, higher-order interactions must be included if we want our model to be effective enough. Moreover, often the second-order interactions themselves might represent little or no useful information. In the literature there were several propositions of how to extend the RBM to the higher-order Boltzmann machines. One such proposal is a third-order multiplicative interaction of two visible binary units \(x_{i}\), \(x_{i'}\) and one hidden binary unit \(h_{j}\) [11, 22], which can be used to learn a representation robust to spatial transformations [19]. Along this line of thinking, our model is the third-order Boltzmann machine but with different multiplicative interactions of one visible unit and two kinds of hidden units.

The proposed model is closely related to the special kind of spike-and-slab restricted Boltzmann machine [6] called the subspace spike-and-slab RBM (subspace-ssRBM) [5] where there are two kinds of hidden variables, namely, spike is a binary variable and slab is a real-valued variable. However, in our approach both the spike and slab variables are discrete. Additionally, in the subspaceRBM the hidden units \(\mathbf {h}\) behave as gates to subspace variables rather than spikes as in ssRBM.

Similarly to our approach, gating units were proposed in the Point-wise Gated Boltzmann Machine (PGBM) [25] where chosen units were responsible for switching on subsets of hidden units. The subspaceRBM is based on an analogous idea but it uses sigmoid units only whereas PGBM utilizes both sigmoid and softmax units.

Our model can be also related to RBM forests [15]. The RBM forests assume each hidden unit to be encoded by a complete binary tree. In our approach each gate unit is encoded by subspace units. Therefore, the subspaceRBM can be seen as a RBM forest but with flatter hierarchy of hidden units and hence easier learning and inference.

Lastly, the subspaceRBM but with the softmax hidden units \({\mathbf {h}}\) turns to be the implicit mixture of RBMs (imRBM) [21]. However, in our model the gate units can be seen as pooling features while in the imRBM they determine only one subset of subspace features to be activated. The subspaceRBM brings an important benefit over the imRBM because it allows the subspaceRBM to reflect multiple factors in data.

5 Experiment

Goal In this paper, we present a new model for capturing binary inputs that can be further used as a building block in a deep network. Typical building block of a deeper architecture is the RBM. Therefore, in the experiment we aim at answering the following question:

  • Is the subspaceRBM preferable to the RBM in terms of reconstruction error and as a better feature extractor?

We want to point out that we verify whether the subspaceRBM can be treated as a better alternative to the RBM. We believe that the positive answer to the stated question will give us a good starting point for further experiments with deep models using the subspaceRBM.

Data We performed the experiment using CalTech 101 \(28 \times 28\) SilhouettesFootnote 3 (CalTech, for the sake of brevity), and MNIST.Footnote 4 CalTech dataset consists of 4100 training images, 2264 validation images, and 2307 test images. In the dataset the objects are centered and scaled on a \(28 \times 28\) image plane and rendered as filled black regions on a white background [18]. MNIST consists of \(28 \times 28\) images representing hand-written digits from 0 through 9 [16]. The data is divided into 50,000 training examples, 10,000 validation images, and the test set contains 10,000 examples. In the experiments, we performed learning with different number of training images (10,100, and 1000 per digit) and the full training set.

Training protocol In the experiment, we compared the subspaceRBM with the RBM for the number of gate units equal \(M = 500\) and different number of subspace units \(K \in \{3, 5, 7\}\). The subspaceRBM was trained using the presented contrastive divergence (see Algorithm 1) and a minibatch of size 10 was used. In order to choose the value of the learning rate we performed the model selection using the validation set and the learning rate was \(\{0.001, 0.01, 0.1\}\). The number of iterations (epochs) over the training set was determined using early stopping according to the validation set cross-entropy reconstruction error, with a look ahead of 5 iterations.

The RBM was used with 500, 1500, 2500 and 3500 hidden units, which corresponds to the same number of gates units multiplied by the number of subspace units in the subspaceRBM. The RBM was trained using the contrastive divergence with 1-step Gibbs sampling. The learning rate was determined using the model selection using the validation set and its possible values were the same as in the case of the subspaceRBM. Similarly to the subspaceRBM, the early stopping procedure was used with looking ahead of 5 epochs.

We evaluated the subspaceRBM as a feature-extraction scheme by plugging it into the classification pipeline developed by [4]. For classification the logistic regressionFootnote 5 used the probabilities of gate units, \(p(h_{j}=1|\mathbf {x})\), as inputs. Analogously was done for the RBM.

We did 3 full runs for each dataset and averaged the results.

Evaluation methodology The performance of the subspaceRBM and the RBM was measured using cross-entropy reconstruction error (reconstruction error, for the sake of brevity), classification error, and mean number of active gate units. The cross-entropy reconstruction error for original object \(\mathbf {x}\) and its reconstruction \(\tilde{\mathbf {x}}\) is defined as follows:

$$\begin{aligned} L(\mathbf {x}, \tilde{\mathbf {x}}) = - \big ( \mathbf {x} \log \tilde{\mathbf {x}} + (1 - \mathbf {x}) \log ( 1 - \tilde{\mathbf {x}} )\big ), \end{aligned}$$
(8)

where \(\tilde{\mathbf {x}}\) is a reconstruction calculated by first sampling hidden units for given data using equations (5) and (6), and further sampling visible variables for sampled hidden units using (4). Similarly, in the case of RBM, the reconstruction is calculated in an analogical manner.

It has been advocated that the cross-entropy reconstruction error is a good proxy of the log-likelihood while using contrastive divergence learning [1].

Table 1 Average test classification error with one standard deviation for the RBM and different settings of the subspaceRBM evaluated on subsets of MNIST

We would like to highlight that in the experiment we aim at evaluating capabilities of the proposed model and comparing it with the RBM. Therefore, we resigned from applying sophisticated learning techniques, e.g., weight decay, momentum term, sparsity regularization [12]. We believe that application of more advanced training protocol could disrupt this comparison. As a consequence, we have obtained results that were worst than current state-of-the-art but these allow to evaluate mainly models instead of learning algorithms.

5.1 Results

MNIST The averaged results with one standard deviation of the subspaceRBM and the RBM are presented in Table 1 (for test classification error), in Table 2 (for test reconstruction error), and the average number of active units calculated on test data is outlined in Table 3. A random subset of subspace features for the subspaceRBM (\(M=500\), \(K=7\)) trained on 50,000 images is shown in Fig. 2.

CalTech The summary results of the performance of the subspaceRBM and the RBM are presented in Table 4. A random subset of subspace features for the subspaceRBM (\(M=500\), \(K=3\)) is shown in Fig. 3.

Table 2 Average test reconstruction error with one standard deviation for different settings of the RBM and the subspaceRBM evaluated on subsets of MNIST
Table 3 Number of active units for the RBM and different settings of the subspaceRBM evaluated on subsets of MNIST
Fig. 2
figure 2

Random subset of subspace features for MNIST (\(N=50{,}000\)) and the subspaceRBM with \(M=500\) and \(K=7\). Relevant three groups of filters are outlined in red, blue and green which evidently tend to learn similar pattern with offsets in position, curvature or rotation. (Color figure online)

Table 4 Average test results with one standard deviation for different settings of the RBM and the subspaceRBM evaluated on CalTech
Fig. 3
figure 3

Random subset of subspace features for CalTech and the subspaceRBM with \(M=500\) and \(K=3\). Relevant three groups of filters are outlined in red, blue and green which evidently tend to learn similar pattern with offsets in position, curvature or rotation. (Color figure online)

5.2 Discussion

We notice that application of subspace units is beneficial for better reconstruction capabilities (see Tables 2 and 4). For classification it is advantageous to use subspaceRBM in the case of small sample size regime (for MNIST dataset with N equal 100 and 1000, and for CalTech data, see Tables 1 and 4) with smaller number of subspace units. However, this result is rather not surprising because for over-complete representations simpler classifiers work better. On the other hand, for the small sample size there is a big threat of overfitting. Introducing subspace units to the hidden layer restricts the variability of the representation and thus preventing from learning noise in data. In the case of classification for larger number of observations (for MNIST data with N equal 10,000 and 50,000, see Table 1), best results were obtained for K equal 5 and 7. This result suggests that indeed the subspace units lead to features that are more robust to small perturbations.

Comparing the subspaceRBM to the RBM with comparable size, i.e., \(M \in \{1500, 2500, 3500\}\), it turns out that in terms of the classification error RBMs with larger number of hidden units obtained much better results. However, this result follows from the fact that it is easier to discriminate if there are more available features. Of course, this statement is true only if the features represent reasonable patterns (i.e., different than noise), and the sample size is appropriate (see Table 1 for \(N=100\) and Table 4 where larger RBMs tend to be heavily overfitted). Nonetheless, the reconstruction error for any dataset is in favor of the subspaceRBM. This effect can be explained as follows. During reconstructing data lots of features are useless but they still contribute to the reconstruction but rather as a source of noise. Therefore, the more features are in a model, the more noise is incorporated to the reconstruction. However, it seems that the larger number of subspace units results in better reconstruction. This means that the subspaceRBM indeed captures different forms of a feature and incorporating more subspace units is beneficial.

Eventually, it is worth noticing that on average the number of active hidden units is higher for the subspaceRBM in comparison to the RBM. This result may be explained by the sum of softplus terms used in calculating the conditional probability (see Eq. 6). The effect of increased activity of hidden units is especially apparent in the case of CalTech where on average about half of gate units are active (see Table 2).

6 Conclusion

In this paper, we have proposed an extension of the RBM by introducing subspace hidden units. The formulated model can be seen as the third-order Boltzmann machine with third-order multiplicative interactions. We have showed that the subspaceRBM does not reduce to a vanilla version of the RBM (see Eq. 7). The carried-out experiments have revealed that the proposed model is advantageous over the RBM in terms of reconstruction and classification error.

We see several possible extensions of the outlined approach. In our opinion, the examination of the effect of high activity of gate units is very appealing. It has been advocated [9] that sparse activity of hidden units provides more robust representation, therefore, we plan to apply some kind regularization enforcing sparsity [14] or features robustness [26]. Moreover, it would be beneficial to utilize other learning algorithms instead of the contrastitve divergence, such as, sampling methods [3], score matching [28] and other inductive principles, e.g., Maximum Pseudo-Likelihood [18]. Last but not least, subspaceRBM can be used as a building block in a deep model. However, we leave investigation of stated issues as future research.