Keywords

1 Introduction

Convolutional neural networks (CNNs) have achieved impressive performances in many challenging problems [15, 24], and even surpass human-level for certain tasks such as ImageNet classification [16]. As CNN-based recognition systems [3] continue to grow, it is critical to improve inference efficiency while maintaining accuracy [5].

Since network compression introduces efficient approximations to CNNs and compressed models require less memory and fewer operations, parameter quantization [8, 18, 30], pruning [11, 13] and low-rank [33, 38] representations have become a topic of interest in the deep learning community. Especially quantization, with the boom of AI chips, will be the workhorse in industry. While these techniques have driven advances in power efficiency, they still face a considerable accuracy loss under low-bit or highly sparse compression. Retraining on the original dataset is usually inevitable. Unfortunately, the retraining process needs a sufficiently large open training dataset which is inaccessible for many real-world applications, such as medical diagnosis [9], drug discovery and toxicology [1]. Therefore, it is imperative to avoid retraining or require no training data.

In this work, we alleviate the accuracy degradation in direct network compression through label-free fine-tuning. For network quantization, which comprises two primary components: weight quantization and feature map quantization. Intuitively, the quantization error determines the performance loss. Therefore, we propose the Quasi-Lloyd-Max algorithm to minimize the weight quantization error. To further improve the accuracy of compressed networks, we explore the reason of feature map distortion. In light of Bayesian networks, we reveal that statistic-shift of batch normalization results in the accuracy degradation of direct compression. When network parameters misfit approximate Gaussian distribution, the prior assumption of mean and variance should mismatch the corrupted features. By employing the re-estimated statistics in batch normalization, the performance of compressed CNNs can be rapidly recovered. Extensive experiments on 4-bit quantization and pruning demonstrate the robustness of this viewpoint.

Compared with conventional label-based compression methods, the main contributions of this paper are as follows:

  • We reveal the hidden factor why direct network compression results in performance degradation, and prove that 4-bit or sparse representation remains capable of original tasks without retraining.

  • A Quasi-Lloyd-Max algorithm is proposed to minimize the weight quantization error on 4-bit networks.

  • The fine-tuning time decreases from days (GPU) to minutes (CPU), by using limited unlabeled data.

2 Related Work

The redundant parameters of deep neural networks induce inefficient computation and large memory footprint. Most compression approaches can be viewed as the regularization techniques to solve these problems. Recently, along with the existence of TPU [22] and low precision BLAS [10], parameter fixed-point representation has frequently been discussed. Traditional hash-based vector quantization, such as HashNet [2], may not directly benefit from customized hardware. In contrast, 8-8-bit structures get TensorRT [27] or TPU [22] support easily. Bit-oriented methods with potential 64\(\times \) acceleration, such as BC [7], FFN [34], BNN [6] and XNOR-Net [30], compress DNNs to extreme 1 bit (\(\sim \)32\(\times \) compression), while suffering an irreversible accuracy loss. INQ [39] shows the reasonable performance of \(2^n\) framework; nevertheless, plenty of labeled data is required for retraining.

For low-rank representation, early studies share the same starting point: using matrix low-rank approximation to reduce the computation. Conventional network structures with regular filters and large feature maps are friendly to matrix decomposition. Generalized Singular Vector Decomposition [38], Tucker Decomposition [23] and Tensor Block Term Decomposition [33, 35] are widely used on AlexNet [24], VGG-16 [31] and GoogleNet [32]. At the cost of negligible loss in accuracy, they gain several times acceleration with a certain amount of compression. Very recently, MobileNet [17] with channel-wise convolution shows the potential capability to extract distinguishing features within limited parameters. Most notably, this network structure is identical to the decomposed matrix, which invalidates the current decomposition methods. Similar problems still arise in ResNet [16] with \(1\times 1\) filters.

Weight pruning benefits from Sparse GEneral Matrix Multiplication (Sparse GEMM) and highly optimized hardware design [14]. Combining with clustering and Huffman coding [13], promising compression results without accuracy loss were reported. The problem is hundreds [11] or even thousands [14] of retraining epochs are time-consuming, and is still heavily reliant on labeled datasets.

By using feature map fitting, [4, 36] implicitly learn from the well-trained networks through the Euclidean distance between feature maps of full-precision and compressed networks. Nevertheless, deeper network structures and imbalanced class samples would be the nightmare to hand-tuned layer by layer analysis.

3 Weight Quantization

Since quantization has been the mainstream compression technique in industry, we first review the cause of quantization, then discuss three hardware-friendly quantizers under different metrics. The scheme with least accuracy loss is adopted in further feature map recovery.

3.1 Cause

In early research, [12, 19] show that it is possible to train deep neural networks using 16-bit fixed-point numbers. Fixed-point computation with high speed and low power consumption is much more friendly to embedded devices. The small circuits would allow for the configuration of more arithmetic units. Besides, the low-bit data representation minimizes the memory footprint, which reduces the data transmission time for the customized device like FPGA [12, 13, 25].

3.2 \(\ell _2\) Norm Metric

Since customized hardware units have fully supported fixed-point multiplication and addition, quantizing float numbers to their nearest fixed-point representations by shift and carry operations can easily accelerate inference time. Suppose the fixed-point number is represented as \([I_{bit}:F_{bit}]\), and the Integer part plus the Fraction part yields the real number. Mathematically this problem, dubbed round-to-nearest, can be stated as follows,

$$\begin{aligned} \mathbf {Q}^*=\arg \min _{\mathbf {Q}}~&J(\mathbf {Q})=~||\mathbf {W}-\mathbf {Q}||_2^2\\ s.t.~ \mathbf {Q}_i\in \{-2^{I}/2,-2^{I}/2&+2^{-F},...0,...,2^{I}/2-2^{-F}\} \end{aligned}$$

where \(\mathbf {Q}\) is forced to fit the large number in \(\mathbf {W}\). This metric minimizes the loss function at the cost of small numbers and becomes much more sensitive to outliers. That is, large numbers determine the bit-width selection of \(I_{bit}\) and \(F_{bit}\). To partly solve this problem, a scaling factor \(\alpha \in \mathbb {R}\) is introduced,

$$\begin{aligned} \alpha ^*,&\mathbf {Q}^*=\arg \min _{\alpha >0, \mathbf {Q}}~J(\alpha , \mathbf {Q})~=~||\mathbf {W}-\alpha \mathbf {Q}||_2^2 \end{aligned}$$

It has been proved that scaling factor could dramatically enlarge the domain of values [30]. Although the function is convex in each variable only, they are not convex in each variable together. It is infeasible to solve \(J(\alpha , \mathbf {Q})\) in the sense of finding global minima, especially under the discrete constraint. However, it is possible to find local minima using iterative numerical optimization. Consider the following problem,

$$\begin{aligned} \alpha ^*,~\mathbf {Q}^*=\arg \min _{\alpha >0}~\big (\alpha ^2\mathbf {Q}^T\mathbf {Q}-2\alpha \mathbf {Q}^T\mathbf {W}+c\big ), \end{aligned}$$
(1)

where \(\mathbf {Q}\) corresponds to a set of fixed-point numbers and \(c=\sum _i\mathbf {W}_i^2\) is an \(\alpha \) and \(\mathbf {Q}\) independent constant. Thus, for any given \(\mathbf {Q}\), the optimal \(\alpha \) is

$$\begin{aligned} \alpha ^*=\frac{\mathbf {Q}^T\mathbf {W}}{\mathbf {Q}^T\mathbf {Q}}. \end{aligned}$$
(2)

By substituting \(\alpha ^*\) into (1), the optimization problem leads to the partial derivatives \(\frac{\partial J(\alpha , \mathbf {Q})}{\partial \mathbf {Q}}\). Setting it to zero, then project the solution to given discrete space

$$\begin{aligned} \mathbf {Q}^*\approx \mathbf {Fix}(\mathbf {W}/\alpha ^*). \end{aligned}$$
(3)

Algorithm 1 iteratively updates \(\alpha ^*\) and \(\mathbf {Q}^*\) through quantizer Fix\((\cdot )\), such as round-to-nearest, \(2^n\) (i.e., quantized to nearest power of 2) or uniform quantization (i.e., quantized to nearest quantization interval endpoints). Following the iterative update rule, the Euclidean distance between \(\mathbf {W}\) and \(\alpha \mathbf {Q}\) is optimized in each iteration.

figure a
Fig. 1.
figure 1

(a) Quasi-Lloyd-Max convergence comparsion of different metrics. The Euclidean distance between \(\mathbf {W}\) and \(\alpha \mathbf {Q}\) is reported. (b) The distributions of quantized 4-bit (\(2^n\) quantization) and full-precision weights of the third convolution layer (some bars get merged for clearness)

3.3 Discrete Entropy Metric

Similar to the squared Euclidean distance (\(\ell _2\)) which is the canonical example of a Bregman distance, another useful measure is the generalized Kullback-Leibler divergence (KL) generated by the convex function \(\sum _ip_i\ln p_i\). In this case,

$$\begin{aligned} \alpha ^*,\mathbf {Q}^*=\arg \min _{\alpha>0, \mathbf {Q}}~D(\alpha \mathbf {Q}||\mathbf {W})&=\sum _{i}(|\mathbf {W}_{i}|\ln \frac{\mathbf {W}_{i}}{\alpha \mathbf {Q}_{i}}-|\mathbf {W}_{i}|+\alpha |\mathbf {Q}_{i}|)\\ s.t.~ \alpha >0,~\mathbf {Q}_i\in \{\pm 2^0&\Delta ,\pm 2^1\Delta ,...,\pm 2^{k-1}\Delta \} \end{aligned}$$

where \(\varDelta \) corresponds to the minimum value of \(2^n\) in k-bit quantization. Like \(\ell _2\) norm metric, this loss function is also lower bounded by zero.

Consider the function of \(D(\alpha \mathbf {Q}||\mathbf {W})\) to be differentiated partially with respect to the elements of \(\mathbf {Q}\), which is

$$\begin{aligned} \frac{\partial D_{(\alpha \mathbf {Q}||\mathbf {W})}}{\partial \mathbf {Q}_i}=-\frac{|\mathbf {W}_i|}{\mathbf {Q}_i}+\alpha \cdot sgn(\mathbf {Q}_i). \end{aligned}$$
(4)

With fixed \(\alpha \), similarly we have

$$\begin{aligned} \frac{\partial D_{(\alpha \mathbf {Q}||\mathbf {W})}}{\partial \alpha }=-\frac{1}{\alpha }\sum _{i}|\mathbf {W}_i|+\sum _{i}|\mathbf {Q}_i|. \end{aligned}$$
(5)

By setting both equations to zero, we obtain a pair of local minima of KL Divergence. Hence, the solutions to \(D(\alpha \mathbf {Q}||\mathbf {W})\) are \(\alpha ^*=\frac{\sum _i|\mathbf {W}_i|}{\sum _i|\mathbf {Q}^*_i|}\) and \(\mathbf {Q}^*=\mathbf {Fix}(\frac{\mathbf {W}}{\alpha ^*})\) through Quasi-Lloyd-Max iterations.

In mathematics, generalized Kullback-Leibler Divergence is similar to a metric, but satisfies neither the triangle inequality nor symmetry. We further test \(D(\mathbf {W}||\alpha \mathbf {Q})\) as a weight quantization loss on AlexNet, shown in Fig. 1(b). Following the same procedure, we obtain

$$\begin{aligned} \frac{\partial D_{(\mathbf {W}||\alpha \mathbf {Q})}}{\partial \alpha }&=\sum _i|\mathbf {Q}_i|\ln \alpha +\sum _i|\mathbf {Q}_i|(\ln \frac{|\mathbf {Q_i}|}{|\mathbf {W}_i|})\end{aligned}$$
(6)
$$\begin{aligned} \frac{\partial D_{(\mathbf {W}||\alpha \mathbf {Q})}}{\partial \mathbf {Q}_i}&=\alpha \cdot sgn(\mathbf {Q}_i)\cdot \ln \frac{\alpha \mathbf {Q_i}}{\mathbf {W_i}}. \end{aligned}$$
(7)

In this case, \(\alpha ^*=\exp (\frac{\sum _i|\mathbf {Q}_i^*|\ln \frac{|\mathbf {W}_i|}{|\mathbf {Q}^*_i|}}{\sum _i|\mathbf {Q}^*_i|})\) and \(\mathbf {Q}^*\) remains the same as Eq. (4).

Taking the third convolution layer and the second fully connected layer of AlexNet as examples, Fig. 1(a) shows the convergence under different metrics. In our evaluations, all metrics converge in the first few iterations and obtain nearly the same quantization error. Since \(\ell _2\) yields more steady convergence speed, we evaluate the accuracy of different quantizer under \(\ell _2\) norm metric. As listed in Table 1 (whole network quantization except the first layer), \(2^n\) outperforms other quantizers by a large margin; thus we follow this setting in the next experiments.

Table 1. Quantizer comparsion of 4-bit Weights and 8-bit activations (\(\ell _2\) Norm)

3.4 Feature-Based Metric

In general, feature map extracted from input data is more crucial than weights in computer vision tasks. To fit the output features rather than pre-trained weights would further improve the performance [36]. Taking full-precision features \(\mathbf {Y}\) and quantized input activations \(\mathbf {\widetilde{X}}\) into account, we obtain the multi-objective optimization problem:

$$\begin{aligned} \alpha ^*,\mathbf {Q}^*=\arg \min _{\alpha >0, \mathbf {Q}}~||\mathbf {W}-\alpha \mathbf {Q}||_2^2+\lambda ||\mathbf {Y}-\alpha \widetilde{\mathbf {X}}\mathbf {Q}^T||_2^2. \end{aligned}$$
(8)

With \(\lambda =0\), Eq. (8) degrades to \(\ell _2\) metric. For large \(\lambda \), feature map fitting becomes more crucial. This problem could be solved by Quasi-Lloyd-Max in a similar way. The closed-form solutions of each step are

$$\begin{aligned}&\alpha ^*=\frac{\lambda \sum _{i=1}^m\mathbf {Q}^T\widetilde{\mathbf {X}}^T_i\mathbf {Y}_i+\mathbf {Q}^T\mathbf {W}}{\lambda \sum _{i=1}^m\mathbf {Q}^T\widetilde{\mathbf {X}}^T_i\widetilde{\mathbf {X}}_i\mathbf {Q}+\mathbf {Q}^T\mathbf {Q}}\end{aligned}$$
(9)
$$\begin{aligned}&(\alpha ^* \lambda \sum _{i=1}^m\widetilde{\mathbf {X}}^T_i\widetilde{\mathbf {X}}_i+\alpha ^* \mathbf {I})\mathbf {Q}=\lambda \sum _{i=1}^m\widetilde{\mathbf {X}}_i^T\mathbf {Y}_i+\mathbf {W}, \end{aligned}$$
(10)

where \(\mathbf {I}\) corresponding to \(n\times n\) unit matrix and m refers to m smaples. If \(\widetilde{\mathbf {X}}^T_i\widetilde{\mathbf {X}}_i\) is symmetric positive definite, then by using modified Cholesky decomposition, one may simplify Eq.(10) as \(\alpha ^*(\lambda \sum _i\widetilde{\mathbf {X}}^T_i\widetilde{\mathbf {X}}_i+ \mathbf {I})=\mathbf {L}\mathbf {D}\mathbf {L}^T\), where L is a lower triangular matrix with unit diagonal elements and D is a diagonal matrix with positive elements on the diagonal. To solve \(\mathbf {L}\mathbf {D}\mathbf {L}^T\mathbf {x}=\mathbf {y}\), we only need to address \(\mathbf {L}\mathbf {x}'=\mathbf {y}\) and \(\mathbf {D}\mathbf {L}^T\mathbf {x}=\mathbf {x}'\), which is faster and with better numerical stability.

However, given limited unlabeled data, there is no global measurement to facilitate the selection of \(\lambda \). In our experiments, the iterative numerical approximation to solve \(\mathbf {Q}\) can be hugely affected by the different settings. Hence, the explicit feature-map based method is deprecated in our further evaluations. Compared with various metrics, Table 2 shows that the \(\ell _2\) norm could better reflect the weight fitting error.

Table 2. Metric Comparsion of Direct 4-bit \(2^n\) Weight Quantization

4 Feature Recovery

To further improve the performance of compressed networks, we focus on the “Gaussian-like”feature distribution. From a Bayesian perspective, the conjugate prior induced by batch normalization results in the performance gap between full-precision network and post hoc compression. Therefore, we can use batch normalization to refine a well-trained network with low-bit or sparse representation.

4.1 Bayesian Networks

The methodology of CNNs is to find the maximum a posteriori (MAP) weights given a training dataset (\(\mathbf {D}\)) and a prior distribution \(p(\varvec{\mathcal {W}})\) over model parameters \(\varvec{\mathcal {W}}\). Suppose that \(\mathbf {D}\) consists of N batch samples \(\{(x_i,y_i)_{i=1:N}\}\), then \(p(\varvec{\mathcal {W}}|\mathbf {D})=\frac{p(\mathbf {D}|\varvec{\mathcal {W}})p(\varvec{\mathcal {W}})}{p(\mathbf {D})}\). Due to the difficulty in calculating \(p(\mathbf {D})\), it is common to approximate \(p(\varvec{\mathcal {W}}|\mathbf {D})\) using a variational distribution \(q_\tau (\varvec{\mathcal {W}})\). By optimizing the variational parameters \(\tau \) so that the Kullblack-Leiber (KL) divergence is minimized:

$$\begin{aligned} \mathcal {L}(\tau )&=-\mathbb {E}_{q_\tau (\varvec{\mathcal {W}})}[\log p(\mathbf {D}|\varvec{\mathcal {W}})]+KL(q_\tau (\varvec{\mathcal {W}})||p(\varvec{\mathcal {W}}))\end{aligned}$$
(11)
$$\begin{aligned}&=-\int _{\varvec{\mathcal {W}}} q_\tau (\varvec{\mathcal {W}})\log p(\mathbf {D}|\varvec{\mathcal {W}})d\varvec{\mathcal {W}}+KL(q_\tau (\varvec{\mathcal {W}})||p(\varvec{\mathcal {W}})). \end{aligned}$$
(12)

Equation (13) is known as the evidence-lower-bound (ELBO), assuming i.i.d. observation noise.

In practice, a Monte Carlo integration is usually employed to estimate the expectation term \(\mathbb {E}_{q_\tau (\varvec{\mathcal {W}})}[\log p(\mathbf {D}|\varvec{\mathcal {W}})]\). Using weight samples \(\hat{\mathcal {W}}^i\sim q_\tau (\varvec{\mathcal {W}})\) for each batch i, leads to the following approximation:

$$\begin{aligned} \mathcal {L}(\tau )&:=-\frac{1}{N}\sum _{i=1}^N\log p(\mathbf {D}|\hat{\mathcal {W}}^i)+KL(q_\tau (\varvec{\mathcal {W}})||p(\varvec{\mathcal {W}}))\end{aligned}$$
(13)
$$\begin{aligned}&:=\underbrace{-\frac{1}{N}\sum _{i=1}^N\log p(\mathbf {y}_{i}|\mathbf {x}_{i},\hat{\mathcal {W}}^i)}_{negative\ log-likelihood}+\underbrace{KL(q_\tau (\varvec{\mathcal {W}})||p(\varvec{\mathcal {W}})).}_{KL ~divergence} \end{aligned}$$
(14)

Especially, for batch normalization parameters \(\{\mu _B,\sigma _B\}\in \varvec{\mathcal {W}}\), we regard the inference at training time as a stochastic process, estimated mean and variance based on samples in a mini-batch are two stochastic variables. Assume i.i.d. M samples where \(\mathbf {z}_i=\overline{Wx}\sim \mathcal {N}(\mu ,\sigma ^2)\) and \(\mu _i=\frac{1}{M}\sum _{k=1}^M\mathbf {z}_k\). By using central limit theorem (CLT) for sufficient random sampling through SGD, we have \(\mu _B\sim \mathcal {N}(\mu ,\frac{\sigma ^2}{M})\). Due to \(\mathbb {E}[(\mathbf {z}_i-\mu )^2]=\sigma ^2\), similarly we obtain \(\sigma _B^2\sim \mathcal {N}(\sigma ^2,\frac{\mathbb {E}[(\mathbf {z}_i-\mu )^4]-\sigma ^4}{M})\).

4.2 KL Divergence and Weight Regularization

Probabilistically, \(p(\mathbf {D}|\varvec{\mathcal {W}})=\prod _{i=1}^Np(\mathbf {y}_i|\mathbf {x}_i,\varvec{\mathcal {W}})\), the posterior \(p(\mathbf {y}_i|\mathbf {x}_i,\varvec{\mathcal {W}})\) expresses a predictive distribution generated by a parameteric model \(\varvec{\mathcal {W}}\), e.g., the cross-entropy criterion for multi-classification. The negative loglikelihood defines \(\mathcal {L}\) as follows:

$$\begin{aligned} \mathcal {L}(\mathbf {y})&=-\frac{1}{N}\sum _{i=1}^N\log p(\mathbf {y}_i|\mathbf {x}_i,\varvec{\mathcal {W}})+\frac{\lambda }{2}||\varvec{\omega }||_2^2. \end{aligned}$$
(15)

where \(\varvec{\omega }\) is learnable parameters such as weights, and \(\varvec{\mathcal {W}}\) also includes random parameters such as \(\mu _B, \sigma _B\).

Since both \(\mathcal {L}(\tau )\) and \(\mathcal {L}(\mathbf {y})\) are solved by gradient descent, the second terms of Eqs. (15) and (17) illustrate the connection between KL divergence (i.e., \(p(\varvec{\mathcal {W}})\) w.r.t the estimated distribution \(q_\tau (\varvec{\mathcal {W}})\)) and weight regularization:

$$\begin{aligned} \frac{\partial KL(q_\tau (\varvec{\mathcal {W}})||p(\varvec{\mathcal {W}}))}{\partial \varvec{\omega }}=\frac{\partial \frac{\lambda }{2}\varvec{\omega }^T\varvec{\omega }}{\partial \varvec{\omega }}. \end{aligned}$$
(16)

The regularization term can be viewed as a log-prior distribution over weights, such as Gaussian derived from \(\ell _2\) norm. Under the constraint of low-bit or sparsity, the penalty term introduces different priors (e.g., spike-and-slab in pruning) which hugely affect the approximation to \(p(\varvec{\mathcal {W}})\). We now describe how weight compression corrupts batch normalization parameters.

For random variables in batch normalization, the KL divergence between approximation \(\mathcal {N}(\mu _q,\sigma _q^2)\) and true distribution \(\mathcal {N}(\mu _p,\sigma _p^2)\) can be calculated using:

$$\begin{aligned} KL(q(\mathcal {W})||p(\mathcal {W}))=\frac{(\mu _q-\mu _p)^2}{2\sigma _p^2}+\log \frac{\sigma _p}{\sigma _q}+\frac{\sigma _q^2}{2\sigma _p^2}-\frac{1}{2}. \end{aligned}$$

Since \(\mu _p\), \(\sigma _p\) won’t change during training, which is independent to \(\varvec{\omega }\), thus \(\mu _p'=\sigma _p'=0\), and then \(\frac{\partial KL}{\partial \omega }=\frac{(\mu _q-\mu _p)\mu _q'}{\sigma ^2_p}+\frac{(\sigma _q^2-\sigma _p^2)\sigma _q'}{\sigma _q\sigma _p^2}\). The optimal approximation \(\mu _q\rightarrow \mu _p\), \(\sigma _q^2\rightarrow \sigma _p^2\) reaches its limit when regularization term solved by SGD (partial derivative is zero). When we compress the well-trained networks, the weight regularization has changed implicitly, in another word, former estimations should introduce a great bias. Fortunately, as proved in Sect. 4.2, the expectations of \(\mu _q\) and \(\sigma _q^2\) converge to the real distribution parameters, then it is possible to renew the distorted features through re-estimation.

Fig. 2.
figure 2

Feature distribution comparsion for AlexNet 5th batch-normalization layer

4.3 Renew Distorted Features

While it is impractical to update weights through inference on unlabeled data, re-estimation on \(\mu _{\mathbf {B}}\) and \(\sigma _{\mathbf {B}}\) is still feasible. From [21], the mean and variance of activations holds that

$$\begin{aligned} \mathbb {E}[\tilde{x}]&:= \mathbb {E}[\tilde{\mu }_{\mathbf {B}}] \end{aligned}$$
(17)
$$\begin{aligned} Var[\tilde{x}]&:= \frac{m}{m-1}\mathbb {E}[\tilde{\sigma }^{2}_{\mathbf {B}}], \end{aligned}$$
(18)

where \(\mathbb {E}(\tilde{\mu })=\frac{1}{m}\sum ^{m}_{i=1}{\tilde{x}_{i}}\) and \(\mathbb {E}(\tilde{\sigma }^{2})=\frac{1}{m}\sum ^{m}_{i=1}{(\tilde{x}_{i}-\tilde{\mu })^{2}}\).

In Bayesian theory, if the posterior distribution is in the same probability distribution family as the prior, then the prior is called a conjugate prior for the likelihood function. Especially, Gaussian distribution is a conjugate prior for the likelihood that is also Gaussian. In this case, we have shown that batch normalization parameters obey normal distribution and combine the empirical observations that output feature of batch normalization is “more Gaussian” [20, 21], one may derive that convolution, or inner-product layer tends to be a Gaussian likelihood. Thus, after compression, by choosing a new Gaussian prior (i.e., re-normalization or re-estimation), it will be more likely that the posterior distribution is also Gaussian:

$$\begin{aligned} P_{Gaussian}\propto P_{likelihood}\times P_{normal}. \end{aligned}$$

Since batch normalization is commonly employed after convolution, the distribution of distorted features can be directly renewed. After the re-normalization, Fig. 2 shows that the distribution has been restored. Nevertheless, interpreting compressed networks as a likelihood function is a weak approximation. The performance of extremely quantized networks, such as binary or ternary, will not be improved since the corruption of likelihood function. In those cases, retraining on the original dataset is somehow inevitable.

5 Experiments

In this section, we verify the effectiveness of proposed methods on ImageNet dataset (ILSVRC 2012). Generally speaking, training-free quantization or pruning on deep neural networks is challenging, but we achieve much closer accuracy to full precision networks. We implement weight pruning and low-bit quantization on three representative CNNs: AlexNet [30], ResNet-18 [16] and MobileNet [17]. Besides, we also evaluate on ResNet-50 [16] to examine the validity of re-normalization on deeper network structures. All images are resized to have 256 pixel at short dimension and then a central crop of \(224\times 224\) is selected for re-normalization and evaluation. No data augmentation was used for all experiments.

Table 3. Results of 8-8-bit (whole network weights & features 8-bit) quantization on ILSVRC2012 validation dataset. Round-to-nearest with \(\ell _2\) metric was adopted in 8-bit weights. For 8-bit feature maps, we just quantize float numbers to nearest fixed-points
Table 4. Final Performance of Network \(2^{n}\) Quantization. Accuracy loss corresponding to full precision network is reported (4-bit Weights & 8-bit Activations)
Table 5. Quantization comparison for AlexNet and ResNet-18. Top-1 and Top-5 gap to the corresponding full-precision network is reported. Label-based retraining methods are marked as “+Label". The bit width before and after “+" is for weight and activation respectively. Not reported retraining epoch was shown as “*". “\(\sim 0\)" requires no backward propagation

5.1 Network Quantization

8-bit quantization with few samples or, ideally, without input data is becoming the workhorse in industry. As shown in Table 3, our 8-8-bit has reached the comparable accuracy with the full precision network. To achieve higher efficiency on embedded devices, we prove that even 4-bit weights could reach approximately 32-bit level. Using the same 4-bit weights in Sect. 3.2, we re-normalize those models on 1 K images randomly selected from ILSVRC 2012 training dataset without label information.

As shown in Table 4, the performance of 4-8-bit network (except the first layer) was hugely improved from direct quantization. Compared with Nvidia TenorRT, 1250 images were used to update the parameters of 8-bit networks; we need 1000 images to learn 4-bit quantization. Results on AlexNet, ResNet-18, and ResNet-50 show the steady performance improvements, which have nearly approached the 32-bit level. MobileNet, with channel-wise convolution layers, is far more challenging to quantize. After straightforward 4-bit weight quantization, the accuracy dropped to nearly zero. This delicate network structure is equivalent to the low-rank representation of Tensor Block Term Decomposition [33]. For this reason, channel-wise convolution with little redundancy is naturally difficult to compress. Since the runtime speed of 8-bit MobileNet on CPU has already only 31 ms (Tensorflow 1.1.0), 4-bit could be a trade-off between even higher speed and lower accuracy.

Table 5 further shows the comparison between accuracy and learning cost. Our 4-8-bit is still competitive with retraining methods. In some cases, 4-8-bit even outperforms some label-based counterparts on AlexNet. For 4-4-bit, slightly different from Sect. 3.2, we quantize features to nearest \(2^n\) (without scale) during the process of re-normalization.

Compared with the 8-8-bit framework, 4-8-bit achieves not only \(2\times \) model compression but higher runtime speed. Low bit-width enables more fixed-point multiplications at the same clock frequency of the chip. This could provide dramatic data-level parallelism to achieve higher speedup. Besides, retraining methods can still benefit from feature map recovery. 3-8-bit AlexNet with +25.43% Top-1 and +26.86% Top-5 improvement yields 50.69% (Top-1) and 74.87% (Top-5) accuracy. This result provides a better starting point for retraining 3-bit networks.

Fig. 3.
figure 3

(a) Normalized accuracy of the compressed networks under different compression rate. “1” indicates original network precision. The performance of direct pruning and re-normalized network are shown as“P” and “RN”. We stop pruning when the accuracy drops to 0.85 (normalized). (b) Normalized accuracy changes over re-normalization iterations. 1K different images were used in each iteration

5.2 Weight Pruning

To further verify the conclusion in Sect. 4.2, we apply network pruning (based on absolute value) to well-trained parameters. Figure 3(a) shows the trade-off between compression rate and accuracy. Within one iteration, i.e., using 1 K images, we recover the performance to the practical level (solid line in Fig. 3(a)). This steady performance improvement not only appeared in network quantization but also in weight pruning.

Since AlexNet with over-parameterized inner-product layers is the typical network structure to examine the effectiveness of pruning approach, we compare the typical pruning approach [14] with ours on compression rate. As listed in Table 6, our method even pruned more parameters on two layers, especially Fc1 with most parameters in AlexNet. The overall compression rate of FC was still very close. Considering the training cost of both methods, ours has a significant advantage of high-efficiency. Due to the accuracy loss under high compression rate, we show the trade-off between training cost and performance in Table 7.

In our experiments, deeper networks, such as ResNet-50, and lightweight structures, such as MobileNet, obtain the same results. For \(3\times \) pruning, MobileNet achieves +53.82% Top-5 improvement to 78.43%, with 43% convolution layer and 7.3% fully connected layer parameters. ResNet-50 yields +6.92% Top-5 improvement to 90.00%, with 35% convolution layer and 10% fully connected layer parameters. The performance improvements are consistent in all our experiments, indicating that better performance becomes available by higher performance network.

Table 6. Model Sparsity Comparsion on AlexNet
Table 7. The comparsion for different compressed models with the number of training epochs and the final compression rate. “*" indicates the not reported training epoch. Label-based retraining methods are marked as “+Label"

5.3 Time Consumption

As listed in Table 8, most networks take only a few minutes to refine the distorted features, and as illustrated in Fig. 3(b), using more images has almost no contribution to the final accuracy. Setting batch size to 1K is just a trade-off between memory size and the sampling error of \(\mathbb {E}(\hat{x})\) and \(Var(\hat{x})\). By using large memory GPU, the whole process may take only a few seconds. This should lead to reduced time consumption of several orders of magnitudes. We believed that learning time speedup with limited unlabeled data is far more practical in real-world applications since slightly accuracy loss is unnoticeable to customers.

Table 8. Time consumption of feature recovery (1 batch = 1K images), evaluated on Intel Xeon CPU E5-2680 v4 @2.40 GHz x2

6 Conclusion

In this paper, we analyze the compression loss from Bayesian perspective and prove that batch normalization statistics misfit is one of the crucial reason for the performance loss. By using the proposed Quasi-Lloyd-Max and re-normalization, we quantize 4-bit networks to nearly full-precision level without retraining. In the experiments of network pruning, we further prove the robustness of this theorem. Our learning process is much more efficient than existing methods since considerably less data are required. In conclusion, we partly solve the real-world challenge of learning from limited unlabeled data to compress deep neural networks, which could be applied in a wide range of applications.