1 Introduction

Being one of the most common household appliances in industrial production, the refrigerator plays a vital role in humans’ daily lives, globally. In recent decades, China ranks the first in refrigerator production; the refrigerator industry has considerably contributed to China’s economy. It has been reported by the National Bureau of Statistics of China [1] that China produced 79.92 million refrigerators in 2015.

Various vision-based applications in refrigerator manufacturing process have been brought forward along with the development of industrial automation especially in the emergence of industry 4.0 standardization [24]. As an example, automatic classification is potentially invaluable for industrial automation. It helps to resolve several disadvantages of manual inspection which is being widely adopted on the production line, e.g., high labor cost and inter/intra-observer variations.

As an emerging problem, image-based automatic classification of the refrigerator is of great importance because it can potentially provide a valuable tool for industrial automation of refrigerator. There have been various machine-learning-based image classification methods proposed in industrial automation. However, none of them was designed for the automatic classification of the refrigerator, which mainly relies on human’s visual system at present. The machine-learning-based classification methods can be roughly divided into two categories, including classifier combined with handcrafted features and automatically extracted features. With experiments, we respectively testify the support vector machine (SVM)-based methods [58] and CNN-based methods, in which the front-view image of the refrigerator is exploited due to the rich information captured by this image in recognizing and differentiating refrigerators. Experimental results show that the CNN-based methods outperform the state-of-the-art SVM-based methods with handcrafted features. Both SVM-based methods and CNN-based methods are effective in the classification of refrigerator front-view images. However, we observe fundamental challenges in the front-view image of the refrigerator, including the dense clutter, varying backgrounds, large homogeneous areas, sparse salient regions, and the specular reflection.

In the field of machine learning, recent advances [911] in deep learning have led to breakthroughs in long-standing tasks such as language translation, text recognition and image-related problems of feature extraction [12, 13], image segmentation [1416], and image classification [1722]. Among all these techniques, CNN is one of the most successful methods and has acquired wide applications in image classification [9]. Recently, there is an increasing trend to develop variants of deep neural networks for the tasks that were difficult to implement previously; for instance, in [23], a deep polynomial network was presented to implement the tumor classification with small ultrasound images, and the classification accuracy for breast ultrasound image is 92.40 + 1.1%. A deep-learning framework for the face attribute prediction in the wild was proposed in [24] while prediction of the face attributes in the wild is challenging due to the complex face variations. It cascades two types of CNNs, which are fine-tuned jointly with attribute tags and pre-trained differently.

In this paper, we present a useful tool for classifying refrigerator based on images taken from its front view, which is accomplished by using a novel CNN architecture adapted to the specialty of refrigerator images. Our approach is data driven and free from handcrafted image features. It leverages a training process to automatically extract multi-scale image features which combine both the local and global characteristics of the refrigerator front-view image. The mapping between the refrigerator and its corresponding category, which is learned from these features, offers a function to automatically specify the classification of a refrigerator given a new input image. The proposed CNN architecture takes triple images as input, from which a triplet loss (similarity loss) is produced. Meanwhile, the traditional convolution layer is modified into a doubly convolutional layer in our proposed CNN architecture. The details of triplet loss and the doubly convolutional layer will be fully discussed in Section 2. Our experimental results from 31,247 refrigerator front-view images of 30 categories show that the proposed CNN architecture produces an impressive classification accuracy of 99.96%, which is considerably better than conventional classification techniques.

Our work offers at least five significant contributions as listed below:

  1. (I)

    To our best knowledge, this is the first attempt to introduce deep-learning methods into the image classification of the refrigerators.

  2. (II)

    We propose a novel CNN architecture with triple input and the doubly convolutional layer which both are adapted to the characteristics of refrigerator front-view images.

  3. (III)

    Besides the softmax loss (classification loss) that was commonly used in previous CNN frameworks, we introduce a triplet loss (similarity loss) in our proposed architecture. Meanwhile, different from the original convolution layer in conventional CNN architectures, a doubly convolutional layer is introduced in the proposed CNN architecture.

  4. (IV)

    Our approach performs with an impressive superiority to both the state-of-the-art image classification techniques and human’s visual inspection.

The rest of this paper is organized as follows. In Section 2, we present our proposed refrigerator image classification approach. Section 3 contains experimental results and our discussions. In Section 4, we present the conclusion.

2 Our method

Our method is employed to classify refrigerator according to the front-view images of the refrigerator. From the observation, we have found various challenges in the images. Aiming at solving these difficulties, we propose a novel CNN architecture as described in Section 2.2.

2.1 The challenges in the front-view image of refrigerator

We observed there are various challenges in the front-view images of the refrigerator including dense clutter, varying backgrounds, large homogeneous areas, subtle salient regions, and specular reflection, which are described in details as follows:

  • Dense clutter. With the observation, we notice there is a dense clutter in the front-view images of the refrigerator. The paper and sellotape attached on the refrigerator at the left of Fig. 1 are typical clutter in the images of the refrigerator. Compared to the image at the right of Fig. 1, the image at the left of Fig. 1 contains redundant information from the paper and sellotape. Without appropriate processing, the clutter would cause the severe distraction to the result of classification.

    Fig. 1
    figure 1

    The dense clutter in the front-view images of refrigerator

  • Varying backgrounds. Due to the changeable position and the surroundings of the refrigerator on the assembling line, the backgrounds in the images are different from each other. Figure 2 shows two images of refrigerators belonging to the same classification but with diverse backgrounds. As the varying backgrounds randomly append unstable factors to the images of the refrigerator, it would unavoidably increase the complexity of the image classification.

    Fig. 2
    figure 2

    The varying backgrounds in the front-view images of refrigerator

  • Large homogeneous areas and subtle salient regions. There are large homogeneous areas existing in the images of refrigerator, as shown in Figs. 1, 2, 3, and 4. Meanwhile, the salient regions including the display, crack, ice maker, and handle of the refrigerator are subtle compared to the overall image, e.g., in Fig. 3 (the areas confined with the red and orange squares). Large homogeneous areas and small salient regions both would cause misclassification and greatly affect the classification accuracy.

    Fig. 3
    figure 3

    The large homogenous areas and subtle salient regions in the front-view images of refrigerator

    Fig. 4
    figure 4

    The specular reflection in the front-view images of refrigerator

  • Specular reflection. Due to the mirror surfaces of the refrigerators, noticeable effects of specular reflection are observed in the captured images. Two refrigerators of the same classification are shown in Fig. 4; there is more apparent specular reflection shown on the surface of the left refrigerator than the right refrigerator. It would produce redundant information of the surroundings in the front-view images of the refrigerator.

2.2 The architecture of our proposed CNN

Based on the analysis of the challenges in the front-view images of refrigerators, we propose a novel CNN to perform the image classification of refrigerators. Notably, this CNN can also be employed to classify images of other household appliances, including the air conditioner, television, and washing machine. Three convolutional layers and one fully connected layer are included in each channel of the parameter-sharing CNN. Then one softmax loss and a triplet loss are combined at the end of the architecture, which is shown in Fig. 5.

Fig. 5
figure 5

The architecture of our proposed deep CNN. The proposed CNN takes the triple images as input (the original image, the positive image, and the negative image). Three parameter-sharing CNNs are exploited to process the original image, the positive image, and the negative image, respectively. Softmax loss and triplet loss are combined at the end of the architecture. L2 denotes the L2-normalization

Each captured image is divided into image patches with same sizes. For each image, a positive image (same category) and a negative image (different category) are chosen from the captured images. Then, the three image patches are jointly inputted into our proposed CNN, in which three parameter-sharing CNNs are presented to handle the original, the positive, and the negative images, respectively.

Furthermore, to extend the parameter-sharing efficiency of CNN, we replace the independent convolutional filters in the convolution layers by a set of filters that are translated versions of each other, which is implemented by a two-step convolution operation or so-called doubly convolutional layer as shown in Fig. 6.

Fig. 6
figure 6

The structure of a convolutional layer (top) and a doubly convolutional layer (bottom). By adding one extra convolution operation (in the purple rectangle), the independent filters (in the yellow rectangle) are changed into translated versions of each other (in the brown rectangle)

  • Convolutional layer. Forty-eight kernels of 3×11×11 and 3×7×7 are applied to the input refrigerator image patch in the first layers combined with the rectified linear unit (ReLU) layer. To handle the characteristics of sparseness in refrigerator image, the operation of ReLU is performed after the convolution operation.

  • Convolutional layer. One hundred twenty-eight kernels of 3×9×9 and 3×5×5 combined with the ReLU layer, the features in the input image are extracted furthermore. Same as the first layer in our CNN architecture.

  • Convolutional layer. One hundred twenty-eight kernels of 3×7×7 and 3×3×3 combined with the ReLU layer, the features in the input image are extracted furthermore. Same as the first layer in our CNN architecture.

  • Fully connected layer. Five hundred twelve neurons combined with ReLU layer, which is used to perform high-level reasoning like neural networks.

In the first convolutional layer, the input image patch is continuously filtered with 48 feature maps of 3×11×11 and 3×7×7 kernels with a stride of 2. The second convolutional layer continuously filters the output of the first convolutional layer with 128 feature maps of 3×9×9 and 3×5×5 kernels. The third convolutional layer filters the output of the second convolutional layer with 128 feature maps of 3×7×7 and 3×3×3 kernels, sequentially. The following fully connected layer has 512 neurons.

The softmax loss locates at the end of the CNN for the original image, which is expressed as:

$$ \text{SoftmaxLoss}=-\sum_{i=1}^{N} \text{log}\left(P\left(\omega_{k}\right)|\left(L_{i}\right)\right), $$
(1)

where N denotes the number of the input images, P(ωk|Li) indicates the probability of the kth image to image patch label Li correctly classified as lk.

Besides the softmax loss function used that has been exploited in other CNN, the triplet loss is also presented in our proposed CNN architecture.

$$ {} \text{TripletLoss}=\frac{1}{2N}\sum\limits_{i=1}^{N} \text{max}\left(0,D(o_{i},p_{i})-D(o_{i},n_{i})+m\right), $$
(2)

where D(.,.) denotes the squared Euclidean distance between two l2 normalized vectors; oi,pi, and ni respectively stand for the l2 normalized vectors from the original image, the positive image, and the negative image, as shown in Fig. 5; m denotes the hyperparameter to confine the value of Eq. (2) greater than zero as the difference between the original image and the positive image is expected to exceed the difference between the original image and the negative image.

The two losses mentioned above are then integrated into a weighted combination:

$$ \text{Loss}=\lambda \text{SoftmaxLoss}+(1-\lambda)\text{TripletLoss}, $$
(3)

where λ denotes the weight that is used to manipulate the trade-off between the softmax loss and the triplet loss.

The feature maps hierarchically extracted by our proposed CNN are shown in Fig. 7. The extracted features include but not limited to color, shape, and texture of the refrigerators. These extracted features consist of both local features and global features in the front-view images of the refrigerator.

Fig. 7
figure 7

The feature maps extracted hierarchically by our proposed CNN architecture

Figure 7 shows that in the first convolutional layer (the first column in Fig. 7), the global features including shape and edge of the refrigerator are extracted. In the two following convolutional layers (the second and third columns in Fig. 7), local features are extracted hierarchically. Notable that these features, which are different from the handcrafted features exploited in [58], are automatically extracted through our proposed CNN.

3 Results and discussion

3.1 Experimental samples and pre-processing

All of the refrigerator images were captured by using a digital camera (Canon EOS 760D, Japan); its setting is the manual mode, and the exposure level is 0.0 without zoom and flash. The color space used in the classification is the sRGB (standard RGB). The original size of the captured images is 3200 × 2400 (0.1 mm/pixel), and they are stored into PNG (Portable Network Graphic) format. To reduce the over-fitting that may exist in a few images, we enlarge the dataset of the originally captured images of refrigerators with augmentation techniques, such as translations and vertical and horizontal reflections. Finally, there are totally 31,247 images of refrigerator belonging to 30 categories.

After the procedure of data augmentation, the images are then resized into 227×227. Among all of the images, 20,000 of them are taken as training data, and the others are used as testing data.

3.2 Training

During the training process, each pair of the captured refrigerator image and its label of classification, a positive image and a negative image of the original image are taken as input to the CNN. The proposed CNN is updated with the back propagation mechanism, which calculates the softmax loss from the original image and the triplet loss from the triple input images. The training is performed on GPU of high performance; it takes 105 times of iterations. For each iteration, it costs around 2.4 s. Through training, our proposed CNN can establish the mapping between the input image of refrigerator and the output refrigerator category.

3.3 Experimental result

After the training process, we conducted experiments to testify the performance of our proposed CNN. The experimental result of the full testing data is shown in Fig. 8. From Fig. 8, we can obtain that the classification accuracy of our method reaches to 99.99% after about 3000 iterations. Meanwhile, the training loss of our method decreases to 0.01. (The decreasing process of training loss represents the convergence of the proposed CNN architecture.)

Fig. 8
figure 8

The experimental result of our proposed CNN architecture

Notably, as numerous images of the refrigerator containing challenges including dense clutter, varying backgrounds, large homogeneous areas, subtle salient regions, and specular reflection are included in our training process, the negative effects brought by them to the image classification have been sigificantly eliminated by our method. Experimental results show that our method can handle these challenges well.

We conducted comparing experiments between the state-of-the-art SVM-based image classification methods [58], human’s visual inspection, and our method. Different handcrafted features including single feature and composite features are exploited by these SVM-based methods, respectively.

Meanwhile, to quantitatively compare the performance of our method and the state-of-the-art SVM methods, the Precision and Recall are exploited, as expressed in Eqs. (4) and (5).

$$ \text{Precision}=\frac{\text{TP}}{\text{TP+FP}} $$
(4)
$$ \text{Recall}=\frac{\text{TP}}{\text{TP+FN}} $$
(5)

where TP denotes true positive, FP is the false positive, TN is the true negative, and FN denotes the false negative. The comparing experimental results are presented in Table 1.

Table 1 The Precision and Recall of SVM methods, human’s visual system, and our method

The second and third columns of Table 1 respectively shows the Precision and Recall of the full testing data. From these, we can conclude that our proposed method outperforms the state-of-the-art methods and human’s visual system. Note that there are 30 categories of refrigerators in our experiments, which greatly confine the performance of human’s visual inspection. While the number of classes is limited, the performance of human’s visual system is promising.

To testify the performance of our proposed method in extreme situations, we intentionally capture images with artifacts as shown in Fig. 9.

Fig. 9
figure 9

The testing data with different artifacts

The experimental results of these images are shown in the fourth and fifth columns of Table 1. These results show that the performance of our method on images with artifacts is better than the state-of-the-art methods, too.

There are misclassifications due to the severe artifacts in these images. Most of the misclassifications might be caused by extreme viewing conditions, e.g., the workers standing in front of the refrigerator. But experimental result shows that even under an extreme situation, our method can still obtain satisfactory outcomes.

3.4 Analysis

As mentioned in Section 2.2, the local and global features are automatically extracted by our proposed CNN. Similar to human’s visual system, our proposed CNN can extract the global features including the color, shape of the refrigerators and the local features from the display, crack, ice maker, and handle of the refrigerators. The global features integrated with local features form a layout for each category of refrigerators. The whole layout is utilized to establish the mapping between the input front-view image of the refrigerator and the output classification label.

Furthermore, different from the previous proposed CNN, our proposed CNN architecture with the introduced triplet loss significantly enhanced the accuracy of image classification by jointly optimizing the classification loss (between the label of the original image and the output classification result) and the similarity loss (between the original image, the positive image, and the negative image), as shown in Fig. 5. The parameter λ in Eq. (3) that is used to balance the softmax loss and triplet loss plays an essential role in the proposed CNN architecture. With λ is set to 1 or 0, the performance of the architecture would degenerate to softmax loss or triplet loss, respectively. According to the process of error and trial, it is reasonable to assign a value greater than 0.5 to λ, which shows that the softmax loss should be more important than the triplet loss in our proposed CNN architecture. However, the introduced triplet loss contributes to the image classification due to the complementary information from both the positive image and negative image. It is suitable for the characteristics of the front-view of refrigerator as dense clutter, varying backgrounds, subtle salient regions, and the specular reflection in the images of refrigerators all can be extracted and represented by the combination of the softmax loss and triplet loss, and the subtle difference between intra-category and inter-categories of refrigerators can be differentiated from each other. Meanwhile, the doubly convolutional layer presented in our approach also positively affects the outcome of the proposed CNN architecture. It not only enhances the parameter-sharing efficiency but also accommodates the characteristics of the captured images. Because the refrigerators were transported on the same conveyor belt and the position of the camera was fixed, the captured intra-class refrigerators are translated versions of each other, which is suitable for the feature maps learned from doubly convolutional layers.

The compared experimental results in Fig. 8 shows that our proposed method outperforms the state-of-the-art image classification methods. It achieves better accuracy of image classification of refrigerators. Due to the accuracy of classification by our method, it can satisfy the practical requirement of industrial automation of refrigerator.

4 Conclusions

In this paper, we propose a CNN-based image classification method based on the front-view images of the refrigerator to cope with the present challenges including dense clutter, varying background, large homogeneous areas, and specular reflection. Our experimental results validate the ability of the proposed technique in solving the refrigerator classification task in practical industrial automation.

Meanwhile, we have the following significant contributions. First of all, this is the first time to introduce machine-learning-based method into the classification of refrigerator instead of human’s visual system. Secondly, the triplet loss and doubly convolutional layer both introduced by our method positively affect the accuracy of image classification. To our best knowledge, this is the early application of both the triplet and double convolution operation into CNN-based classifier. Thirdly, similar to human’s visual system, our proposed CNN can extract the multi-scale features including both the local and global features of the front-view-based images of the refrigerator. As the extraction of multi-scale features of the image plays a vital role in human’s visual system, our method can implement it automatically. Finally, our approach performs with an impressive superiority to both the state-of-the-art image classification techniques and human’s visual inspection. Our proposed CNN-based method is a potential tool for image classification of the refrigerator and other images with similar characteristics as the front-view of the refrigerator.

In the future, we will continue to explore the implicit process of feature extraction by CNN and attempt to find the explicit mapping between the performance of the CNN and the role of local and global features extracted by CNN. Meanwhile, we will also research on more applications of our proposed CNN-based classification method in various fields including medical image processing, object classification, and recognition [2529].