A deep learning-based surface defect inspection system using multi-scale and channel-compressed features

(1)

A Deep Learning-Based Surface Defect

Inspection System Using Multiscale

and Channel-Compressed Features

Jiangxin Yang , Guizhong Fu , Wenbin Zhu , Yanlong Cao , Yanpeng Cao , Member, IEEE,

and Michael Ying Yang , Senior Member, IEEE

Abstract— In machine vision-based surface inspection tasks, defects are typically considered as local anomalies in homoge-neous background. However, industrial workpieces commonly contain complex structures, including hallow regions, welding joints, or rivet holes. Such obvious structural interference will inevitably cause a cluttered background and mislead the clas-sification results. Moreover, the sizes of various surface defects might change significantly. Last but not least, it is extremely time-consuming and not scalable to capture large-scale defect data sets to train deep CNN models. To address the challenges mentioned earlier, we first proposed to incorporate multiple convolutional layers with different kernel sizes to increase the receptive field and to generate multiscale features. As a result, the proposed model can better handle the cluttered background and defects of various sizes. Also, we purposely compress the size of parameters in the newly added convolutional layers for better learning of defect-related features using a limited number of training samples. Evaluated in a newly constructed surface defect data set (images contain complex structures and defects of various sizes), our proposed model achieves more accurate recognition results compared with the state-of-the-art surface defect classifiers. Moreover, it is a lightweight model and can deliver real-time processing speed (>100 frames/s) on a computer equipped with a single NVIDIA TITAN X Graphics Processing Unit (12-GB memory).

Index Terms— Cluttered background, convolutional neural network (CNN), defect classification, feature extraction, multi-receptive field (MRF), surface inspection.

I. INTRODUCTION

S

URFACE defects of raw materials (e.g., steel, plastics, and stones) cause a reduction of corrosion resistance, plasticity, and fatigue limit [1] and thus decrease the quality of

Manuscript received October 15, 2019; revised March 20, 2020; accepted March 27, 2020. Date of publication April 10, 2020; date of current version September 15, 2020. This work was supported by the National Natural Science Foundation of China under Grant 51605428, Grant 51575486, and Grant U1664264. The Associate Editor coordinating the review process was Zheng Liu. (Corresponding author: Yanpeng Cao.)

Jiangxin Yang, Guizhong Fu, Wenbin Zhu, Yanlong Cao, and Yanpeng Cao are with the State Key Laboratory of Fluid Power and Mechatronic Systems, School of Mechanical Engineering, Zhejiang University, Hangzhou 310027, China (e-mail: yangjx@zju.edu.cn; 11525024@zju.edu.cn; wenbinzhu@ zju.edu.cn; sdcaoyl@zju.edu.cn; caoyp@zju.edu.cn).

Michael Ying Yang is with the Scene Understanding Group, ITC Fac-ulty Geo-Information Science and Earth Observation, University of Twente, 7514 AE Enschede, The Netherlands (e-mail: michael.yang@utwente.nl).

Color versions of one or more of the figures in this article are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIM.2020.2986875

the final product. Surface defect inspection plays an important role in many industrial production tasks, providing an essential functionality to reduce resource waste and eliminate risk to human safety [2], [3]. However, it is highly subjective, labor-intensive, and time-consuming to train/deploy human inspec-tors to perform visual quality inspection on a daily basis [4]. Therefore, it is critically important to develop accurate and fully automatic machine vision-based inspection solutions for assisting or replacing the decisions made by human experts.

In the past decades, many machine vision-based surface inspection methods have been proposed for noncontact, non-destructive, and fully automatic defect detection/classification of various surface textures. Statistical approaches extract dis-tinctive features of texture images based on autocorrelation function [5], co-occurrence matrix [6], and multiple fractal features [7]. Spectral approaches build the high-level represen-tation of defects using feature extraction techniques, including Fourier transform [8], Gabor filters [9], and wavelet trans-form [10]. Then, various feature classifiers, such as threshold-ing scheme [11], support vector machine (SVM) [12], Markov random fields (MRFs) [13], and neural networks (NNs) [14], are utilized to differentiate defect and normal image patches. However, the performance of the abovementioned techniques heavily depends on how well the handcrafted features can depict the visual characteristics of surface defects. It is a challenging task to design the optimal feature representations and achieve accurate recognition results in the presence of “interclass” similarity and “intraclass” diversity of surface defects [15].

Recently, convolutional neural networks (CNNs) models have significantly boosted the performance of various com-puter vision tasks, including object detection [16], [17], image segmentation [18], and face recognition [19]. Given a number of training samples, CNNs automatically construct hierarchical features by assembling low-level features to generate high-level representations. Krizhevsky et al. [20] proposed the first deep CNN model (AlexNet), which is a graphics processing unit (GPU) implemented, to achieve high-accuracy classifi-cation results in the ImageNet LSVRC 2010 contest. The Visual Geometry Group (VGG) at Oxford University further presented a very deep CNN model (VGG NNs) that is com-monly utilized as a backbone architecture to facilitate other computer vision tasks [17]. He et al. [16] proposed a novel residual architecture to improve the training of very deep CNN

(2)

models and achieve higher accuracy by increasing network depths. A noticeable drawback of the abovementioned deep CNN models is that they contain a large number of parameters and cannot deliver real-time speed. Iandola et al. [21] pro-posed a lightweight CNN model (SqueezeNet) that achieves the equivalent accuracy of AlexNet while using significantly fewer parameters (i.e., model size of SqueezeNet is less than 0.5 MB). Other lightweight CNN models, including MobileNet [22] and ShuffleNet [23], also attempted to make a good balance between performance and efficiency.

The recent successful application of CNN models on var-ious computer vision tasks (e.g., target detection and object recognition) has inspired new developments to build accurate and fully automatic industrial inspection systems. Li et al. [24] built up an end-to-end (ETE) surface defects recognition sys-tem that generates saliency maps as the classification results of seven types of steel strip defects. Ren et al. [25] obtained pixel-wise prediction by convolving the trained multinomial logistic regression classifier over input image. However, they directly apply the pretrained Decaf model [26] for defect-specific feature extraction. Fu et al. [27] proposed a compact yet effec-tive CNN model, which emphasizes the fine-tuning of low-level features and incorporates multireceptive fields (MRF), to achieve fast and accurate steel surface defect classification. Note that the surface defects are typically considered as local anomalies in homogeneous background [25], [27], which is not satisfied in many practical industrial inspection tasks.

It is not a trivial task to develop deep learning-based surface defects recognition approaches working reliably in real-world inspection situations. Many industrial workpieces contain obvious structural interference (SI) such as fastener holes, bolt holes, welding joints, and grooves, incurring cluttered back-ground to mislead the classification results. Discriminating surface defects in a cluttered background is a challenging task. Moreover, the size of various surface defects might change significantly in the captured images, and thus, the proposed CNN classification models are required to handle the defects of various sizes. Finally, it is extremely time-consuming and not scalable to capture large-scale defect data sets to train deep CNN models. In this article, we propose a surface defect detection and classification framework. It consists of three major processing steps, including region of interest (ROI) extraction, defects classification, and defect localization. First, an ROI area is defined in a real-captured image of the target workpiece via background segmentation and template matching techniques. Then, the extracted ROI is uniformly divided into a number of image patches, and each patch is feed to a CNN-based model for surface defect classification. Finally, spatially adjacent image patches with the same class labels are merged to generate a location map to indicate the positions of various surface defects. Note that the proposed framework can be easily adopted to build accurate and fully automatic industrial inspection applications. The core of our proposed framework is a compact yet effective SqueezeNet-based model to accurately classify surface defects of various sizes in the cluttered background. We incorporate multiple convolutional layers with different kernel sizes to increase the receptive field (RF) and generate multiscale features.

Fig. 1. Hardware setup of the image acquisition system.

Therefore, our proposed model can effectively handle clut-tered background and defects of various sizes. Moreover, we experimentally demonstrate that it is feasible to compress the size of parameters in the newly added convolutional layers and achieve improved recognition accuracy using a limited number of training samples. Our experimental results are con-sistent with many previous research works, such as [28]–[30]. To evaluate the performance of the proposed defect classifier, we further construct a new surface defect data set called USB-SD. More specifically, we capture the images of Univer-sal Serial Bus (USB) connectors that are made of reflective stainless steel and contain complex structures (e.g., hallow areas, welding joints, and rivet holes) and defects of various sizes (e.g., Dent, Spot, Bright Line, and Scratch). The contri-butions of this article are summarized as follows.

1) We construct a new surface defect data set (USB-SD) that contains the images captured in more practical inspection situations. Different from many previous data sets (e.g., hot-rolled steel [15], [31], wood [32], or fab-ric [33]), the target workpieces (USB connectors) con-tain complex structures, including hallow areas, welding joints, and rivet holes. Moreover, the size of different surface defects changes significantly in the USB-SD data set, ranging from∼50 pixels (e.g., Dent or Spot defects) to>5000 pixels (e.g., Bright Line and Scratch defects). 2) We propose a SqueezeNet-based CNN model that achieves more accurate recognition results compared with the state-of-the-art defect classifiers [15], [24], [25], [27], [34]. It incorporates multiple convolutional layers with different kernel sizes to extract multiscale features and achieve larger RFs. As a result, the proposed model can better handle cluttered background and defects of various sizes. Moreover, we experimentally demonstrate that compressing the size of the extracted multiscale features leads to better training of defect-related features using a limited number of samples.

The rest of this article is organized as follows. Section II presents the details of our visual inspection system and the

(3)

Fig. 2. (a) Data capturing and labeling process and (b) some sample images of the USB-SD data set.

constructed USB-SD data set. Section III presents the details of the proposed surface defect detection/classification frame-work. Section IV provides the implementation details of the proposed CNN models. A systematic performance analysis of the proposed SN-MRF-CC model and its comparison with the state-of-the-art alternatives is provided in Section V and VI. Finally, Section VII concludes this article.

II. IMAGEACQUISITIONSYSTEMCONFIGURATION ANDUSB-SD DATASET

The hardware configuration of the image acquisition system is shown in Fig. 1. A 2448× 2048 monochrome industrial camera and a 260-mm working distance telecentric lens are utilized for image capturing. In many industrial inspection tasks, the telecentric lens provides a better alternative to fixed focal length lenses due to their low distortion and invariant magnification. The illumination device provides light

stimulation to make the insignificant surface defects visually more obvious. We experimentally evaluated the lighting effects of red and blue color light sources with 30◦ and 60◦ incident angles. It is noted that defects on the highly reflective metallic surface can be better visualized by deploying a blue light source with shorter wavelength and setting a larger incident angle. As a result, a blue annular light-emitting diode (LED) lighting deceive with a 60◦ incident angle is used in our system. The USB connectors are placed in a fixture platform so that the charge-coupled device camera can adequately cover the entire workpieces.

As shown in Fig. 2(a), the full-size image is uniformly cropped into a number of 200× 200 image patches, and then, all image patches are manually labeled by a human inspector. The USB-SD data set contains 8100 grayscale images of the normal metallic surface and six typical defects (i.e., Bright Line, Deformation, Dent, Scratch, Spot, and Stain). The training data set contains 6000 images in total, in which

(4)

TABLE I

NUMBER OFMANUALLYLABELEDIMAGEPATCHES PERCLASS.

INTOTAL, THEREARESEVENTYPES OFSAMPLESINCLUDING

NORMAL(NR), BRIGHTLINE(BL), DEFORMATION(DF),

DENT(DE), SCRATCH(SC), SPOT(SP),ANDSTAIN(ST)

IN THEUSB-SD DATASET. WEALSOCOUNT THE

NUMBER OFIMAGESAMPLESWITH/WITHOUT

OBVIOUSSIINEACHCLASS

2400 images are captured of the defect-free surfaces and 3600 samples are obtained covering six different defect types (Bright Line, Deformation, Dent, Scratch, Spot, and Stain). Since surface defects typically occur in low probability, it is impractical to capture the normal and defect images of equal numbers in practical industrial inspection applications. For the testing data set (2100 images in total), we collect a new batch of USB connectors and capture 300 samples for the normal surface and six types of defects. Note that image patches in the training and testing data set are captured using different batches of workpieces. Table I shows the number of manually labeled image patches per class (Normal, Bright Line, Deformation, Dent, Scratch, Spot, and Stain) in the USB-SD data set. The number of image samples with/without obvious SI in each class is also calculated. Some sample images of six types of defects and normal surface are shown in Fig. 2(b).

In Fig. 3(a), we show the comparative samples in our USB-SD and some other surface defect data sets [15], [32], [33]. It is noted that the surface defects are typically con-sidered as local anomalies in the homogeneous background in many previous data sets (e.g., hot-rolled steel [15], wood [32], or fabric [33]). In comparison, our target workpieces (USB connectors) contain many complex structures, including hal-low areas, welding joints, and rivet holes. Significant SI will cause cluttered background in a large portion of the input images, as shown in Fig. 3(b). Moreover, the size of different surface defects changes significantly in the USB-SD data set. As shown in Fig. 3(c), the size of Dent or Spot defects is typically less than 50 pixels in a 200× 200 image patch. In comparison, Bright Line and Scratch defects might cover a large portion of the image. We make use of images in the USB-SD data set to evaluate the performance of different models on classifying surface defects of various sizes in the cluttered background.

III. PROPOSEDMETHOD

The overall processing flow of the proposed surface defect detection/classification framework is shown in Fig. 4. Given a full-size input image, we first apply background segmentation

Fig. 3. Cluttered background and defects of various sizes in the USB-SD data set. (a) Comparison of several surface defect data sets [15], [32], [33]. (b) Sample images without/with obvious SI. (c) Size of different surface defects changes significantly in the USB-SD data set.

and template matching techniques to define an ROI area that covers the target workpiece. The extracted ROI is uniformly cropped into a number of image patches, and each patch is feed to a CNN-based model to classify surface defects of various sizes in the cluttered background. Note the proposed CNN model is built on the pretrained SqueezeNet and further fine-tuned using the labeled images in the USB-SD data set in the training stage. The trained CNN model computes a number of confidence scores to predict the class label (i.e., Normal,

(5)

Fig. 4. Overall processing flow of the proposed surface defect detection/classification framework.

Bright Line, Deformation, Dent, Scratch, Spot, and Stain) for each input image patch. Finally, image patches that are spatially adjacent and have the same class label are merged to generate a location map of various surface defects.

A. ROI Extraction

In practical industrial inspection situations, the location and orientation of workpieces are typically not precisely fixed. We design a simple yet effective image processing method to define an ROI area in which the target workpiece is covered. First, we apply the OSTU segmentation technique [35] to high-light the image regions corresponding to the target object. The nonparametric OSTU algorithm calculates a single intensity threshold by minimizing interclass intensity variance to divide the input image into the foreground and background pixels. Then, we compute the location and orientation of the target workpiece in the captured image through template matching of a predefined reference frame. For the USB connectors, we select the region of two hallow windows as the reference frame. Based on the computed orientation, we digitally rotate the input image so that the target workpiece appears vertical in the rectified image. Based on the physical size/shape of workpieces, we use a rectangle bounding box to define an ROI area covering the target workpiece and remove the redundant background. The extracted ROI is uniformly divided into a number of image patches, and each patch is feed to the CNN model for defect classification.

B. Surface Defect Classification

The SqueezeNet is a lightweight architecture proposed by Iandola et al. [21] to alleviate the computational inefficiency

of other very deep CNN models, such as VGG [17] or ResNet [16]. It can achieve high-accuracy recogni-tion results using significantly fewer parameters. Moreover, the SqueezeNet model is easy to fine-tune, less prone to small data set overfitting, and suitable for embedded system imple-mentation. In this article, we adopt the pretrained SqueezeNet model as the backbone architecture for accurate surface defect classification. As shown in Fig. 5, the SqueezeNet contains nine fire modules in which a squeeze convolution layer (using 1 × 1 filters) and two expand layers (using 1 × 1 and 3 × 3 filters) are deployed. The squeezeNet model is pre-trained on ImageNet for the image classification of 1000 cat-egories. Since the USB-SD data set only contains seven categories of sample images (i.e., Normal, Bright Line, Defor-mation, Dent, Scratch, Spot, and Stain), we modify the num-ber of output channels in the Conv-10 layer to 7 accord-ingly. A global average pooling (GAP) layer is used to replace the fully connected layer, which is commonly adopted in many CNN-based classification architectures, including AlexNet [20] and VGG [17], to compute the average over the 13×13 slices to generate 1×1×7 tensors. The configurations of individual layers/modules in the baseline SqueezeNet model for surface defect classification on the USB-SD data set are shown in Table II.

The parameters of SqueezeNet-based model are updated by minimizing a multiclass loss function, which is defined as

L = −

7

k=1

tklog Pr(y = k) (1)

where tk = 1 when the ground-truth label of an input image

(6)

Fig. 5. Architecture of pretrained SqueezeNet model. TABLE II

DETAILEDCONFIGURATIONS OFINDIVIDUALLAYERS/MODULES IN THE

SQUEEZENET-BASEDMODELWHICHISMODIFIED FORSURFACE

DEFECTCLASSIFICATION IN THEUSB-SD DATASET. THEFILTER

PARAMETERSAREINDICATED ASC× W × L, WHEREC IS

THECHANNELNUMBER, W IS THEKERNELWIDTH,AND

L IS THEKERNELLENGTH. NOTE THEUSB-SD DATA

SETONLYCONTAINSSEVENCATEGORIES OF

SAMPLEIMAGES,ANDTHUS, WESET THE

NUMBER OFOUTPUTCHANNELS

IN THECONV-10 LAYER

TO7 ACCORDINGLY

which is calculated by utilizing the softmax function as Pr(y = k) = e

Gk

7

j=1eGj

(2) where Gk denotes the kth output of GAP layer. Note that the

confidence score Pr(y = k) predicts the existence of a defect of kth category in an image (e.g., the USB-SD data set contains seven categories of surface samples).

In many industrial inspection tasks, the target workpieces obtain obvious SI, such as fastener holes, bolt holes, grooves, or welding joints. Significant SI will cause a cluttered back-ground in a large portion of the input images and mislead the classification results. Moreover, the size of different types of defects changes significantly in the USB-SD data set. For instance, the size of Dent or Spot defects is typically less than 50 pixels, whereas Bright Line and Scratch defects might cover a large portion of the image (>5000 pixels). To better handle the cluttered background and defects of various sizes, we propose to add an MRF module after the last feature extraction module (Fire 9) to achieve larger RFs. As shown in

Fig. 6, we incorporate multiple convolutional layers (MRF-a, MRF-b, and MRF-c) with different kernel sizes (1× 1, 3 × 3, and 5× 5) to achieve larger RFs. In CNN models, the RF defines the region in the input space that a particular neuron of the current convolutional layer is referring to. The RF Ri

of the i th convolutional layer is calculated as

Ri = Ri−1+ (ki− 1) × i

i₌₁

si (3)

where R1= k1 and ki and si are the kernel size and stride of

the i th convolutional layer, respectively. The RFs of MRF-b and MRF-c layers are increased from 127 (MRF-c) to 159 and 191 by using 3× 3 and 5 × 5 kernels, respectively.

All input features within an RF contribute to the formulation of the output feature. Therefore, setting a larger RF can improve the capability of CNN models to extract semantic features that are more robust to clutter background. More-over, we propose to integrate the outputs of multiple layers with different RFs (e.g., MRF-a, MRF-b, and MRF-c layers) through concatenation fusion to generate multiscale features and improve the classification accuracy for surface defects of various sizes. The short connection is added between multiple-stacked layers to backpropagate gradient signals directly from the higher level layers to lower level ones, alleviating the gra-dient vanishing/exploring problem [16]. Since MRF-a, MRF-b, and MRF-c layers are randomly initialized, we purposely decrease the number of parameters in the newly added layers (setting the channel number of MRF-a, MRF-b, and MRF-c layers to a smaller number) to achieve better training of defect-related features using a limited number of samples.

To sum up, we made two important modifications to the pretrained SqueezeNet model to improve recognition accuracy including incorporating multiple convolutional layers with different kernel sizes to extract multiscale features and achieve larger RFs and compressing the parameter of the newly added convolutional layers to achieve more efficient training and to alleviate small data overfitting. The effectiveness of the proposed modifications is systematically evaluated on the USB-SD data set in Section V.

IV. IMPLEMENTATIONDETAILS

The publicly available Caffe platform is used for the pro-posed CNN model implementation [36]. We use 2400 normal and 3600 defect images (600 samples for each defect category)

(7)

TABLE III

RECOGNITIONACCURACY(%)OFSQUEEZENET-BASEDMODELSINCORPORATINGDIFFERENTMRF MODULES(SN-MRF-1, SN-MRF-1+3,

ANDSN-MRF-1+3+5). THEMAXIMUMRF (MAXRF)OFTHREEDIFFERENTMODELSARE127, 159,AND191, RESPECTIVELY

TABLE IV

COMPARATIVERESULTS OFSN-MRF MODELSINCORPORATINGCONVOLUTIONALLAYERSWITHDIFFERENTCHANNELNUMBERS(Cn)

Fig. 6. Incorporating multiple convolutional layers (e.g., MRF-a, MRF-b, and MRF-c layers) with different kernel sizes to achieve larger RFs.

for model training. The pretrained SqueezeNet model [21] is utilized to initialize the weights of certain convolutional layers in our CNN model, such as Conv1 and nine fire modules (Fires 1–9). The parameters of modified or newly added convolutional layers, including Conv10, MRF-a, MRF-b, and MRF-c, are randomly initialized with a Gaussian distribution. The batch size is set to 32. The maximum training iteration is set to 2000. The learning rate (LR) is initially set to 0.001 and is reduced according to a polynomial formula. The training process is performed using the stochastic gra-dient descent (SGD) training policy [37] with a momentum of 0.9 and a weight decay of 0.0002. The proposed model is trained on a NVIDIA TITAN X GPU (12-GB memory) within 30 min. In the testing phase, we use 2100 images (300 samples for each class) for performance evaluation. The defect class with the highest confidence score is predicted as our classification result. The performance of defect classifiers is evaluated by computing the recognition accuracy (%), which

is defined as the percentage of correctly classified image patches in each class [15], [24], [25], [27].

V. PERFORMANCEANALYSIS

To better handle cluttered background and defects of various sizes presented in the USB-SD data set, we propose to incorporate a number of convolutional layers with different RFs to the baseline SqueezeNet model. Here, we consider a number of alternatives to add convolutional layers, includ-ing SN-MRF-1 (addinclud-ing a sinclud-ingle convolutional layer with a 1 × 1 kernel), SN-MRF-1+3 (adding convolutional layers with 1× 1 and 3 × 3 kernels), and SN-MRF-1+3+5 (adding convolutional layers with 1× 1, 3 × 3, and 5 × 5 kernels). Note here that the channel number of these newly added convolutional layers is set to 256, which is consistent with the default channel setting in the SqueezeNet model. The comparative results are shown in Table III. It is observed that higher recognition accuracy is achieved by setting a larger RF and integrating multiscale features (SN-MRF-1+3+5 92.5% versus SN-MRF-1+3 91.9% versus SN-MRF-1 91.4%). Such improvement is particularly obvious for the classification of defects of larger sizes (e.g., Deformation: 86.0% versus 84.3% and Stain 96.7% versus 95.3%).

A large number of labeled samples are typically required to train a CNN model for accurate defect classification. However, such practice often requires large-scale image capturing and annotations, which is costly and unscalable, since inspection requirements change from task to task. In this article, we pro-pose to reduce the number of parameters in the newly added layers so that they can be efficiently trained using a few hundreds of sample images. In experiments, the channel num-ber (Cn) of convolutional layers in the MRF module is set to a

number of values, and their comparative results (accuracy %) are shown in Table IV. We experimentally demonstrate that it is feasible to achieve both higher classification accuracy and faster running time by reducing the size of parameters in the newly added convolutional layers (setting lower channel numbers). The recognition accuracy increases from 92.5% to 95.3% when the channel number of MRF-a, MRF-b, and MRF-c layers decreases from 96 to 6. A reasonable

(8)

TABLE V

CLASSIFICATIONACCURACY(%)OFVARIOUSSTATE-OF-THE-ARTDEFECTCLASSIFIERS ANDDEEPLEARNING-BASEDMODELS IN THE

USB-SD DEFECTDATASET. THETOPTHREERESULTSAREHIGHLIGHTED INRED, BLUE,ANDGREEN, RESPECTIVELY

explanation for this phenomenon is that randomly initialized convolutional layers with fewer parameters are more efficient to train, alleviating small training data overfitting. Moreover, the model integrating six-channel MRF-a, MRF-b, and MRF-c layers is significantly smaller than the one using 96-channel convolutional layers in the MRF module (Cn = 6− 3.1 MB

versus Cn− 96 = 9.9 MB).

Based on the abovementioned performance analysis exper-iments, we design an SN-MRF-CC model by integrating an MRF module (adding convolutional layers with 1 × 1, 3 × 3, and 5 × 5 kernels) to the pretrained SqueezeNet model and performing feature channel compression (setting the channel number of newly added MRF-a, MRF-b, and MRF-c layers to 6).

VI. COMPARISONSWITHSTATE OF THEARTS

We compare the proposed SN-MRF-CC model with a number of state-of-the-art surface defect recognition methods [15], [24], [25], [27], [34], [38]. We consider three tradi-tional feature extraction techniques, including gray-level co-occurrence matrix (GLCM) [38], adaptive extended local ternary pattern (AELTP) [34], and adjacent evaluation com-pleted local binary patterns (AECLBP) [15]. The handcrafted features are feed to SVM, nearest neighbor clustering (NNC), and multiple linear regression (ML) for defect classifica-tion. Source codes or pretrained models of these feature extractors and classifiers are publicly available. There are a number of deep learning-based surface defect classification methods. In our experiments, we consider the ETE CNN model proposed by Li et al. [24], the Decaf model-based approach (DECAF+MLR) proposed by Ren et al. [25], and the SqueezeNet-based model proposed by Fu et al. [27]. These CNN models are reimplemented according to the original articles and trained/tested based on the USB-SD data set without any data augmentation techniques.

The comparative results are shown in Table V. It is noted that the deep learning-based methods generally perform better than the classification models built on handcrafted features, achieving higher recognition accuracy for various surface defects. The experimental results verify the finding that the learned features can provide better representations of target objects (e.g., surface defects) than handcrafted ones [17], [20].

Fig. 7. Confusion matrix, precision, and recall of our SN-MRF-CC model evaluated on the USB-SD data set that contains seven categories of surface images, including Normal (Nr), BrightLine (BL), DeFormation (DF), Dent (De), Scratch (Sc), Spot (Sp), and Stain (St) images.

Another interesting finding is that the built-from-scratch ETE model does not perform well on the USB-SD data set. The underlying reason is that parameters of ETE are randomly initialized and cannot be adequately fine-tuned using a small defect-specific image data set [25]. In comparison, Ren et al. proposed to directly apply the pretrained Decaf model [26] without parameter fine-tuning to extract the features that are suboptimal for surface defect classification task. To minimize data labeling effort and maximize classification accuracy, it is reasonable to build a classifier based on a pretrained CNN model and then use a small amount of defect-specific training samples to fine-tune its parameters. Compared with another state-of-the-art defect classifier SDC-SN [27], the pro-posed SN-MRF-CC model further improves its classification accuracy by incorporating multiple convolutional layers with different kernel sizes to increase the RF and generate multi-scale features. Many research articles are utilizing multiple

(9)

Fig. 8. Some misclassified images. Green and red indicate the correct (manually labeled) and incorrect classification results, respectively.

Fig. 9. Some comparative detection/classification results for entire workpieces (USB connectors).

TABLE VI

RUNNINGTIME(CLASSIFICATION OF200× 200 IMAGES), MODELSIZE,

ANDRECOGNITIONACCURACY OFVARIOUSDEEPLEARNING-BASED

APPROACHES[24], [25], [27]ON THEUSB-SD DATASET

convolutional layers with different kernel sizes to extract multiscale features to improve the performances in other com-puter vision tasks [39]–[41]. Such improvement is particularly evident for the defects of larger sizes (e.g., SN-MRF-CC 98.0% versus SDC-SN 90.7% for the large-size Stain defects). In Table VI, we show the running time, model size, and recog-nition accuracy of various deep learning-based approaches. Overall, the ETE model [24] contains the fewest parameters, while its classification accuracy is significantly lower than other alternatives. Our proposed SN-MRF-CC model achieves the highest classification accuracy using fewer or comparable number of parameters. On a computer equipped with a single NVIDIA TITAN X GPU (12-GB memory), the SN-MRF-CC model can process over 100 200× 200 image patches per sec-ond. It is also worth mentioning that the proposed framework takes less than a second to acquire, transmit, preprocess, and classify a full-size input image of a USB connector. In com-parison, it takes 5–10 s for a human inspector to perform

a similar quality inspection task. The proposed framework can be easily adopted to build accurate and fully automatic industrial inspection applications.

To systematically investigate the classification results of different defect categories, we calculate the confusion matrix, precision, and recall of our SN-MRF-CC model on the USB-SD data set in Fig. 7. In this confusion matrix, the first column indicates the ground-truth defect categories and the numbers in each row record the prediction results of our model. Note that all correct predictions should be recorded in the diagonal cells of the confusion matrix. Overall, our method can achieve high-accuracy recognition results for different defect types. Some examples of misclassified images are given in Fig. 8. As shown in the first and second columns in Fig. 8, a number of defect samples are misclassified as normal ones when their visual characteristics are insignificant. It is also observed that some defects existing on the boundary areas are not correctly identified, as shown in the third and fourth columns in Fig. 8. In this article, our SN-MRF-CC model only outputs single-class predictions; and therefore, image patches contain the defects of multiple categories cannot be correctly classified, as illustrated in the fifth and sixth columns in Fig. 8. In the future, we plan to construct/utilize a larger defect data set and a more comprehensive model to distinguish between defect and normal image samples.

After performing defect classification of individual image patches, we merge the results that are spatially adjacent and have the same class label to generate a location map to predict

(10)

the existence of various defects on the surface of a workpiece. Fig. 9 shows some comparative detection/classification results for different USB connectors. It is noted that a single work-piece might contain multiple types of surface defects located in different image positions. Compared with other state-of-the-art deep learning-based methods [24], [27], our proposed SN-MRF-CC can correctly classify various types of defects, as shown in the first and second columns in Fig. 9. Moreover, it can generate precise bounding boxes to highlight the location of defects, as shown in the third and fourth columns. Finally, our SN-MRF-CC model contains convolutional layers with different kernel sizes to increase the RF and to generate multiscale features. Thus, it can successfully handle the defects of various sizes, as shown in the fifth and sixth columns in Fig. 9.

VII. CONCLUSION

In this article, we propose a framework for automatic and machine vision-based surface defect detection/classification. The core of our proposed method is a compact yet effec-tive SqueezeNet-based model, which can accurately classify surface defects of various sizes in the cluttered background. We made two important modifications to the pretrained SqueezeNet model. First, we propose to incorporate multiple convolutional layers with different kernel sizes to extract mul-tiscale features and achieve larger RFs. Second, we reduce the parameter of the newly added convolutional layers to achieve more efficient training and to alleviate small data overfitting. The effectiveness of the proposed modifications is system-atically evaluated on a newly construed USB-SD data set. Sample images in the USB-SD data set contain the cluttered background caused by SI, and the size of different surface defects changes significantly. Our proposed SqueezeNet-based model achieves more accurate recognition results compared with the state-of-the-art surface defect classifiers. Moreover, it is a lightweight CNN model and can process 100 frames/s on a computer equipped with a single NVIDIA TITAN X GPU (12-GB memory).

To develop practical industrial inspection systems, we plan to build a more stable fixture platform to alleviate the vibration effects caused by other machineries in a factory. Moreover, it is critical to generate stable illumination stimulation for accurate classification/detection of various surface defects. It is possible to utilize a black-box device to decrease the impact of irrelevant light sources in a typical manufacturing environment. We plan to investigate new machine learning techniques (e.g., incremental learning [42], [43]) to adapt the CNN-based model for new classification tasks using a small number of new training samples. It is also worth investi-gating how to improve the existing semantic segmentation techniques [18], [44], [45] for fast and accurate pixelwise surface defect detection/classification in industrial inspection tasks.

REFERENCES

[1] B. M. Schönbauer, K. Yanase, and M. Endo, “Influences of small defects on torsional fatigue limit of 17-4PH stainless steel,” Int. J. Fatigue, vol. 100, pp. 540–548, Jul. 2017.

[2] J. Chen, Q. Feng, F. Wang, H. Zhang, and H. Song, “Research on burst tests of pipeline with spiral weld defects,” in Proc. 9th Int. Pipeline

Conf., Amer. Soc. Mech. Eng. Digit. Collection, 2012, pp. 53–60.

[3] I. V. Orynyak and S. M. Ageev, “Modeling the limiting plastic state of heavy-walled pipes with axial surface defects,” J. Machinery Manuf.

Rel., vol. 38, no. 4, pp. 407–413, Aug. 2009.

[4] N. Neogi, D. K. Mohanta, and P. K. Dutta, “Review of vision-based steel surface inspection systems,” EURASIP J. Image Video Process., vol. 2014, no. 1, p. 50, Dec. 2014.

[5] E. Hoseini, F. Farhadi, and F. Tajeripour, “Fabric defect detection using auto-correlation function,” Int. J. Comput. Theory Eng., vol. 5, pp. 114–117, 2013.

[6] C.-Y. Wen, S.-H. Chiu, W.-S. Hsu, and G.-H. Hsu, “Defect segmentation of texture images with wavelet transform and a co-occurrence matrix,”

Textile Res. J., vol. 71, no. 8, pp. 743–749, Aug. 2001.

[7] H.-G. Bu, J. Wang, and X.-B. Huang, “Fabric defect detection based on multiple fractal features and support vector data description,” Eng.

Appl. Artif. Intell., vol. 22, no. 2, pp. 224–235, Mar. 2009.

[8] C.-H. Chan and G. K. H. Pang, “Fabric defect detection by Fourier analysis,” IEEE Trans. Ind. Appl., vol. 36, no. 5, pp. 1267–1276, Sep. 2000.

[9] L. Bissi, G. Baruffa, P. Placidi, E. Ricci, A. Scorzoni, and P. Valigi, “Automated defect detection in uniform and structured fabrics using Gabor filters and PCA,” J. Vis. Commun. Image Represent., vol. 24, no. 7, pp. 838–845, Oct. 2013.

[10] W.-C. Li and D.-M. Tsai, “Wavelet-based defect detection in solar wafer images with inhomogeneous texture,” Pattern Recognit., vol. 45, no. 2, pp. 742–756, 2012.

[11] S. Ghorai, A. Mukherjee, M. Gangadaran, and P. K. Dutta, “Automatic defect detection on hot-rolled flat steel products,” IEEE Trans. Instrum.

Meas., vol. 62, no. 3, pp. 612–621, Mar. 2013.

[12] M. R. Halfawy and J. Hengmeechai, “Automated defect detection in sewer closed circuit television images using histograms of oriented gradients and support vector machine,” Autom. Construction, vol. 38, pp. 1–13, Mar. 2014.

[13] A. Dogandži´c, N. Eua-Anant, and B. Zhang, “Defect detection using hidden Markov random fields,” in Proc. AIP Conf., 2005, vol. 760, no. 1, pp. 704–711.

[14] J. Mirapeix, P. B. García-Allende, A. Cobo, O. M. Conde, and J. M. López-Higuera, “Real-time arc-welding defect detection and clas-sification with principal component analysis and artificial neural net-works,” NDT E Int., vol. 40, no. 4, pp. 315–323, Jun. 2007.

[15] K. Song and Y. Yan, “A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects,” Appl. Surf. Sci., vol. 285, no. 21, pp. 858–864, 2013.

[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, pp. 770–778.

[17] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. ICLR, 2015, pp. 1–14. [18] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks

for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern

Recognit., Jun. 2015, pp. 3431–3440.

[19] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural network cascade for face detection,” in Proc. IEEE Conf. Comput. Vis.

Pattern Recognit., Jun. 2015, pp. 5325–5334.

[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Int. Conf. Neural Inf.

Process. Syst., 2012, pp. 1097–1105.

[21] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size,” 2016, arXiv:1602.07360. [Online]. Available: http://arxiv.org/abs/1602.07360

[22] A. G. Howard et al., “MobileNets: Efficient convolutional neural networks for mobile vision applications,” 2017, arXiv:1704.04861. [Online]. Available: http://arxiv.org/abs/1704.04861

[23] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely efficient convolutional neural network for mobile devices,” in

Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,

pp. 6848–6856.

[24] L. Yi, G. Li, and M. Jiang, “An end-to-end steel strip surface defects recognition system based on convolutional neural networks,” Steel Res.

Int., vol. 88, no. 2, Feb. 2017, Art. no. 1600068.

[25] R. Ren, T. Hung, and K. C. Tan, “A generic deep-learning-based approach for automated surface inspection,” IEEE Trans. Cybern., vol. 48, no. 3, pp. 929–940, Mar. 2018.

(11)

[26] J. Donahue et al., “Decaf: A deep convolutional activation feature for generic visual recognition,” in Proc. Int. Conf. Mach. Learn., 2014, pp. 647–655.

[27] G. Fu et al., “A deep-learning-based approach for fast and robust steel surface defects classification,” Opt. Lasers Eng., vol. 121, pp. 397–405, Oct. 2019.

[28] Z. Emersic, D. Stepec, V. Struc, and P. Peer, “Training convolutional neural networks with limited training data for ear recognition in the wild,” in Proc. 12th IEEE Int. Conf. Autom. Face Gesture Recognit.

(FG), May 2017, pp. 987–994.

[29] M. Dong, D. He, C. Luo, D. Liu, and W. Zeng, “A CNN-based approach for automatic license plate recognition in the wild,” in Proc. BMVC, 2017, pp. 1–11.

[30] H.-C. Cheng and A. Varshney, “Volume segmentation using convolu-tional neural networks with limited training data,” in Proc. IEEE Int.

Conf. Image Process. (ICIP), Sep. 2017, pp. 590–594.

[31] Y. He, K. Song, Q. Meng, and Y. Yan, “An end-to-end steel surface defect detection approach via fusing multiple hierarchical features,”

IEEE Trans. Instrum. Meas., vol. 69, no. 4, pp. 1493–1504, Apr. 2020.

[32] O. Silvén, M. Niskanen, and H. Kauppinen, “Wood inspection with non-supervised clustering,” Mach. Vis. Appl., vol. 13, nos. 5–6, pp. 275–285, Mar. 2003.

[33] C. Kampouris, S. Zafeiriou, A. Ghosh, and S. Malassiotis, “Fine-grained material classification using micro-geometry and reflectance,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 778–792.

[34] A. A. Mohamed and R. V. Yampolskiy, “Adaptive extended local ternary pattern (AELTP) for recognizing avatar faces,” in Proc. 11th Int. Conf.

Mach. Learn. Appl., Dec. 2012, pp. 57–62.

[35] N. Otsu, “A threshold selection method from gray-level histograms,”

IEEE Trans. Syst., Man, Cybern., vol. 9, no. 1, pp. 62–66, Jan. 1979.

[36] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014. [Online]. Available: https://arxiv.org/abs/1408.5093

[37] L. Bottou, “Stochastic gradient descent tricks,” in Neural Networks:

Tricks Trade. Springer, 2012, pp. 421–436.

[38] R. M. Haralick, K. Shanmugam, and I. Dinstein, “Textural features for image classification,” IEEE Trans. Syst., Man, Cybern., vol. SMC-3, no. 6, pp. 610–621, Nov. 1973.

[39] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE

Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9.

[40] X. Wang, H. Ma, X. Chen, and S. You, “Edge preserving and multi-scale contextual neural network for salient object detection,” IEEE Trans.

Image Process., vol. 27, no. 1, pp. 121–134, Jan. 2018.

[41] Z. He et al., “MRFN: Multi-Receptive-Field network for fast and accu-rate single image super-resolution,” IEEE Trans. Multimedia, vol. 22, no. 4, pp. 1042–1054, Apr. 2020.

[42] F. M. Castro, M. J. Marín-Jiménez, N. Guil, C. Schmid, and K. Alahari, “End-to-end incremental learning,” in Proc. Eur. Conf. Comput. Vis.

(ECCV), 2018, pp. 233–248.

[43] J. Xu, C. Xu, B. Zou, Y. Yan Tang, J. Peng, and X. You, “New incremental learning algorithm with support vector machines,” IEEE

Trans. Syst., Man, Cybern., Syst., vol. 49, no. 11, pp. 2230–2241,

Nov. 2019.

[44] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional instance-aware semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern

Recognit., Jul. 2017, pp. 2359–2367.

[45] H. Zhang et al., “Context encoding for semantic segmentation,” in

Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,

pp. 7151–7160.

Jiangxin Yang is currently a Full-Time

Profes-sor with the State Key Laboratory of Fluid Power and Mechatronic Systems and the Laboratory of Advanced Manufacturing Technology of Zhejiang Province, School of Mechanical Engineering, Zhe-jiang University, Hangzhou, China. His research interests are in quality engineering, infrared imaging, and measurement.

Guizhong Fu received the master’s degree in

mechanical engineering and automation from Changzhou University, Changzhou, China, in 2015. He is currently pursuing the Ph.D. degree in mechanical engineering and automation with Zhejiang University, Hangzhou, China.

His main research interests include saliency detection, industrial defect detection, and signal processing.

Wenbin Zhu received the bachelor’s degree in

mechanical engineering from the Wu Han Univer-sity of Technology, Wuhan, China, in 2017. He is currently pursuing the Ph.D. degree in mechanical manufacturing and automation with Zhejiang Uni-versity, Hangzhou, China.

His main research activities are in industrial visual inspection and image-based defect detection.

Yanlong Cao is currently a Full-Time Professor

with the State Key Laboratory of Fluid Power and Mechatronic Systems and the Laboratory of Advanced Manufacturing Technology of Zhejiang Province, School of Mechanical Engineering, Zhe-jiang University, Hangzhou, China. His research interests are in precision design, quality engineering, and measurement.

Yanpeng Cao (Member, IEEE) received the M.Sc.

degree in control engineering and the Ph.D. degree in computer vision from The University of Manchester, Manchester, U.K., in 2005 and 2008, respectively.

He is currently a Research Fellow with the Col-lege of Mechanical Engineering, Zhejiang Univer-sity, Hangzhou, China. He worked in a number of research and development institutes, such as the Institute for Infocomm Research, Singapore, the National University of Singapore, Singapore, and the National University of Ireland at Maynooth, Maynooth, Ireland. His major research interests include infrared imaging, sensor fusion, image processing, and 3-D reconstruction.

Michael Ying Yang (Senior Member, IEEE) received the Ph.D. degree (summa cum laude) from the University of Bonn, Bonn, Germany, in 2011.

He is currently an Assistant Professor with the University of Twente, Enschede, The Netherlands, where he is heading a group working on scene understanding. He published over 100 papers in international journals and conference proceedings. His research interests are in the fields of computer vision and photogrammetry with specialization on scene understanding.

Dr. Yang has received the ISPRS President’s Honorary Citation in 2016 and the Best Science Paper Award at BMVC in 2016. He serves as an Associate Editor for the ISPRS Journal of Photogrammetry and Remote Sensing and