Attention combination mechanism
Due to the difficulty in extracting features from target areas in images, the high computational effort of the model and the low accuracy of detection are addressed. As shown in Fig. 3, we introduce a lightweight feedforward convolutional attention module CBAM after the backbone network Focus module of the YOLOv5s network model. The SE-Net (Squeeze and Excitation Networks) channel attention module is posted at the end of the backbone network. We propose an attention combination mechanism based on the YOLOv5s network model and name the improved network model YOLOv5s-CS. Where the CBAM module has a channel number of 128, a convolutional kernel size of 3 and a step size of 2, the SELayer has a channel number of 1024 and a step size of 4.
Convolutional block attention module network
In 2018, Woo et al.25 proposed the lightweight feedforward convolutional attention module CBAM. The CBAM module focuses on feature information from both channels and space dimensions and combines feature information to some extent to obtain more comprehensive reliable attentional information26. CBAM consists of two submodules, the channel attention module (CAM) and spatial attention module (SAM), and its overall module structure is shown in Fig. 4a.
The working principle of the CAM is shown in Fig. 4b. First, the feature map F is input at the input entrance. Second, the global maximum pooling operation and the global average pooling operation are applied to the width and height of the feature map respectively to obtain two feature maps of the same size. Third, two feature maps of the same size are input to the shared parameter network MLP at the same time. Finally, the new feature map output from the shared parameter network is subjected to a summation operation and a sigmoid activation function to obtain the channel attention features ({M}_{c}).
The channel attention module CAM is calculated as shown in Formula (1):
$${text{M}}_{rm{c}}({text{F}}){=sigma}({text{MLP (AvgPool (F))}}+ {text MLP (MaxPool (F)))}{=sigma}({rm{W}}_{1}({text{W}}_{0}({text{F}}_{{{rm{avg}}}^{rm{c}}}))+{rm{W}}_{0}({rm{W}}_{1}({rm{F}}_{{{rm{max}}}^{rm{c}}})))$$
(1)
where σ represents the sigmoid function, MLP represents the shared parameter network, ({text{W}}_{0}) and ({text{W}}_{1}) represent the shared weights, ({text{F}}_{text{avg}}^{text{c}}) is the result of feature map F after global average pooling, and ({text{F}}_{text{max}}^{text{c}}) is the result of feature map F after global maximum pooling.
The working principle of SAM is shown in Fig. 4c. The feature map F’ is regarded as the input of the SAM. F’ is obtained by multiplying the input of SAM with the output of CAM. First, the global maximum pooling operation and the global average pooling operation are applied to the channels of the feature map to obtain two feature maps of the same size. Second, two feature maps that have completed the pooling operation are stitched at the channels and the feature channels are dimensioned down using the convolution operation to obtain a new feature map. Finally, spatial attention features ({text{M}}_{text{s}}) are generated using the sigmoid activation function.
The spatial attention module (SAM) is calculated, as shown in Formula (2):
$${text{M}}_{text{s}}left({text{F}}right) {=sigma}left({text{f}}^{7 times 7}left(left[{text{AvgPool}}left({text{F}}right)text{;MaxPool}left({text{F}}right)right]right)right) {=sigma}left({text{f}}^{7 times 7}left(left[{text{F}}_{text{avg}}^{text{s}} ; {text{F}}_{text{max}}^{text{s}}right]right)right)$$
(2)
where σ is the sigmoid function, ({text{f}}^{7 times 7}) denotes the convolution operation with a filter size of 7 × 7, ({text{F}}_{text{avg}}^{text{s}}) is the result of the feature map after global average pooling, and ({text{F}}_{text{max}}^{text{s}}) is the result of the feature map after global maximum pooling.
Squeeze and excitation network
In 2018, Hu et al.27 proposed a single-path attention network structure SE-Net. SE-Net uses the idea of an attention mechanism to analyze the relationship feature maps by modeling and adaptively learning to obtain the importance of each feature map28 and then assigns different weights to the original feature map for updating according to the importance. In this way, SE-Net pays more attention to the features that are useful for the target task while suppressing useless feature information and allocates computational resources rationally to different channels to train the model to achieve better results.
The SE-Net attention module is mainly composed of two parts: the squeeze operation and excitation operation. The structure of the SE-Net module is shown in Fig. 5.
The squeeze operation uses global average pooling to encode all spatial features on the channel as local features. Second, each feature map is compressed into a real number that has global information on the feature maps. Finally, the squeeze results of each feature map are combined into a vector as the weights of each group of feature maps. It is calculated as shown in Eq. (3):
$${text{Z}}_{text{c}}={text{F}}_{text{sq}}left({text{u}}_{text{c}}right)=frac{1}{text{H} times {text{W}}}sum_{text{i=1}}^{text{H}}sum_{text{j=1}}^{text{W}}{{text{u}}}_{text{c}}left(text{i,j}right) , , , $$
(3)
where H is the height of the feature map, W is the feature map width, u is the result after convolution, z is the global attention information of the corresponding feature map, and the subscript c indicates the number of channels.
After completing the squeeze operation to obtain the channel information, the feature vector is subjected to the excitation operation. First, it passes through two fully connected layers. Second, it uses the sigmoid function. Finally, the output weights are assigned to the original features. It is calculated as follows:
$$text{s} = {text{F}}_{text{ex}}left(text{z,W}right){=sigma}left({text{g}}left(text{z,W}right)right){=sigma}left({text{W}}_{2}{delta}left({text{W}}_{1}{text{z}}right)right)$$
(4)
$$widetilde{{text{x}}_{rm{c}}}={text{F}}_{rm{scale}}left({text{u}}_{rm{c}}, {text{s}}_{rm{c}}right)={text{s}}_{rm{c}}{{text{u}}}_{rm{c}}$$
(5)
where σ is the ReLU activation function, δ represents the sigmoid activation function, and ({text{W}}_{1}) and ({text{W}}_{2}) represent two different fully connected layers. The vector s represents the set of feature mapping weights obtained through the fully connected layer and the activation function. (widetilde{{x}_{c}}) is the feature mapping of the x feature channel, ({text{s}}_{text{c}}) is a weight, and ({text{u}}_{text{c}}) is a two-dimensional matrix.
Target detection layer
The garbage in rural areas is a smaller target and has fewer pixel characteristics, such as capsule, button butteries. Therefore, we insert a small target detection layer to improve the head network structure based on the original YOLOv5s network model for detecting objects with small targets to optimize the problem of missed detection in the original network model. The YOLOv5s network structure with the addition of the small target detection layer is shown in Fig. 6 and named YOLOv5s-STD.
In the seventeenth layer of the neck network, operations such as upsampling are performed on the feature maps so that the feature maps continue to expand. Meanwhile, in the twentieth layer, the feature maps obtained from the neck network are fused with the feature maps extracted from the backbone network. We insert a detection layer capable of predicting small targets in the thirty-first layer. To improve the detection accuracy, we use a total of four detection layers for the output feature maps, which are capable of detecting smaller target objects. In addition to the three initial anchor values based on the original model, an additional set of anchor values is added as a way to detect smaller targets. The anchor values of the improved YOLOv5s network model are set to [5, 6, 8, 14, 15, 11], [10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119] and [116, 90, 156, 198, 373, 326].
Bounding box regression loss function
The loss function is an important indicator of the generalization ability of a model. In 2016, Yu et al.29 proposed a new joint intersection loss function IoU for bounding box prediction. IoU stands for intersection over union, which is a frequently used metric in target detection. It is used not only to determine the positive and negative samples, but also to determine the similarity between the predicted bounding box and the ground truth bounding box. It can be described as shown in the Eq. (6):
$$text{IoU} = frac{left|text{A} capleft.{text{B}}right|right.}{left|{text{A}} cupleft.{text{B}}right|right.}$$
(6)
where the value domain of IoU ranges from [0,1]. A and B are the areas of arbitrary regions. Additionally, when IoU is used as a loss function, it has to scale invariance, as shown in Eq. (7):
$$text{IoU_Loss} = 1-frac{left|text{A} cap left.{text{B}}right|right.}{left|{text{A}} cup left.{text{B}}right|right.}$$
(7)
However, when the prediction bounding box and the ground truth bounding box do not intersect, namely IoU = 0, the distance between the arbitrary region area of A and B cannot be calculated. The loss function at this point is not derivable and cannot be used to optimize the two disjoint bounding boxes. Alternatively, when there are different intersection positions, where the overlapping parts are the same but in different overlapping directions, the IoU loss function cannot be predicted.
To address these issues, the idea of GIoU (Generalized Intersection over Union)30, in which a minimum rectangular Box C of A and B is added, was proposed in 2019 by Rezatofighi et al. Suppose the prediction bounding box is B, the ground truth bounding box is A, the area where A and B intersect is D, and the area containing two bounding boxes is C, as shown in Fig. 7.
Then, the GIoU calculation, as shown in Formula (8), is:
$$text{GIoU}= text{IoU}-frac{text{|C}-left({text{A}} cup {text{B}}right)text{|}}{text{|C|}}$$
(8)
The GIoU_Loss is calculated as (9):
$$text{GIoU_Loss=1}-{text{IoU}}-frac{text{|C}-left({text{A}} cup {text{B}}right)text{|}}{text{|C|}}$$
(9)
The original YOLOv5 algorithm uses GIoU_Loss as the loss function. Comparing Eqs. (6) and (8), it can be seen that GIoU is a new penalty term (frac{text{|C}-left({text{A}} cup {text{B}}right)text{|}}{text{|C|}}) that is added to IoU and is clearly represented by Fig. 7.
Although the GIoU loss function solves the problem that the gradient of the IoU loss function cannot be updated in time and the prediction bounding box, the direction of the ground truth bounding box is not consistent when predicting, but there are still disadvantages, as shown in Fig. 8.
Figure 8 shows three different position relationships formed when the predicted bounding box and the ground truth bounding box overlap exactly. Among them, the ratio of the length to width of the green grounding truth bounding box is 1:2, and the red predicted bounding box has the same aspect ratio as the ground truth bounding box, but the size is only one-half of the green ground truth bounding box. When the prediction bounding box and the ground truth bounding box completely overlap, the GIoU degenerates to the IoU, and the GIoU value and IoU value for the three different position cases are 0.45 at this time. The GIoU loss function does not directly reflect the distance between the prediction bounding box and the ground truth bounding box. Therefore, we introduce the CIoU (Complete Intersection over Union)31 loss function to replace the original GIoU loss function in the YOLOv5 algorithm and continue to optimize the prediction bounding box.
Therefore, the CIoU is calculated as (10):
$$text{GIoU_Loss}=1-text{IoU}-frac{text{|C}-left({text{A}} cup {text{B}}right)text{|}}{text{|C|}}$$
(10)
where b and ({text{b}}^{text{gt}}) denote the centroids of the prediction bounding box and the ground truth bounding box, respectively, ({rho}) is the Euclidean distance between the two centroids, and c is the diagonal length of the minimum closed area formed by the prediction bounding box and the ground truth bounding box.
(alpha) is the parameter used to balance the scale, and v is the scale consistency used to measure the aspect ratio between the prediction bounding box and the ground truth bounding box, as shown in Eqs. (11) and (12).
$$alpha =frac{text{v}}{left(1-text{IoU}right)+{text{v}}^{{prime}}}$$
(11)
$$text{v} = frac{4}{{pi}^{2}}{left({text{arctan}}frac{{omega}^{text{gt}}}{{text{h}}^{text{gt}}}- text{arctan}frac{{omega}^{text{p}}}{{text{h}}^{text{p}}}right)}^{2}$$
(12)
Therefore, the expression of CIoU_Loss can be obtained according to Eqs. (10), (11) and (12).
$$text{CIoU_Loss} =1-text{CIoU}=1-text{IoU}+frac{{rho}^{2}left(text{b,}{text{b}}^{text{gt}}right)}{{text{c}}^{2}}{+ alpha v }$$
(13)
Optimization algorithm
After optimizing the loss function of the network model, the next step is to optimize the hyperparameters of the network model. The function of the optimizer is to adjust the hyperparameters to the most appropriate values while making the loss function converge as much as possible32. In the target detection algorithm, the optimizer is mainly used to calculate the gradient of the loss function and to iteratively update the parameters.
The optimizer used in YOLOv5 is stochastic gradient descent (SGD). Since a large number of problems in deep learning satisfy the strict saddle function, all the local optimal solutions obtained are almost as ideal. Therefore, SGD algorithm is not trapped in the saddle point and has strong generality. However, the slow convergence speed and the number of iterations of SGD algorithm are still problems that need to be improved. Adam algorithm has both the first-order momentum in the SGD algorithm and combines the second-order momentum in AdaGrad algorithm and AdaDelta algorithm, Adaptive&Momentum. Adam formula can be described as follows:
$${m}_{t}={beta }_{1}{m}_{t-1}+left(1-{beta }_{1}right){g}_{t}$$
(14)
$${v}_{t}={beta }_{2}{v}_{t-1}+left(1-{beta }_{2}right){g}_{t}^{2}$$
(15)
$${widehat{m}}_{t}=frac{{m}_{t}}{1-{beta }_{1}^{t}}$$
(16)
$${widehat{v}}_{t}=frac{{v}_{t}}{1-{beta }_{2}^{t}}$$
(17)
where ({beta }_{1}) and ({beta }_{2}) parameters are hyperparameters and g is the current gradient value of the error function, ({m}_{t}) is the gradient of the first-order momentum and ({v}_{t}) is the gradient of the second-order momentum.
Adam is an adaptive one-step random objective function optimization algorithm based on a low-order moment. It can replace the traditional first-order optimization algorithm for the stochastic gradient descent process. It is able to update the weights of the neural network adaptively based on the data trained during the iterative process. The Adam optimizer occupies fewer memory resources during the training process and is suitable for solving the problems of sparse gradients and large fluctuations in loss values33. Therefore, we use the Adam optimization algorithm instead of the SGD optimization algorithm to train the network model based on the YOLOv5s network model. The calculation is shown in Table 3.
where ({alpha}) is a factor controlling the learning rate of the network, ({beta}^{{prime}}) is the exponential decay rate of the first-order moment estimate, ({beta}^{{primeprime}}) is the exponential decay rate of the second-order moment estimate, and ({varepsilon}) is a constant that tends to zero infinitely as the denominator.
Source: Ecology - nature.com