in

A Swin Transformer-based model for mosquito species identification

The framework of Swin MSI

We established the first Swin Transformer-based mosquito species identification (Swin MSI) model, with the help of self-constructed image dataset and multi-adjustment-test. Gradient-weighted class activation mapping was used to visualize the identification process (Fig. 1a). The key Swin Transformer block was described on Fig. 1b. Based on practical needs, Swin MSI was additional designed to identify Culex pipiens Complex on the subspecies level (Fig. 1c) and novel mosquito (which was defined as ones beyond 17 species in our dataset) classification attribution (Fig. 1d). Detailed results are shown in the following sections.

Figure 1

The Framework of Swin MSI. (a)The basic architecture for mosquito features extraction and identification. Attention visualization generated by filters at each layer are shown. (b) Details for Swin Transformer block. (c) For mosquito within our dataset 17 species, output is the top 5 confidence species. (d) For mosquito beyond 17 species (defined as novel species), whether the output is a species or a genus is decided after comparing with confidence threshold.

Full size image

Mosquito datasets

We established the highest-definition and most-balanced mosquito image dataset to date. The mosquito image dataset covers 7 genera and 17 species (including 3 morphologically similar subspecies in the Cx. pipiens Complex), which covers the most common and important disease-transmitting mosquitoes at the global scale, with a total of 9,900 mosquito images. The image resolution was 4464 × 2976 pixels. The specific taxonomic status and corresponding images are shown in Fig. 2. Due to the limitation of field collection, Ae. vexans, Coquillettidia ochracea, Mansonia uniformis, An. vagus and Toxorhynchites splendens only have females or only have males. In addition, each mosquito species included 300 images of both sexes, which was large enough and same number for each species, in order to balance the capacity and variety of training sets.

Figure 2

Taxonomic status and index of mosquito species included in this study Both male and female mosquitoes were photographed from different angles such as dorsal, left side, right side, ventral side, etc. Except for 5 species, each mosquito includes 300 images of both sexes, and the resolution of mosquito photos were 4464 × 2976. Cx. pipiens quinquefasciatus, Cx. pipiens pallens, and Cx. pipiens molestus (subspecies level, in dark gray background) were 3 subspecies in Cx. pipiens Complex (species level).

Full size image

Workflow for mosquito species identification

A three-stage flowchart of building best deep learning model for identification of mosquito species model was adopted (Fig. 3). The first learning stage was conducted by three CNNs (the Mask R-CNN, DenseNet, and YOLOv5) and three transformer models (the Detection Transformer, Vision Transformer, and Swin Transformer). Based on the performance of the first-stage model and the real mosquito labels, the second learning stage involved adjusting the model parameters of the three Swin Transformer variants (T, B, and L) to compare their performances. The third learning stage involved testing the effects of inputting differently sized images (384 × 384 and 224 × 224) to the Swin Transformer-L model; finally, we proposed a deep learning model for mosquito species identification (Swin MSI) to test the recognition effects of different mosquito species. The model was validated on different mosquito species, with a focus on the identification accuracy of three subspecies within the Cx. pipiens Complex and the detection effect of novel mosquito species.

Figure 3

Flowchart of testing deep learning model for intelligent identification of mosquito species.

Full size image

Comparison between the CNN model and Transformer model results (1st round of learning)

Figure 4a shows the accuracies obtained for the six different computer vision network models tested on the mosquito picture test set. The test results show that the transformer network model had a higher mosquito species discrimination ability than the CNN.

Figure 4

Comparison of mosquito recognition effects of computer vision network models and variants. (a) Comparison of mosquito identification accuracy between 3 CNNs and 3 Transformer; (b) The best effect CNN (YOLOv5) training set loss curve(blue), validation set loss curve(green) and validation set accuracy curve(orange); (c) The best effect Transformer (Swin Transformer) training set loss curve, validation set loss curve and validation set accuracy curve. (d) Swin-MSI-T test result confusion matrix; (e) Swin-MSI -B test result confusion matrix; (f) Swin-MSI -L test result confusion matrix. Confusion matrix of mosquito labels in which odd numbers represent females and even numbers represent males. The small squares in the confusion matrix represent the recognition readiness rate, from red to green, the recognition readiness rate is getting higher and higher An. sinensis: 1, 2; Cx. pipiens quinquefasciatus: 3, 4; Cx. pipiens pallens: 5, 6; Cx. pipiens molestus: 7,8 Cx. modestus: 9,10; Ae. albopictus: 11, 12 Ae. aegypti: 13, 14; Cx. pallidothorax: 15, 16 Ae. galloisi: 17,18 Ae. vexans: 19, 20; Ae. koreicus: 21, 22 Armigeres subalbatus: 23, 24; Coquillettidia ochracea: 25, 26 Cx. gelidus: 27, 28 Cx. triraeniorhynchus: 29, 30 Mansonia uniformis: 31, 32 An. vagus: 33, 34 Ae. elsaie: 35,36 Toxorhynchites splendens: 37, 38.

Full size image

In the CNN training process (applied to YOLOv5), the validation accuracy requires more than 110 epochs to grow to 0.9, and the validation loss requires 110 epochs to drop to a flat interval; in contrast, during the training step, these losses represent a continuously decreasing process. These results indicate that the deep learning model derived based on the Swin Transformer algorithm was able to achieve a higher recognition accuracy in less time than the rapid convergence ability of the CNN during the iterative process (Fig. 4b).

The Swin Transformer model exhibited the highest test accuracy of 96.3%. During the training process, the loss of this model could stabilize after 30 epochs, and its validation accuracy could grow to 0.9 after 20 epochs; during the validation step, the loss can drop to 0.36 after 20 epochs, after which the loss curve fluctuated but did not produce adverse effects (Fig. 4c). Based on the excellent performance of the Swin Transformer model, this model was used as the baseline to carry out the subsequent analyses.

Swin Transformer model variant adjustment (2nd round of learning)

Following testing performed to clarify the superior performance of the Swin Transformer algorithm, we chose different Drop_path_rate, Embed_dim and Depths parameter settings and labeled the parameter sets as the Swin Transformer-T, Swin Transformer-B, and Swin Transformer-L variants. Drop_path is an efficient regularization method, and an asymmetric Drop_path_rate is beneficial for supervised representation learning when using image classification tasks and Transformer architectures. The Embed_dim parameter represents the image dimensions obtained after the input red–green–blue (RGB) image is calculated by the Swin Transformer block in stage 1. The Depths parameter is the number of Swin Transformer blocks used in the four stages. The parameter information and test results are shown in Table 1. Due to the increase in the Swin Transformer block and Embed_dim parameters in stage 3, the recognition accuracies of the three variants were found to be 95.8%, 96.3%, and 98.2%, Correspondingly, the f1 score were 96.2%, 96.7% and 98.3%; thus, these variants could effectively improve the mosquito species identification ability in a manner similar to the CNN by increasing the number of convolutional channels to extract more features and improve the overall classification ability. In this study, the Swin Transformer-L variant, which exhibited the highest accuracy, was selected as the baseline for the next work.

Table 1 Parameters and test accuracy of three variants of Swin Transformer.
Full size table

By plotting a confusion matrix of the test set results derived using the three Swin Transformer variants, we clearly obtained the proportion of correct and incorrect identifications in each category to visually reflect the mosquito species discrimination ability (Fig. 4d–f). In the matrix, the darker diagonal colors indicate higher identification rate accuracies of the corresponding mosquito categories. Among them, five mosquito species were missing because the Ae. vexans, Coquillettidia ochracea, Mansonia uniformis, An. vagus and Toxorhynchites splendens species were represented in the dataset by only females or only males. The confusion matrix shown in Panel C lists the lowest number of mosquito species identification error points and the lowest accuracy level obtained in each category, suggesting that the Swin Transformer-L model has a better classification performance than the Swin Transformer-T and Swin Transformer-B models.

Effect of the input image size on the discrimination ability (3rd round of learning)

To investigate the relationship between the input image size and mosquito species identification performance, in this study, we conducted a comparison test between input images with sizes of 224 × 224 and 384 × 384, based on the Swin Transformer-L model, and identified 8 categories of mosquito identification accuracy differences. These test results are shown in Table 2. When using an image size of 224 × 224 pixels, the batch_size parameter was set to 16, and when using an image size of 384 × 384 pixels, the batch_size parameter was set to 4; under these conditions, the proportion of utilized video memory accounted for 67%, as shown in Eq. 4, and this was consistent with the description of the relationship between the size of self-attentive operations during the operation of the Swin Transformer model when 384 × 384 pixels images were used. The time required for the Transformer-L model to complete all the training sessions was excessive, reaching 126 h and even exceeding the 124 h required by the YOLOv5 model, which was found to require the highest computation time during the training process in this work. Long-term training process could more fully reflect the performance differences between models. Fortunately and actually, the response speed of the model will not be affected by the training time. Compared to the accuracy of 98.2% obtained for 224 × 224 inputs, the 384 × 384 input image size derived based on the Swin Transformer-L model provided a higher mosquito species identification accuracy of 99.04%, representing an improvement of 0.84%.

$$Omega ({text{W}} – {text{MSA}}) = 4{text{HWC}}^{2} + 2{text{M}}^{2} {text{HWC}}$$

(1)

Table 2 Comparison of recognition accuracy for different input image sizes.
Full size table

Visualizing and understanding the Swin MSI models

To investigate the differences in the attentional features utilized by the Swin MSI and taxonomists for mosquito species identification, we applied the Grad-CAM method to visualize the Swin MSI attentional areas on mosquitoes at different stages. Because the Swin Transformer has different attentional ranges among its multi-head self-attention steps in different stages, different attentional weights can be found on different mosquito positions. In stage 1, the feature dimension of each patch was 4 × 4 × C, thus enabling the Swin Transformer’s multi-head self-attention mechanism to give more attention to the detailed parts of the mosquitoes, such as their legs, wings, antennae, and pronota. In stage 2, the feature dimension of each patch was 8 × 8 × 2C, enabling the Swin Transformer’s multi-head self-attention mechanism to focus on the bodies of the mosquitoes, such as their heads, thoraces, and abdomens. In stage 3, when the feature dimension of each patch was 16 × 16 × 4C, the Swin Transformer’s multi-head self-attention mechanism could focus on most regions of the mosquito, thus forming a global overall attention mechanism for each mosquito (Fig. 5). This attentional focus process is essentially the same as the process used by taxonomists when classifying mosquito morphology, changing from details to localities to the whole mosquito.

Figure 5

Attention visualization of representative mosquitoes of the genera Ae., Cx., An., Armigeres, Coquillettidia and Mansonia. This is a visualization for identifying the regions in the image that can explain the classification progress. Images of Toxorhynchites contain only males, with obvious differences in morphological characteristics, are not shown.

Full size image

Ae. aegypti is widely distributed in tropical and subtropical regions around the world and transmits Zika, dengue and yellow fever. A pair of long-stalked sickle-shaped white spots on both shoulder sides of the mesoscutum, with a pair of longitudinal stripes running through the whole mesotergum, is the most important morphological identification feature of this species. This feature was the deepest section in the attention visualization, indicating that the Swin MSI model also recognized it as the principal distinguishing feature. In addition, the abdominal tergum of A. aegypti is black and segments II-VII have lateral silvery white spots and basal white bands; the model also focused on these areas.

Cx. triraeniorhynchus is the main vector of Japanese encephalitis; this mosquito has a small body size, a distinctive white ring on the proboscis (its most distinctive morphological feature), and a peppery color on its whole body. Similarly, the model constructed herein focused on both the head and abdominal regions of this species.

An. sinensis is the top vector of malaria in China and has no more than three white spots on its anterior wing margin and a distinct white spot on its marginal V5.2 fringe; this feature was observed in Stage 2, at which time the modelstrongly focused on the corresponding area.

The most obvious feature of Armigeres subalbatus is the lateral flattening and slightly downward curving of its proboscis; the observation of the attention visualization revealed that the constructed model focused on these regions from Stage 1 to Stage 3. The mesoscutum and abdominal tergum were not critical and were less important for identification than the proboscis, and the attention visualization results correspondingly show that the neural network focused less on these features.

Coquillettidia ochracea belongs to the Coquillettidia genus and is golden yellow all over its body, with the most pronounced abdomen among the analyzed species. The model showed a consistent morphological taxonomic focus on the abdomen of this species.

Mansonia uniformis is a vector of Malayan filariasis. The abdominal tergum of this species is dark brown, and its abdominal segments II-VII have yellow terminal bands and lateral white spots, which are more obvious than the dark brown feature on proboscis. Through the attention visualization, we determined that the Swin MSI model was more concerned with the abdominal region features than with the proboscis features.

Subspecies-level identification tests of mosquitos in the Culex pipiens Complex

Fine-grained image classification has been the focus of extensive research in the field of computer vision25,26. Based on the test set (containing 270 images) constructed herein for three subspecies of the Cx. pipiens Complex, the subspecies and sex identification accuracies were 100% when the Swin MSI model was used.

The morphological characteristics of Cx. pipiens quinquefasciatus, Cx. pipiens pallens, and Cx. pipiens molestus within the Cx. pipiens Complex are almost indistinguishable, but their host preferences, self-fertility properties, breeding environments, and stagnation overwintering strategies are very different27. Among the existing features available for morphological classification, the stripes on the abdominal tergum of Cx. pipiens quinquefasciatus are usually inverted triangles and are not connected with the pleurosternums, while those of Cx. pipiens pallens are rectangular and are connected with the pleurosternums. Cx. pipiens molestus is morphologically more similar to Cx. pipiens pallens as an ecological subspecies of the Cx. pipiens Complex. However, taxonomists do not recommend using the unstable feature mentioned above as the main taxonomic feature for differentiation. By analyzing the attention visualization results of these three subspecies (last three rows on Fig. 5), we found that the neural networks of Cx. pipiens quinquefasciatus, Cx. pipiens pallens, and Cx. pipiens molestus still focused on the abdominal regions, as shown in dark red. The area of focus of these neural networks differ from that of the human eye, and the results of this study suggest that the Swin MSI model can detect finely granular features among these three mosquito subspecies that are indistinguishable to the naked human eye.

Novel mosquito classification attribution

After we performed a confidence check on the successfully identified mosquito images in the dataset, the lowest confidence value was found to be 85%. A higher confidence threshold mean stricter evaluation criteria, which can better reflect the powerful performance of the model. Therefore, 0.85 was set as the confidence threshold when judging novel mosquitoes. When identifying 10 unknown mosquito species, the highest derived species confidence level was below 85%; when the results were output to the genus level (Fig. 1d), the average probability of obtaining a correct judgment was 96.26%accuracy and 98.09% F1-score (Table 3). The images tested as novel Ae., Cx. and An. mosquito were from Minakshi and Couret et al.28,29.

Table 3 Probability of correct attribution of novel species.
Full size table


Source: Ecology - nature.com

Ocean microbes get their diet through a surprising mix of sources, study finds

New materials could enable longer-lasting implantable batteries