Deep learning identification for citizen science surveillance of tiger mosquitoes

Figure 2

Schematic figure of the labeling process. Participants usually upload several images in a single report. The best photo is picked by the validator who first marks the harassing or non-appropriate photos as hidden. All the non-best photos are marked as not classified. In some rare events, two or three images are annotated from the same report. The mosquito images are classified into four different categories (Aedes albopictus, Aedes aegypti, other species or can not tell) and also the confidence of the label is marked as probable or confirmed. In this paper we excluded the not classified, the hidden and the can not tell images.

Full size image

Between 2014 and 2019, 7686 citizen-made mosquito photos were labeled through Mosquito Alert by entomology experts, with labels indicating whether Ae. albopictus appear in the photos. The photos were included in reports that Mosquito Alert participants uploaded, and each report could contain several photos, see Fig. 2. The entomology experts usually labeled the best photo of the report, but sometimes they labeled two (420 times) or three (49 times) for a single report, meaning that the dataset consisted of 7168 reports. For 6699 reports, only one image was labeled by the experts; for 420 reports two were labeled; for 49 reports three were labeled. Although these reports usually contain several photos, only the ones with expert labels were used in the analysis, as cannot be assumed that all of the photos in a report would have been given the same label.

The main goals of Mosquito Alert during this 6 year period were to monitor Ae. albopictus spreading and provide early detection of Ae. aegypti in Spain. Although people participate in Mosquito Alert all over the world, the majority of the participants and the majority of the photos are in Spain (see Fig. 1). As Ae. aegypti has not been reported in Spain in recent times, most Mosquito Alert participants lived in areas where Ae. aegypti is not present, so most of the photos are of Ae. albopictus. For the detailed yearly distribution of the photos, see Table 1.

Table 1 The collected and expert validated dataset for the period 2014–2019.

Full size table

A popular deep learning model, ResNet50²⁶ was trained and evaluated on the collected dataset with yearly cross-validation. ResNet50 was used because of its wide popularity and its proven classification power in various datasets. As presenting infinitesimal increments of the classification power is not a goal of this paper, we do not report various ImageNet state-of-the-art model performances. Yearly cross-validation was used to rule out any possibility of information leakage (possibility of a user submitting multiple reports for the same mosquito).

The trained model is not only capable of generating highly accurate predictions, but it can also ease the human annotator workload by auto-marking the images where the neural network is confident and more accurate, leaving more uncertain cases for the entomology experts. Moreover, while visualizing the erroneous predictions a few re-occurring patterns were identified, which can serve as a proposal for how to make images that can be best processed by the model.

Several aspects of the dataset were explored as follows.

Classification

Since Mosquito Alert was centered around Ae. albopictus during the relevant time period (2014–2019), the collected dataset is biased towards this species (Table 1). We explored training classifiers on the Mosquito Alert dataset alone and also tied training on a balanced dataset, where 3896 negative samples were added from the IP102²⁷ dataset of various non-mosquito insects as negative samples. From the IP102 dataset, images similar to mosquitoes, and images of striped insects were selected. Although the presented mosquito alert dataset is filtered to contain only mosquito images, in later use, non-mosquito images might be uploaded by the citizens. Training the CNN on a combination of mosquito and non-mosquito images can improve the model to make correct predictions, classifying non-tiger mosquitoes for those cases too. For testing, in each fold, only the Mosquito Alert dataset was used.

The trained classifiers achieved an extremely high area under the receiver operating characteristic curve (ROC AUC) score of 0.96 (see Fig. 3). The fact that the ROC AUC score for each fold was always over 0.95 proves the consistency of our classifier. Inspecting the confusion matrix shows us that the model tends to make more false positive predictions (assuming tiger mosquito is defined as the positive outcome) than false negatives, resulting in high sensitivity. The augmentation of the Mosquito Alert dataset with various insects from IP102 images to make it more balanced resulted in a slight performance boost and narrowed the gap between the number of false positive and false negative samples as expected, see Table 2.

Figure 3

Left: ROC curve calculated on the prediction of the 7686 images in the Mosquito Alert dataset with yearly cross-validation. The blue line shows the case when only the Mosquito Alert dataset was used for training, the orange when the training dataset was balanced out with the addition of non-tiger mosquito insect images from the IP102 dataset. Also a zoom into the part of the ROC curve, where the two methods differ the most is highlighted. Right: the confusion matrix was calculated on the same predictions when only the Mosquito Alert dataset was used for training. For both, a positive label means tiger mosquito is present.

Full size image

Table 2 Yearly cross-validation results with using the Mosquito Alert dataset alone and its IP102 augmented version.

Full size table

How to take a good picture?

Inspection of the weaknesses of a machine learning model is a fruitful way to gain a deeper understanding of the underlying problems and mechanisms. In our case, a careful review of the mispredicted images led us to useful insights into what makes a photo hard to classify for the deep learning model. On Fig. 8, a few selected examples are presented. Unlike humans, deep learning models rely more on textures than on shapes²⁸. As a consequence, grid-like background patterns or striped objects may easily confuse the machine classifier. A larger rich training set can help to avoid these pitfalls, but we also have the option to advise the participants. If participants avoid confusing setups when taking photos, this can improve the accuracy of the automated classification. These guidelines can be added to the Mosquito Alert application to help participants make good images of mosquitoes.

Do not use striped structure (e.g. mosquito net or fly-flap) as a background.
Avoid complex backgrounds when possible. A few examples: patterned carpet, different nets, reflecting/shiny background, bumpy wallpaper.
Use clear, white background (e.g. a sheet of plain paper is perfect if possible) or hold the mosquito with finger pads.
Make sure that as much as possible the mosquito is in focus and covers a large area of the photo.

In general, it is desirable to have a clean white background with the mosquito centered, and with the image containing as little background as possible.

Dataset size impact on model performance

Modern deep CNNs tend to generate better predictions when trained on larger datasets. In this experiment, we trained a ResNet50 model on 10–20–(cdots )–90–100% of 6686 images and evaluated the model on the remaining 1000 images. The 1000 images were selected from the same year (2019) and all of them came from reports with only one photo. There were 709 tiger mosquitoes out of the 1000 test images. ROC AUC and accuracy were calculated with a 500 round bootstrapping of the 1000 test images.

Figure 4

Training a ResNet50 model on a subsampled training dataset. The model was tested against the same 1000 test images for all the steps and statistics of the test metric was calculated with a 500 round bootstrapping. The curve proves the diversity of the Mosquito Alert dataset and also suggests that in the future when the dataset will be even larger, the classification performance will increase.

Full size image

The mean and the standard deviation of the 500 rounds are shown in Fig. 4 for each training data size. From the figure, we can conclude that the predictive power of the model increases as more data are used. The shape of the curves also suggests that the dataset did not reach its plateau. In the upcoming years, as the dataset size increases, ROC AUC and accuracy enhancement is expected.

On measuring image quality

Through the examined period, Mosquito Alert outreach was promoting a mosquito-targeted data collection strategy. Participants were expected to report two mosquito species (Ae. aegypti and Ae. albopictus). By defining these species as positive samples and all the other potential species of mosquito as negative, the submission decision by participants becomes a binary classification problem. In the majority of cases, when participants submit an image we should expect them to think of having a positive sample. Later, based on entomological expert validation, the true label for the image was obtained.

The main goal of such a surveillance system is to keep the sensitivity of the users as high as possible while keeping their specificity at an acceptable level. Therefore, measuring the sensitivity and specificity of the users would be a plausible quality measure. Unfortunately, there is no available information regarding the non-submitted mosquitoes (the true negative and false negative ones), meaning it is impossible to measure sensitivity. The specificity can be measured only in a special case, when there are no false positive images submitted by the user, resulting in a specificity of 1. Based on the latter argument, focusing on metrics derived from the ratio of the submitted tiger mosquito images vs. all submitted images is not meaningful. Instead, the quality can be measured by the usefulness of the photos from the viewpoint of the expert validator or a CNN, as presented in the next chapter.

Quality evolution of the images through time and space

The Mosquito Alert dataset is a unique collection of mosquito images, because, among other things, it is built from 5 consecutive years (not counting 2014, where less than 100 reports were submitted) and it also provides geolocation tags. This uniqueness of the dataset provides potential identification of time and spatial evolution and dependence of the citizen-based mosquito image quality. To explore such an evolution, we performed two different experiments. Geolocation tags were converted to country, region, and city-level information via the geopy Python package. It was found, that the vast majority (95% of all) of the reports were coming from Spain so we performed the analysis only for the Spanish data.

Figure 5

Number of submitted reports and the fraction of their ratio where the entomology expert annotator could tell if tiger mosquito was presented on the photo or not. The charts are shown for the four cities, where Mosquito Alert was the most popular.

Full size image

First, we explored the fraction of the photos, where the entomology expert marked “can not tell”, because the photo was not descriptive enough to decide which species were presented. Figure 5 shows the ratio of the useful mosquito reports, when mosquito decision was possible, compared to all the mosquito reports. The chart shows the above-mentioned ratio for four Spanish cities, which have the most reports submitted (the same information is showed on Supplementary Fig. S1 as a heatmap over Spain). The Mann–Kendall test on the fraction of useful reports shows p-values of 0.09, 0.09, 0.81, 0.22 for Barcelona, Valencia, Málaga, and Girona, which does not justify the presence of a significant trend in image quality, although any conclusions drawn from five data points must be handled with a pinch of salt. It does not mean anything about the individual participants’ quality progression, because Mosquito Alert is highly open and dynamic, and active participants can constantly change. Of note, through these years, the tiger mosquitoes have widely spread from the east coast to the southern and western regions of Spain²⁹. New (and naive) citizen scientists living in the newly colonized regions have been systematically called to action and participation, thus, limiting the overall learning rate of the Mosquito Alert participants’ population. Our results suggest, that either a dynamic balance exists between naive and experienced participants over the period of data recollection, or mosquito photographing skills are independent of the user experience level. The expectation would be that as the population in Spain became more aware of the presence of tiger mosquitoes and their associated public health risks, the system should experience an increase in the useful report ratio, at least for tiger mosquitoes, and most tiger mosquito photos maybe classified automatically.

Figure 6

1000 random samples were selected for each years data. Separated ResNet50 models were trained on each of the years and each model was tested on the rest of the years data. Metrics were calculated with a 500 round random sampling with replacement from the test data. Left: mean of the 500 round bootstrapped accuracy calculations. Right: mean of the 500 round bootstrapped ROC AUC calculations.

Full size image

Second, we subsampled randomly 1000 images from all years between 2015 and 2019. Then we trained a different ResNet50 on data from the different years and generated predictions for the rest of the data, for each year separately. This way we can explore if data from any year is a “better training material” than the others. The results see Fig. 6, shows that 2015 is the worst training material, providing 0.83–0.84 ROC AUC score for the test period, while the rest (period 2016–2019) is similar, ROC AUC varies between 0.90 and 0.93. The reason why the 2015 data found to be the least favourable for training is its class imbalance, meaning that data from 2015 is extremely biased towards tiger mosquitoes (94%), so when training on 2015 data, the model does not see enough non-tiger mosquito samples, while for the other years lower class imbalance was found (70–80%), see Table 1. In general, machine learning models for classification require a substantial amount of examples for each possible class, in our case tiger and non-tiger mosquitoes, therefore worse performance is expected when training on the 2015 data.

Other than the varying class imbalance, we can conclude that the Mosquito Alert dataset quality is consistent, we did not find any concerning difference between training and testing our model for any of the 2016–2017–2018–2019 data pairs.

Pre-filtering the images before expert validation

Generating human annotations for an image classification task is a labour-intensive and expensive part of any project especially if the annotation requires expert knowledge. Therefore, having a model that generates accurate predictions for a well-defined subset of the data saves a lot of time and cost. We assume that the trained classifier is more accurate when the prediction probability is whether high or low and more inaccurate when it is close to 0.5. With this assumption in mind one can tune the (p_{low}) and (p_{high}) probabilities, in a way that images with a prediction probability (p_{low}< p < p_{high}) are discarded and sent to human validation.

Figure 7

Randomly selecting 100,000 (p_{low}) and (p_{high}) thresholds on the predictions which were created via yearly cross-validation. Each time only samples were kept where the predicted probability were out of the ([p_{low};p_{high}]) interval. Each point shows the kept data fraction and the prediction accuracy. Varying the lower and upper predicted probability almost 98% of the images are correctly predicted while keeping 80% of all the images.

Full size image

Varying (p_{low}) and (p_{high}) provides a trade-off between prediction accuracy and the portion of images sent to human validation. Based on Fig. 7 sending 20% of the images to human validation while having an almost 98% accurate prediction for 80% of the dataset is a fruitful way to combine human labour-power and machine learning together.

Source: Ecology - nature.com

Deep learning identification for citizen science surveillance of tiger mosquitoes

Classification

How to take a good picture?

Dataset size impact on model performance

On measuring image quality

Quality evolution of the images through time and space

Pre-filtering the images before expert validation

An aggressive market-driven model for US fusion power development

King Climate Action Initiative announces new research to test and scale climate solutions

ITALIAN LANGUAGE

ENGLISH LANGUAGE