System overview
To address the open-set novel species detection problem, our system leverages a two-step image recognition process. Given an image of a mosquito specimen, the first step uses CNNs trained for species classification to extract relevant features from the image. The second step is a novelty detection algorithm, which evaluates the features extracted by the CNNs in order to detect whether the mosquito is a member of one of the sixteen species known to the CNNs of the system. The second step consists of two stages of machine learning algorithms (tier II and tier III) that evaluate the features generated in step one to separate known species from unknown species. Tier II components evaluate the features directly and are trained using known and unknown species. Tier III evaluates the answers provided by the tier II components to determine the final answer, and is trained using known species, unknown species used for training tier II components, and still more unknown species not seen by previous components. If the mosquito is determined by tier III not to be a member of one of the known species, it is classified as an unknown species, novel to the CNNs. This detection algorithm is tested on truly novel mosquito species, never seen by the system in training, as well as the species used in training. If a mosquito is recognized by the system as belonging to one of the sixteen known species (i.e. not novel), the image proceeds to species classification with one of the CNNs used to extract features.
Unknown detection accuracy
In distinguishing between unknown species and known species, the algorithm achieved an average accuracy of 89.50 ± 5.63% and 87.71 ± 2.57%, average sensitivity of 92.18 ± 6.34% and 94.09 ± 2.52%, and specificity of 80.79 ± 7.32% and 75.82 ± 4.65%, micro-averaged and macro-averaged respectively, evaluated over twenty-five-fold validation (Table 1). Here, micro-average refers to the metric calculated without regard to species, such that each image sample has an equal weight, considered an image sample level metric. Macro-average refers to the metric first calculated within a species, then averaged between all species within the relevant class (known or unknown). Macro-average can be considered a species level metric, or a species normalized metric. Macro-averages tend to be lower than the micro-averages when species with higher sample sizes have the highest metrics, whereas micro-averages are lower when species with lower sample sizes have the highest metrics. Cross validation by mixing up which species were known and unknown produced variable sample sizes in each iteration, because each species had a different number of samples in the generated image dataset. Further sample size variation occurred as a result of addressing class imbalance in the training set. The mean number of samples varied for each of the 25 iterations because of the mix-up in data partitioning for cross-validation (see Table 1 for generalized metrics; see Supplementary Table 1, Datafolds for detailed sampling data).
Differences within the unknown species dictated by algorithm structure
The fundamental aim of novelty detection is to determine if the CNN in question is familiar with the species, or class, shown in the image. CNNs are designed to identify visually distinguishable classes, or categories. In our open-set problem, the distinction between known and unknown species is arbitrary from a visual perspective; it is only a product of the available data. However, the known or unknown status of a specimen is a determinable product of the feature layer outputs, or features, produced by the CNN’s visual processing of the image. Thus, we take a tiered approach, where CNNs trained on a specific set of species extract a specimen’s features, and independent classifiers trained on a wider set of species analyze the features produced by the CNNs to assess whether the CNNs are familiar with the species in question. The novelty detection algorithm consists of three tiers, hereafter referred to as Tier I, II, and III, intended to determine if the specimen being analyzed is from a closed set of species known to the CNN:
Tier I: two CNNs used to extract features from the images.
Tier II: a set of classifiers, such as SVMs, random forests, and neural networks, which independently process the features from Tier I CNNs to distinguish a specimen as either known or unknown species.
Tier III: soft voting of the Tier II classifications, with a clustering algorithm, in this case a Gaussian Mixture Model (GMM), which is used to make determinations in the case of unconfident predictions.
The tiered architecture necessitated partitioning of groups of species between the tiers, and an overview of the structure is summarized in Fig. 2A. The training schema resulted in three populations of unknown species: set U1, consisting of species used to train Tier I, also made available for training subsequent Tiers II and III; set U2, consisting of additional species unknown to the CNNs used to train Tiers II and III; and set N, consisting of species used only for testing (see Fig. 2B). Species known to the CNNs are referred to as set K. It is critical to measure the difference between these species sets, as any of the species may be encountered in the wild. U1 achieved 97.85 ± 2.81% micro-averaged accuracy and 97.34 ± 3.52% macro-averaged accuracy; U2 achieved 97.05 ± 1.94% micro-averaged accuracy and 97.30 ± 1.41% macro-averaged accuracy; N achieved 80.83 ± 19.91% micro-averaged accuracy and 88.72 ± 5.42% macro-averaged accuracy. The K set achieved 80.79 ± 7.32% micro-averaged accuracy and 75.83 ± 5.42% macro-averaged accuracy (see Table 2). The test set sample sizes for each of the twenty five folds are as follows, (formatted [K-taxa,K-samples;U1-taxa,U1-samples;U2-taxa,U2-samples;N-taxa,N-samples]): [16,683;8,51;10,536;13,456], [16,673;8,51;9,537;13,485], [16,673;8,51;8,523;13,508], [16,673;8,46;6,159;11,869], [16,694;8,51;7,483;10,548], [15,409;9,62;11,2906;8,546], [15,456;9,62;9,2458;12,1024], [15,456;10,67;13,2359;9,1115], [15,456;9,62;8,3189;12,306], [15,456;10,67;10,2874;10,601], [16,543;10,56;12,1450;10,1052], [16,484;9,52;11,2141;10,312], [16,492;10,54;11,2185;12,263], [16,512;8,45;15,2292;10,189], [16,480;9,49;9,1652;13,790], [16,442;9,44;11,1253;11,665], [16,494;10,54;14,1727;10,228], [16,442;9,55;13,1803;10,96], [16,538;10,60;8,1509;9,502], [16,489;10,60;13,1764;9,184], [16,462;8,47;13,1415;11,452], [16,437;8,54;9,1548;11,320], [16,447;8,55;11,654;10,1193], [16,547;8,44;9,1437;11,531], [16,548;7,52;7,1464;11,499]. See Supplementary Table 1, Datafolds for more detailed sample information.
The novelty detection architecture was designed with three tiers to assess whether the CNNs were familiar with the species shown in each image. (A) Tier I consisted of two CNNs used as feature extractors. Tier II consisted of initial classifiers making an initial determination about whether the specimen is known or unknown by analyzing the features of one of the Tier I CNNs, and the logits in the case of the wide and deep neural network (WDNN). In this figure, SVM refers to a support vector machine, and RF refers to a random forest. Tier III makes the final classification, first with soft voting of the Tier II outputs, then sending high confidence predictions as the final output and low confidence predictions to a Gaussian Mixture Model (GMM) to serve as the arbiter for low confidence predictions. (B) Data partitioning for training each component of the architecture is summarized: Tier I is trained on the K set of species, known to the algorithm; Tier I open-set CNN is also trained on the U1 set of species, the first set of unknown species used in training; Tier II is trained on K set, U1 set, and the U2 set of species, the second set of unknown species used in training; Tier III is trained on the same species and data-split as Tier II. Data-split ratios were variable for each species over each iteration (Xs,m where s represents a species, m represents a fold, and X is a percentage of the data devoted to training) for Tiers II and III; Xs,m was adjusted to manage class imbalance within genus across known and unknown classes. Testing was performed on each of the K, U1, and U2 sets, as well as the N set, the final set of unknown species reserved for testing the algorithm, such that it is tested on previously unseen taxa, replicating the plausible scenario to be encountered in deployment of CNNs for species classification. Over the twenty-five folds, each known species was considered unknown for at least five folds and included as novel for at least one-fold.
Subsequent species classification
Following the novelty detection algorithm, species identified as known are sent for species classification to the closed-set Xception model used in Tier I of the novelty detection algorithm. Figure 3A shows the species classification results independently over the five folds of Tier I, which achieved a micro-averaged accuracy 97.04 ± 0.87% and a macro F1-score of 96.64 ± 0.96%. Figure 3B shows the species classification cascaded with the novelty detection methods where all unknown species are grouped into a single unknown class alongside the known classes in an aggregated mean confusion matrix over the twenty-five folds of the full methods, yielding a micro-averaged accuracy of 89.07 ± 5.58%, and a macro F1-score of 79.74 ± 3.65%. The confusion matrix is normalized by species and shows the average classification accuracy and error distribution. The independent accuracy for classifying a single species ranged from 72.44 ± 13.83% (Culex salinarius) to 100 ± 0% (Aedes dorsalis, Psorophora cyanescens), and 15 of the 20 species maintained an average sensitivity above 95%. Test set sample size for each species were as follows (formatted as species, [fold1,fold2,fold3,fold4,fold5]): Ae. aegypti: [127,0,133,132,126]; Ae. albopictus: [103,90,0,99,102]; Ae. dorsalis: [43,41,42,0,41]; Ae. japonicus: [162,159,154,156,0]; Ae. sollicitans: [57,0,60,58,60]; Ae. taeniorhynchus: [0,25,27,25,24]; Ae. vexans: [50,48,0,46,49]; An. coustani: [29,21,18,0,22]; An. crucians s.l.: [56,58,61,61,0]; An. freeborni: [87,0,77,79,80]; An. funestus s.l.: [158, 174,0,173,175]; An. gambiae s.l.: [182,178,178,0,166]; An. punctipennis: [0,36,31,34,33]; An. quadrimaculatus: [0,28,28,28,30]; Cx. erraticus: [47,47,44,49,0]; Cx. pipiens s.l.: [212,0,218,219,205]; Cx. salinarius: [25,26,0,26,25]; Ps. columbiae: [66,59,67,0, 64]; Ps. cyanescens: [0,55,56,54,56]; Ps. ferox: [40,31,41,34,0].
Mean normalized confusion matrices for species classification shows the distribution of error within species. The species classification in these confusion matrices was performed by the Tier I CNN, the closed-set Xception model. The confusion matrix conveys the ground truth of the sample horizontally, labels on the left, and the prediction of the full methods vertically, labels on the bottom. Accurate classification is across the diagonal, where ground truth and prediction match, and all other cells on the matrix describe the error. Sixteen species were known for a given fold, and 51 species were considered unknown for a given fold, with each of the twenty known species considered unknown for one fold. (A) The species classification independent of novelty detection shows an average accuracy of 97.04 ± 0.87% and a macro F1-score of 96.64 ± 0.96%, calculated over the five folds of Tier I classifiers, trained and tested over an average of 7174.8 and 1544.6 samples. Of the error, 73.5% occurred with species of the same genus as the true species. (B) The species classification as a subsequent step after novelty detection yielded 89.07 ± 5.58% average accuracy, and a macro F1-score of 79.74 ± 3.65% trained and tested on an average of 7174.8 and 519.44 samples, evaluated over the twenty-five folds of the novelty detection methods. First, a sample was sent to the novelty detection algorithm. If the sample was predicted to be known to the species classifier, which was the closed-set Xception algorithm used in Tier I, then the sample was sent to the algorithm for classification.
Many of the species which were a part of the unknown datasets had enough data to perform preliminary classification experiments. Thirty-nine of the 67 species had more than 40 image samples. Species classification on these 39 species yielded an unweighted accuracy of 93.06 ± 0.50% and a macro F1-score of 85.07 ± 1.81% (see Fig. 4A). The average F1-score for any one species was plotted against the number of specimens representing the samples in the species, which elucidates the relationship between the training data available and the accuracy (see Fig. 4B). No species with more than 100 specimens produced an F1-score below 93%.
Species classification across 39 species shows the strength of CNNs for generalized mosquito classification, and elucidates a guideline for the number of specimens required for confident classification. Classification achieved unweighted accuracy of 93.06 ± 0.50% and a macro F1-score of 85.07 ± 1.81%, trained, validated, and tested over an average of 9080, 1945, and 1945 samples over five folds. (A) The majority of the error in this confusion matrix shows confusion between species of the same genera. Some of the confusion outside of genera is more intuitive from an entomologist perspective, such as the 10.2% of Deinocerites cancer samples classified as Culex spp. Other errors are less intuitive, such as the 28.61% of Culiseta incidens samples classified as Aedes atlanticus. (B) This plot of average F1-score of a species against the number of specimens which made up the samples available for training and testing shows the relationship between the available data for a given specimen and classification accuracy. When following the database development methods described in this work, a general guideline of 100 specimens’ worth of data can be extrapolated as a requirement for confident mosquito species classification.
Test set sample size for each species in the 39 species closed-set classification were as follows (formatted as species, [fold1,fold2,fold3,fold4, fold5]): Ae. aegypti: [131,127,127,124,133]; Ae. albopictus: [99,99,107,97,95]; Ae. atlanticus: [15,13,14,14,15]; Ae. canadensis: [17,21,21,21,20]; Ae. dorsalis: [42,41,43,40,43]; Ae. flavescens: [13,14,14,14,14]; Ae. infirmatus: [17,15,19,18,16]; Ae. japonicus: [155,153,151,160,150]; Ae. nigromaculis: [6,6,5,5,5]; Ae. sollicitans: [63,61,58,57,60]; Ae. taeniorhynchus: [30,25,27,25,25]; Ae. triseriatus s.l.: [14,16,17,14,13]; Ae. trivittatus: [28,24,25,24,23]; Ae. vexans: [46,58,57,51,50]; An. coustani: [25,32,27,33,27]; An. crucians s.l.: [64,57,60,59,62]; An. freeborni s.l.: [85,77,82,74,89]; An. funestus s.l.: [181,187,166,175,161]; An. gambiae s.l.: [191,182,178,185,194]; An. pseudopunctipennis: [10,8,12,9,9]; An. punctipennis: [32,28,38,32,32]; An. quadrimaculatus: [30,33,26,37,35]; Coquillettidia perturbans: [31,29,30,32,35]; Cx. coronator: [10,9,10,11,10]; Cx. erraticus: [48,51,49,53,50]; Cx. nigripalpus: [14,14,13,13,13]; Cx. pipiens s.l.: [205,203,216,208,216]; Cx. restuans: [12,13,12,14,12]; Cx. salinarius: [24,25,24,23,24]; Cus. incidens: [9,9,9,9,8]; Cus. inornata: [9,9,8,9,9]; Deinocerites cancer: [10,10,10,10,9]; De. sp. Cuba-1: [16,14,15,14,15]; Mansonia titillans: [15,16,15,14,13]; Ps. ciliata: [29,26,24,23,28]; Ps. columbiae: [62,59,63,60,61]; Ps. cyanescens: [55,54,57,55,55]; Ps. ferox: [32,48,31,36,34]; Ps. pygmaea: [24,25,25,24,25].
Comparison to alternative methods
Some intuitive simplifications of our methods, along with some common direct methods for novel species detection, are compared to our full methods. All compared methods were found to be statistically different from the full methods using McNemar’s test. The compared methods tested, along with their macro F1-score, standard deviation, and p-value as compared to the full methods, were as follows: (1) soft voting of all Tier II component outputs, without a GMM arbiter (86.87 ± 3.11%, p < 0.00001); (2) the Random Forest Tier II component, appended to the closed set classification CNN from Tier I (82.80 ± 3.84%, p < 0.00001); (3) the SVM Tier II component, appended to the closed set classification CNN from Tier I (82.68 ± 4.51%, p < 0.00001); (4) the WDNN Tier II component, appended to the closed set classification CNN from Tier I (81.87 ± 4.53%, p < 0.00001); (5) the softmax of the closed-set Xception logits, producing an unknown prediction for those specimens where no probability exceeded a threshold determined during training (72.38 ± 4.43%, p < 0.00001); (6) using the predicted class of the open-set Xception model, but remapping any genus level class and the general mosquito class to the unknown class (72.72 ± 4.28%, p < 0.00001); (7) ODIN paired with the closed-set Xception to recognize out of distribution classifications (49.58 ± 26.02%, p < 0.00001). Our full novelty detection methods are significantly different from each alternative method tested on these twenty-five folds, but with macro F1-score and standard deviation (86.24 ± 2.48%) similar to the simplified methods. The difference is noticeable in a higher macro-averaged unknown sensitivity for the full methods, seen in Table 3, the advantages of which are discussed in the discussion.
Source: Ecology - nature.com