FIN-PRINT a fully-automated multi-stage deep-learning-based framework for the individual recognition of killer whales

Bigg’s killer whale photo-identification dataset

The dataset of this study includes photos of Bigg’s killer whale individuals accumulated over a period of 8 years (2011–2018), from the coastal waters of southeastern Alaska down to central California¹⁵. None of these animals were directly approached explicitly for this study. All photo-identification data was collected under federally authorized research licenses or from beyond mandated minimum viewing distances.

Supplementary Figure S1 visualizes a series of example images of this dataset. Each image contains one or more individuals. In addition to the identification name of the individual(s), further metadata such as photographer, GPS-coordinates, date, and time are provided. Every identification label is an alphanumeric sequence based on the animals’ ecotype (T—Transient), order of original documentation (e.g. T109), and order of birth (e.g. T109A2—the second offspring of the first offspring of T109)¹⁵.

A parsing procedure was designed to verify, analyze, and prepare the image data, guaranteeing adequate preparation for subsequent machine (deep) learning methods. Results of the entire data parsing procedure are presented in Fig. 2 and Supplementary Table S1. Figure 2 visualizes the number of identified individuals, together with the total amount of occurrences in descending order, considering (1) all images, and (2) only photos including a single label. General statistics with respect to the entire dataset are reported in the caption of Fig. 2. Supplementary Table S1 illustrates the 10 most commonly occurring individuals across all 8 years of data, considering all images including single and multiple labels, compared to photos only containing a single label.

The dataset exhibits a substantial class imbalance, as evidenced by the exponential decline in frequencies per killer whale individual (see Fig. 2). Especially for real-world datasets, such unbalanced data partitioning is a common and well-known phenomenon, also referred to as long-tailed data distribution⁷⁹. Such long-tailed data distributions are divided into two sections⁷⁹: (1) the Head region—representing the most commonly identified killer whale individuals, and (2) the Long-Tail region—visualizing a significantly larger number of killer whale individuals, however, with considerably less occurrences. For the purpose of this pilot study, the top-100 most commonly occurring killer whale individuals were selected for supervised classification and as boundary between the head and long-tail area (see Fig. 2). The defined boundary of the top-100 killer whales (head region) represents approximately 1/4 (100 out of 367) of the individuals, however, covering about 2/3 (55,305 out of 86,789) of the entire dataset of single-labeled images.

Figure 2

Bigg’s killer whale image long-tailed data distribution (2011–2018), summing up a total of 121,095 identification images, with 86,789 containing single labels, as well as 34,306 photos including multiple labels, resulting in 367 identified individuals (average number of images per individual (approx)456, standard deviation (approx)442). The two colored graphs visualize the number of identification images per whale in descending order w.r.t. all images, including single and multiple labels (purple curve) and those only containing a single label (green curve). Furthermore, an exemplary data point is visualized for both curves, presenting the number of identification images in relation to a selected number of whales, here for the top-100, clearly describing the exponential decline. Moreover, the number of animals at which the total amount of identification images is < 10 were marked for both curves. In total, 367 individuals were encountered across 2011–2018. Among them, 128 and 125 were found at least once in each year when considering all images and only those with single labels, respectively.

Full size image

However, the number of usable and correctly labeled images which can actually be utilized for machine learning must be adjusted downward due to several circumstances. Figure 3a–i visualizes multiple examples of situations where images contain valid labels. However, the relevant biometric features are very difficult to recognize or not visible at all. These images cannot be labeled without contextual knowledge, for example by observing previous and/or subsequent images and/or knowing additional information about family-related structures. Therefore, such photos cannot be used for classification of individuals and have to be filtered out out in advance.

Another scenario that impacts the final number of usable identification images is visualized in Fig. 3j. While conducting photo-identification in the field, several images are sometimes taken in very short intervals (< 1 s). However, this procedure leads to several very similar images. To avoid biasing the actual multi-class identification performance by including such images in validation and testing, only the first image of a photo series was machine-selected if the images were taken within a time interval (delta le 5,s), including the same date and photographer. Considering the photo series visualized in Fig. 3j, only the first image was utilized as a potential sample for network validation or testing. The training material for individual classification was unaffected by this time interval rule, since augmentation procedures change the images during training anyway.

Figure 3

Examples of image content which either lead to completely unusable/invalid data samples, or which make a robust and correct detection/classification much more difficult.

Full size image

Killer whale dorsal fin/saddle patch detection (FIN-DETECT)

Object detection

In order to extract the regions of interest—killer whale dorsal fin(s) and saddle patch(es)—from the images, an automated and robust object detection has to be conducted. Object detection includes classification and localization of the corresponding object within the respective image³⁶. In this context, circumscribing rectangles, so-called bounding boxes, are utilized and drawn around the objects to be recognized. Between a ground truth bounding box and the predicted bounding box, a quality metric named Intersection over Union (IoU) ((=frac{text {Area of Overlap}}{text {Area of Union}})) is often used as a quality criterion⁸⁰.

Two additional evaluation attributes are of essential importance too³⁶: (1) objectness score—describes the probability that an object is present inside a given bounding box, and (2) class confidences—characterize the probability distribution over all distinct object classes. All objects which have to be localized inside an image can strongly vary not only in type and shape, but also in size. Hence, object detection algorithms usually predict a variety of potential bounding boxes. As a result, individual objects may be detected several times by circumscribing bounding boxes, locating at slightly different positions³⁶. To counteract this phenomenon, non-maximum suppression ³⁶ (NMS) is executed to keep only the best fitting one. Since object detection requires both, correct classification and localization, the metrics per class are determined as follows⁸¹:

(1) true positive (TP): the target object is within the predicted bounding box area, the bounding box objectness score is larger than a chosen threshold, the object classification and assignment are correct, and IoU between bounding box prediction versus ground truth is higher than a given threshold and all other IoUs of potential overlaying boxes (in case of overlaying boxes, only the box indicating the highest IoU is considered as TP, whereas all remaining boxes are false positives), (2) false positive (FP): the bounding box objectness score is larger than a chosen threshold, but either the target object is not within the predicted circumscribing rectangle, the classification hypothesis is wrong, and/or IoU is smaller compared to any other possible overlaying bounding boxes, (3) false negative (FN): the target object is in the image, but no predicted bounding box hypothesis detected the corresponding object properly, (4) true negatives (TN): object detection ignores TNs, since there are evidently an infinite number of empty boxes with an objectness score that is smaller than a chosen threshold. Based on these traditional binary classification scores, target metrics such as precision, recall, F1-score, average precision (AP), and mean average precision (mAP) can be calculated³⁶. The average precision describes the area-under-the-curve (AUC) of a precision/recall graph, transformed into a monotonically decreasing curve beforehand, calculated on the basis of different IoU thresholds³⁶. The AP is calculated for each class, while the mAP refers to the average of all class-related AP scores³⁶. Consequently, AP and mAP are identical unless the number of classes is greater than one³⁶.

Detection data

The dataset which was utilized for training and evaluation of FIN-DETECT was generated via a two-step semi-automatic procedure. In a first step, 2,286 images, originating from various months in 2015, were manually annotated with bounding boxes resulting in the Human-Annotated Detection Dataset (HADD)—see Table 1. For this purpose, every dorsal fin and associated saddle patch, visible in each image, were individually circumscribed with a rectangle. FIN-DETECT was trained on HADD using the data distribution reported in Table 1.

The resulting and preliminary version of FIN-DETECT was utilized to automatically apply bounding boxes to randomly chosen unseen images from 2011, 2015, and 2018 in order to enlarge the HADD with machine-identified samples. These samples were not manually verified, but images with no bounding boxes, as well as those with more bounding boxes than labels, were discarded. After applying these rules, a joint dataset, named the Extended-Annotated Detection Dataset (EADD), was created, consisting of the HADD and all valid machine-identified data samples. The resulting EADD (see Table 1) was utilized to retrain FIN-DETECT, which was ultimately applied to all future killer whale detections.

Table 1 Human-Annotated Detection Dataset (HADD), including human-labeled dorsal fin/saddle-patch bounding boxes, as well as Extended-Annotated Detection Dataset (EADD) containing human- and machine-labeled dorsal fin/saddle-patch bounding boxes.

Full size table

Network architecture, data preprocessing, training, and evaluation

FIN-DETECT, visualized in Supplementary Fig. S2, is based on an extended version of the original YOLOv3^76,77-based object detection architecture. YOLOv3^74,75,76 (You Only Look Once) is a real-time, single-stage, multi-scale, and fully-convolutional object detection algorithm, which was first introduced as YOLOv1 by Redmon et al.⁷⁴ and continuous improvements have led to the most recent version known as YOLOv5⁸². At the development of FIN-PRINT, YOLOv3 was the most recent version. FIN-DETECT (see Supplementary Fig. S2) essentially consists of two major network parts^74,75,76,83: (1) feature extraction network, usually referred to as feature extractor and/or backbone network, learning compressed representations (feature maps) of a given input image, representing the foundation for subsequent detection, and (2) feature pyramid network, also named head-subnet and/or detector, responsible for detecting objects at three different scales. FIN-DETECT receives as network input a preprocessed, re-scaled, and square (416,times ,416) px RGB-image (zero-padding in case of a none-square original image), resulting in an input shape of (3,times ,416,times ,416). The network detects objects utilizing a 13 (times) 13, 26 (times) 26, and 52 (times) 52 grid to recognize large, medium, and small patterns^76,83 (see Supplementary Fig. S2). FIN-DETECT predicts per cell a (1,times ,21) detection vector, which contains (b=3) different bounding boxes and (c=2) classes (dorsal fin/saddle patch vs. no dorsal fin/saddle patch), combined with four 0/1-normalized bounding box coordinates (x, y, w, h) and one objectness score per box, resulting in b (*) (5 (+) c) (=) 21 elements per cell. Consequently, the scale-dependent detection outputs of FIN-DETECT comprised a final output shape of (13,times ,13,times ,21), (26,times ,26,times ,21), and (52,times ,52,times ,21) (see Supplementary Fig. S2). More detailed information about YOLO in general, YOLOv3, and/or other YOLO versions can be found here^{74,75,76,82,84,85}.

The backbone network (Darknet-53⁷⁶) of FIN-DETECT was initialized with pre-trained weights on ImageNet⁸⁶. A detailed overview about all other network hyperparameters is given in Supplementary Table S2. Moreover, FIN-DETECT implements the following YOLOv3⁷⁶ detection parameters: objectness score threshold of 0.5 (training, validation) and 0.8 (testing), IoU threshold of 0.5, and NMS threshold equals to 0.5. FIN-DETECT reports precision, recall, F1-Score, and mean average precision as evaluation metrics. Based on a given input image, FIN-DETECT returns a text file containing 0/1-normalized bounding box information (x, y, w, h) of every detection hypothesis.

Killer whale dorsal fin/saddle patch extraction (FIN-EXTRACT)

FIN-EXTRACT facilitates automatic extraction and subsequent rescaling of previously detected and marked image sub-regions using the bounding box information derived by FIN-DETECT. For each identified bounding box, a square (512,times ,512) px RGB-sub-image was cropped from the original photo. In a first step, the 0/1-normalized bounding box information (x, y, w, h) was multiplied by the original image shape to obtain the correct coordinates within the original image. In case a bounding box was not square, the larger of the two dimensions was utilized to reshape the original detection rectangle. Furthermore, it was verified whether a bounding box extended beyond the edge of an image and moved accordingly if necessary. In case the original image was smaller than (512,times ,512) px, it was interpolated and resized respectively. Otherwise, a sub-image, based on the original bounding box size, was cropped and if applicable compressed and resized to (512,times ,512) px. Depending on the resized bounding box, this may result in a bit more background content. However, any kind of zero-padding is avoided for subsequent individual classification. In addition, the image quality of the final extracted sub-image(s) depends on the original image resolution, along with the distance of the individual(s) within the captured photos.

Valid versus invalid (VVI) dorsal fin/saddle patch detection (VVI-DETECT)

VVI detection

Considering potential detection errors (e.g. tail and/or pectoral fins, triangular formed head of the animal, etc.), besides all the different challenging situations visualized in Fig. 3a–i, additional data enhancement is indispensable (see also examples in Supplementary Fig. S3). All these scenarios either result in completely unusable/invalid (e.g. missing dorsal fin, no saddle patch, bad angle, distance, detection errors), or insufficient quality images (e.g. poor weather conditions, bad exposure, blurred image). Without sufficient domain knowledge and additional meta-information (e.g. images shortly taken before, other animals in the image, family-related structures, etc.), all the aforementioned situations lead to invalid identification images which are not able to be classified correctly by human or machine. Detected/extracted RGB-sub-images containing a single dorsal fin and saddle patch are considered as valid identification images. To filter the majority of such invalid samples originating from previous processing levels, a binary classification network was designed to distinguish between two classes—Valid Versus Invalid (VVI)—killer whale identification images prior to final multi-class individual recognition. Supplementary Fig. S3 visualizes some of the challenging pre-detected/-extracted sub-images, belonging to the invalid class.

Detection data

In order to train VVI-DETECT, a two-class dataset, named Valid/Invalid Killer Whale Identification Dataset 2011–2017 (VIKWID11-17), was utilized. Table 2 describes VIKWID11-17 in combination with the respective data distribution. VIKWID11-17 is a manually labeled data archive based on randomly chosen, previously detected (FIN-DETECT), and extracted (FIN-EXTRACT) sub-images from 2011 to 2017. In addition to multiple valid pre-detected/-extracted identification images of different individuals, the dataset also includes examples of invalid sub-images covering the scenarios illustrated in Fig. 3a–i. Furthermore, the invalid class was extended by examples of images with potential detection errors (noise), such as water, boats, coastline, houses and/or other landscape backgrounds, to also filter such cases in advance. During data selection an interval of 5 s was applied to the validation and test set (see Fig. 3j) in order to not distort classification accuracy in any way.

Table 2 Valid/Invalid Killer Whale Identification Dataset 2011–2017 (VIKWID11-17), a human-annotated dataset consisting of valid and invalid identification images (dorsal fin + saddle-patch), utilized to train, validate, and test VVI-DETECT, after applying the interval rule of 5 s with respect to the validation and test set.

Full size table

Network architecture, data preprocessing, training, and evaluation

VVI-DETECT, visualized in Supplementary Fig. S3, is a ResNet34⁷⁸-based convolutional neural network (CNN), designed for binary classification between valid versus invalid (VVI) identification images. Residual networks⁷⁸ (ResNets) consist of a sequence of residual layers, which are built up from building blocks including concatenations of weight (e.g. convolutional/fully-connected), normalization (e.g. batch-norm⁸⁷), and activation layers (e.g. ReLU⁸⁸), together with residual-/skip-connections⁷⁸. These connections allow the network to optimize a residual mapping (F(x),=,H(x),-,x) with respect to a given input x, rather than directly learning an underlying mapping H(x)⁷⁸. This type of learning, called residual learning, opens up the possibility to train deeper models⁷⁸. The use of different building block types, together with the number of blocks, results in various ResNet architectures, like ResNet18, ResNet34, ResNet50, ResNet101, and ResNet152⁷⁸. For more detailed information about the concept of residual learning/networks, see He et al.⁷⁸. Compared to the original ResNet34 architecture, the size of the initial (7,times ,7) convolution kernel was changed to (9,times ,9), in order to cover larger receptive fields at the initial stage. As network input, VVI-DETECT receives data of previously detected (FIN-DETECT) and extracted/reshaped (FIN-EXTRACT) (3,times ,512,times ,512)-large RGB-pictures for both classes. The network output is a (1,times ,2) probability vector, containing class-wise model prediction probabilities (see Supplementary Fig. S3). Based on preliminary investigations, ResNet34⁷⁸ proved to be the most efficient version for this entire study in terms of performance and computation efficiency compared to other ResNet architectures. VVI-DETECT integrates an augmentation procedure consisting of eight different functions: (1) addition of random Gaussian noise to the image, (2) image rotation at maximum angle of ± 25 degree, (3) blurring the image by applying a gaussian blur, (4) mirroring the picture with respect to the y-axis, (5) edge enhancement within the image, (6) sharpening the input picture, (7) brighten/darken of the image, and (8) random color change by swapping the RGB channels. Out of this function pool, number, type, and arrangement of augmentation operations were randomly determined for each image within the training phase (no augmentation during validation and testing). The random number of augmentations per image was within an interval of ([1,:,a_{max}]) with (a_{max},in ,[1,:8,]) being constant across the entire training. In this study the maximum augmentation number per image was set to (a_{max}=5). VVI-DETECT reports accuracy, precision, recall, F1-Score, and false-positive-rate. A detailed description of all relevant network hyperparameters is illustrated in Supplementary Table S2.

Individual killer whale classification network (FIN-IDENTIFY)

Individual killer whale classification

Robust multi-class killer whale individual classification requires representative and high-quality animal-specific image data in sufficient quantity. However, significant variations can be observed in the total number of animal-specific images (see Fig. 2). In addition, multiple and essential data constraints have been introduced which strongly affect the actual amount of usable identification images per individual, such as (1) only single-labeled images together with exactly one predicted bounding box hypothesis, (2) data enhancement by pre-filtering invalid identification images to avoid situations visualized in Fig. 3a–i, and (3) time interval rule of 5 s during network validation and testing to counteract the effect of classifying very similar photos, visualized in Fig. 3j. Moreover, all photos from 2018 were completely ignored for additional network evaluation purposes. Additionally, all images including more than a single label (in total 34,306 pictures, 2011–2018, see Fig. 2) could not be used for training an initial multi-class identification network due to the label assignment problem. The label assignment problem describes the situation where an image contains multiple individuals and labels, however, it is unknown which label belongs to which individual. All these data restrictions and constraints led to a significant, qualitative improvement of the material, but also considerably reduced the amount of usable data. In summary, these data limitations led to a final representation of the 100 (out of 367) most commonly single-labeled Bigg’s individuals (see Fig. 2), present across all years (2011–2018), representing about 64% (55,305 photos) of the entire single annotated and original data from 2011 to 2018 (86,789 images). Based on the top-100 killer whales, the smallest individual-specific number of remaining data samples comprised 135 images (see Table 3), to still provide sufficient variation and data diversity combined with various image augmentation techniques during model training. Despite previous filtering by VVI-DETECT and to avoid potential errors caused by previous processing levels, the proposed invalid class was also included at this stage resulting in a final 101-class (100 individuals, 1 rejection class) procedure.

Identification data

FIN-IDENTIFY was trained on two different datasets, both illustrated in Table 3. The first dataset, named Killer Whale Individual Dataset 2011–2017 (KWID11-17), consisted of 39,464 excerpts including only a single label, distributed across 101 classes, and recorded between 2011 and 2017 (see Table 3). All excerpts were machine-annotated, applying FIN-DETECT, FIN-EXTRACT, and VVI-DETECT in a sequential order, following the previously mentioned data constraints and restrictions. VVI-DETECT considered an image to be invalid if the network confidence was p(_{invalid}) > 0.85. The VIKWID11-17 dataset (see Table 2), on which VVI-DETECT was trained on, is completely independent from the entire data listed in Table 3. KWID11-17 consists of 36,457 images being assigned to the valid class, whereas 3007 photos were added to the invalid class, representing a small portion of the overall amount of detected invalid images across 2011 to 2017 in order to not bias class distributions. Table 3 presents the final data distribution of KWID11-17 as well as dataset-specific statistics.

To add additional data and simultaneously counteract the label assignment problem, the first version of FIN-IDENTIFY, trained on KWID11-17, was applied to all images from 2011 until 2017, including those with multiple labels and either one or more of the trained 100 individuals. FIN-IDENTIFY classified all potential detected (FIN-DETECT) and extracted (FIN-EXTRACT) labels for each image containing more than one animal. If the best classification hypothesis (class with the highest probability) per sub-image matches one of the original labels applied to that image, it was considered as correctly classified and added to the respective class. The resulting extended dataset, entitled Killer Whale Individual Dataset Extended 2011–2017 (KWIDE11-17), together with the corresponding data distribution, was utilized to train an updated and more robust version of FIN-IDENTIY (see Table 3). KWIDE11-17 consists of KWID11-17, extended by the additional machine-identified multi-label material, leading to a total number of 65,713 excerpts, distributed across 101 classes. The total number of valid identification images is 62,740, whereas the invalid class comprises 2,973 images. KWID11-17 and KWIDE11-17 use the same portion of machine-annotated invalid data excerpts, however, the overall number of samples slightly differs (KWID11-17—3007 versus KWIDE11-17—2973) due to a different split, in combination with the applied interval rule of 5 s during validation and testing.

Table 3 Killer Whale Individual Dataset 2011–2017 (KWID11-17), including machine-annotated data of valid images (dorsal fin (+) saddle-patch) for the 100 most commonly photographed individuals satisfying the data constraints (one label per image (+) exactly one bounding box prediction), in combination with machine-annotated invalid data utilizing VVI-DETECT after applying the interval rule of 5 s.

Full size table

Network architecture, data preprocessing, training, and evaluation

FIN-IDENTIFY, visualized in Supplementary Fig. S4, is a ResNet34⁷⁸-based convolutional neural network (CNN), created for multi-class individual classification. The network architecture is identical to VVI-DETECT (see Supplementary Fig. S3) except for the final 101-class output layer ((1,times ,101) probability vector). FIN-IDENTIFY was trained on the (3,times ,512,times ,512) sub-images, generated by FIN-EXTRACT and if necessary filtered by VVI-DETECT (see Fig. 1 and Supplementary Fig. S4). Besides the same network architecture, identical interval rule conditions (5 s) were applied during training. Data augmentation and preprocessing was also identical to VVI-DETECT and all other required network hyperparameters are listed in Supplementary Table S2. Next to the overall accuracy, FIN-IDENTIFY reports a top-3 weighted (TWA) and unweighted accuracy (TUA). TWA describes whether the target class probability is within the top-3 and if so, a rank-dependent weight is assigned ((omega _{1},=,1), (omega _{2},=,0.5), and (omega _{3},=,0.25)). TUA illustrates, if the target individual is within the top-3, it is counted as correct, independent of the respective rank. For both metrics, either the sum of all weighted, or correct predictions is divided by the total number of classifications.

Source: Ecology - nature.com

FIN-PRINT a fully-automated multi-stage deep-learning-based framework for the individual recognition of killer whales

Bigg’s killer whale photo-identification dataset

Killer whale dorsal fin/saddle patch detection (FIN-DETECT)

Object detection

Detection data

Network architecture, data preprocessing, training, and evaluation

Killer whale dorsal fin/saddle patch extraction (FIN-EXTRACT)

Valid versus invalid (VVI) dorsal fin/saddle patch detection (VVI-DETECT)

VVI detection

Detection data

Network architecture, data preprocessing, training, and evaluation

Individual killer whale classification network (FIN-IDENTIFY)

Individual killer whale classification

Identification data

Network architecture, data preprocessing, training, and evaluation

Eco-evolutionary responses of the microbial loop to surface ocean warming and consequences for primary production

Population genetics and independently replicated evolution of predator-associated burst speed ecophenotypy in mosquitofish

ITALIAN LANGUAGE

ENGLISH LANGUAGE