Developments in data science solutions for carnivore tooth pit classification

Sample

A total of 620 carnivore tooth pits were included in the present study. These samples included tooth marks produced by;

Brown Bears (Ursus arctos, Ursidae, 69 pits)
Spotted Hyenas (Crocuta crocuta, Hyaenidae, 86 pits)
Wolves (Canis lupus, Canidae, 80 pits)
African Wild Dogs (Lycaon pictus, Canidae, 89 pits)
Foxes (Vulpes vulpes, Canidae, 53 pits)
Jaguars (Panthera onca, Felidae, 77 pits)
Leopards (Panthera pardus, Felidae, 84 pits)
Lions (Panthera leo, Felidae, 82 pits)

Samples originated from a number of different sources, including animals kept in parks as well as wild animals. Samples obtained from wild animals included those produced by foxes as well as wolves. The only sample containing both wild and captive animals was the wolf sample. Preliminary data from these tooth pits revealed animals in captivity to have highly equivalent tooth pit morphologies to wild animals ((vert d vert ) = 0.125, p = 9.0e−14, BFB = 1.4e+11), while tooth scores revealed otherwise ((vert d vert ) = 0.152, p = 0.99, BFB = 3.7e+01 against (H_{a})). Under this premise, and so as to avoid the influence of confounding variables that go beyond the scope of the present study, tooth scores were excluded from the present samples and are under current investigation (data in preperation). Nevertheless, other research have shown tooth pits to be more informative than tooth scores when considering morphology^20,23.

When working with tooth mark morphologies, preference is usually given to marks found on long bone diaphyses. This is preferred considering how diaphyses are denser than epiphyses, and are thus more likely to survive during carnivore feeding. Nevertheless, when working with captive or semi-captive animals, controlling the bones that carnivores are fed is not always possible. This is due to the rules and regulations established by the institution where these animals are kept⁶⁴. While this was not an issue for the majority of the animals used within the present study, in the case of P. pardus, animals were only fed ribs in articulation with other axial elements. In light of this, a careful evaluation on the effects this may have on the analogy of our samples was performed (Supplementary Appendix 2). These reflections concluded that in order to maintain a plausible analogy with tooth marks produced by other animals on diaphyses, tooth marks could only be used if found on the shaft of bovine ribs closest to the tuburcle, coinciding with the posterior and posterior-lateral portions of the rib, and farthest away from the costochondral junction⁶⁵. This area of the rib corresponds to label RI3 described by Lam et al.⁶⁵. Moreover, with a reported average cortical thickness of 2.3mm (± 0.13 mm) and Bone Mineral Density of (4490 kg/m^{3} [213.5, 334.6])⁶⁶, bovine ribs are frequently employed in most bone simulation experiments used in agricultural as well as general surgical sciences. Finally, considering the grease, muscle and fat content of typical domestic bovine individuals⁶⁷, alongside the general size of P. pardus teeth, it was concluded that the use of rib elements for this sample was the closest possible analogy to the tooth marks collected from other animals.

Carnivores were fed a number of different sized animals, also dependent in most cases on the regulations established by the institution where these animals are kept⁶⁴. Nevertheless, recent research has found statistical similarities between tooth marks found on different animals²⁵, with the greatest differences occurring between large and small sized animals. Needless to say, considering the typical size of prey some of these carnivores typically consume, this factor was not considered of notable importance for the present study²⁵ (Supplementary Appendix 1).

For the purpose of comparisons, animals were split into 5 groups according to ecosystem as well as taxonomic family. From an ecological perspective, two datasets were defined; (1) the Pleistocene European Taxa dataset containing U. arctos, V. vulpes, C. crocuta, P. pardus, P. leo and C. lupus; and (2) the African Taxa dataset containing C. crocuta, P. pardus, L. pictus and P. leo. When considering taxonomic groupings, animals were separated into 3 groups, including; (1) the Canidae dataset, including V. vulpes, L. pictus and C. lupus; (2) the Felidae dataset, including P. pardus, P. onca and P. leo; and (3) a general Taxonomic Family dataset, including all Canidae in the same group, all Felidae in the same group, followed by Hyaenidae and Ursidae. Some complementary details on each of these carnivores have been included in Supplementary Appendix 1.

All experiments involving carnivores were performed in accordance with the relevant ethical guidelines as set forth by park keepers and general park regulations. No animals were sacrificed specifically for the purpose of these experiments. Likewise, carnivores were not manipulated or handled at any point during the collection of samples. Collection of chewed bones were performed directly by park staff and assisted by one of the authors (JY). The present study followed the guidelines set forth by ARRIVE (https://arriveguidelines.org/) wherever necessary. No licenses or permits were required in order to perform these experiments. Finally, in the case of animals in parks, bone samples were provided by the park according to normal feeding protocols. More details can be consulted in the Extended Samples section of the supplementary files.

3D modelling and landmark digitisation

Digital reconstructions of tooth marks were performed using Structured Light Surface Scanning (SLSS)⁶⁸. The equipment used in the present study was the DAVID SLS-2 Structured Light Surface Scanner located in the C.A.I. Archaeometry and Archaeological Analysis lab of the Complutense University of Madrid (Spain). This equipment consists of a DAVID USB CMOS Monochrome 2-Megapixel camera and ACER K11 LED projector. Both the camera and the projector were connected to a portable ASUS X550VX personal laptop (8 GB RAM, Intel^® Core^TM i5 6300HQ CPU (2.3 GHz), NVIDIA GTX 950 GPU) via USB and HDMI respectively. The DAVID’s Laser Scanner Professional Edition software is stored in a USB Flash Drive. Equipment were calibrated using a 15 mm markerboard, using additional macro lenses attached to both the projector and the camera in order to obtain optimal resolution at this scale. Once calibrated the DAVID SLS-2 produces a point cloud density of up to 1.2 million points which can be exported for further processing via external software.

The landmark configuration used for this study consists of a total of 30 landmarks (LMs)²¹; 5 fixed Type II landmarks¹⁸ and a (5 times 5) patch of semilandmarks⁶⁹ (Fig. S2). Of the 5 fixed landmarks, LM1 and LM2 mark the maximal length (l) of each pit. For the correct orientation of the pit, LM1 can be considered to be the point along the maximum length furthest away from the perpendicular axis marking the maximum width (w). LM2 would therefore be the point closest to said perpendicular axis (see variables (d_{1}) and (d_{2}) in Fig. S2 for clarification). LM3 and LM4 mark the extremities of the perpendicular axis (w) with LM3 being the left-most extremity and LM4 being the right-most extremity. LM5 is the deepest point of the pit. The semilandmark patch is then positioned over the entirety of the pit, so as to capture the internal morphology of the mark.

Landmark collection was performed using the free Landmark Editor software (v.3.0.0.6.) by a single experienced analyst. Inter-analyst experiments prior to landmark collection revealed the landmark model to have a robustly defined human-induced margin of error of 0.14 ± 0.09 mm (Median ± Square Root of the Biweight Midvariance). Detailed explanations as well as an instructional video on how to place both landmarks and semilandmarks can be consulted in the Supplementary Appendix and main text of Courtenay et al.²¹.

Geometric morphometrics

Once collected, landmarks were formatted as morphologika files and imported into the R free software environment (v.3.5.3, https://www.r-project.org/). Initial processing of these files consisted in the orthogonal tangent projection into a new normalized feature space. This process, frequently referred to as Generalized Procrustes Analysis (GPA), is a valuable tool that allows for the direct comparison of landmark configurations^18,19,70. GPA utilises different superimposition procedures (translation, rotation and scaling) to quantify minute displacements of individual landmarks in space⁷¹. This in turn facilitates the comparison of landmark configurations, as well as hypothesis testing, using multivariate statistical analyses. Nevertheless, considering observations made by Courtenay et al.^20,21,25 revealed tooth mark size to be an important conditioning factor in their morphology, prior analyses in allometry were also performed⁷². From this perspective, allometric analyses first considered the calculation of centroid sizes across all individuals; the square root of the sum of squared distances of all landmarks of an object from their centroid¹⁸. These calculations were then followed by multiple regressions to assess the significance of shape-size relationships. For regression, the logarithm of centroid sizes were used. In cases where shape-size relationships proved significant, final superimposition procedures were performed excluding the scaling step of GPA (form).

In addition to these analyses, preliminary tests were performed to check for the strength of phylogenetic signals⁷³. This was used as a means of testing whether groups of carnivores produced similar tooth pits to other members of the same taxonomic family. For details on the phylogenies used during these tests, consult Fig. S1 and Supplementary Appendix 1.

For the visualisation of morphological trends and variations, Thin Plate Splines (TPS) and central morphological tendencies were calculated^19,71. From each of these mean landmark configurations, for ease of pattern visualisation across so many landmarks, final calculations were performed using Delaunay 2.5D Triangulation algorithms⁷⁴ creating visual meshes of these configurations in Python (v.3.7.4, https://www.python.org/).

Once normalised, landmark coordinates were processed using dimensionality reduction via Principal Components Analyses (PCA). In order to identify the optimal number of Principal Component Scores (PC Scores) that best represented morphological variance, permutation tests were performed calculating the observed variance explained by each PC with the permuted variance over 50 randomized iterations⁷⁵. Multivariate Analysis of Variance (MANOVA) tests were then performed on these select PCs to assess the significance of multivariate morphological variance among samples.

Geometric Morphometric applications were programmed in the R programming language (Sup. Appendix 8).

Robust statistics

While GPA is known to normalize data⁷⁶, this does not always hold true. Under this premise, caution must be taken when performing statistical analyses on these datasets. Taking this into consideration, prior to all hypothesis testing, normality tests were also performed. These included Shapiro tests and the inspection of Quantile–Quantile graphs. In cases where normality was detected, univariate hypothesis tests were performed using traditional parametric Analysis of Variance (ANOVA). For multivariate tests, such as MANOVA, calculations were derived using the Hotelling-Lawley test-statistic. When normality was rejected, robust alternatives to each of these tests were chosen. In the case of univariate testing, the Kruskal–Wallis non-parametric rank test was prefered, while for MANOVA calculations, Wilk’s Lambda was used.

Finally, in light of some of the recommendations presented by The American Statistical Association (ASA), as debated in Volume 73, Issue Sup1 of The American Statistician^{<a data-track="click" data-track-action="reference anchor" data-track-label="link" data-test="citation-ref" aria-label="Reference 77" title="Wasserstein, R. L., Schirm, A. L. & Lazar, N. A. Moving to a world beyond “p}

$$<$$

77,78, the present study considers p-values of (>2sigma ) from the mean to indicate only suggestive support for the alternative hypothesis ((H_{a})). (p ; > ; 0.005), or where possible, (3sigma ) was therefore used as a threshold to conclude that (H_{a}) is “significant”. In addition, Bayes Factor Bound (BFB) values (Eq. 1) have also been included alongside all corresponding p-Values⁷⁹. Unless stated otherwise, BFBs are reported as the odds in favor of the alternative hypothesis (BFB:1). More details on BFB, Bayes Factors and the (p ; > ; 3sigma ) threshold have been included in Supplementary Appendix 3. General BFB calibrations in accordance with Benjamin and Berger’s Recommendation 0.3⁷⁹, as well as False Positive Risk values according to Colquhoun’s proposals⁸⁰, have also been included in Table S20 of Supplementary Appendix 3.

$$begin{aligned} BFB = frac{1}{-e ; p ; log (p)} end{aligned}$$

(1)

All statistical applications were programmed in the R programming language (Sup. Appendix 8).

Computational learning

Computational Learning employed in this study consisted of two main types of algorithm; Unsupervised and Supervised algorithms. The concept of “learning” in AI refers primarily to the creation of algorithms that are able to extract patterns from raw data (i.e. “learn”), based on their “experience” through the construction of mathematical functions^38,81. The basis of all AI learning activities include the combination of multiple components, including; linear algebra, calculus, probability theory and statistics. From this, algorithms can create complex mathematical functions using many simpler concepts as building blocks³⁸. Here we use the term “Computational Learning” to refer to a very large group of sub-disciplines and sub-sub-disciplines within AI. Deep Learning and Machine Learning are terms frequently used (and often debated), however, many more branches and types of learning exist. Under this premise, and so as to avoid complication, the present study has chosen to summarise these algorithms using the term “Computational”.

Similar to the concepts of Deep and Machine Learning, many different types of supervision exist. The terms supervised and unsupervised refer to the way raw data is fed into the algorithm. In most literature, data will be referred to via the algebraic symbol x, whether this be a vector, scalar or matrix. The objective of algorithms are to find patterns among a group of x. In an unsupervised context, x is directly fed into the algorithm without further explanation. Algorithms are then forced to search for patterns that best explain the data. In the case of supervised contexts, x is associated with a label or target usually denominated as y. Here the algorithm will try and find the best means of mapping x to y. From a statistical perspective, this can be explained as (pleft( y vert x right) ). In sum, unsupervised algorithms are typically used for clustering tasks, dimensionality reduction or anomaly detection, while supervised learning is typically associated with classification tasks or regression.

The workflow used in the present study begins with dimensionality reduction, as explained earlier with the use of PCA. While preliminary experiments were performed using non-linear dimensionality reduction algorithms, such as t-distributed Stochastic Neighbor Embedding (t-SNE)⁸² and Uniform Manifold Approximation and Projection (UMAP)⁸³, PCA was found to be the most consistent across all datasets, a point which should be developed in detailed further research. Once dimensionality reduction had been performed, and prior to any advanced computational modelling, datasets were cleaned using unsupervised Isolation Forests (IFs)⁸⁴. Once anomalies had been removed, data augmentation was performed using two different unsupervised approaches; Generative Adversarial Networks (GANs)^38,39,40,41 and Markov Chain Monte Carlo (MCMC) sampling⁴⁴. Data augmentation was performed for two primary reasons; (1) the simulation of larger datasets to ensure supervised algorithms have enough information to train from, and (2) to balance datasets so each sample has the same size. Both MCMCs and GANs were trialed and tested using robust statistics to evaluate quality of augmented data⁴¹. Once the best model had been determined, each of the datasets were augmented so they had a total sample size of (n = 100). In the case of the Taxonomic Family dataset, augmentation was performed until all samples had the same size as the largest sample.

Once augmented, samples were used for the training of supervised classification models. Two classification models were tried and tested; Support Vector Machines (SVM)⁸⁵ and Neural Support Vector Machines (NSVM)^86,87. NSVMs are an extension of SVM using Neural Networks (NNs)³⁸ as feature extractors, in substituting the kernel functions typically used in SVMs. Hyperparameter optimization for both SVMs and NSVMs were performed using Bayesian Optimization Algorithms (BOAs)⁸⁸.

Supervised computational applications were performed in both the R and Python programming languages (Sup. Appendix 8). For full details on both unsupervised and supervised computational algorithms, consult the Extended Methods section of the Supplementary Materials.

Evaluation of supervised learning algorithms took into account a wide array of different popular evaluation metrics in machine and deep learning. These included; Accuracy, Sensitivity, Specificity, Precision, Recall, Area Under the receiver operator characteristic Curve (AUC), the F-Measure (also known as the F1 Score), Cohen’s Kappa ((kappa )) statistic, and model Loss. Each of these metrics, with the exception of loss, are calculated using confusion matrices, measuring the ratio of correctly classified individuals (True Positive & True Negative) as well as miss-classified individuals (False Positive & False Negative). For more details see Supplementary Appendix 6.

Accuracy is simply reported as either a decimal (left[ 0, 1right] ) or a percentage. Accuracy is a metric often misinterpreted, as explained in Supplementary Appendix 6, and should always be considered in combination with other values, such as Sensitivity or Specificity. Both Sensitivity and Specificity are values reported as decimals (left[ 0, 1right] ), and are used to evaluate the proportion of correct classifications and miss-classifications. AUC values are derived from receiver operator characteristic curves, a method used to balance and graphically represent the rate of correctly and incorrectly classified individuals. The closest the curve gets to reaching the top left corner of the graph, the better the classifier, while diagonal lines in the graph represent a random classifier (poor model). In order to quantify the curvature of the graph, the area under the curve can be calculated (AUC), with (AUC=1) being a perfect classifier and (AUC=0.5) being a random classifier. The (kappa ) statistic is a measure of observer reliability, usually employed to test the agreement between two systems. When applied to confusion matrix evaluations, (kappa ) can be used to assess the probability that a model will produce an output (hat{y}) that coincides with the real output y. (kappa ) values typically range between (left[ 0, 1right] ), with (kappa =1) meaning perfect agreement, (kappa =0) being random agreement, and (kappa =0.8) typically used as a threshold to define a near-perfect or perfect algorithm.

While in the authors’ opinion, AUC, Sensitivity and Specificity values are the most reliable evaluation metrics for studies of this type (Supp. Appendix 6), for ease of comparison with other papers or authors who choose to use other metrics, we have also included Precision, Recall and F-Measure values. Precision and Recall values play a similar role to sensitivity and specificity, with recall being equivalent to sensitivity, and precision being the calculation of the number of correct positive predictions made. Precision and Recall, however, differ from their counterparts in being more robust to imbalance in datasets. F-Measures are a combined evaluation of these two measures. For more details consult Supplementary Appendix 6.

Loss metrics were reported using the Mean Squared Error (Eq. 2);

$$begin{aligned} MSE = frac{1}{n} sum _{i = 1}^{n} left( y_{i} – hat{y}_{i} right) ^{2} end{aligned}$$

(2)

Loss values are interpreted considering values closest to 0 as an indicator of greater confidence when using the model to make new predictions.

Final evaluation metrics were reported when using algorithms to classify only the original samples, without augmented data. Augmented data was, therefore, solely used for training and validation. Finally, so as to assess the impact data augmentation has on supervised learning algorithms, algorithms were also trained on the raw data. This was performed using 70% of the raw data for training, while the remaining 30% was used as a test set.

Source: Ecology - nature.com

Developments in data science solutions for carnivore tooth pit classification

Sample

3D modelling and landmark digitisation

Geometric morphometrics

Robust statistics

Computational learning

MIT unveils a new action plan to tackle the climate crisis

Niche partitioning shaped herbivore macroevolution through the early Mesozoic

ITALIAN LANGUAGE

ENGLISH LANGUAGE