WOODIV, a database of occurrences, functional traits, and phylogenetic data for all Euro-Mediterranean trees

The geographic area covered by the WOODIV database is the Euro-Mediterranean region, as defined by Médail et al.¹. The northern Mediterranean region was selected following the definition of terrestrial ecoregions of the world by Olson et al.¹³. The study area covers all or part of the following countries and islands: Albania, Croatia, Cyprus, France, Greece, Italy, Malta, Montenegro, Portugal, Slovenia, Southern Macedonia, and Spain, including the Balearic archipelago, Corsica, Sardinia, Sicily, and Crete.

We focused on the 245 tree taxa (210 species and 35 subspecies) identified in the Euro-Mediterranean checklist from Médail et al.¹. These taxa belong to 33 families and 64 genera and include 46 endemics (as defined by Médail et al.¹, i.e. range-restricted taxa in and outside of the study area).

Observed occurrence data

We collected tree occurrence data (at the species or subspecies level) from 23 sources: national databases and floras, regional databases, and publications (Table 1). Some records still unpublished were specifically provided at the grid level for this project by experts for southern Macedonia, Malta, Montenegro, and Sicily (four sources, Table 1).

Table 1 Sources of the occurrence records, giving the name of the dataset (Source name; ined. if unpublished), the Type of data (records with geographic coordinates (records), records at the grid level (gridded records), or atlas-type (atlas) data), and the Countries/Islands covered by the source.

Full size table

When considering the subspecies level, the WOODIV database lacks the occurrences of 11 sub-species among the 35 listed by Médail et al.¹. When aggregated at the species level (to match the taxonomic resolution of the functional and phylogenetic data which are available at the species level only), the WOODIV database lacks only the occurrences of 3 of the 210 species from the Médail et al.¹ checklist (n = 207; Table 2; Supplementary Table 2): Pyrus elaeagrifolia Pall., which occurs in Albania and Macedonia (and in northeastern Greece but outside the Mediterranean biome), P. syriaca Boiss. and Tamarix passerinoides Desv., which occur in Cyprus and in Sardinia, respectively.

Table 2 Summary of the availability of data in the WOODIV database: total number of species among the 210 species from the Médail et al.¹ checklist with (1) observed occurrences; (2) functional traits data, including the detail of the number of species with available data for 4 traits: adult plant height (Height), seed mass (SeedMass), specific leaf area (SLA) and wood density (SSD) (see “Functional data” section); and, (3) genetic data including the detail of the number of species with available data for 3 DNA-regions: matK, rbcL and psbA-trnH (see “Genetic data” section).

Full size table

Also, due to the taxonomic heterogeneity of the different data sources, we recommend aggregating the occurrences of certain tree taxa at the species’ group level (see sections Data Records and Usage Notes): i.e. to aggregate Pinus uncinata DC. and P. mugo Turra into P. mugo aggr., Juniperus deltoides R.P.Adams and J. oxycedrus L. into J. oxycedrus aggr. and Alnus lusitanica Vít, Douda & Mandák., A. rohlenae Vít, Douda & Mandák, and A. glutinosa (L.) Gaertn. into A. glutinosa aggr. The WOODIV database thus contains reliable occurrences of 200 species and three aggregated species (n = 203; Table 2; Supplementary Table 2).

The raw dataset obtained from gathering occurrences from all sources included a total of 1,248,701 occurrence records distributed across the participating countries.

The raw occurrence data were aggregated at a resolution of 10 × 10 km in line with an INSPIRE¹⁴ compliant 10 × 10 km grid (SCR 4258). This gridding procedure provided a way to standardize data from different sources. We selected this spatial grain because it was the finest resolution available for some countries of the study area (e.g. Slovenia, Croatia, Greece). Sources of occurrence data with a resolution coarser than 10 × 10 km (e.g. Atlas Florae Europaeae¹⁵) were not considered. The considered area includes 10,042 grid cells with at least one occurrence record (Fig. 1a). The occurrence dataset provided by the WOODIV database, i.e. aggregated records for species considered as native in the given grid cell using the 10 × 10 km grid (removal of duplicate species within a grid cell) includes 140,279 occurrences.

Fig. 1

Geographic scope of the WOODIV database, spatial distribution, and validation of trees occurrences. (a) Number of species within a 10 × 10 km grid cell based on modelled occurrence data for the 171 modelled species, with the addition of the occurrence data of the 21 small-range species; and, within grid cells of Atlas Flora Europaeae (AFE; 50x50km) (b) Number of species with presences recorded in AFE but not in the WOODIV dataset on the 104 species present both in the AFE and WOODIV data; and, (c) Number of species with presences recorded in the WOODIV dataset but not in AFE on the 104 species present both in the AFE and WOODIV data.

Full size image

Modelled occurrence data

The WOODIV database provides modelled occurrences of the species from the Médail et al.¹ checklist. From the 10 × 10 km gridded observed occurrence data, we modelled the distribution of each species across the Euro-Mediterranean area using Species Distribution Models (SDM). SDM statistically relate species occurrence records to environmental variables to predict the potential distribution of species¹⁶.

Due to the extent of the study area, we only related species occurrence to climate gradients¹⁷. Bioclimatic variables were extracted from the CHELSA database V1.2¹⁸ available at a resolution of 30 arc‐sec (http://chelsa‐climate.org/) and then averaged to a 10 × 10 km resolution. The selection of the environmental predictors for niche modeling is a source of uncertainty in model predictions that can be reduced with sound statistical methods and ecological knowledge of the target species¹⁹. We also focused on proximal predictors that directly influence species distribution and selected a low number of predictive variables to reduce the issues of model overfitting and multicollinearity²⁰. We selected four bioclimatic variables that previous studies had reported to be relevant predictors of the distribution of plant species, especially in environments such as those that characterize the Mediterranean Basin^21,22,23,24: “Minimum temperature of the coldest month” (Bio06, in °C) quantifies potentially lethal frost events and more generally, stress due to low temperatures; “Total annual precipitation” (Bio12, in mm) approximates average water availability; “Precipitation of the driest month” (Bio14, in mm) describes the extremes associated with drought events and stress due to low water availability, and “Temperature seasonality” (Bio04, no dimension) describes the variability of temperature during the year. All selected predictors showed VIF (variance inflation factor²⁵) values below 5, indicating that a given predictor was not correlated with any linear combinations of the other predictors (VIF Bio04 = 1.68, VIF Bio06 = 2.06, VIF Bio12 = 1.53, and VIF Bio14 = 2.07).

We related species occurrence to these four bioclimatic variables using the Random Forest algorithm²⁶. As only presence data are archived in the WOODIV database, we randomly sampled a number of pseudo-absences equal to the number of observed occurrences²⁷. This random selection of pseudo-absences was repeated 10 times for each species. When comparing the floras, occurrence data in the Italian Peninsula, Sardinia and/or Sicily were highly unrepresentative of the distribution of some species (n = 84; see Supplementary Table 3). To overcome this potential bias in the models, we did not include these regions in the model calibration step (Supplementary Table 3). The model was projected in these areas after having tested the similarity in the variables between the projection dataset (Italy, Sicily, and Sardinia) and the fitting dataset (the rest of the study area). Indeed, when model predictions are projected into regions not analyzed in the fitting data, it is necessary to measure the similarity between the new environments and those in the training sample²⁸, as models are not so reliable when predicting outside their domain²⁹. Similarity analyses computed using ExDet³⁰ indicated that all covariables in the projected area are within the univariate range of the fitting area and that there is no change in correlation between covariables (NT1 and NT2 = 0).

Each of these 10 datasets (per species) was then randomly split into two datasets to evaluate model performance on pseudo-independent data³¹: 70% of the data was used to calibrate models and the 30% remaining data was used to evaluate model performance using the True Skill Statistic (TSS³²) and the Area Under the Curve (AUC) of the receiver-operating characteristic (ROC) plot³³ metrics. This split-sample step was repeated 10 times resulting in 100 models per species.

For each of the 171 modelled species, a mean model (from the 100 replicates) was then used to predict potential species distribution. Predicted probabilities of occurrence were finally converted into presence/absence using the threshold maximizing the TSS. We fitted all models under the R environment R Core team³⁴ and the package biomod2^35,36.

The WOODIV database provides modelled occurrences of each of the 171 species for each 10 × 10 km grid cell (Fig. 1a). Thirty-two species with less than 10 occurrence records were not modelled (Supplementary Table 3). Among these 32 species, 21 are small-ranged species whose distribution is limited to a few grid cells (Supplementary Table 3). The observed occurrence records for these 21 species can be considered as representative of their distribution and we therefore recommend using the non-modelled records for these species for analyses. The occurrences of the remaining 11 species should be considered unrepresentative of their distribution.

Functional data

Four functional traits were considered in this project: adult plant height (Height), seed mass (SeedMass), specific leaf area (SLA), and wood density (StemSpecDens). These traits have been proposed to reflect a global spectrum of plant strategies^37,38: height is a commonly measured proxy for individual size and reflects several aspects including resource acquisition, competitive ability, or dispersal capacity. SeedMass represents the trade-off between fecundity, seed survival, and dispersal. SLA (the ratio between leaf area and dry mass) is correlated to photosynthetic capacity and leaf life span and is an indirect measure of the return on investments in carbon gain compared to water loss. StemSpecDens is a key component of woody plant growth linked to the mechanical support of the stem and its growth rate.

We compiled the values for these traits at the species level for the trees from the Médail et al.¹ checklist, referring mostly to 2 databases: TRY⁹ and BROT 2.0³⁹. Supplementary values were obtained from more specific databases (Global Wood Density Database⁴⁰, Kew Seed Information Database⁴¹) or from the scientific literature and atlas^{42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61}. In total, 92% of the entries were extracted from TRY, 7% from BROT 2.0 and the remaining were retrieved from the other sources. The original ID of records from the TRY and BROT databases is provided in order to make it possible to refer to the complete observation if a user needs to have some contextual information.

The WOODIV database lacks all traits data for only 6 of the 210 species from the checklist (Table 2, Supplementary Table 2): Alnus lusitanica Vít, Douda & Mandák, Alnus rohlenae Vít, Douda & Mandák, Malus dasyphylla Borkh., Quercus infectoria Olivier, Tamarix arborea Ehrenb. ex Bunge and, Tamarix passerinoides Del. ex Desf.

Adult plant height and seed mass data were available for more than 75% of the 210 species (Table 2; Fig. 2a), whereas wood density and specific leaf area were available for only around 50%. The WOODIV database includes all four trait values for 41% of the 210 species (Fig. 2b; Supplementary Table 2), three trait values for 56% more species.

Fig. 2

Prevalence of traits and genetic data among the 210 species from Médail et al.¹ checkist: (a) For each of the four considered functional traits (adult plant height (Height), seed mass (SeedMass), wood density (SSD) and specific leaf area (SLA)), percentage of the 210 species with existing data; (b) Percentage of the 210 species for which none to four functional traits data are available; (c) For each of the three considered DNA regions (matK, rbcL and psbA-trnH), percentage of the 210 species with existing data (in grey species with only one available sequence for the considered region, in black species with consensus sequence for that region); and, (d) Percentage of the 210 species for which none to three DNA regions data are available.

Full size image

The database provides an R script that can be used to estimate missing trait values using the taxonomic classification if needed.

Genetic data

Three different DNA regions from the plastid genome corresponding to the most commonly used DNA barcode regions^62,63,64 were considered in this project: the ribulose-bisphosphate/carboxylase Large-subunit gene (rbcL), the maturase-K gene (matK), and the psbA-trnH intergenic spacer (trnH).

In a first step, we collected all sequences from GenBank (https://www.ncbi.nlm.nih.gov/genbank/) for the three DNA regions available for the species from the Médail et al.¹ checklist at the species level: rbcL: n = 650 sequences for 146 species, matK: n = 644 sequences for 127 species, trnH: n = 493 sequences for 129 species). To fill the gaps, we obtained DNA from fresh samples collected in the field or gathered from herbarium specimens (Supplementary Table 4). DNA extraction and sequencing were performed at INRA-URFM, Avignon (France) and the National Research Council (IBBR-CNR), Florence (Italy) (rbcL: n = 233 for 125 species, matK: n = 162 for 91 species, trnH: n = 200 for 120 species). Methods used for DNA isolation and Sanger sequencing are described by Albassatneh et al.⁶⁵. When more than one sequence was available for a given DNA region/species, a sequence alignment was performed to check data quality and a taxon-consensus sequence was generated. Consensus sequences were built using the IUPAC-IUB ambiguity⁶⁶ code for a total of 119 (rbcL), 109 (matK), and 110 species (trnH), respectively (Fig. 2c). All newly created sequences were uploaded to GenBank.

The WOODIV database lacks the DNA-region sequences data of only 6 of the 210 species from the Médail et al.¹ checklist (Table 2, Fig. 2d): Alnus lusitanica Vít, Douda & Mandák, Cytisus aeolicus Guss., Celtis planchoniana K.I. Chr., Salix appendiculata Vill., Tamarix hampeana Boiss. & Heldr. and, Tamarix minoa J.L. Villar, Turland, Juan, Gaskin, M.A. Alonso & M.B. Crespo.

Phylogeny

The WOODIV database provides a phylogram including the 204 species for which at least one piece of DNA-region sequence data was available (Supplementary Table 2) and phylograms including the 210 species from the Medail et al.¹ list (Supplementary Fig. 1).

Uneven taxon sampling focused on a single biogeographic area such as ours, can bias phylogenetic inferences⁶⁷. Our goal here is to provide DNA sequence data that can be readily re-used to estimate, e.g. comparable phylogenetic diversity indices, not phylogenetic inferences per se. To illustrate our DNA-sequences data and to facilitate their use for future analyses (to calculate phylogenetic diversity for example), we constructed a molecular phylogeny encompassing the 204 Euro-Mediterranean tree species. Each gene was independently aligned using the MAFFT program⁶⁸ and parsed using the program Gblocks⁶⁹ to exclude the segments characterized by several variable positions or gaps from final alignments. An appropriate substitution model of sequence evolution was selected for each of the three plastid DNA regions using the Akaike Information Criterion (AIC) as implemented in the JModeltest 2 program⁷⁰. The optimal substitution model identified was the same for all three sequences: GTR + I + G. We obtained a concatenated matrix with 1615 aligned bases. We used the Maximum Likelihood analysis⁷¹ as implemented in the RAxML V8 program⁷². The DNA sequence matrix of 1615 sites was analyzed using three partitions with the GTRGAMMAI model (GTR + Gamma substitution model + proportion of invariant sites). We searched for the optimal tree, running at least 20 independent maximum likelihood analyses; full analyses also consisted of 100 bootstrap replicates⁷².

For users who would like to work on the complete pool of 210 tree species, we also built a 210 species phylogram including all Euro-Mediterranean trees. The six missing species for which no DNA-region sequence was available were added to the phylogenetic tree using the Simulation with Uncertainty for Phylogenetic Investigating (SUNPLIN) method⁷³, with 100 replicates. The geometric median tree was computed from the set of 100 replicates with the medTree function from the R package treespace⁷⁴. Both the median tree and the set of 100 replicates are provided in the WOODIV database, together with the molecular tree with 204 species.

Source: Ecology - nature.com

WOODIV, a database of occurrences, functional traits, and phylogenetic data for all Euro-Mediterranean trees

Observed occurrence data

Modelled occurrence data

Functional data

Genetic data

Phylogeny

Using RNA-seq to characterize pollen–stigma interactions for pollination studies

Impact of noise on development, physiological stress and behavioural patterns in larval zebrafish

ITALIAN LANGUAGE

ENGLISH LANGUAGE