Natural and anthropogenic factors drive large-scale freshwater fish invasions

Invasion

We used freshwater fish biodiversity data collated by and described in Milardi, et al.⁴⁷. In summary, the dataset included 3777 sites sampled 1999–2014, recorded a total of 99 different fish species (35 of which were exotic and already established, even if some are restricted to areas with thermal springs), spanned > 11 degrees of longitude (~ 1200 km) and 10 degrees of latitude (~ 1100 km), covering streams at altitudes -2.7–2500 m above sea level. Community turnover was not a relevant factor in our study, because fish communities are typically stable over these timescales and the data was collected in a restricted timeframe within each area^29,39. Furthermore, time elapsed since last introductions was sufficient to analyze distribution patterns after major invasions had already occurred see e.g.^23,48.

Abundance of each species sampled during the monitoring was recorded with Moyle classes (Moyle and Nichols, 1973), which were weighted according to body-size classes in order to obtain a body-mass-corrected abundance, hereafter referred to simply as abundance. We then calculated an invasion degree, i.e. the share of introduced species in freshwater fish communities, based on the abundance of introduced and native species see e.g.^9,49. A high invasion degree equals to a high share of introduced species and a low share of native species.

We also selected the top 10 invasive species as further response variables, under the assumption that these would be the main components of the invasion degree, but would respond to different invasion drivers based on each species’ ecology. Invasiveness rank was defined through an index obtained by multiplying colonization (share of sites colonized) and prevalence (average relative abundance in the fish community) of each introduced species. The relative abundance of each of these species in the fish community was used as a response variable, being a measure comparable to invasion degree for single species.

Invasion drivers

We tested a combination of geographical, climate and anthropogenic impact factors as potential drivers of invasion. To avoid temporal mismatches, we chose time periods that overlapped as much as possible with our biological data.

We used basin area, altitude and slope (derived from a seamless digital elevation model of the whole Italian territory at 10 m resolution, Tarquini, et al.⁵⁰) as geographical variables.

We derived climate data from available series of long-term national monitoring (http://www.scia.isprambiente.it/). We used daily air temperature (2000–2009), measured at a total of 2266 sites throughout the country, as a proxy for temperature regimes. We also used cumulated annual precipitations, number of annual dry days (precipitation < 1 mm) and maximum number of consecutive dry days (all from 2000–2009), measured at a total of 3098 sites throughout the country, as a proxy for hydrological regimes.

We used an index the 2009 Human Footprint⁵¹, based on eight variables (built-up environments, population density, electric power infrastructure, crop lands, pasture lands, roads, railways, and navigable waterways), as a proxy for overall anthropogenic impact and propagule pressure. The lower the proxy, the smaller the anthropogenic impact. We also further explored single components of anthropogenic impact, by analyzing separately variables related to land use in 2012 (Copernicus Land Monitoring Service—https://land.copernicus.eu/pan-european/corine-land-cover/).

We further used an index see e.g.⁵², based on the concentration of 7 different parameters linked to nutrient levels (oxygen saturation, biochemical oxygen demand, chemical oxygen demand, NH₄, NO₃, total P and E. coli levels), measured at monthly intervals 2006–2010, at 1636 sites throughout the country, as a proxy for eutrophication levels. High proxy values correspond to low eutrophication levels. We further added the intensity of animal farming in 2010 (numbers of animals reared, ISTAT—http://dati.istat.it/Index.aspx?DataSetCode=DCSP_CONSISTENZE).

Finally, we used the presence of migration barriers as a proxy for riverine habitat fragmentation. We detected migration barrier locations through high-resolution cloud-free satellite images, and manually classified them in 4 categories (small jump, high jump, minor dam, major dam) according to their type and size (as gauged from visual characteristics, e.g. the presence of upstream retention basins).

Estimating invasion drivers at fish sampling points

We derived elevation of each fish sampling point from the DEM, which we similarly used to calculate the total area of the basin above the site, cropped at a 10 km distance from the fish sampling point. In lowland areas, where basin determinations were uncertain due to low elevation gradients, we derived the same variables from an area of 10 km radius around the sampling point. We derived the slope using a 10 km long river network segment, centered on the fish sampling point.

We used daily temperatures, cumulated annual precipitations, number of annual dry days, and maximum number of consecutive dry days, to build mean integrated nested Laplace approximated (INLA) annual layers for the decade 2000–2009⁵³. We assigned to each fish sampling point the value of the mean interpolated in a 5 km radius around the point (temperature regime proxy) or the mean over the basin above the sampling point (hydrological regime proxy).

We calculated the minimum, maximum, sum and mean values of the overall anthropogenic impact proxy and expressed them as densities in the cropped basin above the sampling point. We used a 10 km cutoff under the assumption that it would capture the most relevant pressures for any given sampling point, that other pressures further upstream would be less relevant as their pressure would be partly dampened by environmental buffers and avoid overlap between different sampling points along the same watercourse. We used a similar calculation for the density of animal farming (number of poultry, sheep, pigs and cattle), which were converted in livestock units (poultry = 0.01, cattle = 1, sheep = 0.1, pigs = 0.5) so that they could be combined into one variable. Similarly, we used the percentage of each land cover class in the cropped basin, aggregated at the highest level (i.e. Artificial Surfaces, Agricultural areas, Forest and semi-natural areas, Wetlands, Water bodies).

We calculated the eutrophication proxy as a mean of INLA-interpolated annual layers and projected these over the river network. We then used the mean (and relative SD) of the proxy using a 10 km long river network segment, centered on the fish sampling point.

For habitat fragmentation we used three variables: the number of reachable barriers along the river network, the mean distance of reachable barriers, and a habitat fragmentation index. This index used reachable barriers only, and was built as:

$$frac{1}{{frac{pi }{2}}}tan left( {frac{1}{barrier ,distance} times barrier ,category } right)$$

(1)

To vary non-linearly between 0 (least fragmented) and 1 (most fragmented). We chose a 10 km cutoff for these variables, as it was in line with other measures and the average distance most freshwater fish species are expected to move up or downstream⁵⁴, recognizing that some species have both shorter and longer migration spans.

Data analysis

After a preliminary analysis we retained the following variables for a full analysis and grouped them in 4 large invasion drivers groups: Geography (slope, altitude), Climate (mean values of temperature and precipitation, as well as mean maximum number of dry days (drought)), Human factors (mean densities of human footprint and livestock units, mean eutrophication index and habitat fragmentation index) and Land use (percentages of broad land cover classes). We used mean densities, rather than absolute values, trying to reduce any area-dependent effects.

We developed Booster Regression Tree (BRT) models for both invasion degree and the top 10 invasive taxa. BRT models are among a family of techniques used to advance single-classification or regression trees by combining the results of sequentially fit regression trees to reduce predictive error and improve overall performance^55,56,57. BRT models have been shown to have superior performance over linear modeling techniques especially with data that are often highly skewed, such as environmental data^55,58, and are considered an efficient method to describe non-linear relationships between variables and automatically incorporate interactions between variables. We reduced explanatory variables in each final BRT model by using a combination of variable importance (VI) scores, evaluation of interactions, and partial dependency responses (see below) following the approach outlined by Elith, et al.⁵⁵ to minimize overfitting. All variables with VI < 7 were deleted and the remaining variables were used to develop the final BRT model. Calculations of VI values are based on the number of times a variable is selected for splitting, weighted by the squared improvement to the models as a result of each split, averaged over all trees. After a first run, we used the BRT analysis residuals to test for spatial autocorrelation (SAC) through the Moran’s I test and, where SAC was found, we built a SAC autocovariate that was fed into the model to account for SAC. Final model choice relied on best model fit, and residuals were tested to confirm that spatial autocorrelation was reduced. The relative importance of each variable is scaled so that the sum adds to 100, with higher numbers indicating stronger influence on the modeled response. When two variables that we interpreted as explaining the same type of variation within the same stressor type, and showing the same type of response pattern, occurred in the top 10 most important variables, we kept only one variable in the final model, unless dropping one of the variables reduced the CV R² (cross-validation R²) beyond a reasonable level, based on expert judgement. We used CV R² (cross validation) values instead of R² to compare performance among BRT models because CV R² values are more conservative and less likely to be inflated by potential overfitting. We calculated CV R² values by holding 25% (bag fraction) of the sites out for each regression tree split, then used the withheld sites to test the percentage of deviance explained by the split⁵⁵. We used partial dependency plots to visualize the direction of individual drivers effects on the response variable, after accounting for the average effects of all other explanatory variables in each final model^55,56. A partial dependency plot is a scatter plot of an individual driver vs biotic metric and the modeled response form for that metric, where the response curves indicate the general shape, direction, and potential breakpoints (i.e., effect levels) for each driver. We ran Moran’s I test and built a SAC autocovariate using a Voroni tessellation with the functions of the spdep package⁵⁹, including testing for negative SAC. We ran BRT models using the gbm package^60,61 and specific code from Elith, et al.⁵⁵. Because Elith et al.’s code optimizes the number of trees run in each model, the number of trees can vary for each model; however, all models had at least 1000 trees.

We investigated the collinearity of variables through the variance inflation factor (VIF) within each variable group, and we excluded collinear variables (VIF > 8) from variation partitioning and RDA analyses. We performed variation partitioning through a partial regression to find the relative contributions of each group of invasion drivers (i.e., Geography, Climate, Human factors and Land use) in explaining invasion degree. The total variation was thus partitioned into different components: the variance explained by a single group of explanatory variables, the variance jointly explained by variables of two or three groups and unexplained variance (Legendre & Legendre, 2012). The significance of unique fractions was tested using permutation-based ANOVA with 999 permutations⁶². Geography (slope), Climate (mean temperature, mean precipitation, mean drought), Human factors (human footprint, animal farming, eutrophication, river fragmentation) and Land use (all land use subclasses) were ultimately retained. We used the varpart function of the “vegan” R package⁶² to partition the variance, and the “eulerr” R package⁶³ to represent the outputs through area-proportional Euler-Venn diagrams.

We also used Redundancy Analysis (RDA) to investigate the variation of the top 10 invasive species explained by invasion drivers⁶⁴, using adjusted R² values to report the variance explained. We used the RDA function of the “vegan” R package⁶² to run this analysis and test the significance of axes using permutation-based ANOVA with 499 permutations.

All analyses were performed in R³¹.

Natural and anthropogenic factors drive large-scale freshwater fish invasions

Invasion

Invasion drivers

Estimating invasion drivers at fish sampling points

Data analysis

Waste-derived biochar for water pollution control and sustainable development

ReaLSAT, a global dataset of reservoir and lake surface area variations

ITALIAN LANGUAGE

ENGLISH LANGUAGE