in

Modeling Posidonia oceanica shoot density and rhizome primary production

Study area and environmental variables

The data set used in this study included 192 sites in which lepidochronological data and shoot density were acquired between 1994 and 2003. Clearly, the rhizome primary production of P. oceanica was estimated as defined by Pergent-Martini et al.12.

The spatial coverage of the data set was not uniform across the Italian Seas. In fact, the sampling sites were mainly concentrated in five Italian regions, i.e. Liguria, Tuscany, Lazio, Basilicata and Apulia (Fig. 1).

Figure 1

Sampling sites from which field data and indirect measurements have been collected (red circles). Data about several sampling stations are available at each site (N = 6 to 15).

Full size image

The environmental variables were all acquired from maps and other related information sources (Table 1), according to the main aim of the study. A detailed explanation of these variables and of the methodology for their acquisition is given in the supplementary materials.

Table 1 Environmental factors used as predictive variables for developing P. oceanica models.

Full size table

Since these environmental factors were used as predictive variables in the modeling procedure, their selection was based on the ecological nature of the modelled processes, taking into account their influence on the latter. For instance, it is well known that depth plays a crucial role in determining the properties of P. oceanica meadows, such as density and productivity, as it is strictly related to other fundamental environmental factors, e.g. light. Therefore, both depth and gradient were considered as predictive variables, as well as the profile of the isobaths, described as either linear, convex or concave. The presence of sources of disturbance, such as sewage discharge or similar pollution, was also taken into account, as an increase in turbidity following an excessive enrichment from nutrient inputs might entail a reduction of water transparency and light penetration, which in turn can alter the ecological proprieties of a P. oceanica meadow. As for the sea floor typologies, i.e. sand, rock and matte, sources of disturbance have been represented as binary variables because of the intention of using only indirect methods for data acquisition, e.g. maps. Clearly, with such types of data source it was possible to perform, with good confidence, only a qualitative assessment. A quantitative coding of those predictive variables would indeed require expensive and time-consuming efforts for field activities, leading to a major drawback of the proposed approach.

The data set was partitioned into two subsets, i.e. training and test sets, for modeling purposes. Data partitioning represents a critical step in modeling, whose aim is obtaining two subsets that are as much as possible independent from each other, while simultaneously representative of the modelled problem, in order to avoid modeling artifacts and to ensure the applicability of the resulting models18.

Accordingly, the partitioning was not based on random selection of the data, rather the subsets were obtained on the basis of the following approach. The data were stratified according to depth, i.e. they were sorted on the basis of their depth and assigned to one of the following bathymetric classes, i.e.[0,5] m, (5,10] m, (10,15] m, (15,20] m, (20,25] m, (25,35] m. These classes comprised 16.67%, 23.96%, 27.08%, 17.71%, 9.90% and 4.69% of the total number of records, respectively. Subsequently, within each bathymetric class, about 70% of the data, i.e. n = 136, were assigned to the training set, while the remaining ones, i.e. n = 56, to the test set. While the former subset comprising the majority of the data was used for the training procedure of the Machine Learning algorithm, i.e. Random Forest19, the test subset was only used a posteriori to evaluate model performance.

The rationale behind the aforementioned approach is that the depth has a paramount ecological role in regulating both P. oceanica shoot density and rhizome primary production, as previously noted. In fact, a wide range of environmental conditions are related to depth, such as light, water movement and sedimentation flows, which in turn strictly affected the structure, the functioning and the ecological condition of P. oceanica meadows. Therefore, using the abovementioned strategy in the data allocation, the inherent variability of the ecological patterns was properly distributed among the subsets, thus ensuring the possibility of obtaining ecologically sound models.

Random Forest

The Random Forest (RF) is a Machine Learning technique which fits an ensemble of Classification Trees and combines their predictions into a single model19.

RF has proven effective in a wide range of applications as it is able to address, for example, both regression and classification problems20, to perform cluster analysis and missing values imputation21,22.

RF has been used for predicting current and potential future spatial distribution of plant species23, as well as for estimating the marine biodiversity on the basis of the sea floor hardness24. RF has been also applied in ecological applications as a classification tool for the assessment of the vulnerability of P. oceanica meadows over a large spatial scale25, and for land cover classification using remote sensing data26,27.

This method relies upon one of the main features of Machine Learning methods, namely that an ensemble of ‘weak learners’ usually outperforms a single ‘strong learner’19. As a matter of fact, each Classification Tree in the forest represents a weak learner, i.e. a single model, trained on a partly independent data subset, i.e. on a bootstrap sample. Each Classification Tree provides predictions based on the data contained in its bootstrap sample, and many trees are combined into an ensemble model, i.e. into a ‘forest’. The overall output of a RF is obtained by averaging the outcomes of all the trees for regression applications, while it is based on majority voting for classification problems.

The diversity of the trees in the forest is ensured by the use of random subsets of data for the tree-building process, i.e. bootstrap samples, as well as by making a random subset of predictive variables available for the tree splitting procedure. These features allow the RF to reduce the correlation among its Classification Trees, while keeping the variance relatively small, thus leading to a more robust model19.

The selection of a random subset of predictive variables at each split ensures maintaining a certain level of randomness during the tree construction process28, and is necessary for the proper functioning of RF. As a matter of fact, the size of the random subset of predictive variables available for the tree splitting procedure represents a tuning parameter, defined as mtry. The latter together with the minimum number of records to be contained in each leaf, called nodesize, are the main tuning parameters that deeply affect RF performance21,29.

In its original work, Breiman19 suggested to set the mtry value equal to p/3 for regression applications, being p is the total number of predictors, and tuning it from half to twice its original value. On the other hand, nodesize and ntree (the latter parameter is the total number of Classification Trees in the forest) are more related to the generalization ability of the RF, and to the overall complexity of the model. Growing a very large forest, e.g. ntree > 500, or growing the trees to achieve a high degree of purity at their leaves, e.g. nodesize < 5, could substantially increase the computational costs, leading to an extremely complex model21. It has been largely demonstrated that these parameters have to be tuned considering the available data, as large data set might require larger nodesize and smaller ntree values25,28,30,31,32.

As the goal in modelling is to obtain a model showing a high level of accuracy while presenting an appropriate level of complexity, which might vary according to the nature of the modelled process without exceeding18, in this study the RF training, involving the calibration of the tuning parameters, was performed as follows.

The mtry parameter was tested in the [3, 12] range as the data set included 18 predictive variables, while the nodesize was tested in the [1, 10] interval of values, setting ntree to 1000. The moderate size of the available data set (N = 192) allowed to grow quite large forests representing trees almost grown to their maximum depth. The resulting 100 RF configurations (i.e. 10 mtry values, times 10 nodesize values, times 1 ntree value) were trained using only the data contained in the training set, while their performances were assessed on the basis of the withheld data, i.e. of the test set.

The abovementioned RF training was performed for developing all the predictive models, i.e. (1) the shoot density (as shoots m−2) model, (2) the rhizome primary production (as g DW m-2 y−1) model based on known shoot density, (3) and the cascaded rhizome primary production (as g DW m-2 y−1) model based on predicted shoot density (see “Cascaded approach for modeling P. oceanica rhizome primary production” section).

Both model training and evaluation (see “Model evaluation” section) were performed in R33 environment using the package randomForest20 which implements the original RF algorithm developed by Breiman19.

Cascaded approach for modeling P. oceanica rhizome primary production

As previously noted, shoot density is one of the fundamental parameters in the estimation of rhizome primary production, basing on the standardized approach proposed by Pergent-Martin et al.12.

Due to the fact that data on shoot density are obtained from laborious field activities, usually expensive and time-consuming, we proposed a cascaded approach aimed at modeling the rhizome primary production of P. oceanica using predicted density values, rather than observed ones. In a general perspective, the use of predicted values of shoot density could zero out survey costs.

In a methodological perspective, predicted shoot density values are meant to be used at run time, thus they were used when assessing the RF performance, while observed data were only used during the training procedure. In other words, predicted values of shoot density, provided by our predictive model, were included in test set for evaluating the performance of the cascaded model of rhizome primary production, while the training procedure of the latter was instead carried out using data obtained from direct measurements, i.e. observed data on shoot density.

The rationale behind that solution is based on practical as well as methodological reasons. In fact, it has to be considered that during the training phase the RF is aimed at detecting patterns in the data, learning how the predictive variables are related to the target. On the other side, when the model is applied to the test set, its ‘learning ability’ is assessed and the way the model is meant to be applied, i.e. with no field measurements, must be taken into account.

Accordingly, the use of the observed data on P. oceanica shoot density in the training procedure allowed the RF to learn the underlying interactions between the predictive variables, including shoot density, and the target, i.e. rhizome primary production. On the contrary, the use of predicted data during the model evaluation allowed testing the capability of the RF in modeling the multifaceted relationships between predictive variables and target, including density-productivity ones.

Model evaluation

As previously noted, the models’ performance was evaluated using the data included in the test set, which are those never seen by the RF during its training. The performance of each model was evaluated by computing the determination coefficient (R2), which measures the proportion of target variance explained by the model, and the Mean Squared Error (MSE). The final models regarding P. oceanica shoot density and rhizome primary production were selected on the basis of the R2 value, i.e. the models showing the best predictive ability (maximum R2 value) were selected as the final ones.

Afterwards, for developing the cascaded model of P. oceanica rhizome primary production, the predicted values provided by the shoot density model showing the best performance, i.e. the most accurate model, were included among the test set data of the former. The final cascaded primary production model of P. oceanica rhizomes was also chosen on the basis of the R2 value, thus selecting the RF showing the best performance.

Relative importance of predictive variables

The assessment of the relative importance of the predictive variables is performed during the RF training on the basis of a permutation procedure. The importance of any given predictive variable is estimated on the basis of the increase in the error rate when that predictive variable is randomly permuted19,21. Estimation of relative importance of predictive variables is computed using the Out-Of-Bag (OOB) data, which are the records not included in the bootstrap sample for the tree-building process. These OOB data are passed down to the tree previously grown using the bootstrap sample, obtaining predictions for these OOB data. The OOB records are then passed down to the same tree once more and the values of each predictive variable, one at a time, are randomly permuted, while those of the others are left unchanged. During this second step, new predictions for the modified OOB records are obtained, which are aggregated tree by tree as the forest is constructed. Finally, the overall deviation between the estimates provided by the original and the modified OOB records is computed and regarded as a measure of the relative importance of each predictive variable25,32.


Source: Ecology - nature.com

Institute Professor Emeritus Mario Molina, environmental leader and Nobel laureate, dies at 77

Deep amoA amplicon sequencing reveals community partitioning within ammonia-oxidizing bacteria in the environmentally dynamic estuary of the River Elbe