Study site
The study site was the Kamigamo Experimental Station of Kyoto University, located in a suburban area of Kyoto, Japan (Supplementary Figure S1). This area is located in a warm and humid climate zone, with an elevation of 109–225 m above sea level. The mean annual precipitation and temperature are 1582 mm and 14.6 °C, respectively. The overall area is 46.8 ha. 65% of the area is naturally generated forest, primarily consisting of Japanese cypress (Chamaecyparis obtuse) and some broad-leaved trees such as oak (Quercus serrata or Quercus glauca). Within this area, 28% is planted forest, mainly consisting of foreign coniferous species. 7% consists of sample gardens, nurseries, or buildings.
In this work, we focused on the northern part (an area of 11 ha) of the Kamigamo Experimental Station, containing a naturally regenerated forest of Japanese cypress, and a managed forest of Metasequoia (Metasequoia glyptostroboides), strobe pine (Pinus strobus), slash pine (Pinus elliottii), and taeda pine (Pinus taeda).
Remote sensing data
Flight campaigns were conducted around noon in two seasons: on October 2, 2016, which is the end of the leaf season, and November 20, 2016, the peak of the fall leaf offset season. We used UAV DJI Phantom 4 (DJI, Shenzhen China). The UAV had an onboard camera with a 1/2.3 CMOS sensor that can capture RGB spectral information. The UAV was operated automatically using the DroneDeploy v2.66 application (https://ww.dronedeploy.com, Infatics Inc., San Francisco, United States). On October 2, we set flight parameters as follows: both the overlap and sidelap were set to 75%, and the flight height was set to 80 m from the take-off ground level. However, we failed to align some parts of the images; thus, we changed the overlap and height parameters to 80% and 100 m on November 20. We used 10 ground-control points (GCPs) for reducing the error of the GPS with the images. From the images taken by the UAV, we produced an orthomosaic photo and a digital surface model (DSM) using the Agisoft PhotoScan Professional v1.3.4 software (https://www.agisoft.com, Agisoft LLC, St. Petersburg, Russia). An orthomosaic photo is an image that is composed of multiple overhead images corrected for perspective and scale. The parameter settings used in generating the orthomosaic photo are shown in Supplementary Table S1. These parameters are for November 20. The parameters for October 2 differ only in that the ground sampling distance (GSD) were approximately one centimetre. GSD of the orthomosaic photo and DSM was approximately 5 cm and 10 cm, respectively.
Segmentation and preparation of supervised data
The technological workflow of the individual tree image segmentation and extraction method we used is summarised in Fig. 1. First, we segmented each tree crown using UAV image (orthomosaic photo), a DSM, and a slope model. Second, we visually constructed the ground truth map. Third, we extracted each tree image with a ground truth label. Further details are discussed in sections from “Object-based tree crown segmentation” to “Tree image extraction with ground truth label”.
Workflow for supervised images extraction.
Object-based tree crown segmentation
At the segmentation stage, we segmented at the tree level. First, we constructed a slope model by calculating the slope from the DSM using the ArcGIS Desktop v10.4 software (https://www.esri.com, Environmental Systems Research Institute, Inc., Redlands, United States). The slope model showed the maximum rate of elevation change between each cell and its neighbours, such that the borders of trees were emphasised. From the orthomosaic photo, the DSM, and the slope model, tree crown segmentation was performed in the eCognition Developer v9.0.0 software (https://www.trimble.com, Trimble, Inc., Sunnyvale, United States) using the ‘Multiresolution Segmentation’ algorithm36. The parameter values were adjusted by trial and error. The tree crown map made by this segmentation process is shown in Fig. 2 with enlarged images for visual confirmation of the result, and the best parameters are presented in Supplementary Table S2.
Whole area and representative enlarged tree crown map. The blue line show grid-line of segmented polygons. The white rectangle shows the location of enlarged area, and light blue polygons are used for evaluating the accuracy of tree segmentation. This map was constructed via multiresolution segmentation using colour, DSM, and Slope model. This figure was created using ArcGIS Desktop v10.6 software (https://www.esri.com, Environmental Systems Research Institute, Inc., Redlands, United States).
Herein, we evaluated the accuracy of the segmentation. The segmented crowns were placed into the following five categories according to their spatial relationships with the visually confirmed reference crown. The five categories, set based on a previous study 37, and illustrated in Supplementary Figure S2, are as follows.
(a) Matched: If the overlap of the segmented polygon and the reference crown was more than 80%, the segmented polygon was categorized as “Matched”.
(b) Nearly matched: If the overlap of the segmented polygon and the reference crown was 60–80%, the segmented polygon was categorized as “Nearly matched”.
(c) Split: If the overlap of the segmented polygon and the reference crown was 20–60%, the segmented polygon was categorized as “Split”.
(d) Merged: If multiple reference crowns covered by the segmented polygon, and even one overlap was more than 20%, the segmented polygon was categorized as “Merged”. If the segmented polygon had only one class reference crowns, the polygon was categorized as “one class merged”. If the segmented polygon had multiple class reference crowns, the polygon was categorized as “multiple class merged”.
(e) Fragmented: If one or multiple reference crowns covered by the segmented polygon, and their respective overlaps were less than 20%, the segmented polygon was considered as a “fragmented polygon”.
We calculated the segmentation accuracy of trees at four areas: Areas 1–4. Area 1 was a deciduous coniferous forest and Area 2 was a strobe pine forest, for which we calculated the entire area. Area 3 was a slash pine and taeda pine forest, for which we calculated part of the areas. Area 4 was a naturally regenerated forest, for which we calculated 1 ha in area. As a result, some segmented images had multiple tree crowns, but this method almost succeeded in separating each tree class (Table 1).
Ground truth label attachment to tree crown map
After segmentation, we classified segmented images into the following seven classes: deciduous broad-leaved tree, deciduous coniferous tree, evergreen broad-leaved tree, Chamaecyparis obtuse, Pinus elliottii or Pinus taeda, Pinus strobus, and non-forest. The ‘non-forest’ class included understory vegetation and bare land, as well as artificial structures. For deciding these classes, we conducted field research. We set three rectangular plots sized 30 m × 30 m and checked the tree species, regarding the classes we decided could be identified from the November 20 drone images. The Pinus elliottii or Pinus taeda class consisted of two Pinus species, because these two species are difficult to identify from drone images. At the ground truth map-making phase, we visually attached the class label to each tree crown, using nearest neighbour classification in the eCognition software to improve operational efficiency, which was then used for forest mapping38 (Fig. 3). More specifically, we chose some image objects as training samples and applied that algorithm to the overall tree crowns. In subsequent steps, by adding wrongly classified objects to correct classes of the training samples, we improved the accuracy of the ground truth map.
Segmentation and ground truth map-making result. The tree classes found in the image on the left are represented by the colours explained in the legend in the figure on the right. This figure was created using ArcGIS Desktop v10.6 software (https://www.esri.com, Environmental Systems Research Institute, Inc., Redlands, United States).
Tree image extraction with ground truth label
From the orthomosaic photos of the two season and the ground truth map, we extracted each tree image with a class label using the ‘Extract by Mask’ function in ArcGIS. There were some inappropriate images, such as fragments of trees, those difficult to be interpreted or classified visually, and those including multiple classes; thus, we manually deleted inappropriate images and placed wrongly classified images into the correct class by group consensus. Representative images of the tree classes are shown in Figs. 4 and 5. The number of extracted images and that of arranged images are shown in Supplementary Table S3. After arrangement, the number of each class ranged from 37 to 416. The images had a wide range of sizes, but the length of one side of the largest image was approximately 400 pixels.
Representative extracted images from each class in the November 20 images. These images were segmented well at each tree crown level. However, the image of Pinus strobus includes several tree images. The image of the non-forest class shows the roof of a house.
Representative extracted images from each class in the October 2 images. These images were extracted from the same tree crown map polygon as the November 20 images.
After extraction, we resized the images from October 2 to the size of images from November 20 in order to align the two season conditions. Thus, all images were adjusted to the size of images taken from a height of approximately 100 m.
Machine learning
To construct a model for object identification, we used the publicly available package PyTorch v0.4.139 as a deep learning framework and four standard neural network models—specifically, AlexNet23, VGG1640, Resnet18, and Resnet15241—for fine-tuning. Fine-tuning is an effective method to improve the learning performance, especially when the amount of data is insufficient for training42. We used each neural network model, which had been learned with the ImageNet dataset43, and trained all neural network layers using our data. At the CNN training phase, we augmented the training images eight times by flipping and rotating them. Further augmentation did not improve accuracy. For the input to the CNN, we applied ‘random resized crop’ at a scale of 224 × 224 pixel size for training, which crops the given image to a random size and aspect ratio. For validation and training, we resized the images into 256 × 256 pixel sizes and used ‘centre crop’ at a scale of 224 × 224 pixel size. These cropping algorithms extracted only one resized image (patch) from each cropped image. The ranges of the other learning settings are outlined in Supplementary Table S4.
To evaluate the performance of the CNN, we used SVM as a machine learning platform. We used the average and standard deviation of each band and GLCM texture values as features. GLCM is a spatial co-occurrence matrix that computes the relationships of pixel values, and uses these relationships to compute the texture statistics44. For calculating GLCM, images with a large number of data bits result in huge computational complexity. In this case, the images that were converted to grey scale were 8-bit data. It is known that reduction of bit size causes only minor decrease in classification accuracy; hence, we rescaled from 8-bit to 5-bit45,46. After calculation of GLCM, we extracted five GLCM texture features (angular second moment (ASM), contrast, dissimilarity, entropy, and homogeneity). Their algorithms are defined in Eqs. (1)–(5):
$$begin{array}{c}ASM=sum_{i,j}^{N}{(P}_{i,j}{)}^{2} end{array}$$
(1)
$$begin{array}{c}Contrast=sum_{i,j}^{N}{P}_{i,j}{left(i-jright)}^{2}end{array}$$
(2)
$$begin{array}{c}Dissimilarity=sum_{i,j}^{N}{P}_{i,j}left|i-jright|end{array}$$
(3)
$$begin{array}{c}Entropy=sum_{i,j}^{N}{P}_{i,j}mathrm{log}left({P}_{i,j}right)end{array}$$
(4)
$$begin{array}{c}Homogeneity=sum_{i,j}^{N}{P}_{i,j}/(1+{left(i-jright)}^{2})end{array}$$
(5)
where ({P}_{i,j}) is the GLCM at the pixel which is located in row number i and column number j. We obtained these GLCM texture features at each pixel, excluding pixels close to the image margin, and then calculated their mean and standard deviation for each image. Another important parameter that affects classification performance is the kernel size47,48. To determine the most suitable kernel size for GLCM operation, we calculated GLCM texture features with various kernel sizes of 3, 11, 19, 27, 35, 43, 51, and 59. For SVM validation, we used radial basis function (rbf) kernel and conducted a parameter grid search in the range of gamma from ({10}^{-1}) to ({10}^{-5}) and cost from 1 to ({10}^{5}). As a result of the grid search, we obtained the best validation accuracy and the best parameters at each GLCM kernel size (Supplementary Figure S3). The validation accuracy slightly increased along with the increase in kernel size, and the accuracy stopped increasing at the 51 × 51 kernel size. Considering this result, we adopted the 51 × 51 kernel size and the best parameters as follows: gamma and cost were ({10}^{-2}) and ({10}^{3}) in the fall peak season, and ({10}^{-3}) and ({10}^{4}) in the green leaf season, respectively. We then used these parameters for SVM learning and the comparative evaluation.
For machine learning, we divided the data into training, validation, and testing sets. The validation dataset was used for hyperparameters tuning such as learning rate, batch size for deep learning, and kernel size, cost, and gamma values for SVM. In the testing phase, we used the data which had not been used for training and parameter tuning. Validation accuracy is not suitable for comparing performance as a final result because validation accuracy can be higher than testing accuracy; we tuned the hyperparameters to get higher accuracy using the validation data. Using testing data, we can exclude the bias of parameter tuning. We also used a kind of cross-validation because we had a limited amount of data and decreased the contingency of accuracy. In this case, we randomly divided all the images evenly into four datasets and used two of them for training, one for validation, and one for testing. Subsequently, we interchanged successively the datasets used for training, validation, and testing. This process was repeated four times. For the accuracy evaluation and confusion matrix, we used total accuracy and all the images.
For this calculation, we used a built to order (BTO) desktop computer with a Xeon E5-2640 CPU, 32 GB RAM, and a Geforce GTX 1080 graphics card; the OS was Ubuntu 16.04.
Evaluation
For evaluation, we used the overall accuracy, Cohen’s Kappa coefficient49, and the macro average F1 score. F1 score is the harmonic mean of Recall and Precision. In this study, the number of images acquired for each class varied significantly. The overall accuracy, which is typically utilised for evaluating the machine learning performance, is subject to the difference in the amount of data available to each class. Therefore, we used the Kappa and F1 score, which is suitable for evaluating imbalanced dataset accuracy, as well as overall accuracy to obtain an objective evaluation. Additionally, for evaluating the per-class accuracy, we used the F1 score of each class.
Source: Ecology - nature.com