Sentinel2GlobalLULC: A Sentinel-2 RGB image tile dataset for global land use/cover mapping with deep learning

To build Sentinel2GlobalLULC, we followed two main steps. First, we established a spatio-temporal consensus between 15 global LULC products for 29 LULC classes. Then, we extracted the maximum number of Sentinel-2 RGB images representing each class. Each image is a tile that has 224 × 224 pixels at 10 × 10 m spatial resolution and was built as a cloud-free composite from all the Sentinel-2 images acquired between June 2015 and October 2020. Both tasks were implemented using GEE, an efficient programming, processing and visualisation platform that allowed us to have free manipulation and access to all used LULC products and Sentinel-2 imagery, simultaneously.

Finding spatio-temporal agreement across 15 global LULC products

To establish the spatio-temporal consensus between different LULC products for each one of the 29 LULC classes, we followed four steps: (1)Identification of the LULC products to be used in the consensus, (2)Standardization and harmonization of the LULC legend that was subsequently used to annotate the image tiles, (3)Spatio-temporal aggregation across LULC products, and (4)Spatial reprojection and tile selection based on optimized spatial purity thresholds.

Global LULC products selection

The adopted purity measure for spatio-temporal agreement across the 15 global LULC products we selected from GEE (Table 2) aims to find areas of high consensus to maximize the annotation quality. Spatial and temporal consensus across such rich diversity of LULC products, in terms of spatial resolution, time coverage, satellite source, LULC classes and accuracy, was used as a source of robustness for our subsequent LULC annotation. Products outside GEE were not used due to computing limitations.

Table 2 Main characteristics of the 15 global Land-Use and Land-Cover (LULC) products available in Google Earth Engine (GEE) that were combined to find consensus in the global distribution of 29 main LULC classes.

Full size table

Standardization and Harmonization of LULC legends

Land cover (LC) data describes the main type of natural ecosystem that occupies an area; either by vegetation types such as shrublands, grasslands and forests, or by other biophysical classes such as permanent snow, bare land and water bodies. Land use (LU) includes the way in which humans modify or exploit an area, such as urban areas or agricultural fields.

To build our 29 LULC classes nomenclature, we established a standardization and harmonization approach based on expert knowledge. During this process, we took into account both the needs of different practitioners in the global and regional LULC mapping field and the thematic resolution of the global LULC legends available in GEE. Our nomenclature consists of 23 LC and 6 LU distinct classes identified through specific consensus rules across 15 LULC products (see Table 4). A six-level (L0 to L5) hierarchical structure was adopted in the creation of these 29 LULC classes (Fig. 2). To facilitate the inter-operability of our 29 legends at the finest level L5 across all LULC products and with the widely used FAO’s hierarchical Land Cover Classification System (LCCS)¹, we have established an LULC classification system where the 29 classes can be mapped directly to FAO’s LCCS as explained in the table of Supplementary File 1. The LC part in our dataset contains 20 terrestrial ecosystems and 3 aquatic ecosystems. The terrestrial systems are: Barren lands, Grasslands, Permanent snow, Moss and Lichen lands, Close shrublands, Open shrublands, in addition to 12 Forests classes that differed in their tree cover, phenology, and leaf type. The aquatic classes are: Marine water bodies, Continental water bodies, and Wetlands; furthermore, wetlands were divided into 3 classes: Marshlands, Mangroves and Swamps. The LU part is composed of urban areas and 5 coarse cropland types that differed in their irrigation regime and leaf type. In Table 3, you can find the semantic definition of each one of the 29 classes in Sentinel2GlobalLULC. We provided a table in Supplementary File 2, for a more detailed definition of each LULC class.

Fig. 2

Tree representation of the six-level (L0 to L5) hierarchical structure of the Land-Use and Land-Cover (LULC) classes contained in the Sentinel2GlobalLULC dataset. Outter circular leafs represent the final or most detailed 29 LULC classes (C1 to C29) of level L5. The followed path to define each class is represented through inner ellipses that contain the names of intermediate classes at different levels between the division of the Earth’s surface (square) into LU and LC (level L0) and the final class circle (level L5). All LULC classes belong to three levels at least, except the 12 forest classes that belong to L5 only.

Full size image

Table 3 Semantic signification of each one of the 29 Land Use and Land Cover (LULC) classes contained in the Sentinel2GlobalLULC dataset according to the six-level (L0 to L5) hierarchical structure.

Full size table

Combining products across time and space

For each one of the 29 LULC classes, we combined in space and time the global LULC information among the 15 GEE LULC products. This way, each image was annotated with a LULC class only if all combined products agreed in its corresponding tile (i.e., 100% of agreement in space and time). For each product and LULC type, we first set one or more criteria to create a global mask at the native resolution of the product in which each pixel was classified as 1 or 0 depending on whether it met the criteria for belonging to that LULC type or not, respectively (see first stage in Table 4). For certain LULC classes, some products did not provide any relevant information, so they were not used. For example (Table 4), in Grasslands (C3), Open Shrublands (C4) and Close Shrublands (C5), we combined 14 products, while in UrbanBlUpArea (C29) and Permanent Snow (C23) we only combined 10 and 7 products, respectively.

Table 4 First stage of the rule set criteria used to find consensus across the 15 Land-Use and Land-Cover (LULC) products available in Google Earth Engine (GEE) for each of the 29 LULC classes contained in the Sentinel2GlobalLULC dataset.

Full size table

Then (see second stage in Table 5), for each LULC type, we calculated the average of all the masks obtained from each product to create a final global probability map from all products with values ranging between 0 and 1. Value 1 meant that all products agreed to assign that pixel to a particular LULC class, while 0 meant that none of the products assigned it to that particular class (Fig. 3). These 0-to-1 values are interpreted as the spatio-temporal purity level of each pixel to belong to a particular LULC class and are provided as metadata with each image.

Table 5 Second stage of the rule set criteria used to find consensus across the 15 Land-Use and Land-Cover (LULC) products available in Google Earth Engine (GEE) for each of the 29 LULC classes contained in the Sentinel2GlobalLULC dataset.

Full size table

Fig. 3

Example of the process of building the final global probability map for one of the 29 Land-Use and Land-Cover (LULC) classes (e.g. C1: “Barren”) by means of spatio-temporal agreement of the 15 LULC products available in Google Earth Engine (GEE). The final map is normalized to values between 0 (white, i.e., areas with no presence of C1 in any product) and 1 (black spots, i.e., areas containing or compatible with the presence of C1 in all 15 products), whereas the shades of grey corresponds to the values in between (i.e., areas that did not contain or were not compatible with the presence of C1 in some of the products). This process is divided into two stages: the first stage (the blue part, see details in Table 4) and the second stage (the yellow part, see details in Table 5). LULC products available for several years are represented with superposed rectangles, while single year products are represented with single rectangles. GMP: global probability map, NA: Not Available.

Full size image

As an example of the first stage (see details in Table 4), to specify if a given pixel belongs to Dense Evergreen Needleleaf Forest, we evaluated its tree cover level using “ ≤ “ and “ ≥ “, while for bands containing the leaf type information, we used the equal operator “ = “. For the spatio-temporal combination of multiple criteria we have used the following operators: “AND”,“OR” and “ADD”. For example, we combined the tree cover percentage criteria with the leaf type criteria using “AND” to select forest pixels that met both conditions. To combine many years instances of the same product, we used “ADD”, except for product P13, where we used “AND” to identify permanent water areas only. Whenever we used the “ADD” operator, we normalized pixel values afterwards to bring it back to a probability interval between 0 and 1 using the division by the total number of combined years or criteria.

In the second stage (see details in Table 5), we combined for each LULC class the 15 global probability maps previously derived from each product to create a final global probability map (Fig. 3). This combination was carried out using various operators such as “ADD”, “MULTIPLY” and “OR”, depending on the LULC type. When “ADD” was used, the final pixel values were normalized by dividing the final addition value of each pixel by the total number of added products. The “MULTIPLY” operator was mostly used at the end, to remove urban areas from non-urban LULC classes, or to remove water from non-water LULCs. The multiplication operator was also adopted to make sure that a certain criteria was respected in the final probability map. For instance, for the swamp class, we multiplied all pixels in the final stage by a water mask where saline water areas have a value of 0 in order to eliminate mangrove from swamp pixels and vice versa. Finally, we used “OR” operator between different water related products to take advantage of the fact that they complement each other in terms of spatial-temporal coverage and accuracy.

In GEE, when two products are aggregated using “ADD”, “MULTIPLY” or any other operator, the output is aggregated at the spatial resolution of the product at the left of the operator. Hence, to maintain the finest spatial resolution in the final probability map, we multiplied everything by product P15 and placed it at the left of the final “MULTIPLY” operation (See Table 5). Hence, all the 29 final probability maps were generated at the P15 spatial resolution of 30 m/pixel (except the urban class C29 which maintained the 30 m/pixel resolution of product P14).

Re-projection and Selection of purity threshold

Since our objective was finding pure Sentinel-2 image tiles of 224 × 224 10-m pixels representing each LULC class, we reprojected the 30 m/pixel probability maps to 2240 m/pixel using the spatial mean reducer in GEE. That is, each pixel value at 2240 m resolution was computed using the mean over all the 30m-pixel values contained within it. Hence, the resulting pixel values at 2240 m resolution represent the purity level that each Sentinel-2 image tile of 224 × 224 10-m pixels has. We illustrated the reprojection and selection processes in Fig. 4.

Fig. 4

Example of the workflow to obtain a Sentinel-2 image tile of 2240 × 2240 m for one of the 29 Land-Use and Land-Cover (LULC) classes (e.g. C1: “Barren”). The process starts with the reprojected final global probability map obtained from stage two (Table 5) and ends with its exportation to the repository of a Sentinel-2 image tile of 224 × 224 pixels. The white rectangle is the only one having a probability value of 1 (Recall that the purity threshold used for Barren was 1, i.e., 100%). The black pixels has a null probability value, while the probability values between 0 and 1 are represented in gray scale levels.

Full size image

For each one of the reprojected maps, we defined a pixel value threshold to decide whether a given 2240 × 2240 m tile was representative of each LULC class or not. Since training DL image classification models needs a large number of high quality (both in terms of image quality and annotation quality) image tiles to reach a good accuracy, when the spatial purity of 100% (full agreement across products in all the pixels of the 224 × 224 tile) resulted in a small number of agreement tiles for a particular class, the purity threshold was decreased for that class until the number of tiles was larger than 1000 or further decreased in less abundant classes to a minimum of 75% of purity. The found purity value is always provided as metadata for each image in the dataset, so the user can always restrict its analysis to those image tiles and classes at any desired purity level. Decreasing the purity threshold down to 75% for the less abundant classes (e.g swamp, mangrove, etc.) was a trade-off between maintaining a good data annotation quality and providing a sufficient number of tiles in each class. In Table 6, we present the number of agreement tiles found at different purity thresholds ranging from 75% to 100% for each LULC class. This spatial purity was not further decreased since machine learning image classification models are known to be robust when the target class is spatially dominant in each training image (it occupies more than 60% of the pixels in the scene)⁴². On the other hand, when the number of pure tiles for a LULC class was too large to be downloaded (i.e., greater than 14000), we applied a selection algorithm as described in the Supplementary File 3, to download a maximum of 14000 spatially representative images. For this, the world was divided into a one-degree squared cell grid. If a cell contained less than 50 image tiles, we selected them all. If it contained more than 50, we applied that automatic maximum geographic distance algorithm that selected images as far from each other as possible in a number proportional to the number of existing images in that cell. The map in Fig. 6 shows the global distribution of the selected 194877 image tiles contained in Sentinel2GlobalLULC and distributed in 29 LULC classes.

Table 6 Summary of the varying number of found and eventually selected Sentinel-2 image tiles of 224 × 224 pixels depending on the different consensus level reached across the 15 Land-Use and Land-Cover (LULC) products available in Google Earth Engine (GEE) for each of the 29 LULC classes contained in the Sentinel2GlobalLULC dataset.

Full size table

Data extraction

Sentinel2GlobalLULC provides the user with two types of data: Sentinel-2 RGB images (jpeg and geotif versions) and CSV files with associated metadata. In the following subsections, we describe the process for associating metadata, including the Global Human Modification (GHM) index.

Global human modification index extraction

As an additional metadata related to the level of human influence in each image, we calculated for each tile in GEE, the spatial mean of the global human modification index for terrestrial lands⁴³, where 0 means no human modification and 1 means complete transformation. Since the original GHM product was mapped at 1 × 1 km resolution, we reprojected it to 2240 × 2240 m using the same reprojection procedure explained in (Re-projection and Selection of purity threshold).

CSV files generation

Once the tiles were selected, for each LULC class we listed the image tiles in descendent order of purity. Metadata included: geographical coordinates of each tile centroid, tile purity value, name and ID of the LULC class, and average GHM index for that tile. Then, we used the geographical coordinates of each tile to identify its exact administrative address geolocation. To implement this reverse geo-referencing operation, we used a free request-unlimited python module called reverse_geocoder. This way, we assigned a country code, two levels of administrative departments, and the locality to each tile.

For LULC classes that had more than 14000 pure tiles, we have released the coordinates before and after the distance-based selection in case the user wants to download more tiles or use our consensus coordinates for other purposes.

Sentinel-2 RGB images exportation

After extracting all these pieces of information and grouping them into CSV files, we went back to the geographic center coordinates of each tile and used them to extract the corresponding 224 × 224 Sentinel-2 RGB tiles using GEE. Each exported image was identical to the 2240 × 2240 m area covered by its Sentinel-2 tile.

We chose “Sentinel-2 MSI (Multi-Spectral Instrument) product” since it is free and publicly available in GEE at the fine resolution of 10 × 10 m. We chose “Level-1C” (i.e., top-of-atmosphere reflectance) since it provides the longest data availability of Sentinel-2 images without any modification of the data. To build RGB images, we extracted the three bands B4, B3 and B2 that correspond to Red, Green and Blue channels, respectively. More bands available in Sentinel-2 or even in Sentinel-1 images can be incorporated in the future to our dataset. However, computational limitations (i.e., the size of the dataset would be impractical) did not allowed us to handle it as a first goal. In addition, the spatial resolution of the images would be heterogeneous across bands.

To minimize the inherent noise due to atmospheric conditions (e.g. clouds, aerosols, smoke, etc.) that could affect the satellite RGB images, every image was built as a temporal aggregation of all images gathered by Sentinel-2 satellites between June 2015 and October 2020. During this aggregation, only the highest quality images in the corresponding image collection were considered, as we firstly discarded all image instances where the cloud probability exceeded 20% according to the metadata provided in their corresponding Sentinel-2 collection. Then, we calculated the 25th-percentile value between all remaining images for each reflectance band (R, G, and B), and built the final image with the obtained 25-percentile values in each pixel for its RGB bands. The 25th-percentile choice was adopted giving its suitability in atmospheric noise reduction^{44,45,46,47,48}.

Usually, Sentinel-2 MSI product includes true colour images in JPEG2000 format, except for the “Level-1C” collection used here. The three original bands (B4, B3, and B2) required a saturation mapping of their reflectance values into 0–255 RGB digital values. Thus, we mapped the saturation reflectance of 3558 into 255 to obtain true RGB channels with digital values between 0 and 255. The choice of these mapping numbers was taken from the Sentinel-2 true colour image recommendations section of Sentinel user guidelines. Finally, after exporting the selected tiles for each LULC class as “.tif” images, we converted them into “.jpeg” format using a lossless conversion algorithm.

Technical implementation

To implement all our methodology steps, we first created a javascript in GEE for each LULC class. Each script is a multi-task javascript where we implemented a switch command to control which task we want to execute (between the spatio-temporal aggregation task, the spatial reprojection and tiles selection task, or the data exportation task). In each one of these scripts, we selected from GEE LULC datasets repository the 15 LULC products used to build the consensus of that LULC class. Each script was responsible of elaborating the spatio-temporal combination of the selected products and generating the final consensus map for that LULC class as described in the subsection “Combining products across time and space”. Then, it exports the final global probability map as an asset into GEE server storage to make its reprojection faster. In the same script, once the consensus map exportation was done, we imported it from the GEE assets storage and reprojected it to 2240 × 2240 m resolution; then, we exported the new reprojected map into GEE assets storage again to make its analysis and processing faster. Afterwards, we imported the reprojected map into the same script and applied different processing tasks. During this processing phase, many purity threshold values were evaluated. Then, we elaborated in this same script the pure tiles identification and their center coordinates exportation into a CSV file. A distinct GEE script was developed to import, reproject and export the global GHM map. The resulted GHM map was saved as an asset too, then imported and used in each one of the 29 LULC multi-task scripts.

A python script was developed separately to read the exported CSV files for each LULC class and apply the reverse geo-referencing on their pure tiles coordinates then add the found geolocalization data (country code, locality…etc) to the original CSV files as new columns. Then, another python script was implemented to read the new resulted CSV files with all their added columns (reverse geo-referencing data, GHM data) and use the center coordinates of each pure tile in that class to export first its corresponding Sentinel-2 satellite geotiff image within GEE through the python API. Finally, after downloading all the selected geotiff images from our Google drive, we created another python script to convert these geotiff images into JPEG format.

Source: Ecology - nature.com

Sentinel2GlobalLULC: A Sentinel-2 RGB image tile dataset for global land use/cover mapping with deep learning

Finding spatio-temporal agreement across 15 global LULC products

Global LULC products selection

Standardization and Harmonization of LULC legends

Combining products across time and space

Re-projection and Selection of purity threshold

Data extraction

Global human modification index extraction

CSV files generation

Sentinel-2 RGB images exportation

Technical implementation

The expanding value of long-term studies of individuals in the wild

Advancing the energy transition amidst global crises

ITALIAN LANGUAGE

ENGLISH LANGUAGE