Statistically enriched geospatial datasets of Brazilian municipalities for data-driven modeling

The procedure began by obtaining the boundaries of Brazil’s municipalities, which are the most precise spatial reference units available from the Brazilian Ministry of Health of data records on diseases and health events. The boundaries were obtained from the geographic database of the Brazilian Institute of Geography and Statistics (IBGE)¹¹, corresponding to the territorial grid of 2015, with a total of 5,570 Brazilian municipalities.

A broad and diverse set of thematic data was used to compose the datasets, spanning a range of time periods (from 1981 to 2021) according to the temporal regularity of individual layers (annual, quinquennial, atemporal, or without temporal regularity), thus covering spatial and temporal variations over Brazil’s territory. It is worth noticing that during the period of 1981 to 2021 the number of municipalities grew from 3991 to 5570¹², which of course led to major changes to their boundaries, in addition to the creation of the state of Tocantins in 1988 as a result of the division of the state of Goiás¹³. Most of the changes, though, are subdivisions of one municipality into two or more municipalities. To provide statistics that are invariant over the period we would have to resort to using clusters of municipalities (“artificial municipalities”) by means of the Minimum Comparable Areas (MCA) strategy¹⁴. Due to the time-consuming process we preferred to characterize only the current territorial division, thus providing the most refined statistical characterization of Brazil’s municipalities. Still, one can find it useful to aggregate our characterization according to an MCA territorial division; for that we refer the reader to the article by Ehrl¹⁴.

A total of 19 thematic layers were used, obtained from different Brazilian government and international agencies (Tables 1 and 2, illustrated by Figs. 1–4). Each layer may have multiple thematic classes or variables, depending on the nature of the theme, totaling 642 thematic classes or variables. For each class, 18 descriptive statistics were calculated (9 raw statistics plus 9 normalized by municipality’s area–Table 3) for all the available years, totaling 11,556 attributes per municipality.

Table 1 Thematic layers comprising the dataset collection.

Full size table

Table 2 Original data format, resulting geometry, unit and scale/resolution of the thematic layers.

Full size table

Fig. 1

Examples of thematic layers with annual temporality in the territorial extension of the municipality of Rio de Janeiro.

Full size image

Fig. 2

Examples of atemporal and no temporal regularity thematic layers in the territorial extension of the municipality of Rio de Janeiro.

Full size image

Fig. 3

Examples of bioclimatic variables from Worldclim in the territorial extension of the municipality of Rio de Janeiro.

Full size image

Fig. 4

Climate data for total precipitation, maximum, mean and minimum temperature from Worldclim in the territorial extension of the municipality of Rio de Janeiro for the month of January.

Full size image

Table 3 Statistics calculated for the features/variables in the scope of the municipalities.

Full size table

The annual thematic layers for land use and land cover include 25 thematic classes from 1985 to 2020 for the entire Brazilian territory with spatial resolution of 30 m. (Except for the Fernando de Noronha archipelago, municipality geocode 260545, for which there is no land user/cover data due to the absence of historical series Landsat satellite images for that region.) These layers were produced and made available by the online platform MapBiomas¹⁵, collection 6.0. Annual land use and land cover maps were produced via automatic classification processes applied to Landsat satellite images¹⁶. The MapBiomas Project is a multi-institutional initiative coordinated by the Greenhouse Gas Emissions Estimation System (SEEG) from the Climate Observatory’s and consists of a collaborative network of cocreators including nongovernmental organizations (NGOs), universities, and companies. The objective is to produce annual land cover and land use maps of Brazil from 1985 to the present.

The annual temperature and precipitation layers include 19 different types of data from 1981 to 2020 for the entire land surface, with spatial resolution of 5 km (0.05°). These fields were derived from two different observational gridded datasets, one for precipitation and another for temperature. The observed precipitation came from the Climate Hazards Group Infrared Precipitation with Stations data (CHIRPS)¹⁷, with a daily temporal resolution and a spatial resolution of approximately 5 km (0.05°). The observed temperature drawn from the NCEP Climate Forecast System Reanalysis (NCEP/CFSR)¹⁸ at a 6-hour temporal resolution and a spatial resolution of approximately 50 km (0.5°). The NCEP/CFSR gridded dataset was spatially downscaled to a higher spatial resolution of 5 km (0.05°) using bilinear interpolation in order to have the same spatial resolution as CHIRPS. (As with land use and land cover, there is no temperature/precipitation data for the Fernando de Noronha archipelago (geocode 260545).)

The quinquennial layers for Population Count and Population Density were obtained from the Socioeconomic Data and Applications Center (SEDAC)¹⁹ through NASA’s Earth Observing System Data and Information System (EOSDIS), and is hosted by the Center for International Earth Science Information Network (CIESIN) at Columbia University. This dataset estimates the population count for the years 2000, 2005, 2010, 2015 and 2020, based on national censuses and population records, and is available in raster graphics with spatial resolution of 1 km. The official population demographics data from IBGE census is not used because it is available only as a tabular data aggregate count per census sector or municipality and therefore cannot yield meaningful descriptive statistics.

Atemporal data include the following themes: Climatological Normals for Temperature; Altitude; Geomorphology; Soils; Phytophysiognomies; and Biome boundaries. Climatological Normals for Temperature came from Worldclim²⁰ and correspond to observational data, representative of 1950 to 2000, which were interpolated to a resolution of 1 km. These temperature values are in degree Celsius, but for historical reasons they are scaled by a factor of 10. The used mean, minimum and maximum values of temperature include information from different remote sensors onboard the MODIS and NOAA satellites which operate to jointly capture surface temperature and air humidity values. Besides the annual temperature data, we also included climatological normal data because they provide monthly mean values for temperature. These values complement the annual information (considerably influenced by climate events like El Niño and La Niña) and serve as an important reference on seasonal temperature variation patterns, a factor that directly influences the reproduction and survival dynamics of species such as vectors. The altitude data came from NASA’s Shuttle Radar Topography Mission digital elevation model (SRTM) 1 ArcSecond Global, conceived to provide consistent high-quality near-global elevation data²¹. The original data are radar images with spatial resolution of 30 m, version 3, reprocessed to fix inconsistencies and fill missing data (“voids”). The other themes–Geomorphology, Soils, Phytophysiognomies, and Biome boundaries–were obtained from IBGE²². These provides regional details, and were constructed from interpretation of satellite images and various field studies throughout Brazil beginning in 1990²³.

The layers without temporal regularity include: Mining Areas; Roads; Railways; Waterways or watercourses; Hydroelectric Plants; Dams; Conservation Units; Indigenous Lands; and Zone Climates and Regional Subunits. The Mining Areas layer has 336 classes, representing the different types of minerals explored in Brazil’s territory, provided by the Brazilian National Mining Agency (ANM). The boundaries of Conservation Units were provided by the Brazilian Ministry of Environment (MMA). The other layers are single classes of Roads, Railways, Waterways/watercourses, Hydroelectric Plants, Dams, obtained from the Continuous Cartographic Bases²⁴ and Indigenous lands and Quilombola territories²⁵, all this datasets from IBGE. The roads category comprises all its available classifications, covering data from subcategories such as highways and dirt roads. The same unification was adopted for the railways and waterways categories. The layer on Zone Climates and Regional Subunits represents the different climate zones in Brazil’s territory, grouped by temperature and humidity. This layer also identifies the climate types, characterized by shades and hues: tropical, subtropical, mild mesothermal, and median mesothermal²⁶.

Considering the heterogeneity of the data sources and the structural particularities of the thematic layers acquired, it was essential to conduct a pre-processing and structuring stage with the datasets in order to proceed with the calculation of the descriptive statistics. All the raw data, whose total size amounted to 195 GB, were pre-processed in QGIS v3.10²⁷. This stage required standardizing the geospatial data’s cartographic characteristics, correcting topological errors, eliminating duplicate information, and uniformizing the attribute tables. The data were generally organized in two major groups: vector data and matrix data (raster).

To be able to process the Land Use and Land Cover features at the original 30 m spatial resolution, we had first to break down each annual raster (1985 to 2020) into 5,569 smaller raster pieces, one for each municipality, by using the gdalwarp tool from the Geospatial Data Abstraction Library (GDAL). Next, we converted all the resulting rasters to vector format (geopackage) via the script gdal_polygonize.py, also from GDAL. The conversion was necessary because the vector format (geopackage) allowed the calculation of the polygons’ statistics for all the Land Use and Land Cover features, which is not possible with the raster format with the techniques and functions used (described in the Code availability section). All that pre-processing took about 600 hours running in parallel on an Intel Core i7 computer with 8 physical CPU cores and 64 GB of RAM.

The data on Temperature, Precipitation, Population Count/Density, Altitude, and Climatological Normals, also provided in matrix format, were converted to point geometry, since they are inherently points but which had been interpolated by their sources before making them available. The conversion of Altitude from raster to vector was the most computationally demanding operation due to the need to process 10.6 billion points (spread across 821 tiles of 3601 × 3601 points each) at the resolution of 30 m. It took about one month of uninterrupted parallel processing on a 20-core Intel Xeon E5-2690 machine with 128 GB of RAM.

For the vector data, it was first necessary to homogenize the cartographic references using South America Albers Equal Area Conic (EPSG:102033) for data requiring calculation of areas (polygons), South America Equidistant Conic (EPSG:102032) for data requiring calculation of distances (lines), and SIRGAS 2000 Geodetic Reference (EPSG:4674) for data with restricted localization (points)²⁸. It was also necessary to correct some topological errors in the vector data regarding the line and polygon geometries, which are artifacts introduced during the data construction/vectorization stage. The vector data correspond to the following themes: Geomorphology; Soils; Phytophysiognomies; Biome Boundaries; Mining Areas; Roads; Railways; Waterways or watercourses; Hydroelectric Plants; Dams; Conservation Units; Indigenous lands and Quilombola territories; Zone Climates and Regional Subunits.

For the statistical description of the municipalities’ socioenvironmental characteristics, we calculated the measures of central tendency such as mean and median, and measures of dispersion such as maximum and minimum values, standard deviation, and percentiles. For each descriptive statistic we also calculated a corresponding normalized statistic, simply dividing the original statistics value by the municipality’s area. The values were normalized due to the wide variation in the territorial area of Brazil’s municipalities. For example, Altamira, in the state of Pará, is Brazil’s largest municipality, with an area of 159,533 km², while Santa Cruz de Minas, in the state of Minas Gerais, is the smallest one, with only 3,565 km² ²⁹. This wide territorial variability might otherwise skew the modeling towards the identification of distorted correlations, such as the identification of relations between higher proportions of natural or anthropic features and higher concentration of cases, which is merely due to the municipality’s larger territorial dimensions.

Based on structuring of the graphic, we executed a spatial data intersection with the municipal boundaries by means of different routines from PostGIS³⁰, an extension that adds spatial and geographic objects to the PostgreSQL object-relational database.

Calculation of the descriptive statistics

The meaning of the statistics described in Table 3 actually depends on both feature’s geometry and unit of measurement, which are reported in Table 2 for each thematic layer.

For polygons, such as conservation units, the area of each unit is computed in square meters and the set of all conservation units’ areas in the municipality forms the statistical population upon which the descriptive statistics will be calculated for that municipality. This means that the minimum statistic will refer to the smallest area among the conservation units in the municipality, the mean statistic to the average area, the count statistic will refer to the number of conservation units in the municipality, and so forth. Analogously, when the feature type is line, e.g. roads, the set of all road stretches’ lengths (in meters) is the statistical population.

The procedure differs a bit for point features, such as altitude and temperature. In this case, except for the count statistic (which refers to the number of points in the municipality), the actual value at each feature point is taken; for instance, the altitude and temperature at a given location. Differently from the polygons and line cases, the associated unit cannot be predefined (in square meters or meters), and it will depend on the actual unit of the underlying layer–for altitude it is meters, but for temperature it could be either Celsius or Kelvin. Some point-type features, such as hydroelectric plants, do not have a unit per se, i.e. they merely refer to a quantity. Once the set of all point-type feature values are taken, we have a statistical population of values and the calculation of the statistics proceeds exactly as described with the other two feature types.

For each descriptive statistic, there is a corresponding normalized one which is calculated by dividing the statistic by the municipality’s area (in m²). Those normalized statistics complement the set of descriptive information and provide the notion of proportion or density. As an example, the statistic sum_normalized corresponds to the percentage of occupation of a given polygon-type thematic layer in the municipality, or an estimation of density for line-type layers such as roads.

Source: Ecology - nature.com

Statistically enriched geospatial datasets of Brazilian municipalities for data-driven modeling

Calculation of the descriptive statistics

Solving a longstanding conundrum in heat transfer

Thermal adaptation best explains Bergmann’s and Allen’s Rules across ecologically diverse shorebirds

ITALIAN LANGUAGE

ENGLISH LANGUAGE