The network nature of language endangerment hotspots
Database utilizedThe database comprises information obtained with permission from the Catalogue of Endangered Languages that is hosted on the Endangered Languages Project platform (https://www.endangeredlanguages.com/). The Endangered Languages Project was first developed and launched by Google, and is currently overseen by First People’s Cultural Council and the Institute for Language Information and Technology at Eastern Michigan University. Information about the languages in this project is provided by the Catalogue, which is produced by the University of Hawai’i at Mānoa and Eastern Michigan University, with funding provided by the U.S. National Science Foundation (Grants #1058096 and #1057725) and the Luce Foundation. The project is supported by a team of global experts comprising its Governance Council and Advisory Committee.In general, the Catalogue aims to present all languages that communities and scholars have pointed out to be at some level of risk as well as languages that have become dormant. In addition to being the largest database of endangered languages globally, the Catalogue is updated periodically based on feedback gathered from language communities and scholars worldwide. The data therefore represents what was most accurately known about the state of each language’s vitality at its point of utilization. At the time of usage, there were 3423 languages represented in the Catalogue that were determined to be at various levels of risk. Assessment of each language’s risk level is carried out using the Language Endangerment Index, which was developed for the Catalogue’s purposes. The Index is used to assess the level of endangerment of any given language based on whether there is intergenerational transmission of the language (whether the language is being passed on to younger generations), its absolute number of speakers, speaker number trends (whether numbers are stable, increasing, or decreasing), and domains of language use (whether the language is used in a wide number of domains or limited ones). The levels of endangerment that the Index generates include ‘safe’, ‘vulnerable’, ‘threatened’, ‘endangered’, ‘severely endangered’, and ‘critically endangered’. Languages for which it remains unclear if the language has gone extinct or whose last fluent speaker is reported to have died in recent times are referred to as ‘dormant’. Given that the focus of the Catalogue is languages that are at some level of threat, safe languages are excluded in general. Where locality information is available, each language is also accompanied with its latitudinal and longitudinal coordinates.Steps taken to prepare the data for network analysisThe data obtained from the Catalogue was further organized and cleaned up for analysis.
1.
Identifier code
Where available, the ISO 639-3 code for each language was utilized as its unique identifier. Otherwise, its LINGUIST List local use code was utilized. These are temporary codes that are not in the current version of the ISO 639-3 Standard for languages. For languages with neither, unique 3-letter codes were constructed.
2.
Endangerment level
Each language’s endangerment level appeared together with a level of certainty score in the same cell in the original data file. Both pieces of information were split into separate columns and only endangerment levels were utilized.
For languages where different data were available in the Catalogue depending on resource utilized, the data was listed in additional columns. The endangerment level data points utilized in these cases were the ones with the most complete and updated information. If there was no data available regarding endangerment level, this information was also reflected.
3.
Coordinates
Where exact coordinates were not available, coordinates were approximated using Google maps based on the location description provided in the Catalogue source (e.g., the Tel Aviv district), attained from other sources such as Glottolog, UNESCO Atlas of the World’s Languages in Danger, or approximated from maps provided in other sources. ‘NA’ was indicated in the field for coordinates if none could be found.
Coordinates found to be inaccurate were rejected, for example in the instance that coordinates provided indicate a different location than the country the language is supposedly found in. The above steps were then taken to populate the coordinates field.
In instances where a language appears in more than one country, these are listed in separate rows as separate entries. Where there are two sets of coordinates for a country, the set that best corresponds with the written description in the Catalogue source, has greater detail, or is more recent is chosen. Where there are more than two sets of coordinates, a middle point is chosen as being representative of the language’s location, by plotting all coordinates on MapCustomizer (www.mapcustomizer.com).
4.
Language family
On the Catalogue, the information regarding language family may be multi-tiered. For example, Laghuu falls under the Lolo-Burmese branch of the Sino-Tibetan family. For this study, the broader family is utilized—in the case of Laghuu the label ‘Sino-Tibetan’ is used.
Mixed languages, pidgins, and creoles have all been categorized as ‘contact languages’.
Language isolates are listed as ‘isolates’.
5.
Region
The Catalogue groups ‘Mexico, Central America, Caribbean’ together under region. Central America and Caribbean are listed as separate regions in this study, with Mexico falling under Central America.Network constructionA spatial network of endangered languages was constructed from the database. Each node represented an endangered language, and edges or links depicted the distance between the locations of the languages as specified in the database. A distance matrix containing the distances between all endangered languages was computed by using functions from the ‘geosphere’ R package. Specifically, Haversine distances were computed for each pair of longitude and latitude points in the dataset. The radius of the earth used in the Haversine distance calculation is 6,378,137 m (for more details see: https://www.rdocumentation.org/packages/geosphere/versions/1.5-14/topics/distHaversine). Haversine distance refers to the shortest distance between two points on a spherical earth, also referred to as the “great-circle-distance”29.Sensitivity analyses of edge thresholdsThe distance matrix is a fully connected network with weighted, undirected links. We set out to capture the strongest or “closest” spatial relationships among the endangered languages, therefore an edge threshold was applied to the distance matrix such that only the edges in the xth lowest percentile were retained in the spatial network. Such an approach allows for the analysis of the most meaningful (i.e., the physically closest) spatial relations in the dataset and how they relate to language endangerment status. The edges were then transformed into unweighted connections to create a simple unweighted, undirected graph for analysis. In order to determine the value of x (i.e., the percentile at which the edge threshold is to be applied), we constructed 10 spatial networks that retained edges with distances below the 1st, 2nd, 3rd… 10th percentile (in increments of 1%) of all distances in the matrix. Additional information of the distances depicted by the edges in each of the 10 networks is provided in Supplementary Information.These 10 networks were then analyzed for their macro- and meso-scale network properties. A summary of macro and meso-scale network measures used in this analysis and their definitions is provided in Table 1, which depicts the 10 networks showing similar patterns in their network structures.Table 1 An overview of macro- and meso-level network measures of spatial networks with different thresholds.Full size tableResultsAs expected, network density and average degree of the networks, which serve as indicators of the number of edges relative to the number of nodes in the network, increased as the edge threshold used to connect nodes became more liberal. The relatively high values of C (i.e., high levels of local clustering among nodes) and low values of ASPL (i.e., relatively short paths despite large size of network) suggested the presence of small world structure30. The community detection analysis using the Louvain method31 indicated strong evidence of community structure in the networks—suggesting the presence of clusters of endangered languages.The point at which the vast majority of nodes was located within the largest connected component of the network occurred at the 5% edge threshold. Because the 5% network was not too fragmented, we report the analyses conducted on the largest connected component of the 5% network in the following subsections. Please see Supplementary Information for additional details behind the rationale for selecting the 5% network for further analyses. The smaller connected components were excluded. Note however that our results are robust across spatial networks of various edge thresholds (due to lack of space, please see Supplementary Information for a complete summary of all reported analyses conducted on all 10 spatial networks).Macro-level analysis: assortative mixing of endangerment statusesMethodTo investigate the macro-level structure of the spatial network of endangered languages, we computed the assortativity coefficient of the spatial network. Specifically, we wanted to know if the endangerment statuses of the languages tended to cluster at the global level of the entire network. If the assortativity coefficient is positive, the languages in the network would tend to be connected to languages of similar levels of endangerment. If the assortativity coefficient is negative, the languages in the network would tend to be connected to languages of dissimilar levels of endangerment.ResultsThere is a significant positive correlation (Spearman’s rank correlation) between the endangerment status of connected pairs of endangered languages in the network, r = 0.20, p More