Sampling and genotyping
We collected hair and cheek swab samples from 77 men from geographically and linguistically distinct groups of the Caucasus: Kartvelian speakers from Georgia and Turkey, Northeast Caucasian speakers and Turkic speakers from the Russian Federation and Armenian speakers from Georgia’s southern province of Javakheti, descendants of the families displaced from Mush and Erzurum provinces of eastern Turkey in the early nineteenth century (Table 1, Fig. 1). To maximize the representativeness of the genetic signature of each population, the samples were collected from locals with no ancestors from outside of the respective ethnic/geographic population over the last three generations. DNA was extracted from follicles of 10–12 male chest hairs and cheek swab samples. Extraction was performed using Qiagen DNeasy Blood and Tissue kit, following the manufacturer’s recommendations (Qiagen, Valencia, CA, USA). The DNA samples were genotyped for 693,719 autosomal and 17,678 X-chromosomal SNPs by Family Tree DNA (FTDNA—Gene By Gene, Ltd, Houston, TX, www.familytreedna.com).
The distribution of the study populations: averaged centroids of ancient populations (uniquely colored points in the main map, see Table 2 for details) and hubs of the modern Caucasian populations (identified in the inset map, see Table 1 for details). Glacial human refugia extracted from Gavashelishvili and Tarkhnishvili5 are shaded in purple. The map is generated using QGIS Desktop 3.10.6-A Coruña (https://qgis.org).
Our dataset of modern Caucasian genotypes was supplemented with published 10 modern Mbuti (Supplementary Table S1) and 122 Upper Paleolithic-Mesolithic human genotypes, retrieved as a part of 1240 K dataset from David Reich’s Lab website, Harvard University (https://reich.hms.harvard.edu/downloadable-genotypes-present-day-and-ancient-dna-data-compiled-published-papers; see Supplementary Table S2 for details). The ancient genotypes were selected such that they either dated from the LGM or fell within the glacial refugia identified by Gavashelishvili and Tarkhnishvili5. We did so in order to maximize the genetic signature of potential refugial populations in our analysis. We divided the ancient genotypes into 2000-year-long intervals, and then grouped each of these intervals into geographic units (hereafter ancient populations, Table 2, Fig. 1). The modern and ancient genotypes were merged using PLINK 1.9 (PLINK 1.9: www.cog-genomics.org/plink/1.9/27.
Ethics statement
The research team members, through their contacts in the studied communities, inquired whether locals would voluntarily participate in genetic research that would help clarify the genetic makeup of the Caucasus. A verbal agreement was made with volunteer donors of DNA samples, according to which the results would be communicated, electronically or in hard copy, with participants individually. Participants were informed that, upon the completion of the lab work, the research would be published without mentioning the names of sample donors. Those who agreed provided us with the envelopes containing their chest hairs or cheek swab samples, with the birthplace of their ancestors (last three generations) written on the envelope or a piece of paper. In accordance with the preferences of the sample donors, the agreement was verbal and not written. The envelopes and papers are stored as evidence of voluntary provision of the samples and the related information. Analysis of data was done anonymously, using only location and ethnic information; only the first and third authors of the manuscript had access to names associated with the samples. Therefore, this study was based on noninvasive and nonintrusive sampling (volunteers provided hair and swab samples they collected themselves), and the information destined for open publication does not contain any personal information. The study methodology and the procedure of verbal consent was discussed in detail with and approved by the members of the Ilia State University Commission for Ethical Issues before the field survey started, and the commission decided that formal ethical approval was not needed for conducting this study. This is confirmed in a letter from the commission chairman, a copy of which has been provided to the journal editor as part of the submission process.
Genetic affinity and geography
First, we measured genetic affinity between the modern Caucasian populations, and between the modern populations and the ancient populations of hunter-gatherers, and then tested whether the genetic affinity between these populations was determined by geographic features. Data were mapped using QGIS Desktop 3.10.6-A Coruña, whereas graphs were created using the “ggplot2” package28 in R version 3.5.229.
To evaluate genetic affinities and structure of the modern populations, we used Wright’s fixation index (Fst), inbreeding coefficient, admixture analysis and the principal component analysis (PCA). For these procedures we filtered the raw SNP genotypes in PLINK 1.9, first removing all SNPs with the minor allele frequency < 0.05, followed by LD pruning of markers that exceeded the pairwise correlation threshold of r2 > 0.3, calculated in windows of 50 bp size and 10 bp steps (–maf 0.05 –indep-pairwise 50 10 0.3). Since all individuals in our dataset possess a single copy of the X-chromosome, we did not expect any differential ploidy bias, and SNPs on the X were treated similarly to those on the autosomes. Fst pairwise values were calculated using the smartpca program of EIGENSOFT30 with default parameters, inbreed: YES, and fstonly: YES. The relationship between the modern populations based on Fst values was visualized by constructing a neighbor-joining tree using the “ape” package31 in R version 3.5.2. The average and standard deviation of the inbreeding coefficient for each population was calculated using “fhat2” estimate of PLINK 1.9. The LD pruned genotypes were used in ADMIXTURE 1.3.032, performed in unsupervised mode in order to infer the population structure from the modern individuals. The number of clusters (k) was varied from 2 to 7 and the fivefold cross-validation error was calculated for each k33. We conducted principal components analysis in the smartpca program of EIGENSOFT30, using default parameters and the lsqproject: YES and numoutlieriter: 0 options. Eigenvectors of principal components were inferred with the modern populations from the Caucasus, while the ancient populations were then projected onto the PCA plots. We also assessed the relatedness between sampled individuals using kinship coefficients estimated by KING34.
To quantify genetic affinities between the modern and ancient populations, we used the programs qp3Pop and qpDstat in the ADMIXTOOLS suite (https://github.com/DReichLab35 for f3- and f4-statistics, respectively. f3-statistics of the form f3(X,Y,Outgroup) measure the amount of shared genetic drift of populations X and Y after their divergence from an outgroup. We used an ancient population and a modern Caucasian population for X, Y and Mbuti as an outgroup. f4-statistics of the form f4(Outgroup,Test;X,Y) show if population Test is equally related to X and Y or shares an excess of alleles with either of the two. In the f4-statistic calculation we used Mbuti for Outgroup, a modern population of the Caucasus for Test, and X and Y for contemporaneous ancient populations. This meant that f4 < 0 indicated higher genetic affinity between the test population and X, while f4 > 0 indicated higher genetic affinity between the test population and Y.
To quantify geographic features, we derived least-cost paths and measured least-cost distances (LCD) between the modern and ancient populations using the Least Cost Path Plugin for QGIS. The computation of LCD considers a friction grid that is a raster map where each cell indicates the relative difficulty (or cost) of moving through that cell. A least-cost path minimizes the sum of frictions of all cells along the path, and this sum is the least-cost distance (LCD). For impedance to human movement and expansion, we used 15 geographic features (Table 3). All gridded geographic features (i.e. raster layers) were resampled to a resolution of 1 km using the nearest-neighbor assignment technique. All possible subsets of the 15 geographic features, that did not cancel out each other, were used to calculate different variables of LCD. We assumed that most human movements occurred during climate warming events when the earth’s surface was not dramatically different from that of today, and hence used the current data of the geographic features.
Linking genetic affinity and geography
Generalized additive models (GAMs) were used to fit the outgroup f3-statistic to time and variously calculated LCD between the modern and ancient populations using the “mgcv” package36 in R version 3.5.2. Time between the modern and ancient populations was measured in BP (years before present, defined by convention as years before 1950 CE). We used GAMs because without any assumptions they are able to find nonlinear and non-monotonic relationships. GAMs were fitted using a Gamma family with a log link function. Penalized thin plate regression splines were used to represent all the smooth terms. The restricted maximum likelihood (REML) estimation method was implemented to estimate the smoothing parameter because it is the most robust of the available GAM methods36.
Model and variable selection were performed by exploring LCD, time BP and the interaction term. The predictive power of the models was evaluated through a tenfold cross-validation. The cross-validation of many models was handled through R’s parallelization capabilities37,38. The best model was selected by the mean squared error of the cross-validation. Akaike’s Information Criterion (AIC) is generally used as a means for model selection. However, we preferred cross-validation for model selection because AIC a priori assumes that simpler models with the high goodness of fit are more likely to have the higher predictive power, while cross-validation without any a priori assumptions measures the predictive performance of a model by efficiently running model training and testing on the available data.
We additionally validated the effect of different subsets of geographic features by assessing the relationship between statistically significant values of f4-statistic (i.e. |Z|> 3) and each subset. The relationship between f4-statistic of the form of f4(Outgroup,Test;X,Y) and geographic features was determined by measuring the agreement between the negative/positive signs of f4-statistic and the difference in LCD (LCD.D) for each pair of contemporaneous ancient populations X and Y. LCD.D was calculated as (LCD1–LCD2), where LCD1 was least-cost distance between the test population and X, and LCD2 was least-cost distance between the test population and Y. LCD.D < 0 indicated less least-cost distance between Test and X, while LCD.D > 0 indicated less least-cost distance between Test and Y. So, the same sign of f4 and LCD.D values indicated agreement between geographic proximity and genetic affinity. We used Cohen’s kappa39 to measure the agreement.
In order to test if geographic features (Table 3) accounted for present-day genetic differentiation in the Caucasus, we measured the relationship between Fst and LCD across the modern populations using the Mantel test in the “vegan” package40 in R version 3.5.2. In addition, we checked whether contribution from ancient samples was related to today’s genetic differentiation. To do so, we calculated median of f3-statistic of ancient populations of each geographic grouping (e.g. the following 6 populations made up one group: Balkans 39,950–41,950 BP, Balkans 37,950–39,950 BP, Balkans 31,950–33,950 BP, Balkans 9950–11,950 BP, Balkans 7950–9950 BP, Balkans 5950–7950 BP). Then we measured the manhattan distance of f3 median values of all combinations of the geographic groupings between the modern populations and compared the results to Fst and LCD using the Mantel test.
Source: Ecology - nature.com