Fast and accurate population admixture inference from genotype data from a few microsatellites to millions of SNPs

Overall strategy

An admixture analysis aims to estimate the admixture proportions (or ancestries), Q, of each sampled individual in a given number of K source populations (Pritchard et al. 2000), and the characteristic allele frequencies, P, at each locus of each inferred source population. Even though Q is frequently of the primary interest, P must be estimated simultaneously because we have genotype data only and Q is highly dependent on P which actually defines the source populations. For N individuals from K source populations genotyped at L loci with a total number of A alleles, the numbers of independent variables in Q and P are V_Q = (K − 1)N and V_P = (A − L)K, respectively. The high dimensionality of an admixture analysis, with V = V_Q + V_P = (K − 1)N + (A − L)K variables, not only incurs a large computational burden, but also poses a high risk of non-convergence (to the global maximum) for any algorithm, especially when either Q or P is expected to be poorly estimated in difficult situations such as a small sample (say, a couple) of individuals from each source population or low differentiation.

I propose a two-step procedure with corresponding algorithms to reduce the risk of non-convergence, to speed up the computation, and to make more accurate inferences of both Q and P. In the first step, I assume a mixture model (Pritchard et al. 2000; Falush et al. 2003) that individuals in a sample can come from different source populations, but each individual’s genome comes exclusively from a single population. Under this simplified probabilistic model, I conduct a clustering analysis to obtain estimates of both individual memberships and allele frequencies of each cluster by a global maximisation algorithm, simulated annealing, with extra care (details below) of convergence. In the absence of admixture and with sufficient information for complete recovery of population structure, the estimated individual memberships and allele frequencies of the clusters are expected to be equivalent to Q (with element q_ik = 1 and q_il = 0 if individual i is inferred to be in cluster k where l ≠ k) and P, respectively. Otherwise, they are expected to be good approximations of Q and P, because an admixed individual i with the highest ancestral proportion from a population would be expected to be assigned (exclusively) to that population. In the second step, I assume an admixture model (Pritchard et al. 2000; Falush et al. 2003) to refine estimates of Q and P, using an EM algorithm and the start parameter (Q and P) values obtained from the clustering analysis. Because the starting values are already close to the truth, the algorithm is fast and has a much-reduced risk of converging to a local maximum than the original EM algorithms (Tang et al. 2005; Alexander et al. 2009).

Clustering analysis

I assume N diploid individuals are sampled from K source populations. The origin of a sampled individual from the K source populations is unknown, which is the primary interest of structure analysis. However, if it is (partially) known, this information can be used to supervise (help) the clustering analysis of other sampled individuals of unknown origins. Each individual’s genome comes exclusively from one of the K unknown source populations (i.e., mixture model, no admixture). I assume each individual is genotyped at L loci, with a diploid genotype {x_il1, x_il2} for individual i (=1, 2, …, N) at locus l (=1, 2, …, L). The task of the clustering analysis is to sort the N individuals with genotype data X = {x_ila:i = 1, 2, …, N; l = 1, 2, …, L; a = 1, 2} into K clusters, with each representing a source population. No assumption is made about the evolutionary relationships of the populations, which, when summarized by F statistics, are estimated from the same genotype data in both clustering and admixture analyses.

Suppose, in a given clustering configuration Ω = {Ω₁, Ω₂, …, Ω_K}, cluster k (=1, 2, …, K), Ω_k, contains a set of N_k (with N_k > 0 and (mathop {sum}nolimits_{k = 1}^K {N_k equiv N})) individuals, denoted by Ω_k = {ω_k1, ω_k2, …, ω_kN_k} where ω_kj is the index of the jth individual in cluster k. The genotype data of the N_k individuals in cluster k is X_k = {x_ila: i ∈ Ω_k; l = 1, 2, …, L; a = 1, 2}. The log-likelihood of Ω_k is then the log probability of observing X_k given Ω_k

$${{{mathcal{L}}}}_kleft( {{{{mathbf{Omega }}}}_k} right) = {{{mathrm{LogP}}}}left( {{{{mathbf{X}}}}_kleft| {{{{mathbf{Omega }}}}_k} right.} right) = mathop {sum}limits_{l = 1}^L {mathop {sum}limits_{j = 1}^{J_l} {c_{klj}{{{mathrm{Log}}}}left( {p_{klj}} right)} }$$

(1)

where c_klj and p_klj are the count of copies and the frequency, respectively, of allele j at locus l in cluster k, and J_l is the number of alleles at locus l. Given Ω_k, c_klj is counted from genotype data X_k, and allele frequency p_klj is estimated by

$$p_{klj} = left( {p_{lj} + c_{klj}} right)/mathop {sum}limits_{m = 1}^{J_l} {left( {p_{lm} + c_{klm}} right)}$$

(2)

where p_lj is the frequency of allele j at locus l in the entire population represented by the K clusters. p_lj is calculated by

$$p_{lj} = mathop {sum}limits_{k = 1}^K {c_{klj}} /mathop {sum}limits_{m = 1}^{J_l} {mathop {sum}limits_{k = 1}^K {c_{klm}} } = c_{lj}/mathop {sum}limits_{m = 1}^{J_l} {c_{lm}}$$

(3)

where (c_{lm} = mathop {sum}nolimits_{k = 1}^K {c_{klm}}) is the count of allele m (=1, 2, …, J_l) at locus l in the entire sample of individuals.

Under the mixture model above, clusters are only weakly dependent (with the extent of dependency decreasing with an increasing value of K) and the total log-likelihood of the clustering configuration, Ω = {Ω₁, Ω₂, …, Ω_K}, is thus

$${{{mathcal{L}}}}left( {{{mathbf{Omega }}}} right) = mathop {sum}limits_{k = 1}^K {{{{mathcal{L}}}}_kleft( {{{{mathbf{Omega }}}}_k} right)} ,$$

(4)

where ({{{mathcal{L}}}}_kleft( {{{{mathbf{Omega }}}}_k} right)) is calculated by (1).

It is worth noting that allele frequencies, P, are modelled as hidden or nuisance variables and are estimated as a by-product of maximising (4) for estimates of Ω. Yet, careful modelling of P proves important for estimating Ω, as the two are highly dependent. Bayesian admixture methods assume allele frequencies p_kl = {p_kl1, p_kl2, …, (p_{klj_l})} in a Dirichlet distribution (e.g., Foreman et al. 1997; Rannala and Mountain 1997; Pritchard et al. 2000), ({{{mathcal{D}}}}left( {lambda _1,lambda _2, ldots ,lambda _{J_l}} right)). For any population k, the uncorrelated (Pritchard et al. 2000) and correlated (Falush et al. 2003) allele frequency model assumes λ_j = 1 and (lambda_j=p_{ol_j}F_K/(1-F_k)), respectively, for j = 1, 2, …, J_l. In the latter model, p_0lj is the frequency of allele j at locus l in the ancestral population (common to the K derived populations), and F_k is the differentiation of population k from the ancestral population. In contrast, likelihood admixture methods (e.g., Tang et al. 2005; Alexander et al. 2009; Frichot et al. 2014) and non-model based clustering methods (e.g., K-means method, Jombart et al. 2010) do not use any prior, which is equivalent to assuming p_lj ≡ 0 for j = 1, 2, …, J_l in Eq. (2). However, properly modelling prior allele frequencies, as carefully considered in Bayesian methods (Pritchard et al. 2000; Falush et al. 2003), becomes important in situations where allele frequencies are not well defined or tricky to estimate, such as when few individuals are sampled from a source population or when rare alleles are present. The frequentist estimator (2) is in spirit similar to the Bayesian correlated allele frequency model (Falush et al. 2003), and leads to accurate results in various situations to be shown in this study. I have also tried alternatives such as p_lj ≡ 1/J_l (which is similar to the uncorrelated allele frequency model of Pritchard et al. 2000) or p_lj ≡ 0 (which is equivalent to the treatment in previous likelihood admixture analysis or non-model based clustering analysis) in replacement of (2), but none works as well as (2) and could yield much less accurate results in difficult situations (below).

Scaling for unbalanced sampling

Bayesian methods of STRUCTURE’s admixture model assume an individual i’s ancestry, q_i = {q_i1, q_i2, …, q_iK}, follows a prior Dirichlet probability distribution ({{{mathbf{q}}}}_isim {{{mathbf{{{{mathcal{D}}}}}}}}left( {alpha _1,alpha _2, ldots ,alpha _K} right)) (Pritchard et al. 2000; Falush et al. 2003). By default, α₁ = α₂ = ··· = α_K = α, which essentially assumes that an individual has its ancestry originating from each of the assumed K populations at an equal prior probability of 1/K. To model unequal sample sizes such that an individual comes from a more intensively sampled population at a higher prior probability, STRUCTURE also has applied an alternative prior, α₁ ≠ α₂ ≠ ··· ≠ α_K. It is shown that, when sampling intensity is heavily unbalanced among populations, the default prior could lead to the split of a large cluster and the merge of small clusters, while the alternative prior yields much more accurate results (Wang 2017). These priors have a large impact on admixture analysis; applying the default prior to data of highly unbalanced samples leads to inaccurate Q estimates even when many informative markers are used (Wang 2017).

Unfortunately, current non-model based or likelihood-based admixture analysis methods do not utilise these or other priors for handling unbalanced sampling. As a result, they can give inaccurate admixture estimates, just like STRUCTURE under the default ancestry prior model, for data from highly unbalanced sampling. To reduce the cluster split and merge problems, herein I propose the following method to scale the likelihood of a cluster by the size, the number of individual members, of the cluster.

The original log-likelihood of cluster k, ({{{mathcal{L}}}}_kleft( {{{{mathbf{Omega }}}}_k} right)), is calculated by (1). It is then scaled by the cluster size, N_k, as

$${{{mathcal{L}}}}_{Sk}left( {{{{mathbf{Omega }}}}_k} right) = {{{mathcal{L}}}}_kleft( {{{{mathbf{Omega }}}}_k} right)/left( {1 + e^{sN_k/left( {8N} right)}} right),$$

(5)

where s is the scaling factor taking values 1, 2, 3 for weak, medium and strong scaling, respectively. This scaling scheme encourages large clusters and discourages small clusters. Although (5) is not an analytically derived but an empirical equation and is thus not guaranteed to be optimal, extensive simulations (some shown below) verify that the scaling scheme works very well for data from highly unbalanced sampling, yielding accurate clustering analysis results and thus similarly or more accurate admixture estimates than STRUCTURE under its alternative ancestry model. The most appropriate scaling level (1, 2 or 3) for a particular dataset depends on how unbalanced the sampling is, how much differentiated the populations are, and how much informative the markers are. For example, a low scaling level, s = 1, is appropriate when many markers are genotyped for a set of lowly differentiated (low F_ST) populations. Usually, we do not know these factors in analysing the data. Therefore, when the data are suspected to be unbalanced in sampling among populations, they are better analysed comparatively with different levels of scaling (0, 1, 2, and 3). When the applied level of scaling is too low, large populations tend to be split and small populations tend to be merged. When the applied level of scaling is too high, small populations tend to be merged among themselves or with a large population. With the help of some internal information such as consistency of replicate runs at the same scaling level and the same K value and some external information such as sampling locations in examining the admixture estimates, the appropriate scaling level can be determined.

Simulated annealing algorithm

A likelihood function with many variables, such as (4), is difficult to maximise for estimates of the variables. Traditional methods, such as derivative based Newton-Raphson algorithm (e.g., Tang et al. 2005) and non-derivative based EM algorithm (Dempster et al. 1977; Tang et al. 2005; Alexander et al. 2009), may converge to a local rather than the global maximum for a large scale problem with ridges and plateaus (Gaffe et al. 1994). Although trying multiple replicate runs with different starting values and choosing the run with the highest likelihood could reduce the risk of landing on a local maximum, a global maximum cannot be guaranteed regardless of the number of runs. The Bayesian approach as implemented in STRUCTURE (Pritchard et al. 2000) has a similar problem, as different replicate runs of the same data with the same parameter and model choices but different random number seeds may yield different admixture estimates and likelihood values (Tang et al. 2005; below).

Simulated annealing (SA) was developed to optimise very large and complex systems (Kirkpatrick et al. 1983). Using the Metropolis algorithm (Metropolis et al. 1953) from statistical mechanics, SA can find the global maximum by searching both downhill and uphill and by traversing deep valleys on the likelihood surface to avoid getting stuck on a local maximum (Kirkpatrick et al. 1983; Goffe et al. 1994). It has been proved to be highly powerful in pedigree reconstruction (Wang 2004; Wang and Santure 2009) from genotype data, which is probably more difficult than population structure reconstruction (i.e., clustering analysis) because the genetic structure (i.e., sibship) of the former is, in general, more numerous, more complicated with hierarchy, and smaller (thus more elusive and more difficult to define) than that in the latter. Herein I propose a SA algorithm for a population clustering analysis, as detailed in Supplementary Appendix 1.

Admixture analysis

Under the mixture model, the above clustering analysis partitions the N sampled individuals into a predefined K clusters, each representing a source population. The properties (e.g., genetic diversity) of and the relationships (e.g., F_ST) among these populations can be learnt from the inferred clusters. However, the clustering results are accurate only when the mixture model is valid. For a sample containing a substantial proportion of highly admixed individuals (i.e., who have recent ancestors from multiple source populations), the clustering results are just approximations. In such a case, the admixture model is more appropriate and can be used to refine the mixture analysis results by inferring the admixture proportions (or ancestry coefficients) of each sampled individual.

Under the admixture model (Pritchard et al. 2000), an individual i’s ancestry (or admixture proportions) can be characterised by a vector q_i = {q_i1, q_i2, …, q_iK}, where q_ik is the proportion of its genome coming from (contributed by) source population k. Equivalently, q_ik can also be taken as the probability that an allele sampled at random from individual i comes from source population k. Obviously, we have q_ik ≥ 0 and (mathop {sum}nolimits_{k = 1}^K {q_{ik} equiv 1}). The overall admixture extent of individual i can be measured by (M_i = 1 – mathop {sum}nolimits_{k = 1}^K {q_{ik}^2}), the probability that the two alleles at a randomly drawn locus come from different source populations. Individual i is purebred and admixed when M_i = 0 and M_i > 0, respectively. An F₁ and F₂ hybrid individual i is expected to have M_i = 0.5 and M_i = 0.625, respectively.

The task of an admixture analysis is to infer q_i for each individual i, denoted by Q = {q₁, q₂, …, q_N}. The log-likelihood function is

$${{{mathcal{L}}}}left( {{{{mathbf{Q}}}},{{{mathbf{P}}}}left| {{{mathbf{X}}}} right.} right) = mathop {sum}limits_{i = 1}^N {mathop {sum}limits_{l = 1}^L {mathop {sum}limits_{a = 1}^2 {{{{mathrm{Log}}}}left( {mathop {sum}limits_{k = 1}^K {q_{ik}p_{klx_{ila}}} } right)} } }$$

(6)

Note (6) is essentially the same as those proposed in previous studies (e.g., Tang et al. 2005; Alexander et al. 2009). It assumes independence of individuals conditional on the genetic structure defined by Q, and independence of alleles both within and between loci. The former can be violated when the data have genetic structure in addition to the subpopulation structure defined by Q, such as the presence of familial structure (Rodríguez‐Ramilo and Wang 2012) or inbreeding (Gao et al. 2007) within a subpopulation. The assumption of independence among loci is violated for markers in linkage disequilibrium. It, as well as the assumption of independence between paternal and maternal alleles within a locus, is also violated due to admixture (Tang et al. 2005) or inbreeding (Gao et al. 2007). However, (6) is a good approximation and works well in general even when these assumptions are violated, as checked by extensive simulations.

If P were known, it would be trivial to estimate Q from X. Unfortunately, usually, the only information we have is genotype data X, from which we must infer K, Q and P jointly. Herein I modify the EM algorithm of Tang et al. (2005) to solve (6) for maximum likelihood estimates of Q and P given K, as detailed in Supplementary Appendix 2.

Despite essentially the same likelihood function, my EM algorithm differs from that of Tang et al. (2005) in three aspects. First, I use the clustering results of mixture model as initial values of Q. Even in the worst scenario of many highly admixed individuals included in a sample, the clustering results should still be much closer to the true Q than a random guess, as used in previous likelihood methods (Tang et al. 2005; Alexander et al. 2009). It is possible (and indeed it has been trialled) to use the results of a faster non-model based clustering method, such as K-means method, in place of those of the likelihood-based clustering method with simulated annealing algorithm as described above. However, such non-model based methods are less reliable and less accurate, especially in difficult situations (below). Second, rather than updating Q and P in alternation, I update Q to asymptotic convergence under a given P. I then update P using the converged Q. This two-step iteration process is repeated until the convergence of both Q and P is reached. Third, the allele frequencies for a specific individual i are calculated by excluding the genotypes of the individual, which are then used in the EM procedure for iteratively updating q_i.

Optimal K

The above-described clustering analysis and admixture analysis are conducted by assuming a given number of source populations, K. Apparently, different genetic structures would be inferred from the same genotype data if different K values are assumed. In some cases, a reasonable K value is roughly known. For example, individuals might be sampled from K known discrete locations (say, lakes), and the purposes of a structure analysis are to confirm that populations from different locations are indeed differentiated and thus distinguishable, to identify migrants between the locations, and to find out the patterns of genetic differentiations (e.g., whether isolation by distance applies or not). In many other cases, however, we may have no idea of the most likely K value. For example, individuals might be sampled from the same breeding or feeding ground and we wish to know how many populations are using the same ground, and to learn the properties of these populations from the individuals sampled and assigned to them. In such a situation of hidden genetic structure, we need first to identify the most likely one or more K values, and then investigate the corresponding structure/admixture.

Estimating the most likely K value from genotype data is difficult (Pritchard et al. 2000). Although many methods have been proposed and applied (see review by Wang 2019), they are all ad hoc to some extent and may be inaccurate in difficult situations such as highly unbalanced sampling from different populations and low differentiation (Wang 2019). Herein I propose two ad hoc estimators of K that can be calculated from the clustering analysis presented in this study. They have a satisfactory accuracy as checked by many test datasets, simulated and empirical.

The first estimator is based on the second order rate of change of the estimated log-likelihood as a function of K in a clustering analysis, D_LK2. This estimator is similar in spirit to the ∆K method of Evanno et al. (2005), but does not use the mean and standard deviation of log-likelihood values among replicate runs (for a given K value) because the standard deviation (the denominator of ∆K) is frequently zero thanks to the convergence of our clustering analysis by the simulated annealing algorithm.

The second estimator, denoted by F_STIS, is based on Wright (1984)’s F-statistics. The best K should produce the strongest population structure, with high differentiation (measured by F_ST) of each inferred cluster and low deviation from Hardy-Weinberg equilibrium (measured by F_IS) within each inferred cluster. Details of how to calculate the two estimators are in Supplementary Appendix 3.

Simulations

To evaluate the accuracy, robustness, and computational efficiency of the new methods implemented in PopCluster in comparison with other methods, I simulated and analysed data with different population structures and sampling intensities. The simulation procedure described below is implemented in the software package PopCluster.

Simulation 1, small samples

A population becomes difficult to define genetically when few individuals from it are sampled and included in an admixture analysis. However, a small sample of individuals can be common in practice when, for example, archaeological samples (usually few) are used in studying ancient population structure or in studying the relationship between ancient and current populations (e.g., Lazaridis et al. 2014). In a mixed stock analysis (Smouse et al. 1990) or a wildlife forensic analysis of source populations, there might also be few sampled individuals representing a rare population. To investigate the impact of sample sizes on an admixture analysis, I simulated 10 populations in an island model with F_ST = 0.05. N_k (=2, 3, …, 10 and 20) individuals were sampled from each of the 10 populations, or 1 individual was sampled from each of the first five populations and 2 individuals were sampled from each of the last five populations (the case N_k = 1.5, Table 1). Other simulation parameters are summarized in Table 1.

Table 1 Simulation parameters.

Full size table

Simulation 2, many populations

Admixture becomes increasingly difficult to infer with an increasing K, the number of assumed populations, because the dimensions of both Q and P increase linearly with K. This contrasts with the number of individuals, N, and the number of loci, L, which determines the dimensions of Q and P only, respectively. Therefore, the scale of an admixture analysis, in terms of the number of parameters to be estimated, is predominantly determined by K rather than N or L. I simulated data with a widely variable number of populations (K = [6, 100]) to see if the structure can be accurately reconstructed by using relatively highly informative markers (parameters in Table 1), especially when K is large which is rarely considered in previous simulation studies.

Simulation 3, spatial admixture model

The spatial admixture model resembles isolation by distance where population structure changes gradually as a function of geographic location. Under this model, populations are not discrete as assumed by admixture models and have no recognisable boundaries, posing challenges to an admixture analysis. To simulate the spatially gradual changes in genetic structure, I assume source populations 1, 2, …, K are equally spaced in that order along a line (say, a river in reality). Sampled individuals 1, 2, …, N are also equally spaced in that order on the same line. The admixture proportions of individual i, q_i = {q_i1, q_i2, …, q_iK}, being the proportional genetic contributions to i from source populations k, are a function of the individual’s proximity to these K source populations. Formally, we have

$$q_{ik} = frac{{q_{ik}^ ast }}{{mathop {sum}nolimits_{k = 1}^K {q_{ik}^ ast } }}$$

(7)

where

$$q_{ik}^ ast = left[ {1 – left( {frac{{i – 1}}{{N – 1}} – frac{{k – 1}}{{K – 1}}} right)^2} right]^S$$

and parameter S is used to regulate the admixture extent of the N sampled individuals. Under this spatial admixture model, an individual i’s admixture (q_i) is determined by its location, or the distances from the K source populations. The 1st and the last sampled individuals (i = 1, N) always have the least admixture, measured by (M_i = 1 – mathop {sum}nolimits_{k = 1}^K {q_{ik}^2}). q₁₁ (=q_NK) is always the largest among the q_ik values for i = 1, 2, …, N and k = 1, 2, …, K. Given a desired value of q₁₁ and K, the scaler parameter S can be solved from the above equations. Given K, N and S, q_i of an individual i can then be calculated from the above equations. In this study, I simulated and analysed samples generated with parameters K = 5, N = 500, L = 10000 SNPs, and q₁₁ varying between 0.5 and 1.0 (Table 1).

Simulation 4, low differentiation

Population structure analysis becomes increasingly difficult with a decreasing differentiation, usually measured by F_ST, among subpopulations. Fortunately, with genomic data of many SNPs, it is still possible to detect weak and subtle population structures (Patterson et al. 2006) as demonstrated in human fine-structure analysis (e.g., Leslie et al. 2015). I simulated data with varying weak population structures (low F_ST, Table 1) and otherwise ideal populational (only 3 equally differentiated subpopulations) and sampling conditions (i.e., a large sample of individuals per subpopulation, and many SNPs). The number of SNPs used in analyse was L = 1000/F_ST such that in principle the population structures should be inferred with roughly equal power and accuracy. Because L is large for low F_ST, STRUCTURE analysis was abandoned due to computational difficulties.

Simulation 5, unbalanced sampling

Samples of individuals from different source populations are rarely identical in size in practice. Frequently, different source populations are represented by different numbers of individuals in a sample. The impact of unbalanced sampling and how to mitigate it in applying STRUCTURE have been investigated (e.g., Puechmaille 2016; Wang 2017). Similar problems exist for other admixture or clustering analysis methods but have not been studied yet. The same population structure and unbalanced sampling schemes (see parameters in Table 1) used in Wang (2017) were used to simulate data, which were then analysed by various methods to understand their robustness to unbalanced sampling.

Simulation 6, computational efficiency

Samples from a variable number of populations (Table 1) were analysed by the four programs on a linux cluster to compare their computational efficiencies. Each program uses a single core (no parallelisation) of a processor (Intel Xeon Gold 6248 2.5 GHz) for a maximal allowed time of 48 or 72 (when K = 1024 only) hours. Default parameter settings are used for all four programs. For STRUCTURE, both burn-in and run lengths were set to 10⁴, although much higher burn-in is required for convergence when K is large (say K > 20). The running time for STRUCTURE is thus conservative, especially when K is not small.

Further simulations were conducted to investigate the effects of high admixture and the presence of familial relationships and inbreeding on the relative performance of different admixture analysis methods, as detailed in Supplementary Appendix 4.

In all simulations except for the spatial admixture model, I assumed a population with K discrete subpopulations in Wright’s (1931) island model in equilibrium among mutation, drift and migration. For a locus l (=1, 2, …, L) with J_l alleles, allele frequencies of the ancestral population, p_0l = {p_0l1, p_0l2, …, (p_{0lJ_l})}, were drawn from a uniform Dirichlet distribution, ({{{mathcal{D}}}}left( {lambda _1,lambda _2, ldots ,lambda _{J_l}} right)) where λ_j = 1 for j = 1, 2, …, J_l. Given p_0l, allele frequencies of subpopulation k (=1, 2, …, K), p_kl = {p_kl1, p_kl2, …, (p_{klJ_l})}, were drawn from a uniform Dirichlet distribution, ({{{mathcal{D}}}}left( {lambda _1,lambda _2, ldots ,lambda _{J_l}} right)), where (lambda _j = ( {frac{1}{{F_{ST}}} – 1} )p_{0lj}) for j = 1, 2, …, J_l (Nicholson et al. 2002; Falush et al. 2003). Given p_kl and the admixture proportion q_i of individual i, two alleles at locus l were drawn independently to form the individual’s genotype. The multilocus genotype of an individual was obtained by combining single locus genotypes sampled independently, assuming linkage equilibrium. N_k individuals were drawn at random from population k (= 1, 2, …, K), which were then pooled and subjected to a structure analysis.

For the spatial population and sampling model, allele frequencies at a locus l, p_0l and p_kl, are generated as before, assuming F_ST = 0.05 among K = 5 subpopulations. A number of N = 500 individuals, equally spaced on the line between source populations 1 and 5, are sampled. The admixture proportion of individual i, q_i, is determined by its location, calculated by Eq. (7). Given p_kl and q_i, the multilocus genotype of individual i is simulated as described above.

For each parameter combination, 100 replicate datasets were simulated, analysed and assessed for estimation accuracy. Each dataset was analysed for admixture by different methods (see below for details) with an assumed K as used in simulations. I did not consider estimating the optimal K by analysing a simulated dataset in a range of possible K values. This is because, like previous studies (e.g., Pritchard et al. 2000; Alexander et al. 2009), I am more concerned with admixture inference under a given K, which is important of itself and forms the basis for inferring the optimal K as well. This is also because it is almost impossible computationally to estimate the optimal K for so many replicate datasets and so many parameter combinations in a large-scale simulation study like the present one, even when using large computer clusters. The optimal K was estimated for several empirical datasets (below).

Measurement of accuracy

Inference accuracy could be assessed by comparing, for each individual i, the agreement between simulated ancestry coefficients, q_i, and estimated ancestry coefficients, (widehat {{{mathbf{q}}}}_i), obtained by an admixture analysis assuming the true/simulated subpopulation number K. Because the reconstructed populations are labelled arbitrarily (Pritchard et al. 2000), no meaningful results can be gained by comparing q_i and (widehat {{{mathbf{q}}}}_i) directly, however. It is possible to relabel the reconstructed populations and find the labelling scheme that has the maximum agreement between q_i and (widehat {{{mathbf{q}}}}_i) as the measurement of accuracy. However, there are K! possible labelling schemes, making the approach difficult to calculate when K is large (say, K > 50).

The labelling becomes irrelevant when pairs of individuals are considered for the co-assignment probabilities (or coancestry) (Dawson and Belkhir 2001). I calculate and use the average difference between simulated and estimated coancestry for pairs of sampled individuals to measure the average assignment error, AAE (Wang 2017),

$$AAE = left( {frac{1}{{Nleft( {N – 1} right)/2}}mathop {sum}limits_{i = 1}^N {mathop {sum}limits_{j = 1 + 1}^N {left( {mathop {sum}limits_{k = 1}^K {q_{ik}q_{jk}} – mathop {sum}limits_{k = 1}^K {widehat q_{ik}widehat q_{jk}} } right)^2} } } right)^{1/2}.$$

(8)

The minimum value of AAE is 0, when ancestry (admixture) is inferred perfectly. The maximum value is 1, when there are no admixed individuals in the sample, individuals from the same source population are always assigned to different populations and individuals from different source populations are always assigned to the same population. It is worth noting that the minimum AAE value of 0 is always possible for any population structure. However, the maximum value varies and can be much smaller than 1, depending on the actual underlying population structure. With an increasing K value or increasing admixture (i.e., q_ik→1/K for any individual i), the maximum value of AAE tends to decrease. For this reason, AAE cannot be compared fairly between different genetic structures (e.g., different K values, different actual Q for a given K, or different sizes of subsamples from the source populations) for measuring the relative inference qualities. However, it can always be used to compare the accuracy of different inference methods for a given simulated genetic structure and a given sample.

Analysis of real datasets

An ant dataset

It was originally used in a study of the mating system of an ant species, Leptothorax acervorum (Hammond et al. 2001). Ten sampled colonies, A, B, C, D, E, F, G, H, I, and J, contribute respectively 9, 7, 47, 45, 45, 45, 45, 45, 44, and 45 diploid workers to a sample of 377 individuals. For this species, we know that each colony is headed by a single diploid queen mated with a single haploid male. Therefore, workers from the same colony are full-sibs and workers from different colonies are non-sibs. Each sampled worker was genotyped at up to 6 microsatellite loci, which have 3 to 22 alleles per locus observed in the 377 individuals. This dataset was analysed to reconstruct the genetic structure of the sample, which actually is the family structure. ADMIXTURE and sNMF cannot handle multiallelic marker data and therefore only STRUCTURE and PopCluster are used for analysing this dataset.

For STRUCTURE, I used the default parameter settings, except for the burning-in and run lengths which were both set to 10⁵ to reduce the risk of non-convergence. Two analyses were conducted. First, optimal K values were determined using three estimators (Wang 2019) calculated from STRUCTURE outputs, and using the D_LK2 estimator of PopCluster. For this K estimation purpose, 20 replicate runs for each possible K value in the range [1, 15] were conducted by both STRUCTURE and PopCluster. Second, assuming K = 10, a number of 100 replicate runs (each with a distinctive seed for the random number generator) were conducted by both STRUCTURE and PopCluster to investigate their convergence.

An Arctic charr dataset

Shikano et al. (2015) sampled 328 Arctic charr individuals from 6 locations in northern Fennoscandia: two lakes (Galggojavri and Gallajavri) and one pond (Leenanlampi) in the Skibotn watercourse drain into the Atlantic Ocean and three lakes (Somasjärvi, Urtas-Riimmajärvi and Kilpisjärvi) in the Tornio-Muoniojoki watercourse drain into the Baltic Sea. Individuals were genotyped at 15 microsatellite loci to study the genetic structure and demography. The data were again analysed by STRUCTURE and PopCluster but not by ADMIXTURE and sNMF because the markers are multiallelic. I conducted two separate analyses of the genotype data. First, I estimated the most likely K value by each program, making 20 replicate runs with each K value in the range [1, 10]. Second, I investigated the convergence of each program by conducting 100 replicate runs of the data at K = 6. STRUCTURE analyses were run with default parameter settings except for both burn-in and run lengths being 10⁵.

A human SNP dataset

Using FRAPPE (Tang et al. 2005), Li et al. (2008) studied the world-wide human population structure represented by 938 individuals sampled from 51 populations of the Human Genome Diversity Panel (HGDP). Each individual was genotyped at 650000 common SNP loci. The data were expanded to include genotypes of 1043 individuals at 644258 SNPs, available from http://www.cephb.fr/en/hgdp_panel.php#basedonnees. In this study, the expanded data were comparatively analysed by PopCluster, ADMIXTURE, and sNMF, assuming K = 7 clusters (regions) as in the original study (Li et al. 2008). STRUCTURE was too slow to analyse this big dataset and thus it was abandoned.

The human 1000 genomes phase I dataset

The dataset (Abecasis et al. 2012), available from https://www.internationalgenome.org/data/, has 1092 human individuals sampled from 14 populations across all continents, with each individual having 38 million SNP genotypes. After removing monomorphic loci (note, no pruning was applied regarding missing data, minor allele frequency and linkage disequilibrium, in contrast to other studies), genotypes at a number of L = 38035992 SNPs were analysed by PopCluster and sNMF, assuming K = 9 clusters (regions). Both STRUCTURE and ADMIXTURE were too slow to analyse this huge dataset and thus were abandoned. No attempts are made to find the optimal K for this dataset as done for the ant and Arctic charr datasets, because too much computational time is required for PopCluster or sNMF to analyse the data with a number of replicate runs at each of a number of K values even when using a large cluster, and there might be multiple K values that explain the data equally well (at different spatial and time scales). For a better understanding of the world-wide human population genetic structure, the data should be analysed at least with one replicate under each of a number of possible K values, say K = [5, 12], to reveal and compare the genetic structure. This study analysed the data at a single K = 9 for the purpose of demonstrating the capacity of different methods, and comparing the admixture estimates of PopCluster and sNMF at this particular value of K. Because of the incompleteness of the analysis, the biological interpretations of the results should be taken with caution.

Comparative analyses by different software

I compared the accuracy and computational time of STRUCTURE (Pritchard et al. 2000; Falush et al. 2003), ADMIXTURE (Alexander et al. 2009), sNMF (Frichot et al. 2014) and PopCluster in analysing both simulated and empirical datasets described above. Quite a few other model-based methods implemented in various software exist. I choose STRUCTURE and ADMIXTURE because they are the most popular model-based admixture analysis methods used for small and large datasets, respectively. I also choose sNMF because it is a very fast model-based method that works for huge datasets for which other methods, such as ADMIXTURE, fail to run or take unrealistically too much time to run.

STRUCTURE can handle both diallelic (such as SNPs) and multiallelic (such as microsatellites) markers, but runs too slowly to analyse large datasets with many markers, many individuals, or many populations. It was therefore used to analyse all simulated and empirical datasets with no more than 10000 loci. The default parameter setting was used for most datasets, with a burn-in length of 10⁴ and a run length of 10⁴ iterations. For better convergence, the burn-in and run lengths were increased to 10⁵ iterations for analyses involving a large number of simulated populations (say, when K ≥ 10) or for analyses of empirical datasets. For unbalanced sampling, the alternative ancestry model instead of the default model was used by setting POPALPHAS = 1.

Both ADMIXTURE and sNMF were developed specifically for diallelic markers and could not analyse multiallelic marker data. In this study, they were used to analyse SNP data only. For the human 1000 genome phase I data, however, ADMIXTURE could not complete the analysis within a realistic period of time (72 h, the maximum allowed in the linux cluster used for the analysis) even when the maximal number of parallel threads were used. Therefore, only sNMF and PopCluster were used to analyse this dataset.

To understand the relative computational efficiency and how much speedup can be gained by parallelisation, ADMIXTURE, sNMF and PopCluster were used to analyse the HGDP dataset and the 1000 genome dataset, by using a variable number of parallel threads on a linux cluster with many nodes, each having 32 cores. The maximum wall clock time allowed for a job on the cluster is 48 h.

Source: Ecology - nature.com