in

# Rank-invariant estimation of inbreeding coefficients

### Statistical sampling

We can describe the dependence between pairs of uniting alleles in a single population without invoking an evolutionary model for the history of the population. In this “statistical sampling” framework (Weir, 1996) we do not consider the variation associated with evolutionary processes but we do consider the variation among samples from the same population. Although extensive sets of genetic data allow individual-level inbreeding coefficients to be estimated with high precision, we start with population-level estimation.

Allelic dependencies can be quantified with the within-population inbreeding coefficient, written here as fW to emphasize it is a within-population quantity, defined by

$${H}_{l}=2{p}_{l}(1-{p}_{l})(1-{f}_{W})$$

(1)

where Hl is the population proportion of heterozygotes for the reference allele at SNP l and pl is the population proportion of that allele. The same value of fW is assumed to apply for all SNPs. An immediate consequence of this definition is that the population proportions of homozygotes for the reference and alternative alleles are ({p}_{l}^{2}+{p}_{l}(1-{p}_{l}){f}_{W}) and ({(1-{p}_{l})}^{2}+{p}_{l}(1-{p}_{l}){f}_{W}) respectively. This formulation allows fW to be negative, with the maximum of −pl/(1 − pl) and −(1 − pl)/pl as lower bound. It is bounded above by 1. Hardy–Weinberg equilibrium, HWE, corresponds to fW = 0 and textbooks (e.g., (Hedrick, 2000)) point out that negative values of fW indicate more heterozygotes than expected under HWE.

Observed heterozygote proportions ({tilde{H}}_{l}) have Hl as within-population expectation ({{{{{{mathcal{E}}}}}}}_{W}) over samples from the study population, ({{{{{{mathcal{E}}}}}}}_{W}({tilde{H}}_{l})={H}_{l}), and this would provide a simple estimator of fW if the population allele proportions were known. In practice, however, these proportions are unknown. Steele et al. (2014) suggested use of data external to the study sample to provide reference allele proportions in forensic applications where a reference database is used for making inferences about the population relevant for a particular crime. The more usual approach is to use study sample proportions ({tilde{p}}_{l}) in place of the true proportions pl, as in equation 1 of Li & Horvitz (1953):

$${hat{f}}_{{W}_{l}}=1-frac{{tilde{H}}_{l}}{2{tilde{p}}_{l}(1-{tilde{p}}_{l})}$$

(2)

The moment estimator in Eq. (2) is also an MLE of fW when only one locus is considered, but it is biased (Robertson & Hill, 1984) since not only is it a ratio of statistics but also the expected value ({{{{{{mathcal{E}}}}}}}_{W}[2{tilde{p}}_{l}(1-{tilde{p}}_{l})]) over repeated samples of n from the population is 2pl(1 − pl)[1 − (1 + fW)/(2n)] (e.g., (Weir, 1996), p39).

This approach can be used to estimate the within-population inbreeding coefficient fj for each individual j in a sample from one population. These are the “simple” estimators of Hall et al. (2012) and the ({hat{f}}_{{{{{{{rm{HOM}}}}}}}_{j}}) of Yengo et al. (2017):

$${hat{f}}_{{{{{{{rm{HOM}}}}}}}_{jl}}=1-frac{{tilde{H}}_{jl}}{2{tilde{p}}_{l}(1-{tilde{p}}_{l})}$$

(3)

The sample heterozygosity indicator ({tilde{H}}_{jl}) is one if individual j is heterozygous at SNP l and is zero otherwise. Averaging Eq. (3) over individuals gives the estimator based on SNP l in Eq. (2).

A single SNP provides estimates that are either 1 or a negative value depending on ({tilde{p}}_{l}), so many SNPs are used in practice. In both Hall et al. (2012) and Yengo et al. (2017) data were combined over loci as weighted or “ratio of averages” estimators:

$${hat{f}}_{{{{{{{rm{Hom}}}}}}}_{j}}=1-frac{{sum }_{l}({tilde{H}}_{jl})}{{sum }_{l}[2{tilde{p}}_{l}(1-{tilde{p}}_{l})]}$$

(4)

Gazal et al. (2014) referred to this estimator as fPLINK as it is an option in PLINK. We show below the good performance of this weighted estimator for large sample sizes and large numbers of loci. We will consider throughout that a large number L of SNPs are used so that ratios of sums of statistics over loci, such as in Eq. (4), have expected values equal to the ratio of expected values of their numerators and denominators. Ochoa & Storey (2021) showed statistics of the form ({tilde{A}}_{L}/{tilde{B}}_{L}), where ({tilde{A}}_{L}=mathop{sum }nolimits_{l = 1}^{L}{a}_{l}/L) and ({tilde{B}}_{L}=mathop{sum }nolimits_{l = 1}^{L}{b}_{l}/L), have expected values that converge almost surely to the ratio A/B when ({{{{{{mathcal{E}}}}}}}_{W}({tilde{A}}_{L})=A{c}_{L}) and ({{{{{{mathcal{E}}}}}}}_{W}({tilde{B}}_{L})=B{c}_{L}). This result rests on the expectations ({{{{{{mathcal{E}}}}}}}_{W}({a}_{l})=A{c}_{l}) and ({{{{{{mathcal{E}}}}}}}_{W}({b}_{l})=B{c}_{l}) with ({c}_{L}=mathop{sum }nolimits_{l = 1}^{L}{c}_{l}/L). It requires al, bl to both be no greater than some finite quantity C, cL to converge to a finite value c as L increases, and for Bc not to be zero. For the ratio in Eq. (4), ({a}_{l}={tilde{H}}_{jl}), ({b}_{l}=2{tilde{p}}_{l}(1-{tilde{p}}_{l})) so A = (1 − fj), B = 1 for large sample sizes n, and cL = ∑l2pl(1 − pl)/L ≤ 1/2. The conditions are satisfied providing at least one SNP is polymorphic. For an “average of ratios” estimator of the form (mathop{sum }nolimits_{l = 1}^{L}({a}_{l}/{b}_{l})/L), the denominators bl can be very small and convergence of its expected value is not assured.

As an alternative to using sample allele frequencies, Hall et al. (2012) used maximum likelihood to estimate population allele proportions for multiple loci whereas Ayres & Balding (1998) used Markov chain Monte Carlo methods in a Bayesian approach that integrated out the allele proportion parameters. Neither of those papers considered data of the size we now face in sequence-based studies of many organisms, and we doubt the computational effort to estimate, or integrate over, hundreds of millions of allele proportions in Eqs. (2) or (4) adds much value to inferences about f. The allele-sharing estimators we describe below regard allele probabilities as unknown nuisance parameters and we show how to avoid estimating them or assigning them values.

Hall et al. (2012) used an EM algorithm to find MLEs for fj when population allele proportions were regarded as being known and equal to sample proportions. Alternatively, a grid search can be conducted over the range of validity for the single parameter fj that maximizes the log-likelihood

$${{{{mathrm{ln}}}}},[{{{{{rm{Lik}}}}}}({f}_{j})]={{{{{rm{Constant}}}}}}+mathop{sum }limits_{l=1}^{L}{{tilde{H}}_{jl}{{{{mathrm{ln}}}}},[(1-{f}_{j})]+(1-{tilde{H}}_{jl}){{{{mathrm{ln}}}}},[1-2{tilde{p}}_{l}(1-{tilde{p}}_{l})(1-{f}_{j})]}$$

Estimation of the within-population inbreeding coefficients fW (FIS of (Wright, 1922)) and fj does not require any information beyond genotype proportions in samples from a study population, nor does it make any assumptions about that population or the evolutionary forces that shaped the population. The coefficients are simply measures of dependence of pairs of alleles within individuals.

### Genetic sampling

Inbreeding parameters of most interest in genetic studies are those that recognize the contribution of previous generations to inbreeding in the present study population. This requires accounting for “genetic sampling” (Weir, 1996) between generations, thereby leading to an ibd interpretation of inbreeding: ibd alleles descend from a single allele in a reference population. It also allows the prediction of inbreeding coefficients by path counting when pedigrees are known (Wright, 1922). If individual J is ancestral to both individuals (j^{prime}) and j, and if there are n individuals in the pedigree path joining (j^{prime}) to j through J, then Fj = ∑(0.5)n(1 + FJ) where FJ is the inbreeding coefficient of ancestor J and Fj is the inbreeding coefficient of offspring j of parents (j^{prime}) and j. The sum is over all ancestors J and all paths joining (j^{prime}) to j through J. The expression is also the coancestry ({theta }_{j^{prime} j^{primeprime} }) of (j^{prime}) and j: the probability an allele drawn randomly from (j^{prime}) is ibd to an allele drawn randomly from j.

The allele proportion pl in a study population has expectation πl over evolutionary replicates of the population from an ancestral reference population to the present time. Sample allele proportions ({tilde{p}}_{l}) provide information about the population proportions pl, and their statistical sampling properties follow from the binomial distribution. We do not invoke a specific genetic sampling distribution for the pl about their expectations πl although we do assume the second moments of that distribution depend on probabilities of ibd for pairs of alleles. One consequence of the assumed moments is that the probability of individual j in the study sample being heterozygous, i.e., the total expected value ({{{{{{mathcal{E}}}}}}}_{T}) of the heterozygosity indicator over replicates of the history of that individual, is

$${{{{{{mathcal{E}}}}}}}_{T}({tilde{H}}_{{j}_{l}})=2{pi }_{l}(1-{pi }_{l})(1-{F}_{j})$$

(5)

The quantity Fj is the individual-specific version of FIT of Wright (1922) and we can regard it as the probability the two alleles at any locus for individual j are ibd. There is an implicit assumption in Eq. (5) that the reference population needed to define ibd is infinite and in HWE: there is probability Fj that j has homologous alleles with a single ancestral allele in that population and probability (1 − Fj) of j having homologous alleles with distinct ancestral alleles there. In the first place, the single ancestral allele has probability π of being the reference allele for that locus and the implicit assumption is that two ancestral alleles are both the reference type with probability π2. This does not mean there is an actual ancestral population with those properties, any more than use of ({{{{{{mathcal{E}}}}}}}_{T}) means there are actual replicates of the history of any population or individual, and we note that Eq. (5) does not allow higher heterozygosity than predicted by HWE. Nonetheless, the concept of ibd allows theoretical constructions of great utility and we now present a framework for approaching empirical situations.

Inbreeding, or ibd, implies a common ancestral origin for uniting alleles and statements about sample allele proportions ({tilde{p}}_{l}) require consideration of possible ibd for other pairs of alleles in the sample. The total expectation of (2{tilde{p}}_{l}(1-{tilde{p}}_{l})) over samples from the population and over evolutionary replicates of the study population is ((Weir, 1996), p176)

$${{{{{{mathcal{E}}}}}}}_{T}[2{tilde{p}}_{l}(1-{tilde{p}}_{l})]=2{pi }_{l}(1-{pi }_{l})left[(1-{theta }_{S})-frac{1}{2n}left(1+{F}_{W}-2{theta }_{S}right)right]$$

(6)

where FW is the parametric inbreeding coefficient averaged over sample members, ({F}_{W}=mathop{sum }nolimits_{j = 1}^{n}{F}_{j}/n), and θS is the average parametric coancestry in the sample, ({theta }_{S}=mathop{sum }nolimits_{j = 1}^{n}{sum }_{j^{prime} ne j}{theta }_{jj^{prime} }/[n(n-1)]). Equivalent expressions were given by McPeek et al. (2004) and DeGiorgio and Rosenberg (2009). We note the relationship fW = (FW − θS)/(1 − θS) given by Wright (1922) and we showed in WG17 the equivalent expression fj = (Fj − θS)/(1 − θS) for individual-specific values (θS is Wright’s FST).

For a large number of SNPs, the expectation of a ratio estimator of the type considered here is the ratio of expectations (Ochoa & Storey, 2021). Therefore, the total expectations of the ({hat{f}}_{{{{{{{rm{Hom}}}}}}}_{j}}), taking into account both statistical and genetic sampling, are

$${{{{{{mathcal{E}}}}}}}_{T}({hat{f}}_{{{{{{{rm{HOM}}}}}}}_{j}})=1-frac{1-{F}_{j}}{(1-{theta }_{S})-frac{1}{2n}left(1+{F}_{W}-2{theta }_{S}right)}=frac{{f}_{j}-frac{1}{2n}(1+{f}_{W})}{1-frac{1}{2n}(1+{f}_{W})}$$

(7)

For all sample sizes, ({hat{f}}_{{{{{{{rm{HOM}}}}}}}_{j}}) has an expected value less than the true value fj, with the bias being of the order of 1/n. The ranking of ({{{{{{mathcal{E}}}}}}}_{T}({hat{f}}_{{{{{{{rm{HOM}}}}}}}_{j}})) values, however, is the same as the ranking of the fj and, therefore, of the Fj. For large sample sizes, Eq. (7) reduces to ({{{{{{mathcal{E}}}}}}}_{T}({hat{f}}_{{{{{{{rm{HOM}}}}}}}_{j}})={f}_{j}). Averaging over individuals shows that ({{{{{{mathcal{E}}}}}}}_{T}({hat{f}}_{{{{{{rm{HOM}}}}}}})={f}_{W}): the population-level estimator in Eq. (2) has total expectation of fW, not FW.

A different outcome is found for the ({hat{f}}_{{{{{{{rm{UNI}}}}}}}_{j}}) estimator of Yengo et al. (2017) (i.e., ({hat{f}}^{III}) of Yang et al. (2011); ({hat{f}}_{{{{{{rm{GCTA}}}}}}3}) of (Gazal et al., 2014)). This estimator, with the weighted (w) ratio of averages over loci we recommend, as opposed to the unweighted (u) average of ratios over loci used in their papers, is

$${hat{f}}_{{{{{{{rm{UNI}}}}}}}_{j}}^{w}=frac{mathop{sum }nolimits_{l = 1}^{L}[{X}_{jl}^{2}-(1+2{tilde{p}}_{l}){X}_{jl}+2{tilde{p}}_{l}^{2}]}{mathop{sum }nolimits_{l = 1}^{L}2{tilde{p}}_{l}(1-{tilde{p}}_{l})}$$

(8)

In this equation Xjl is the reference allele dosage, the number of copies of the reference allele, at SNP l for individual j. It is equivalent to the estimator given by (Ritland (1996), eq. 5) and attributed by him to Li & Horvitz (1953).

Ochoa & Storey (2021) showed that ({hat{f}}_{{{{{{{rm{UNI}}}}}}}_{j}}^{w}) has expectation, for a large number of SNPs and a large sample size, of

$${{{{{{mathcal{E}}}}}}}_{T}({hat{f}}_{{{{{{{rm{UNI}}}}}}}_{j}}^{w})=frac{{F}_{j}-2{{{Psi }}}_{j}+{theta }_{S}}{1-{theta }_{S}}={f}_{j}-2{psi }_{j}$$

(9)

where Ψj is the average coancestry of individual j with other members of the study sample: ({{{Psi }}}_{j}=mathop{sum }nolimits_{j^{prime} = 1,j^{prime} ne j}^{n}{theta }_{jj^{prime} }/(n-1)). We term ψj = (Ψj − θS)/(1 − θS) the within-population individual-specific average kinship coefficient. The Ψj have an average of θS over members of the sample, so the average of the ψj’s is zero and expected value of the average of the ({hat{f}}_{{{{{{{rm{UNI}}}}}}}_{j}}^{w}) is fW, as is the case for ({hat{f}}_{{{{{{{rm{AS}}}}}}}_{j}}) below.

Equation (9) shows that the ({hat{f}}_{{{{{{{rm{UNI}}}}}}}_{j}}^{w}) have expected values with the same ranking as the Fj values only if every individual j in the sample has the same average kinship ψj with other sample members.

Finally, we mention another common estimator described by VanRaden (2008), termed fGCTA1 by Gazal et al. (2014) and available from the GCTA software (Yang et al., 2011) with option -ibc. We referred to this as the “standard” estimator in WG17. The weighted version for multiple loci is

$${hat{f}}_{{{{{{{rm{STD}}}}}}}_{j}}^{w}=frac{{sum }_{l}{({X}_{jl}-2{tilde{p}}_{l})}^{2}}{{sum }_{l}2{tilde{p}}_{l}(1-{tilde{p}}_{l})}-1$$

(10)

and it has the large-sample expectation of (fj − 4ψj) as is implied by WG17 (Eq. 13) and as was given by Ochoa & Storey (2021). We summarize the various measures of inbreeding and coancestry in Table 1, and we include sample sizes in the expectations shown in Table 2.

The ({hat{f}}_{{{{{{rm{HOM}}}}}}}), ({hat{f}}_{{{{{{rm{UNI}}}}}}},{hat{f}}_{{{{{{rm{STD}}}}}}}) and ({hat{f}}_{{{{{{rm{MLE}}}}}}}) estimators of individual or population inbreeding coefficients make explicit use of sample allele proportions. This means that all four have small-sample biases, and none of the four provide estimates of the ibd quantities F or Fj. We showed that ({hat{f}}_{{{{{{rm{HOM}}}}}}}) is actually estimating the within-population inbreeding coefficients: the total inbreeding coefficients relative to the average coancestry of pairs of individuals in the sample, but ({hat{f}}_{{{{{{rm{UNI}}}}}}}) and ({hat{f}}_{{{{{{rm{STD}}}}}}}) are estimating expressions that also involve average kinships ψ.

### Allele sharing

In a genetic sampling framework, and with the ibd viewpoint, we consider within-individual allele sharing proportions Ajl for SNP l in individual j (we wrote M rather than A in WG17 and in (Goudet et al., 2018)). These equal one for homozygotes and zero for heterozygotes and sample values can be expressed in terms of allele dosages, ({tilde{A}}_{jl}={({X}_{jl}-1)}^{2}). We also consider between-individual sharing proportions ({A}_{jj^{prime} l}) for SNP l and individuals j and (j^{prime}). These are equal to one for both individuals being the same homozygote, zero for different homozygotes, and 0.5 otherwise. Observed values can be written as ({tilde{A}}_{jj^{prime} l}=[1+({X}_{jl}-1)({X}_{j^{prime} l}-1)]/2), with an average over all pairs of distinct individuals in a sample of ({tilde{A}}_{Sl}). Astle & Balding (2009) introduced ({tilde{A}}_{jj^{prime} l}) as a measure of identity in state of alleles chosen randomly from individuals j and (j^{prime}), and Ochoa & Storey (2021) used a simple transformation of this quantity. The allele sharing for an individual with itself is Ajjl = (1 + Ajl)/2.

The same logic that led to Eq. (5) provides total expectations for allele-sharing proportions for all (j,j^{prime}):

$$begin{array}{lll}{{{{{{mathcal{E}}}}}}}_{T}({tilde{A}}_{jj^{prime} l})&=&1-2{pi }_{l}(1-{pi }_{l})(1-{theta }_{jj^{prime} }) {{{{{{mathcal{E}}}}}}}_{T}({tilde{A}}_{Sl})&=&1-2{pi }_{l}(1-{pi }_{l})(1-{theta }_{S})end{array}$$

Note that θjj = (1 + Fj)/2. The nuisance parameter 2πl(1 − πl) cancels out of the ratio ({{{{{{mathcal{E}}}}}}}_{T}({tilde{A}}_{jj^{prime} l}-{tilde{A}}_{Sl})/{{{{{{mathcal{E}}}}}}}_{T}(1-{tilde{A}}_{Sl})) and this motivates definitions of allele-sharing estimators of the inbreeding coefficient for individual j and the kinship coefficient for individuals (j,j^{prime}) as

$${hat{f}}_{{{{{{{rm{AS}}}}}}}_{j}}=frac{{sum }_{l}({tilde{A}}_{{j}_{l}}-{tilde{A}}_{{S}_{l}})}{{sum }_{l}(1-{tilde{A}}_{Sl})},{hat{psi }}_{{{{{{{rm{AS}}}}}}}_{jj^{prime} }}=frac{{sum }_{l}({tilde{A}}_{jj^{prime} l}-{tilde{A}}_{{S}_{l}})}{{sum }_{l}(1-{tilde{A}}_{Sl})}$$

(11)

For a large number of SNPs, these are unbiased for fj and ({psi }_{jj^{prime} }) for all sample sizes. We showed in WG17 there is no need to filter on minor allele frequency to preserve the lack of bias. Note that ({hat{f}}_{{{{{{{rm{AS}}}}}}}_{j}}) is a linear function of the form ({a}_{S}+{b}_{S}{tilde{A}}_{j}) with ({tilde{A}}_{j}) being the total homozygosity for j and constants aS, bS being the same for all individuals j. Changing the scope of the study, from population to world for example, preserves linearity (with different values of aS, bS). The changed estimates are linear functions of the old estimates: old and new estimates are completely correlated and are rank invariant over all samples that include particular individuals, i.e., over all reference populations. Unlike the case for ({hat{f}}_{{{{{{rm{UNI}}}}}}}) or ({hat{f}}_{{{{{{rm{STD}}}}}}}), rank invariance is guaranteed for ({hat{f}}_{{{{{{{rm{AS}}}}}}}_{j}}) for any two individuals even if only one more individual is added to the study.

For large sample sizes, ((1-{tilde{A}}_{Sl})approx 2{tilde{p}}_{l}(1-{tilde{p}}_{l})). Under that approximation, ({hat{f}}_{{{{{{{rm{AS}}}}}}}_{j}}) is the same as ({hat{f}}_{{{{{{{rm{Hom}}}}}}}_{j}}) but the approximation is not necessary in computer-based analyses. Summing the large-sample estimates over individuals not equal to j gives an estimator for the average individual kinship ψj:

$${hat{psi }}_{{{{{{{rm{AS}}}}}}}_{j}}=-frac{{sum }_{l}({X}_{jl}-2{tilde{p}}_{l})(1-2{tilde{p}}_{l})}{{sum }_{l}4{tilde{p}}_{l}(1-{tilde{p}}_{l})}$$

(12)

Adding (2{hat{psi }}_{{{{{{{rm{AS}}}}}}}_{j}}) to ({hat{f}}_{{{{{{{rm{UNI}}}}}}}_{j}}^{w}) gives ({hat{f}}_{{{{{{{rm{AS}}}}}}}_{j}}), as expected, as does adding (4{hat{psi }}_{{{{{{{rm{AS}}}}}}}_{j}}) to ({hat{f}}_{{{{{{{rm{STD}}}}}}}_{j}}^{w}). Similarly, ({hat{psi }}_{{{{{{{rm{AS}}}}}}}_{jj^{prime} }}) is obtained by adding ({hat{psi }}_{{{{{{{rm{AS}}}}}}}_{j}}) and ({hat{psi }}_{{{{{{{rm{AS}}}}}}}_{j^{prime} }}) to ({hat{psi }}_{{{{{{{rm{STD}}}}}}}_{jj^{prime} }}), where (Yang et al., 2011)

$${hat{psi }}_{{{{{{{rm{STD}}}}}}}_{jj^{prime} }}=frac{mathop{sum}nolimits_{l}({X}_{jl}-2{tilde{p}}_{l})({X}_{j^{prime} l}-2{tilde{p}}_{l})}{mathop{sum}nolimits_{l}4{tilde{p}}_{l}(1-{tilde{p}}_{l})}$$

These are the elements of the first method for constructing the GRM given by VanRaden (2008).

When inbreeding and coancestry coefficients are defined as ibd probabilities they are non-negative, but the within-population values f and ψ will be negative for individuals, or pairs of individuals, having smaller ibd allele probabilities than do pairs of individuals in the sample, on average. Individual-specific values of f always have the same ranking as the individual-specific F values, and they are estimable. Negative estimates can be avoided by the transformation to (({hat{f}}_{{{{{{{rm{AS}}}}}}}_{j}}-{hat{f}}_{{{{{{{rm{AS}}}}}}}_{j}}^{min })/(1-{hat{f}}_{{{{{{{rm{AS}}}}}}}_{j}}^{min })) where ({hat{f}}_{{{{{{{rm{AS}}}}}}}_{j}}^{min }) is the smallest value over individuals of the ({hat{f}}_{{{{{{{rm{AS}}}}}}}_{j}})’s. We don’t see the need for this transformation, and we noted above the recognition of the utility of negative values. Ochoa & Storey (2021) wished to estimate Fj rather than fj and, to overcome the lack of information about the ancestral population serving as a reference point for ibd, they assumed the least related pair of individuals in a sample have a coancestry of zero. We showed in WG17 that this brings estimates in line with path-counting predicted values when founders are assumed to be not inbred and unrelated, but we prefer to avoid the assumption. We stress that, absent external information or assumptions, F is not estimable. Instead, linear functions of F that describe ibd of target pairs of alleles relative to ibd in a specified set of alleles are estimable and have utility in empirical studies.

### Runs of homozygosity

Each of the inbreeding estimators considered so far has been constructed for individual SNPs and then combined over SNPs. Observed values of allelic state are used to make inferences about the unobserved state of identity by descent. Estimators based on ROH, however, suppose that ibd for a region of the genome can be observed. Although F is the probability an individual has ibd alleles at any single SNP, in fact ibd occurs in blocks within which there has been no recombination in the paths of descent from common ancestor to the individual’s parents. Whereas a single SNP can be homozygous without the two alleles being ibd, if many adjacent SNPs are homozygous the most likely explanation is that they are in a block of ibd (Gibson et al., 2006). There can be exceptions, from mutation for example, and several publications give strategies for identifying runs of homozygotes for which ibd may be assumed (e.g., Gazal et al. (2014); (Joshi et al., 2015)). These strategies include adjusting the size of the blocks, the numbers of heterozygotes or missing values allowed per block, the minor allele frequency, and so on. These software parameters affect the size of the estimates (Meyermans et al., 2020). Some methods (e.g., Gazal et al. (2014); (Narasimhan et al., 2016)) use hidden Markov models where ibd is the hidden status of an observed homozygote. Model-based approaches necessarily have assumptions, such as HWE in the sampled population.

We provide more details elsewhere, but we note here that ROH methods offer a useful alternative to SNP-by-SNP methods even though they cannot completely compensate for lack of information on the ibd reference population. We note also that shorter runs of ibd result from more distant relatedness of an individual’s parents, and ROH procedures can be set to distinguish recent (familial) ibd from distant (evolutionary) ibd. SNP-by-SNP estimators do not make a distinction between these two time scales.

Source: Ecology - nature.com