Grey wolf genomic history reveals a dual ancestry of dogs
Sampling, DNA preparation and sequencingStockholmSamples LOW002, LOW003, LOW006, LOW007, LOW008 and PON012 were processed at the Archaeological Research Laboratory at Stockholm University, Sweden, following methods previously described8. In brief, this involved extracting DNA by incubating the bone powder for 24 h at 37 °C in 1.5 ml of digestion buffer (0.45 M EDTA (pH 8.0) and 0.25 mg ml–1 proteinase K), concentrating supernatant on Amicon Ultra-4 (30-kDa molecular weight cut-off (MWCO)) filter columns (MerckMillipore) and purifying on Qiagen MinElute columns. Double-stranded Illumina libraries were prepared using the protocol outlined in ref. 48, with the inclusion of USER enzyme and the modifications described in ref. 49.Samples 367, PDM100, Taimyr-1 and Yana-1 were processed at the Swedish Museum of Natural History in Stockholm, Sweden, following previously described methods8. In brief, this involved extracting DNA using a silica-based method with concentration on Vivaspin filters (Sartorius), according to a protocol optimized for recovery of ancient DNA50. Double-stranded Illumina libraries were prepared using the protocol outlined in ref. 48, with the inclusion of USER enzyme.Samples ALAS_024, VAL_033, ALAS_016, VAL_008, HMNH_007, HMNH_011, VAL_050, VAL_005, DS04, VAL_037, VAL_012, VAL_011, VAL_18A, IN18_016 and IN18_005 were processed at the Swedish Museum of Natural History in Stockholm, Sweden, following previously described methods for permafrost bone and tooth samples51. In brief, this involved DNA extraction using the methodology of ref. 52 and double-stranded Illumina library preparation as described in ref. 48, with dual unique indexes and the inclusion of USER enzyme. Between eight and ten separate PCR reactions with unique indexes were carried out for each sample to maximize library complexity. The libraries were sequenced alongside samples HOV4, AL2242, AL2370, AL2893, AL3272 and AL3284 across three Illumina NovaSeq 6000 lanes with an S4 100-bp paired-end set-up at SciLifeLab in Stockholm.PotsdamSamples JAL48, JAL65, JAL69, JAL358, AH574, AH575 and AH577 were processed at the University of Potsdam. Pre-amplification steps (DNA extraction and library preparation) were conducted in separated laboratory rooms specially equipped for the processing of ancient DNA. Amplification and post-amplification steps were performed in different laboratory rooms. DNA was extracted from bone powder (29–54 mg) following a protocol specially adapted to recover short DNA fragments52. Single-stranded double-indexed libraries were built from 20 µl of DNA extract according to the protocol in ref. 53. The libraries were sequenced on an HiSeq X platform at SciLifeLab in Stockholm.Tübingen/JenaSamples JK2174, JK2175, JK2179, JK2181, JK2183, TU144, TU148, TU839 and TU840 were processed at the University of Tübingen, with DNA extraction and pre-amplification steps undertaken in clean room facilities and post-amplification steps performed in a separate DNA laboratory. Both laboratories fulfil standards for work with ancient DNA54,55. All surfaces of tooth and bone samples were initially UV irradiated for 30 min, to minimize the potential risk of modern DNA contamination. Subsequently, DNA was extracted by applying a well-established guanidine silica-based protocol for ancient samples52. Illumina sequencing libraries were prepared by using 20 µl of DNA extract per library48; afterwards, dual barcodes (indexes) were chemically added to the prime ends of the libraries56. For the samples from Auneau (TU839 and TU840), five sequencing libraries each were prepared; for all other samples processed in Tübingen, three sequencing libraries each were prepared. To detect potential contamination of the chemicals, negative controls were conducted for extraction and library preparation. After preparation of the sequencing libraries, DNA concentration was measured with qPCR (Roche LightCycler) using corresponding primers48. The DNA concentration was given by the copy number of the DNA fragments in 1 µl of the sample.Amplification of the indexed sequencing libraries was performed using Herculase II Fusion under the following conditions: 1× Herculase II buffer, 0.4 µM IS5 primer and 0.4 µM IS6 primer48, Herculase II Fusion DNA polymerase (Agilent Technologies), 0.25 mM dNTPs (100 mM; 25 mM each dNTP) and 0.5–4 µl barcoded library as template in a total reaction volume of 100 µl. The applied amplification thermal profile was processed as follows: initial denaturation for 2 min at 95 °C; denaturation for 30 s at 95 °C, annealing for 30 s at 60 °C and elongation for 30 s at 72 °C for 3 to 20 cycles; and a final elongation step for 5 min at 72 °C. Thereafter, the amplified DNA was purified using a MinElute purification step and DNA was eluted in 20 µl TET. The concentration of the amplified DNA sequencing libraries was measured using a Bioanalyzer (Agilent Technologies) and a DNA1000 lab chip from Agilent Technologies.The sequencing libraries were sequenced on an Illumina HiSeq 4000 platform at the Max Planck Institute for Science of Human History in Jena. The samples from Auneau (TU839 and TU840) were paired-end sequenced applying 2 × 50 + 8 + 8 cycles. All other libraries prepared in Tübingen were single-end sequenced using 75 + 8 + 8 cycles.OxfordSamples AL2657, AL2541, AL2741, AL2744, AL3185, AL2350, CH1109, AL2370, AL3272 and AL3284 were processed at the dedicated ancient DNA facility at the PalaeoBARN laboratory at the University of Oxford, following methods described previously8. In brief, double-stranded libraries were constructed following the protocol in ref. 48. These libraries were sequenced on a HiSeq 2500 (AL2657, AL2541, AL2741, AL2744) or a HiSeq 4000 (AL3185, AL2350, CH1109) instrument at the Danish National Sequencing Center or on a NextSeq 550 instrument (AL2741) at the Natural History Museum of London. For samples AL2370, AL3272 and AL3284, between six and eight separate PCR reactions with unique indexes were carried out on their libraries and they were sequenced alongside samples HOV4, VAL_18A and IN18_016 on an Illumina NovaSeq 6000 lane with an S4 100-bp paired-end set-up at SciLifeLab in Stockholm.CopenhagenSamples CGG13, CGG17, CGG19, CGG20, CGG21, CGG25, CGG26, CGG27, CGG28, CGG34, Tumat1 and IRK were processed at the GLOBE Institute, University of Copenhagen. All pre-PCR work was performed in ancient DNA facilities following ancient DNA guidelines57. The details of extraction, library construction and sequencing for the samples with CGG codes are described in ref. 21, in relation to the publication of mitochondrial data from these specimens. The Tumat1 sample was processed following the exact same protocol. In brief, DNA extraction was performed using a buffer containing urea, EDTA and proteinase K50, double-stranded libraries were prepared with NEBNext DNA Sample Prep Master MixSet 2 (E6070S, New England Biolabs) and Illumina-specific adaptors48, and sequencing was performed on an Illumina HiSeq 2500 platform using 100-bp single-read chemistry. For the IRK sample, DNA was extracted from three subsamples and purified as described in ref. 21. The three DNA extracts and the purified pre-digest of one subsample were incorporated into double-stranded libraries following the BEST protocol58, with the modifications described in ref. 59, and sequenced on a BGISEQ-500 platform using 100-bp single-read chemistry.Santa CruzSamples SC19.MCJ017, SC19.MCJ015, SC19.MCJ010 and SC19.MCJ014 were processed at the UCSC Paleogenomics Lab and were provided by the Yukon Government Paleontology program. All pre-PCR work was performed in a dedicated ancient DNA facility at the University of California, Santa Cruz, following standard ancient DNA methods60. Subsamples (250–350 mg) were sent to the UCI KECK AMS facility for radiocarbon dating, and the remaining amounts were powdered in a Retsch MM400 for extraction. For each sample, ~100 mg of powder was treated with a 0.5% sodium hypochlorite solution before extraction to remove surface contaminants61 and then combined with 1 ml lysis buffer for extraction, following the protocol in ref. 52. Samples were processed in parallel with a negative control. We quantified the extracts using a Qubit 1× dsDNA HS Assay kit (Q33231) before preparing libraries. We prepared single-stranded libraries following the protocol in ref. 62 and amplified the libraries for 9–16 cycles as informed by qPCR. After amplification, we cleaned the libraries using a 1.2× SPRI bead solution and pooled them to an equimolar ratio for in-house shallow quality-control sequencing on a NextSeq 550 paired-end 75-bp run. We then sent the libraries to Fulgent Genetics for deeper sequencing on two paired-end 150-bp lanes on a HiSeq X instrument.ViennaSample HOV4 was processed at the Department of Anthropology, University of Vienna. The sample is a canine tooth, which after sequencing was determined to derive from a dhole (Cuon alpinus). DNA was extracted from its cementum using the methods described in ref. 63 with a modified incubation time of ~18 h. The library was prepared according to the protocol in ref. 48 with the modifications from ref. 64. Five separate PCR reactions with unique indexes were carried out on the library and were sequenced alongside samples VAL_18A, IN18_016, AL2242, AL2370, AL2893, AL3272 and AL3284 on an Illumina NovaSeq 6000 lane with an S4 100-bp paired-end set-up at SciLifeLab in Stockholm.An overview of all samples and their associated metadata is available in Supplementary Data 1.Genome sequence data processingFor paired-end data, read pairs were merged and adaptors were trimmed using SeqPrep (https://github.com/jstjohn/SeqPrep), discarding reads that could not be successfully merged. Reads were mapped to the dog reference genome canFam3.1 using BWA aln (v.0.7.17)65 with permissive parameters, including a disabled seed (-l 16500 -n 0.01 -o 2). Duplicates were removed by keeping only one read from any set of reads that had the same orientation, length and start and end coordinates. For sample Taimyr-1, previously published data13 were merged with newly generated data. Data from samples processed in Copenhagen were processed as described previously66 except that they were also mapped to canFam3.1. Post-mortem damage was quantified using PMDtools (v0.60)67 with the ‘–first’ and ‘–CpG’ arguments.Genotyping and integration with previously published genomesTo construct a comparative dataset for population genetic analyses, we started from a published variant call set compiling 722 modern dog, wolf and other canid genomes from multiple previous studies (NCBI BioProject accession PRJNA448733)40. To this, we added additional modern whole genomes from other studies: 4 African golden wolves and 15 Nigerian village dogs (Genome Sequence Archive (http://gsa.big.ac.cn/), accession PRJCA000335)68, 12 Scandinavian wolves (European Nucleotide Archive accession PRJEB20635)69, 9 North American wolves and coyotes (European Nucleotide Archive accession PRJNA496590)25 and 8 other canids (African hunting dog, dhole, Ethiopian wolf, golden jackal, Middle Eastern grey wolves) (European Nucleotide Archive accession PRJNA494815)22. Reads from these genomes were mapped to the dog reference genome using bwa mem (version 0.7.15)70, marked for duplicates using Picard Tools (v2.21.4) (http://broadinstitute.github.io/picard), genotyped at the sites present in the above dataset using GATK HaplotypeCaller (v3.6)71 with the ‘-gt_mode GENOTYPE_GIVEN_ALLELES’ argument and then merged into the dataset using bcftools merge (http://www.htslib.org/). The following filters were then applied to sites and genotypes across the full dataset: sites with excess heterozygosity (bcftools fill-tags ‘ExcHet’ P value 5.8×. For divergence time analyses, haploid X chromosomes from two different male genomes were combined and the point at which the inferred effective population size for this ‘pseudodiploid’ chromosome increased sharply upwards was taken to correspond to a population divergence. Results were scaled using a mutation rate of 0.4 × 10−8 mutations per site per generation13,87 (with a 25% lower rate for X-chromosome analyses) and a mean generational interval of 3 years13. For effective population size inferences, transition variants were ignored and results were scaled using a transversions-only mutation rate inferred from results on modern genomes. For more details on the MSMC2 analyses, see Supplementary Information section 3.Selection analysesSelection analysis was performed using PLINK (v1.90b5.2)88. This analysis used the 72 ancient wolf genomes and 68 modern wolf genomes (with the latter including a historical Japanese wolf genome73 treated as ancient for analysis purposes, with its age set to 200 bp). A list of the genomes used for this analysis is available in Supplementary Data 2 (“Used for selection scan” column). All SNPs, not only transversions, were used for this analysis. The age of each wolf was set as the phenotype, with values of 0 for modern wolves, and the ‘–linear’ argument was used to test for an association between SNP genotypes and age, also applying the ‘–adjust’ argument to correct P values using genomic control. The application of genomic control34 here aimed to use the magnitude of temporal allele frequency variance observed across the genome to account for what was observed from genetic drift alone given wolf demographic history. Only results for the following sets of sites were retained and included in the Manhattan plot: sites where at least 40 ancient genomes had a genotype call, sites with a minor allele frequency among the ancient wolves of ≥5% and sites that had at least 7 neighbouring sites within a 50-kb window with a P value that was at least 90% as large (on a log10 scale) as the P value of the site itself. The last ‘neighbourhood filter’ aimed to reduce false positives by requiring similar evidence across multiple nearby sites. As a P-value significance cut-off to correct for the genome-wide testing, we used 5 × 10−8, which is commonly used in genome-wide association studies in humans and also in dogs89. We excluded 15 regions where only a single variant reached significance. A detailed table with the 24 detected regions is available in Supplementary Data 3. To test the robustness of this analysis to false positives arising from genetic drift alone, we applied the same analysis to data from neutral coalescent simulations generated using ms90 and found no false positives. For more details, see Supplementary Information section 4.Ancestry modelling with qpAdm and qpWaveWe used the qpAdm and qpWave methods43 from ADMIXTOOLS (v5.0)84 to test ancestry models for wolf and dog targets postdating 23 ka. For the primary analyses, we used the following set of candidate source populations (age estimate in brackets, years bp): Armenia_Hovk1.HOV4 (ancient dhole), Siberia_UlakhanSular.LOW008 (70,772), Germany_Aufhausener.AH575 (57,233), Siberia_BungeToll.CGG29 (48,210), Germany_HohleFels.JK2183 (32,366), Siberia_BelayaGora.IN18_016 (32,020), Yukon_QuartzCreek.SC19.MCJ010 (29,943), Altai_Razboinichya.AL2744 (28,345), Siberia_BelayaGora.IN18_005 (18,148) and Germany_HohleFels.JK2179 (13,229). We used a rotating approach in which, for each target, we tested all possible one-, two- and three-source models that could be enumerated from the above set. Individuals from the set that were not used as a source in a given model served as thereference set (or the ‘right’ population in the qpAdm framework). This means that, in every model, each of the above individuals was always either in the source list or in the reference list. We ranked models on the basis of their P values, but prioritized models with fewer sources using a P-value threshold of 0.01: if a simpler model (meaning a model with fewer sources) had a P value above this threshold, it ranked above a more complex model (meaning a model with more sources) regardless of the P value of the latter. We also failed models with inferred ancestry proportions larger than 1.1 or smaller than −0.1. For single-source models, qpWave was run instead of qpAdm. Both programs were run with the ‘allsnps: YES’ option (without this option, there was very little power to reject models). We describe ancestry assigned to the ancient dhole source (Armenia_Hovk1.HOV4) as ‘unsampled’ ancestry; note that this does not imply that such ancestry is of non-wolf origin, only that it is not represented by (that is, diverged early from and lacks shared genetic drift with) the ancient wolf genomes in the reference set.To test whether any post-23 ka or modern wolf genome available might be a good proxy for the western Eurasian wolf-related ancestry identified in Near Eastern and African dogs, we added the 9,500-year-old Zhokhov dog17 to the rotating set of candidate source populations. Chosen for its high coverage, early date and easterly location, this makes the assumption that the Zhokhov dog is a good representative for the eastern dog ancestry component. Using the African Basenji dog as a target, models involving the Zhokhov dog plus another given wolf thus allowed us to test whether that wolf was a good match for the additional component of ancestry. For more details on the qpAdm and qpWave analyses, see Supplementary Information sections 2 (wolf targets) and 5 (dog targets).Reporting summaryFurther information on research design is available in the Nature Research Reporting Summary linked to this paper. More