Inversions maintain differences between migratory phenotypes of a songbird

The research in this study was performed in agreement with permission M45-14 issued by Malmö/Lund Ethical Committee for Animal Research, Sweden, which granted capture and blood sampling of wild birds

Samples

Nine willow warblers, determined to be males (based on a wing length > 69 mm), were caught opportunistically with mist nets during the time of autumn migration in September 2016 at Krankesjön, 15 km East of Lund, Southern Sweden. While most of the individuals were phenotypically similar to willow warblers breeding in Southern Scandinavia, some were slightly larger and had a greyer plumage, which is more commonly seen in Northern Scandinavia¹². The set of samples thus potentially contained willow warblers of each of the two major migratory phenotypes. Blood from each bird was collected through a puncture of the brachial vein and was stored in two vails containing SET buffer and 70% ethanol, respectively. An aliquot of the blood was used for DNA extraction with a phenol-chloroform protocol. From the extracted DNA, we genotyped the samples for two loci located on chromosomes 1 and 5, respectively (NBEA and FADS2)^45,46, and for a bi-allelic marker within the divergent region on chromosome 3 (AFLP-ww1)⁴⁷. Based on the genotyping results we selected two samples that were homozygous northern or homozygous southern for all three loci, respectively. We also included a sample from a chiffchaff Phylloscopus collybita (female) for de novo genome sequencing of a closely related outgroup species, as well as an additional willow warbler (DD81063, male) to confirm breakpoint differences with linked read sequencing. Both of these birds were opportunistically caught at the same site as above during autumn migration in 2019, and collection of blood followed the same approach as for the other birds.

Optical maps

DNA from the northern and southern willow warbler was extracted from blood stored in ethanol using a Plug Lysis protocol (v.30026D; Bionano Genomics, CA, USA). The blood was first separated from the ethanol through gentle centrifugation and embedded in molten 2% agarose plugs (DNA plug kit; Bio-Rad, CA, USA). The solidified plugs were submerged in Lysis Buffer solution (Bionano Genomics) and 66.8 µl per ml Buffer Puregene Proteinase K (Qiagen,MD, USA) for 2 h at 50 °C. The plugs were subsequently washed in 1× Wash buffer (Bio-Rad DNA plug kit) followed by TE buffer. In the following step, the plugs were treated with RNase (Qiagen, 20 µl in 1 ml TE buffer) for 1 h at 37 °C, followed by another washing step using the same buffers as in the previous step. Next, the plugs were melted for 2 min at 70 °C and treated with GELase (Epicenter, WI, USA) for 45 min at 43 °C. The DNA was then purified from digested agarose using drop dialysis against TE buffer on a 0.1 µm dialysis membrane (MF-Millipore, Merck KGaA, Germany) for 2.5 h.

Optical maps for each of the two samples were produced using Bionano Genomic’s commercial Irys system⁴⁸. BspQ1 was determined to be the most suitable nicking enzyme after using the software LabelDensityCalculator v.1.3.0 and Knickers v.1.5.5 to analyze a previous short-read assembly¹³. Bionano Genomic’s IrysPrep Labeling-NLRS protocol (v.30024) was used for the NLRS reaction. For this step, DNA was treated with Nt.BspQ1 (NEB, MA, USA) to create single-stranded nicks in a molecule-specific pattern. These were then labeled with Bionano Genomic’s (CA, USA) labeling mix (NLRS kit), aided by Taq Polymerase (NEB), and repaired using Bionano Genomics’s repair mix (NLRS kit), in the presence of Thermopol Rxn buffer, NAD+, and Taq DNA Ligase (NEB). Finally, the DNA backbone was stained using DNA stain from Bionano Genomics’s NLRS kit. Each sample was then loaded on two IrysChips (Bionano Genomics) each, and the DNA with stained BspQ1 nicks was visualized using an Irys instrument, following Bionano Genomics’s Irys user guide (v.30047). This resulted in 200 and 182 Gb of data for the northern and southern sample, respectively.

Genome maps were assembled de novo using Bionano Genomic’s in house software IrysView v.2.5.1, with noise parameter set to “autonoise” and using a human arguments xml file. The genome map was then further refined by re-assembling all data, but using the first assembly version as a reference. The final assemblies were both 1.3 Gb in total size, with an average coverage of 92.3 and 96.4×, and N50 of 0.93 and 0.95 Mb, for the northern and southern sample, respectively.

Linked read sequencing

For the southern sample and sample DD81063, DNA for chromium sequencing (10× Genomics, CA, USA) was extracted from blood stored in SET buffer using a MagAttract HMW DNAkit (Qiagen) at Scilifelab, Stockholm, Sweden. For the northern sample the extraction for bionano optical maps was used. The libraries of the northern and southern sample were each sequenced on a separate lane of a HiSeqX (Illumina, CA, USA) and the DD81063 sample was sequenced on a NovaSeq6000 (Illumina). For all samples sequencing was performed using a 2 × 150 bp setup.

Northern willow warbler de novo assembly

Library preparation for long read sequencing was done on DNA previously extracted for the optical map and followed Pacific Bioscience’s (CA, USA) standard protocol for 10–20 kb libraries. No shearing was performed prior to the library construction, but the library was size selected using the BluePippin pulse field size selection system (Sage Science, MA, USA), with a size cut-off >25 kb. The library was sequenced on eight SMRT cells on a Sequel platform (Pacific Biosciences). The sequencing yielded 63.66 Gbp of data comprised of 4,690,365 subreads with a mean length of 13,573 bp (range: 50–170,531 bp).

The Pacbio reads were assembled de novo in HGAP4⁴⁹ in the SMRT Link package with default settings except for specifying an expected genome size of 1.2 Gbp and setting the polishing algorithm to “Arrow”. We ran Falcon unzip⁵⁰ on the assembly to obtain partially phased primary contigs and fully phased haplotigs. Within the software, Arrow was used to polish the assembly using reads assigned to each haplotype. We evaluated two unzipped assemblies based on 30× or 40× coverage of seed reads in the preassembly step in HGAP4. A lower coverage threshold will lead to longer reads in the initial assembly step, which may increase the contiguity of the assembly, but will on the other hand, limit the number of reads that can be used in the phasing and polishing step. Although the unzipped assemblies were very similar, the 40× version was chosen for downstream analyses as it was slightly more contiguous and contained a higher number of single-copy bird orthologues as determined by BUSCO version 3.0.2⁵¹.

The assembly was further polished with Pilon 1.22⁵² with Illumina chromium reads from the same sample. The Illumina reads were mapped to the assembly using bwa version 0.7.17-r1188⁵³ and duplicated reads were marked using picardtools 2.10.3 (http://broadinstitute.github.io/picard). Pilon was run by only correcting indels and in total the software made 1,043,827 insertions and 275,457 deletions, respectively, of which the vast majority (94%) were single basepair changes. The Illumina polishing had a pronounced effect on the number of single-copy bird orthologues that could be detected in the primary contigs (Supplementary Table 1).

For further assembly steps, we extracted the Illumina-polished primary Pacbio contigs (N = 2737, N50 of 2.1 Mb and a length of 1.29 Gb). These contigs showed an unexpectedly high level of duplicated single-copy orthologues (7.4%), which suggested partial or complete overlap between some contigs. As a first step to reduce the redundancy and increase the contiguity of the assembly, we hybridized the primary contigs to the optical map of the same sample using bionano solve version 3.2.2 (BioNano Genomics) with default settings except for specifying aggressive scaffolding parameters. The hybrid scaffolding resulted in 19 cuts to the bionano maps and 259 cuts to the Pacbio contigs and created 363 super-scaffolds. Most of the gaps between the contigs in the super-scaffolds were estimated to be negative (i.e., some overlap between sequences). However, in the hybrid assembly, sequences on either side of these gaps were not collapsed and thus formed false segmental duplications. To remedy this problem we extracted 304 sets of overlapping contigs (“supercontigs”) and used GAP5 in the staden package 2.0.0.b11⁵⁴ to find potential joins between the contig ends. Using this approach, we merged contigs at 558 (87%) of the putative overlaps. The mean alignment length in the overlaps was 111 kb (range: 0.259–661 kb) with a mean sequence divergence of 3.28% (range: 0.31–15.55%). The highest divergence was caused by the presence of large indels. By trimming off one or both ends of the contigs at the gaps (mean 23 kb, range: 0.6–60 kb), we were able to close 23 further gaps. For the remainder of gaps, GAP5 failed to find potential joins between contigs or the ends supposed to be joined were considered to have too high divergence. The new assembly, including supercontigs consisted of 2401 contigs with an N50 of 6.5 Mb and had a considerably lower amount of duplicated single-copy genes (4.6% vs 7.4%).

To further reduce the redundancy, we used the purge haplotig pipeline⁵⁵ (downloaded 2019-02-15) to remove contigs that could be mapped over most of their length to larger contigs and that showed limited diploid coverage. We first estimated coverage by mapping the Pacbio subreads used for the de novo assembly with minimap2 version2.13-r860⁵⁶ using default settings for Pacbio reads (-x map-pb). To minimize the loss of repetitive sequences that could be separated and scaffolded by the bionano optical map, we used the first bionano hybrid assembly (363 superscaffolds and 1500 cut and unscaffolded contigs) as a reference for mapping. From the mapped data we detected a clear haploid and diploid peak and set a threshold of diploid coverage above 34× and below 85×. Any scaffold where less than 80% of its positions had diploid coverage was considered a putative haplotig and was mapped to other scaffolds using minimap2 within the software. We removed 1209 scaffolds (mean size: 107,655 bp, range: 598–495,788 bp) with a coverage to the best hit of at least 70% (mean: 97.4%). Using this approach, we specifically excluded contigs that could not be incorporated in superscaffolds. However, we also removed three contigs that each entirely made up short superscaffolds that could be uniquely assigned to larger superscaffolds and that had a high degree of haploid coverage. At this stage, we also removed five additional contigs shorter than 1000 bp that were the result of cutting the assembly with the bionano optical map. This led to an assembly with 1187 contigs, a length of 1.1 Gbp and a N50 of 7.9 Mb. The filtered assembly showed a large reduction in single-copy orthologue bird genes (1.3 vs 4.6%).

To provide an intermediate level of scaffolding to the optical map, we mapped the 10× chromium reads of the same sample to the assembly using bwa and used arcs version 1.0.5⁵⁷ and LINKS version 1.8.6⁵⁸ for scaffolding. Arcs was run with default settings except for enabling gap size estimation (–dist_est) and LINKS was run by setting the number of supporting links to at least 5 (-l = 5) and the maximum link ratio between the two best contig pairs to 0.3 (-a = 0.3). The scaffolding resulted in 739 scaffolds with a N50 of 16.4 Mb and a length 1.12 Gb.

As a final scaffolding step, we hybridized the 10× chromium-Pacbio scaffolds to the bionano optical map using the same settings as before. The hybrid scaffolding made 23 cuts to the optical map, 122 cuts to the scaffolds and resulted in 497 scaffolds with an N50 of 16.8 Mb. Two contigs representing the divergent region on chromosome 1 had been scaffolded together by arcs but were separated and not re-scaffolded with other sequences in the bionano hybrid assembly. Since the mismatched end of the optical map was short, located at a large gap, and the gene order is the same as seen in other bird genomes, we decided to keep the scaffold generated by arcs.

For this round of hybrid scaffolding, there were 52 gaps that were estimated to be negative. Using the same approach as when creating supercontigs, we were able to close 10 of these gaps. We additionally closed gaps using PBJelly⁵⁹ from PBSuite 15.8.24 with default settings except for specifying –spanOnly –capturedOnly”. The software filled 97 gaps, extended one end of 12 gaps, extended both ends of 18 gaps and overfilled 28 gaps (extended both ends but detected no overlap despite the extension is larger than the predicted gap).

We further checked for potential misjoins between scaffolds that originate from different chromosomes. To this end, we used SatsumaSynteny 2.0⁶⁰ to produce whole-genome alignments between the assembly and the genomes of chicken (version GRCg6a) and zebra finch (version taeGut3.2.4), both downloaded from Ensembl (www.ensembl.org). Using this approach, we detected a scaffold that showed good alignments to both chromosomes 10 and 23 in both of the other species. We considered this join unlikely and decided to split the scaffold.

Next, we performed a second round of polishing with the 10× chromium Illumina data from the same sample. For this round, since we had fewer than 500 scaffolds, we used the longranger 2.1.14 align pipeline⁶¹ to map reads in a barcode-aware way. Pilon was then run with the same settings as before and resulted in the correction of 417,032 indels, of which 78.7% were single-basepair changes. The second round of polishing considerably increased the number of single-copy bird orthologues that could be identified in the assembly (Supplementary Table 1).

The mitochondrial genome was not found in the original Pacbio genome assembly. We obtained this genome by adding the complete mitochondrial sequence from a previous short-read assembly¹³. We then used bwa to map the 10× chromium reads from the northern sample to the assembly and extracted alignments on the mitochondrial sequence. Next, freebayes was used with a haploid setting to detect differences present in the aligned reads. The raw variant file was filtered with vcftools for sites with a quality less than 30 and for two intervals with excessive read coverage (possibly reads from unassembled NUMTs). The filtered variant file contained 11 substitutions and three indels, and was used with bcftools version 1.14⁶² to create a new mitochondrial reference.

For the extraction and removal of sequences in the different assembly steps we used kentUtils 370 (https://github.com/ucscGenomeBrowser/kent). Summary statistics for each assembly (e.g., N50) were calculated using the assemblathon_stats.pl script⁶³.

Southern willow warbler and chiffchaff de novo assemblies

The southern willow warbler and the chiffchaff were each sequenced on two lanes on a Sequel II (Pacific Biosciences) using a high-fidelity (HiFi) setup. Sequencing libraries for the southern willow warbler was prepared from a previous extraction used for optical maps (see above), whereas for the chiffchaff, DNA was extracted from blood using a Nanobind extraction kit (Circulomics, MD, USA). The southern willow sample yielded 2,576,876 HiFi reads with a mean length 19,303 bp and representing 49.7 Gbp. The chiffchaff sample yielded 2,612,165 HiFi reads with a mean length of 19,829 bp and representing 51.8 Gbp.

The HiFi reads were assembled de novo using hifiasm version 0.15.5-r350⁶⁴ with default settings and primary contigs were selected for downstream analyses. For the chiffchaff hifiasm assembly, we removed the first 6 Mb part of a contig overlapping with another contig and removed a short interval at the end of a contig containing adaptor sequences. For the southern willow warbler, the primary contigs (N = 540, Supplementary Table 1) were hybridized to the optical map of the same sample using the same pipeline as for the northern sample. Although we had access to chromium data from the same sample, we did not include it to perform an intermediate scaffolding step (as we did for the northern willow warbler assembly) because the long-read assembly was already highly contiguous. The hybridization step made 39 cuts to the contigs and 20 cuts to the optical maps, resulting in an assembly with 111 superscaffolds and 439 non-scaffolded contigs. We decided to ignore an optical map-supported fusion of contigs that mapped to separate chromosomes in other bird species, as this fusion was made in a large repetitive region. We further excluded a 45 bp sequence resulting from the hybrid assembly cutting and masked four short intervals containing adaptor sequences. The assembly of the mitochondrion in the southern assembly followed the same pipeline as used for the northern assembly (see above). In this case, 10 substitutions and two indels were added to the mitochondrial sequence from the previous short-read assembly based on alignments of linked reads from the southern sample.

Repeat annotation

We used Repeatmodeler version 1.0.8⁶⁵ for de novo identification of repeats in the southern assembly. The repeats detected by repeatmodeler were combined with 1,023 bird-specific repeats into a custom library. Next, we used repeatmasker version 4.0.7⁶⁶ with the custom library and by using a more sensitive search (-s flag) to annotate repeats in the genome. Bedtools v2.29.2⁶⁷, together with the annotated repeats, was used to create a softmasked version of the southern assembly, which was used in the gene annotation step. The same repeat library was also used to annotate repeats in the de novo assembly of the northern sample. For the chiffchaff assembly we used the same annotation approach as for the southern willow warbler, but included a species-specific library generated with repeatmodeler, and also included a tandem-repeat associated sequence associated with the divergent regions on chromosomes 1 and 3 from the willow warbler library. Intervals with tandem repeats in divergent regions were also analyzed with tandem repeats finder version 4.0.9⁶⁸ using default settings except for specifying a maximum period size of 2000 bp.

Duplicated intervals within divergent scaffolds were identified with Minimap2 and subsequently aligned with EMBOSS Stretcher 6.6.0 (https://www.ebi.ac.uk/Tools/psa/emboss_stretcher/).

RNA sequencing

We used total RNA extracted from whole brain from six samples used in an earlier study quantifying differential expression in migratory and breeding willow warblers⁶⁹ (Supplementary Table 3). The quality of the RNA was checked with a Bioanalyzer version 2100 (Agilent, CA, USA). All of the extractions had a RNA Integrity Number (RIN) of at least > 7.10. RNA libraries for sequencing were prepared using a TruSeq Stranded mRNA Sample prep kit with 96 dual indexes (Illumina) according to the instructions of the manufacturer with the exception of automating the protocols using an NGS workstation (Agilent) and using purification steps as described in Lundin et al⁷⁰. and Borgström et al⁷¹. The raw RNA data was trimmed using cutadapt version 1.8⁷² within Trim Galore version 0.4.0 (https://github.com/FelixKrueger/TrimGalore) with default settings.

We used Stringtie version 1.3.3⁷³ to create transcripts from the RNAseq data. These transcripts were not used directly in the generation of gene models, but used in the manual curation step as potential alternative transcripts. For the software, we first mapped the reads with Hisat2 version 2.1.0⁷⁴ using default settings for stranded sequence libraries and downstream transcript analyses.

Gene annotation

We used Augustus version 3.2.3⁷⁵ to create gene models using hints provided from RNAseq data and protein data from other bird species. For the RNAseq data, we mapped the trimmed reads to the assembly using STAR version 2.7.9a⁷⁶. Accessory scripts in the Augustus package were used to filter the alignments for paired and uniquely mapped reads and for extracting intron hints. We additionally generated coverage wig files for each strand from the filtered alignment file using the software stranded-coverage (https://github.com/pmenzel/stranded-coverage) and used these as input for the august wig2hints.pl to generate exonpart hints.

For homology evidence, we downloaded a set of bird proteins from NCBI (https://www.ncbi.nlm.nih.gov/). This data set included 49,673 proteins from chicken, 41,214 proteins from zebra finch and 38,619 proteins from great tit. We also downloaded an additional dataset from Uniprot (www.uniprot.org) that consisted of 3175 manually reviewed bird proteins and 204 and 12,263 bird proteins that were not manually reviewed but supported by protein or transcript data, respectively. The protein data was mapped to the genome using exonerate version 2.4.0⁷⁷. We used the script align2hints.pl from braker 2.1.6⁷⁸ to generate CDSpart, intron, start and stop hints from the data.

Augustus was run with species-specific parameters (see training Augustus below) and with default settings except for specifying “softmasking=true”, “–alternatives-from-evidence=true”, “–UTR = on”, “–gff3=on” and “–allow_hinted_splicesites=atac”. In the extrinsic configuration file, we changed the malus for introns from 0.34 to 0.001, which increases the penalty for predicted introns that are not supported by the extrinsic data (RNAseq and protein hints). The prediction resulted in 28,491 genes and 35,389 transcripts.

The Augustus-derived gene models were assigned names based on overlap with synteny-transferred zebra finch genes. For this purpose, we used SatsumaSynteny with default settings to obtain whole-genome alignments between our assembly and the zebra finch genome version bTaeGut1.4.pri⁷⁹. Based on the alignment, we used kraken⁸⁰ (downloaded 2020-04-14) to transfer the zebra finch genome annotations (NCBI Release 106) to the willow warbler assembly. We then extracted the CDS from the Augustus gene models and the kraken genes and used bedtools intersect to quantify the overlap. The gene models were also searched against the longest translation of each of the chicken, zebra finch and great tit Parus major genes used as evidence for the gene prediction step and against 86,131 swissprot vertebrate proteins using blastp 2.5.0+⁸¹ with an E value threshold of 1e−5. Gene models that were not annotated through synteny were assigned a gene name based on the blast results. Protein domains in the gene models were annotated with interproscan v 5.30–69.0⁸². To reduce the number of false positive predictions we removed 5697 genes that were not supported by synteny to zebra finch genes, showed no significant similarity to vertebrate proteins or did not contain any annotated protein domains.

We used Webapollo 2.6.5⁸³ to manually curate gene models in the previously identified divergent chromosome regions and in other regions where differences were present. In the curation step, we specifically validated the support for the coding sequence and the UTR and also removed genes that were likely to be pseudogenes based on a truncated coding sequence compared to homologous genes in other vertebrates, had no support from synteny in other bird species and/or that were located in repeat-rich regions.

Training Augustus

We used a previous repeat-masked short-read assembly¹³ and the trimmed RNAseq data used in this study to obtain species-specific parameters for Augustus. The RNAseq data was assembled into transcripts using Trinity version 2.0.2⁸⁴ to create a de novo and a genome-guided assembly that together were comprised of 1,929,396 transcripts. The genome-guided transcript assembly was based on RNAseq mapped to the genome using GSNAP version 2016-07-11⁸⁵ with default settings. We used PASA version 2.0.2⁸⁶ to create high-quality transcripts, which were imported into Webapollo. To assess the completeness of the transcripts, we compared them to synteny-transferred models from the chicken genome using Kraken. We selected 1249 transcripts that appeared complete, were not overlapping with other genes and showed less than 80% amino acid similarity to another gene in the training set. From this set, we excluded 21 genes that were giving initial training errors, which gave us a training set of 1228 genes. This gene set was randomly split into 1028 training genes and 200 genes used for testing. For training, we used the optimize_augustus.pl script with default settings except for the flag –UTR = on.

Whole-genome resequencing and variant calling

We used the whole-genome resequencing data from nine samples of each migratory phenotype provided in Lundberg et al¹³. and sequenced an additional two high-coverage samples from each migratory phenotype (Supplementary Table 4). Sequencing libraries for the new samples were prepared with a TruSeq DNA PCR-Free kit (Illumina) with a targeted insert size of 670 bp or with a Truseq DNA nano (Illumina) with a targeted insert size of 350 bp. All of the new samples were sequenced on a HiSeqX (Illumina). The raw reads were trimmed with trimmomatic 0.36⁸⁷ with the parameters “ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:30”.

Quality-trimmed reads were mapped to the southern assembly using bwa mem with default settings except for specifying -M flag to ensure compatibility with the downstream duplicate removal steps and converted into binary alignment map (bam) files using samtools. For samples sequenced across multiple lanes, reads from each lane were mapped independently and the resulting bam files were merged with samtools. Read duplicates were removed with the markduplicates tool provided in picardtools.

From the aligned whole-genome resequencing data set, we called variants with freebayes v1.1.0 using default settings and parallelizing the analyses of separate scaffolds using GNU parallel⁸⁸. Vcflib version 2017-04-04⁸⁹ was used to filter the raw set of variants for sites with quality score >30 and for alternate alleles that were supported by at least one read on each strand (SAF > 0 & SAR > 0) and had at least one read balanced to the right and the left (RPL > 0 & RPR > 0). Next, we used vcftools 0.1.16⁹⁰ to filter genotypes with a coverage of at least 5x and removed sites a maximum of four genotypes missing in each of the populations. The variants were also filtered for collapsed repeats by removing sites with a mean coverage of more than twice the median mean coverage (30×). We next used vcflib to decompose haplotype calls and complex alleles into indels and SNPs and removed any variants that were overlapping with annotated repeats. This gave us a final of 51 million variants of which 45 million were bi-allelic SNPs. We used vcftools to calculate F_ST⁹¹ for each variant and for bi-allelic SNPs in non-overlapping windows of 10 kb. As many rare variants segregate in the willow warbler populations, which may downwardly bias differentiation estimates⁹², we focused on variants with a minor allele frequency of at least 0.1.

Coverage for each resequenced sample was calculated in non-overlapping 1 kb windows using bedtools and only included properly paired reads with a mapping quality of at least 1. The raw coverage values for each sample were normalized by its median coverage across all windows.

Structural variant calling

We used a combination of delly 0.9.1⁹³ and GraphTyper 2.7.4⁹⁴ to call structural variants in the resequenced samples. To identify a set of high confidence variants, we first mapped the long reads from the northern willow warbler to the southern assembly using minimap 2.22-r1101⁵⁶ with default settings for Pacbio reads and from the alignments called variants using delly. Next, GraphTyper was used to genotype the resequenced samples for the delly variants in the scaffolds containing the divergent chromosome regions. The raw set of variants were filtered to contain only sites with a “PASS” flag and, for each variant, the aggregated genotype, which is the genotype model out of breakpoint alignments and coverage that has the highest genotyping quality, was chosen for downstream analyses. Genetic differentiation (F_ST) was calculated in vcftools and variants with F_ST ≥ 0.7 between homozygotes in each divergent chromosome region were extracted and checked for overlap with genes and gene features using bedtools. To get more reliable differentiation estimates, we only included sites where at least 80% of the southern and northern homozygotes had genotypes.

Inversion genotypes for resequenced samples

The resequenced samples were assigned a genotype of southern and northern haplotypes for each of the divergent regions based on a multidimensional scaling (MDS)-based clustering in invclust⁹⁵ of SNP array genotypes in Lundberg et al.¹³. To obtain genotypes of the SNPs included on the array in the resequenced samples, we mapped the SNP array probe sequences to the northern assembly using gmap and from the alignments extracted the positions of the focal SNPs. Next, we used freebayes to genotype the resequenced samples for these positions and plink version 1.9⁹⁶ to combine the genotypes with the genotypes from the SNP array. In the genotyping step, we also included mapped 10× chromium libraries for the northern and southern reference samples and the additional willow warbler sample. From the combined dataset, we extracted genotypes for SNPs located in each of the divergent regions and used invclust to assign each sample a genotype of inverted and non-inverted haplotypes. The inverted and non-inverted haplotypes were recoded as southern or northern haplotypes based on their frequency in each subspecies.

Breakpoint analyses

We used MUMmer 4.0.0rc1⁹⁷ to align the genomes of the southern and northern willow warblers, and the southern willow warbler genome to the genomes of the chiffchaff, zebra finch (3.2.4) and collared flycatcher FicAlb (1.5)⁹⁸.

To provide further evidence of breakpoints, we mapped the 10× chromium reads of each sample to both the northern and the southern assembly and called structural variants using the longranger wgs pipeline. For the southern genome, we selected the 499 largest scaffolds and concatenated the rest into a single scaffold to make it compatible with the software. We also checked for differences in linked read molecule coverage between the samples. For this purpose, the raw reads of each sample were first processed with longranger basic for quality trimming and barcode processing. The trimmed reads were mapped to the assemblies using bwa mem using a -C flag to extract the barcode information of each read and alignments converted into bam files using samtools. To estimate coverage of barcodes, we first used the tigmint-molecule script from tigmint 1.1.2⁹⁹ to obtain positional information of barcodes (molecules) in each divergent region. The software was run with default settings except for only using reads with a mapping quality of at least 1 and only to report molecules that were estimated to be at least 10 kb. We next used bedtools to count the number of overlapping molecules in 1 kb windows.

We explored differences between optical maps by using the runSV.py script in bionano solve with the southern optical map as a query and the northern assembly as target and the reciprocal analysis with the northern optical map as a query and the southern assembly as a target. We also used the bionano solve hybrid assembly pipeline to visualize differences between the optical maps and the genome assemblies at breakpoint regions.

Functional annotation of differences

We used bedtools to quantify the distance between breakpoint intervals and annotated genes. To provide a functional annotation of the SNPs and short indels, we selected variants that showed a F_ST ≥ 0.7 between southern and northern homozygotes for each of the region and used these as input to Snpeff 5.0.0e¹⁰⁰ together with the annotation and reference genome. We used Snpsift 5.0.0e¹⁰¹ to select variants that were predicted to have a moderate to high effect on genes. Gene ontology terms for the genes were extracted from orthologous genes in other bird genomes in ensembl (www.ensembl.org) or through domain searches of the proteins with interproscan.

Age estimation and demographic analyses of divergent regions

In order to estimate the timing of the inversion events, we used high-coverage resequencing data from two southern samples, two northern samples and, as an outgroup, one dusky warbler Phylloscopus fuscatus (Supplementary Table 4). The willow warbler samples were chosen so that they were either homozygous southern or northern for all of three divergent regions. The dusky warbler library was prepared using a TruSeq Nano DNA library prep kit for Neoprep (Illumina) according to the instructions of the manufacturer and sequenced on a HiSeq X (Illumina). Quality-trimming of the raw reads and mapping of the trimmed reads to the northern reference genome followed the same approach as used for the willow warbler resequencing samples (see above).

Variants were called using freebayes and the raw set of variants were filtered using gIMble’s preprocess module (v0.6.0). Sample-specific callable sites were identified using gIMble preprocess and were defined as those with a minimum coverage of 8× and a maximum of 0.75 standard deviations above the mean coverage. Genic and repetitive regions of the genome were removed from the callable sites in order to limit downstream analyses to intergenic regions.

Summary statistics of genetic variation (π and d_xy) within the divergent regions were calculated using gIMble. Following this, net divergence (d_a) between northern and southern samples was calculated as d_north–_south − (π_north + π_south)/2. To convert the net divergence into years we used the germline mutation rate (4.6 × 10⁻⁹) estimated in the collared flycatcher²¹. Relative node depth (RND) using the dusky warbler (DW) as an outgroup was calculated as d_{north–south}/(d_DW-north + d_DW-south)/2. For each divergent region, a blockwise site frequency spectrum (bSFS) was generated with gIMble using blocks of 64 bp in length. This length refers to the number of callable sites within a block, while the physical length of blocks was allowed to vary due to missing data but was limited to 128 bp. Downstream analyses that relied on a bSFS used a kmax of 2, meaning that only marginal probabilities were calculated for mutation counts >2. The composite likelihood (CL) of a model, given the bSFS of one of the divergent regions, was optimized using the Nelder-Mead algorithm with the maximum number of iterations set to 1000. Within the software we evaluated three different population models. The first model was a strict isolation model (SI), with parameters ancestral effective population size, effective population sizes for southern and northern willow warblers and divergence time. The second model was an isolation with migration model (IM₁) that also included a migration rate from northern to southern samples, and the third model (IM₂) instead had a migration rate from southern to northern willow warblers.

Simulations were carried out by msprime 0.7.4¹⁰² through gIMble. The recombination rates used for these simulations were chromosome-specific estimates from a high-density recombination map of the collared flycatcher⁹⁸ and were 2.04, 1.95, and 2.63 cM/Mb for chromosomes 1, 3, and 5, respectively. A total of 100 replicates were simulated for the optimized SI parameters of each region. These simulated bSFSs were then optimized under both an SI model as well as the best fitting IM model for that region. The improvement in CL between these models was used as a null distribution for testing whether improvements in CL observed for the real data were greater than expected given a history of no migration. For each parameter, we calculated 95% CI as Maximum Composite Likelihood (MCL) estimate ± 1.96 * standard deviation of simulations (Supplementary Table 7). As a result, our estimates of uncertainty are affected by the recombination rates that we assumed for simulations. We also used the results of simulations to quantify the potential bias in MCL estimates due to intra-block recombination (Supplementary Table 7). However, we did not attempt to correct for this bias as it is relatively small (e.g., the MCL divergence times are estimated to be biased upwards by 7, 24, and 10%) and our estimation of the bias itself is largely dependent on the recombination rates we assumed.

MSMC2²⁴ was used to explore genome-wide changes in N_e through time. As input to the software, we used the callable intergenic bed file and filtered vcf file mentioned above, with the addition of further filtering the bed file to only include autosomal scaffolds ≥500 kb and excluding the divergent regions. The input files for MSMC2, i.e., an unphased set of heterozygous sites for each sample, were generated using the generate_multihetsep.py script from msmc-tools. MSMC2 was run with a starting ρ/μ of 1 for 30 expectation-maximum iterations. For both the demographic modeling and MSMC2, we used the collared flycatcher germline mutation rate²¹ and a generation time of 1.7 years¹¹ to convert divergence times into years.

To infer the effects of demographic events and selection, we also calculated several genetic summary statistics. To this end, we first imputed missing genotypes and inferred haplotypes for the filtered set of variants using beagle version 5.4¹⁰³. From the full set of samples, we selected 10 and seven samples that were homozygous southern or northern for the three divergent regions, respectively, as determined from the MDS analysis (see above), and extracted bi-allelic SNPs. To identify ancestral and derived alleles, we extracted genotypes for the focal SNP positions from the aligned chiffchaff and dusky warblers reads using bcftools 1.14⁶² with the mpileup command. As a conservative approach, we considered any site with the presence of both the reference and alternate allele as heterozygous (regardless of their frequencies) and only included sites where the coverage was at least one-third of the mean coverage among all sites for each outgroup species. We next used a customized script to extract the sites from the original vcf files, and, if necessary, switch the reference and alternate allele and swap the genotypes accordingly. With the polarized genotype data, we used PopGenome 2.7.5¹⁰⁴ to calculate Fay and Wu’s H and vcftools to get counts for the derived allele. We further used selscan 1.3.0¹⁰⁵ to calculate XP-nsl¹⁰⁶ between the southern and northern samples, Sweepfinder2¹⁰⁷ to calculate a composite likelihood ratio (CLR) between a model where a selective sweep has had an effect on the allele frequency and a model based on the genome-wide allele frequency spectrum and used vcftools to calculate nucleotide diversity, Tajima’s D and linkage disequilibrium (D’).

The use of the southern assembly as a reference could potentially lead to a mapping bias for reads from southern samples, particularly in regions of higher divergence between the subspecies. This, in turn, could have an effect on genetic summary statistics and demographic modeling estimates. To explore the effect of reference bias, we therefore also mapped the resequencing data to the northern assembly, performed variant calling and calculated nucleotide diversity and Tajima’s D in 10 kb windows. For the northern assembly, we also used the same demographic modeling as used for the southern assembly. Contrasting average genetic summary statistics and demographic parameter estimates, we found negligible differences between the two genome assemblies (Supplementary Table 10).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source: Ecology - nature.com