DNA sample collection
To analyze the population structure of wild medaka populations, we selected samples from the DNA collection of Takehana et al.29, deposited in University of Shizuoka. The original DNA collection had been made throughout 1980s and 2000s. The selected samples covered the major mitotypes and contained more than three individuals of each population (Table S11, Fig. 3), which were collected from three collection sites for O. sakaizumii and 12 collection sites for O. latipes. We also examined several artificial strains: HNI and Hd-rR, which are inbred strains derived from O. sakaizumii and O. latipes, respectively, and four Himedaka individuals from commercial stock (Uruma city, Okinawa Prefecture, Japan).
In addition, samples were newly collected at Kunigami Village, Okinawa Prefecture. Live fish were anesthetized with MS-222 (aminobenzene methanesulfonate, FUJIFILM Wako Pure Chemical Corporation, Osaka, Japan) and then fixed in 99% ethanol. Genomic DNA was extracted using a DNeasy kit (Qiagen Inc., Hilden, Germany) from ethanol-fixed pectoral fin samples according to the manufacturer’s protocol. The DNA concentration was measured using a spectrophotometer (Nanodrop 1000, Thermo Fisher Scientific, Waltham, Massachusetts, USA), and the DNA was diluted with PCR-grade water to a concentration of c.a. 10 ng/µl (UltraPure™ DNase/RNase-Free Distilled Water, Thermo Fisher Scientific).
Ethic statement
All methods were carried out in accordance with the Regulation for Animal Experiments at University of the Ryukyus for handling live fish. All experiments were approved by the Animal Care Ethics Committee of University of the Ryukyus (R2019035). All experimental methods are reported in accordance with ARRIVE guidelines.
PCR primer design
The following steps were used to select primers for MAAS (Fig. 1). (1) All possible 10-mer sequence combinations (i.e., 410 = 1,048,576 sequences) were generated in silico. (2) The sequences containing simple sequence repeats, some of which had been used in the MIG-seq method17, were excluded. (3) Sequences containing a functional motif, such as a transcription factor-binding site, were also excluded because they may not be suitable for examining neutral genetic markers. We obtained a catalog of motifs from the JASPAR CORE40 (http://jaspar.genereg.net). (4) To avoid taxon-dependency in primer performance, we used information about the k-mer (k = 10) frequency of reference genomes from multiple phyla. Sequences that showed marked differences in frequency among taxa were excluded. The frequencies of each 10-mer sequence in the reference genomes of 17 species belonging to 12 phyla of metazoa were counted (Table S12) using the “oligonucleotideFrequency” function in the “Biostrings” package ver. 2.441. In each of these taxa, the frequencies of sequences were stratified into three grades (< 102, 102–103, and > 103). We then selected the sequences that showed the same grade in more than 80% (14/17) of the species. (5) To avoid synthesizing primer dimers, self-complementary sequences were excluded, taking Illumina adapter sequences (5′-CGCTCTTCCGATCT-3′ and 5′-TGCTCTTCCGATCT-3′) into account. Self-complementation of two bases at the 3′-end or every three continuous bases in primer sequences was then evaluated using a custom script in R ver. 3.5.0 (R Development Core Team, http://cran.r-project.org). Based on the selected 10-mer sequences (i.e., 129 sequences, Fig. 1), 7-mer primer sequences were designed by removing the 3 bases at the 3′ end. Finally, we selected 24 candidate sequences for both 10-mer and 7-mer primers for the subsequent step (Table S1).
The primer sequence consisted of three parts17: partial sequence of the Illimina adapter, 7 N bases, and a short priming sequence, e.g., 5′-CGCTCTTCCGATCTNNNNNNNGTCGCCC-3′. PCR amplification was performed using the candidate primers using the first PCR protocol described below (Table S1). Banding patterns were observed by electrophoresis on 1% agarose gels (agarose S; TaKaRa, Japan). Of the candidate primers, we selected four 7-mer primers and four 10-mer primers that each gave a smeared banding pattern with amplification products ranging from 500 to 2000 bp, indicating uniform amplification of multiple target sequences (Table S1).
Library construction and sequencing
The library was constructed by a two-step PCR approach using a modification of a MIG-seq protocol14. In the first PCR step, multiple regions of genomic DNA were amplified using a cocktail of primers with a Multiplex PCR Assay Kit Ver.2 (TaKaRa) (Table 1). The volume of the PCR reaction mixture was 7 μl, containing 1 μl of template DNA, 2 μM of each PCR primer, 3.5 μl of 2 × Multiplex PCR Buffer, and 0.035 μl of Multiplex PCR Enzyme Mix. PCR was performed under the following conditions: denaturation at 94 °C for 1 min; 25 cycles of 94 °C for 30 s, 38 °C for 1 min, and 72 °C for 1 min, followed by a final extension step at 72 °C for 10 min.
The primers in the second PCR step contained the Illumina sequencing adapter and an index sequence to identify each sample. Following the Truseq indexes, we used the combinations of eight forward indexes (i5) and 12 reverse indexes (i7), which resulted in a total of 96 combinations. To be used as a template for the second PCR, the first PCR product from each sample was diluted 50 times with PCR-grade water. The second PCR was performed in a 15-μl reaction mixture containing, 3 μl of diluted first PCR product, 3 μl of 5 × PrimeSTAR GXL Buffer, 200 μM of each dNTP, 0.2 μM of forward index primer and reverse index primer, 0.375 U of PrimeSTAR GXL DNA Polymerase (TaKaRa). The PCR conditions were as follows: 12 cycles at 98 °C for 10 s, 54 °C for 15 s, and 68 °C for 30 s.
The second PCR product of each sample was pooled by equal volume and size-selected from 600 to 1000 bp using solid phase reversible immobilization (SPRI) select beads (Beckman Coulter Inc, Brea, California, USA) according to the manufacturer’s protocol. The DNA concentration of the pooled library was measured using a Qubit fluorometer (Thermo Fisher Scientific). We sequenced the libraries using two NGS platforms, MiSeq (Illumina, MiSeq Reagent Kit v2 Micro, Paired-End (PE), 150 bp) and HiSeq X (Illumina, PE, 150 bp). Sequencing using the HiSeq X platform was performed by Macrogen Japan (Tokyo, Japan).
To compare primer performance, the DNA libraries constructed using the 7-mer and 10-mer primers for one individual were sequenced using MiSeq. Then, a 7-mer primer cocktail containing four sets of mixed primers was used for the subsequent analyses (Table 1). We also constructed DNA libraries using 7-mer and MIG-seq primer cocktails for three individuals and sequenced them using the HiSeq X platform. Finally, we constructed DNA libraries using 7-mer primer cocktails for 67 wild individuals and six artificial strain individuals for population genetics analyses (Table S11, Fig. 3).
Mapping and SNV calling
Genotyping was conducted using the following BWA-GATK best-practices pipeline for each sample42. Primer sequences were removed using cutadapt with the –b option selected43. The Illumina adapter sequences were also removed and quality filtering was performed using fastp ver. 0.20.0 with the “–detect_adapter_for_pe, –cut_front” option selected44. The remaining reads were mapped on the reference genome of medaka, Hd-rR strain, GCA_002234675.1; ASM223467v127 using Burrows-Wheeler Alignment tool, BWA mem ver. 0.7.1745. After mapping, output files were converted to Binary Alignment/Map (BAM) format using SAMtools ver. 1.746. SNVs and InDels in the sample were determined following the best practice guidelines set out in the Genome Analysis Tool Kit (GATK ver. 3.8.0)42. We then filtered out SNVs and InDels based on the following criteria: “QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < −12.5 || ReadPosRankSum < −8.0” for SNVs and “QD < 2.0 || FS > 200.0 || ReadPosRankSum < −20.0” for InDels. Based on the filtered SNVs and InDels, we recalibrated the mapping quality of each read and extracted sequences including variant and non-variant sites in vcf format using the “GenotypeGVCFs” command in GATK with the “–includeNonVariantSites” option selected.
Sequencing performance evaluation
To compare genotyping performance between primer cocktails (7-mer and MIG-seq), we evaluated the performance based on four indicators. (1) The proportion of properly sequenced reads were evaluated, which was taken as the read number in the raw fastq data that passed quality filtering with ‘fastp’; (2) The proportion of reads uniquely mapped on the reference genome was calculated. To extract uniquely mapped reads from the BAM file, we used the “samtools view” command with the “-q 1-F 4-F 256” option selected and grep with “-v -e XA:Z-v-e SA:Z”. (3). The sequencing output in genotyping should be evaluated after standardizing by read number. We extracted and analyzed 6 million reads from the uniquely mapped reads in each sample. The distribution of the coverage depth in the sequenced positions was then examined to evaluate the amplification bias. We stratified the depth thresholds (≥ 1, ≥ 3, ≥ 10, ≥ 30, ≥ 100, ≥ 500 and ≥ 1000) and calculated the proportion of each group for all of the bases in the reference genome (734,040,372 bp); (4) The saturation curve of the number of reads and the sufficiently sequenced bases was then calculated. To do this, we extracted subsets of the reads (0.5, 1, 2, and 6 million reads) from the original BAM file and considered only those sequences with depths of more than 10 using bcftools ver.1.7 (http://samtools.github.io/bcftools/); (5) The reduction curve of the commonly sequenced bases among randomly chosen individuals of O. latipes was then estimated using the following equation with the “nls” function in R: (C1 = 2,051,200, C2 = 325,000, α = 0.71):
$$Y = C_{1} times a^{x} + C_{2}$$
where Y is the total number of bases with no missing genotype calls, x is the number of individuals, α, C1 and C2 is the constants. Since a reference genome is often not available for population genetic analyses involving non-model organisms, we also performed de novo assembly and genotyping without the reference genome to simulate such a situation. De novo assembly and genotyping were conducted using “denovo_map.pl” in Stacks ver. 2.236,47 using the fastq of the 67 individuals, which were consistent with the population structure analysis described below in the next section. The parameter in the de-novo assembly was optimized following the manual47. We ran several iterations with different parameter (“ustacks –M”, 1–9). The optimized parameter was chosen as M = 7, which maximized the number of polymorphic loci where more than 80% of the samples were genotyped.
Analysis of population structure and admixture
The SNV data were prepared using bcftools and VCFtools ver.0.1.15 (http://vcftools.sourceforge.net/). For each individual, the sequenced positions with depths ≥ 10 were included in the analysis. Due to low read numbers and genotyped bases, eight individuals were excluded from the analysis (Table. S11). The individual vcf files were merged into a multi-sample vcf file and the sequence positions that were genotyped in all individuals were extracted using the “bcftools view” command with the option “NO_MISSING = 0”. As a result, 126,765 bases were sequenced including non-variant and variant positions from 67 individuals (8 O. sakaizumii and 59 O. latipes), and 6479 SNV and InDel sites were identified. Based on the total sequenced bases, we calculated nucleotide diversity π and pairwise genetic distances DXY. Weighted FST values of wild populations were also calculated using pixy ver. 1.0.048. Then, we excluded sites with three or more alleles and InDels. We also removed SNVs that showed strong linkage disequilibrium (r2 > 0.1) in each of the 50,000 base windows using plink ver. 1.9049. Sequences were deposited in the DDBJ Sequence Read Archive (DRA) database under the accession number DRA011342.
After filtering, 2987 SNVs remained. Phylogenetic relationships were then estimated by the neighbor-joining method using a genetic distance matrix by the “snpgdsDiss” function implemented in the “SNPRelate” package ver. 1.22.050,51. The distance matrix was converted to the nexus format using the “writeDist” function of the “phangorn” package ver. 5.052,53. The phylogenetic tree and network were drawn using SplitsTree ver. 4.15.154 and Figtree ver. 1.4.3 (https://github.com/rambaut/figtree/). The branch between O. latipes and O. sakaizumii was considered as the root. To estimate the population structure, we used ADMIXTURE ver. 1.3.037. ADMIXTURE was run for the number of clusters (K) from 1 to 15. Statistical support for the different number of clusters was evaluated based on the lowest cross-validation errors. We ran 50 replicates with random seeds for each K and calculated the mean of the cross-validation errors10 (Table. S7).
Reconstruction of admixture among wild populations and artificial strains
Himedaka is considered to be a cross-breeding strain derived from several wild populations55. To elucidate the genetic background of strains and populations, we used the F statistics framework24,56. To identify populations that are closely related to Himedaka, we used outgroup F3 statistics with the “qp3Pop” command in Admixtools ver. 7.056. We also performed an ABBA-BABA test and calculated Patterson’s D statistics using “qpDstat”24,56, which estimates admixture among four populations. For this analysis, SNVs that were polymorphic in the four populations were extracted using the “admixr” package57.
We reconstructed the admixture graph under an assumed demographic scenario using “qpGraph” command in the Admixtools24,56. We selected the model in which the F4 statistics for all the population combinations were less than three. To estimate the source populations of admixture, we performed analyses using MixMapper ver. 2.038. MixMapper semi-automatically detects source populations of admixture when we specify the number of admixture events and the target of the admixed population.
Source: Ecology - nature.com