Field sampling
Lake Masoko fish were chased into fixed gill nets and SCUBA by a team of professional divers at different target depths determined by diver depth gauge (12× male benthic, 12× male littoral). Riverine fish (11× Mbaka River and 1× Itupi river) were collected by local fishermen. On collection, all fish were euthanized using clove oil. Collection of wild fish was done in accordance with local regulations and permits in 2015, 2016, 2018 and 2019. On collection, fish were immediately photographed with color and metric scales, and tissues were dissected and stored in RNAlater (Sigma-Aldrich); some samples were first stored in ethanol. Only male specimens (showing bright nuptial coloration) were used in this study for the practical reason of avoiding any misassignment of individuals to the wrong population (only male individuals show clear differences in phenotypes and could therefore be reliably assigned to a population). Furthermore, we assumed that any epigenetic divergence relevant to speciation should be contributing to between-population differences in traits possessed by both sexes (habitat occupancy, diet). To investigate the role of epigenetics in phenotypic diversification and adaptation to different diets, homogenized liver tissue – a largely homogenous and key organ involved in dietary metabolism, hormone production and hematopoiesis – was used for all RNA-seq and WGBS experiments.
Common-garden experiment
Common-garden fish were bred from wild-caught fish specimens, collected and imported at the same time by a team of professional aquarium fish collectors according to approved veterinary regulations of the University of Bangor, UK. Wild-caught fish were acclimatized to laboratory tanks and reared to produce first-generation (G1) common-garden fish, which were reared under the same controlled laboratory conditions in separate tanks (light–dark cycles, diet: algae flakes daily, 2–3 times weekly frozen diet) for approximately 6 months (post hatching). G1 adult males showing bright nuptial colors were culled at the same biological stages (6 months post hatching) using MS222 in accordance with the veterinary regulations of the University of Bangor, UK. Immediately on culling, fish were photographed and tissues collected and snap-frozen in tubes.
Stable isotopes
To assess dietary/nutritional profiles in the three ecomorph populations, carbon (δ13C) and nitrogen (δ15N) isotope analysis of muscle samples (for the same individuals as RRBS; 12, 12 and 9 samples for benthic, littoral and riverine populations, respectively) was undertaken by elemental analyzer isotope ratio mass spectrometry by Iso-Analytical Limited. It is important to note that stable isotope analysis does not depend on the use of the same tissue as the ones used for the RRBS/WGBS samples45. Normality tests (Shapiro–Wilk, using the R package rstatix v.0.7.0), robust for small sample sizes, were performed to assess sample deviation from a Gaussian distribution. Levene’s test for homogeneity of variance was then performed (R package carData v.3.0-5) to test for homogeneity of variance across groups. Finally, Welch’s ANOVA was performed followed by Games–Howell all-pairs comparison tests with adjusted P value using Tukey’s method (rstatix v.0.7.0). Mean differences in isotope measurements and 95% CI mean differences were calculated using Dabestr v.0.3.0 with 5,000 bootstrapped resampling.
Throughout this manuscript, all box plots are defined as follows: centre line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers.
RNA-seq
Next-generation sequencing library preparation
Total RNA from liver tissues stored in RNAlater was extracted using a phenol/chloroform approach (TRIzol reagent; Sigma-Aldrich). Of note, when tissues for bisulphite sequencing samples were not available, additional wild-caught samples were used (Supplementary Table 3). The quality and quantity of RNA extraction were assessed using TapeStation (Agilent Technologies), Qubit and NanoDrop (Thermo Fisher Scientific). Next-generation sequencing (NGS) libraries were prepared using poly(A) tail-isolated RNA fraction and sequenced on a NovaSeq system (S4; paired-end 100/150 bp; Supplementary Table 3), yielding on average 32.9 ± 3.9 Mio reads.
Read alignment and differential gene expression analysis
Adaptor sequence in reads, low-quality bases (Phred score < 20) and reads that were too short (<20 bp) were removed using TrimGalore v.0.6.2 (options: –paired –fastqc; https://github.com/FelixKrueger/TrimGalore). Paired-end 150 bp sequencing read samples were trimmed to 100 bp (both read pairs) to account for read length differences using TrimGalore’s options: –three_prime_clip_R1 50 –three_prime_clip_R2 50. Paired-end reads were aligned to the M. zebra reference genome (GCF_000238955.4_M_zebra_UMD2a_genomic.fa) using kallisto46 v.0.46.0 (options: –bias -b 100), resulting in high mapping rates (83.9 ± 1.6%, mean ± s.d.). Using transcription levels at all annotated genes for all RNA-seq samples, unbiased hierarchical clustering was done using the R script pheatmap v.1.0.12 (Euclidean distances and complete-linkage clustering using Spearman’s correlation matrix). Differential gene expression analysis was then carried out with sleuth v.0.30.0 (ref. 47) using Wald’s test with FDR-adjusted two-sided P value (Benjamini–Hochberg method). Only genes with a q value < 0.05, log2 fold change ≥ 1.5 between any pairwise population comparison and showing high expression levels in ≥1 biological samples (maximal gene expression ≥10 transcripts per million (TPM) in any one sample, which represents the 91st percentile of gene expression in the benthic liver samples) were analysed further. Heatmaps of scaled gene expression values (z-score) for all DEGs were generated using pheatmap v.1.0.12 (Euclidean distances and complete-linkage clustering). Data for gene expression values (TPM) across different A. calliptera tissues were used from Vernaz et al.9.
High-molecular-weight genomic DNA extraction
High-molecular-weight genomic DNA from liver tissues stored in RNAlater was isolated using the QIAamp DNA Mini Kit (catalogue no. 51304; QIAGEN). The quality and quantity of the extracted DNA samples were assessed using TapeStation, Qubit and NanoDrop.
WGBS
NGS library preparation (wild and common-garden samples)
Unmethylated lambda phage genome (0.5% w/w) was first spiked in every sample (catalogue no. D1521; Promega Corporation). DNA samples were then fragmented to approximately 400 bp in length by sonication (E220 Focused-ultrasonicator, Covaris). The length and quality of DNA fragments were assessed using TapeStation. NGS libraries were prepared using approximately 400 ng sonicated DNA fraction using NEBNext Ultra II DNA Library Kit for Illumina (catalogue no. E7645; New England Biolabs) and methylated adaptors (catalogue no. E7535; New England Biolabs) according to the manufacturer’s instructions. DNA libraries were then treated with sodium bisulphite (catalogue no. MOD50; Sigma-Aldrich Imprint) according to the manufacturer’s instructions. Bisulphite-treated DNA libraries were then amplified by PCR (14 cycles) and sequenced as paired-end 150 bp reads on Illumina HiSeq 4000 and NovaSeq systems (the latter for 1 littoral and 1 benthic wild fish) to generate 322.02 ± 58.94 million paired-end reads per sample (mean ± s.d.).
WGBS read mapping
Adaptor sequence in reads, low-quality bases (Phred score ≤20) and reads that were too short (<20 bp) were removed using TrimGalore (options: –paired –fastqc). Sequencing reads (FASTQ) for the same sample generated on multiple lanes were merged. Paired-end reads were first mapped against the lambda genome (GenBank accession no. J02459) to assess bisulphite conversion (98.4 ± 1.0%, mean ± s.d. spike-in conversion rate for all wild and common-garden samples) and then to single-nucleotide polymorphism (SNP)-corrected version of the M. zebra reference genome (GCF_000238955.4_M_zebra_UMD2a_genomic.fa) to account for A. calliptera-specific genotype/SNP (following the same protocol as developed in Vernaz et al.9) using Bismark v.0.20.0 (options: -N 0 -p 4 -X 500)48. Mapping rates were similar across samples, yielding 54.7.1 ± 4.3% best unique read mapping (mean ± s.d., n = 13; Extended Data Fig. 2a), similar to the mapping rates observed in other WGBS studies49. Clonal paired-end reads (that is, PCR duplicates) were removed using Bismark’s deduplicate_bismark function (options: -p –bam). Methylation scores (read count supporting mC/total read count) at each CpG site genome-wide were extracted using Bismark’s bismark_methylation_extractor function (options: -p –multicore 6 –no_overlap –comprehensive –merge_non_CpG –bedGraph).
WGBS DMR prediction
DMR prediction was performed using DSS50 v.2.34.0 (smoothing=TRUE). First, Wald tests were performed on methylation difference at all CG sites between any two populations. Predicted DMRs consisted of CG sites showing significant methylation differences (Wald test, two-sided P < 0.05). Then, to identify putatively biologically relevant DMRs (that is, cis-regulatory elements of a typical size, possibly bound by DNA-binding proteins), the following stringent cut-off parameters were chosen based on previous methylome studies20,51,52: only DMRs that showed substantial methylation differences (≥25% average methylation difference at any one DMR), covering ≥4 CG sites and with a minimal length of ≥50 bp were analysed further. Overall, predicted DMRs showed on average approximately 45% methylation differences (Extended Data Fig. 3b,c), ranged in length from 50 to 3,000 bp (median length, 250 bp; Extended Data Fig. 3d) and covered 4–232 CG sites (median, 15 CG sites; Extended Data Fig. 3d).
WGBS methylome analysis
For subsequent analyses, only CpG sites with ≥4 ≤ 100 unique (non-clonal) paired-end read coverage were used (genome-wide CG site coverage across all samples: 9.14 ± 1.4, mean ± s.e.m.; Extended Data Fig. 2b). Methylation scores at single CpGs were calculated using Bismark output files as follows: number of methylated reads/total number of reads. Methylation levels in non-overlapping 50 bp-long genomic windows for each biological sample or each population (averaged mCG/CG levels) were generated with BEDTools v.2.27.1 (ref. 53) and visualized as BIGWIG files (bedGraphToBigWig v.4; https://genome.ucsc.edu/) in IGV genome browser v.2.9.2 (Broad Institute). PCA (centred and scaled) was carried out using R v.3.6.3 (prcomp) using all CG sites. Unbiased hierarchical clustering (complete-linkage clustering method) was carried out using R based on Euclidian distances (dist) of pairwise Spearman’s correlation scores (cor). Heatmaps were created using pheatmap (complete-linkage clustering method using Euclidean distances). Circos plots were generated on R using circlize v.0.4.12 to visualize DMR genomic distribution across LG chromosomes only (NC chromosomes). Transcription factor binding motif enrichment analysis within DMR sequences was performed on DMRs both located outside gene bodies (excluding the first 1 kbp downstream transcription start site [TSS]) and in promoters/intergenic regions using HOMER v.4.9.1-6 (‘findMotifs.pl’ to identify enriched motifs; scrambleFasta.pl on DMR FASTA sequence to generate background sequences [approximately 50,000 scrambled sequences]).
Fixed/reset DMRs
DMRs predicted between any pairwise comparisons of wild populations (from Fig. 1g) were considered fixed when statistically found between the respective common-garden populations, consistently across all samples and with the same methylation direction (Wald test two-sided P < 0.05), or reset if no longer significant (P ≥ 0.05) or if showing change in methylation direction. The within-population methylome variation arising from the common-garden experiment itself was excluded from this analysis. Wild DMRs found among wild riverine, littoral and benthic fish were merged when found in >1 pairwise comparison using the BEDTools mergeBed function (v.2.27.1). Methylation levels at all wild DMRs were then plotted using pheatmap (Fig. 3a).
RRBS
NGS library preparation and analysis
High-molecular-weight gDNA from liver tissue from 12 adult male fish per ecomorph (36 in total) was isolated using a modified version of the Wizard Genomic DNA Purification Kit (Promega Corporation). The quality and quantity of extracted DNA samples were assessed using Qubit and NanoDrop. Approximatively 100 ng of liver high-molecular-weight gDNA were used to make RRBS libraries according to the manufacturer’s instructions (Premium RRBS kit, catalogue no. C02030032; Diagenode). Each of the three RRBS sequencing libraries multiplexed 12 different samples, with ecomorph representation randomized among libraries. The quality and quantity of all libraries were assessed using TapeStation, Qubit and NanoDrop. RRBS libraries were sequenced on an Illumina NextSeq 500 system (to generate single-end 75 bp-long reads).
Due to poor read quality and low read counts, assessed using FastQC, one riverine ecomorph sample was excluded from further analysis. Analysis of spike-in controls gave a mean CpG bisulphite conversion efficiency across samples of 98.6%. Adaptor sequence in reads, low-quality end bases (Phred score ≤20), reads that were too short (<20 bp) and the first 5 bp (5′-end to avoid sequencing bias) were removed using TrimGalore (options: –rrbs –fastqc –clip_R1 5). In total, after quality trimming, there were 11.1 ± 3.4 Mio reads per RRBS sample (mean ± s.d.). Reads were then aligned to the same M. zebra reference genome (GCF_000238955.4_M_zebra_UMD2a_genomic.fa; see above) using Bismark (options: -N 1; ref. 48). Mapping rates were 83.8 ± 0.8%, 81.6 ± 1.7% and 82.9 ± 1.6%, for benthic, river and littoral populations, respectively, similar to what has been reported in other RRBS studies54. Differences in mapping rates between the WGBS and RRBS datasets stem from technical and sequencing differences, such as sequencing read length, single-/paired-end reads and genome coverage48. Methylation scores (read count supporting mC/total read count) at each CpG site genome-wide were extracted using Bismark’s bismark_methylation_extractor (options: -s –multicore 4 –comprehensive –merge_non_CpG –bedGraph). PCA of methylation levels at CpG sites found across all samples (common CG sites, n = 151,900) was carried out using R ‘prcomp’ (centered and scaled). MANOVA followed by post-hoc Games–Howell multiple comparison tests using Tukey’s correction were used to assess PC1 and PC2 score differences between populations using the R packages stats v.3.6.3 and rstatix.
RRBS DMR prediction
RRBS DMRs were identified using the same method as for WGBS DMRs (see above).
Genomic annotation
DMR localization
Since no functional annotation exists for Lake Malawi cichlid genomes, promoter regions were defined in silico as regions ±1 kbp around the TSS. Gene bodies comprise exon and intron minus the first 1 kbp downstream of the TSS to avoid any overlap with promoter regions. Intergenic regions were defined as regions outside promoters and gene bodies for DMR localization. Only transposon repeats (TE) were analysed (excluding simple repeats, low complexity repeats, ribosomal RNA repeats and satellite repeats) and were annotated using RepeatMasker v.4.0.9 according to Vernaz et al.9. The annotation of CGIs was defined as in Vernaz et al.9. Overlaps between DMR coordinates and each respective genomic annotation were counted using the BEDTools intersectBed function (v.2.27.1).
Enrichment for genomic features
Enrichment for methylome divergence (DMR) in different genomic features was performed by dividing the observed number of DMRs overlapping each genomic feature by the expected values (observed/expected ratio). The expected values were obtained by randomly shuffling the DMR coordinates genome-wide (1,000 iterations) for each genomic feature (BEDTools shuffleBed). One-sample t-tests were performed to test whether expected values were significantly different from the observed values. Chi-squared tests (R) were then performed for all observed/expected distributions among the three DMR comparison groups for each genomic feature.
Assignment of DMRs to genes and GO
DMRs were assigned to genes when located in gene promoters (that is, TSS ± 1 kbp [promoter DMRs]; BEDTools intersect -f 0.5, ≥50% DMR sequence overlap required), in their gene bodies (excluding the first 1 kbp downstream TSS [gene DMRs]; BEDTools intersect -f 0.5, ≥50% DMR sequence overlap). When located outside promoter and gene bodies, intergenic DMRs were assigned to the closest gene if located 1–5 kbp away from it (closestBed; v.2.27.1). DMRs were associated with DEGs following the same method. An exact hypergeometric test (and representation factor) for the overlap between promoter DMRs and DEGs was performed. GO enrichment analysis using the genes associated with each DMR category was then performed using g:Profiler (https://biit.cs.ut.ee/gprofiler/gost; version March 2021; ref. 55). Only annotated genes for M. zebra were used with a statistical cut-off of FDR < 0.05.
Colocalization with HDRs
The coordinates of HDRs from Malinsky et al.16 were translated to the UMD2a M. zebra reference genome (GCF_000238955.4_M_zebra_UMD2a_genomic.fa) using the UCSC liftOver tool (namely, axtChain and liftOver; kent source v.418), based on a whole-genome alignment between the original by Brawand et al.1 (https://www.ncbi.nlm.nih.gov/assembly/GCF_000238955.1) and the UMD2a M. zebra genome assemblies. The pairwise whole-genome alignment was generated using lastz v.1.02 (ref. 56) with the following parameters: “B = 2 C = 0 E = 150 H = 0 K = 4,500 L = 3,000 M = 254 O = 600 Q = human_chimp.v2.q T = 2 Y = 15,000”. This was followed by using the USCS genome utilities (https://genome.ucsc.edu/util.html) axtChain tool with -minScore = 5,000. Additional tools with default parameters were then used after the UCSC whole-genome alignment paradigm (http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto) to obtain a contiguous pairwise alignment and the ‘chain’ file input for liftOver. All 98 HDRs mapped to the new assembly, although some HDRs were split into more than 1 region in the UMD2a assembly, resulting in 141 regions. The distances between DMRs between littoral and benthic populations and the closest HDR were inferred using the BEDTools closestBed function v.2.27.1 (ref. 53).
RRBS-WGBS cross-validation
To cross-compare the RRBS and WGBS datasets and validate the use of the whole-genome unbiased methylome sequencing technique, methylation variation at the WGBS DMRs using the RRBS methylome data (n = 413 DMRs in total) was analysed. In detail, methylome levels for all RRBS samples were averaged over all DMRs predicted using the WGBS samples (BEDTools intersect). Unbiased hierarchical clustering and the heatmap of the Spearman’s correlation matrix using RRBS methylome variation at the WGBS DMRs were produced using pheatmap (Euclidean distances and complete-linkage clustering). The same clustering and heatmap approaches were used to plot methylation levels (averaged mCG/CG per population for both WGBS and RRBS samples) of RRBS samples at WGBS DMRs.
Correlation of DNA methylation and transcription activity
To assess the overall correlation between DNA methylation and transcriptional activity, all annotated genes were split into 5 categories based on their gene expression levels: from genes not expressed (TPM < 5; ‘OFF’, n = 24,598) to expressed genes (4 ‘ON’ categories; TPM ≥ 5), from lowest to highest gene expression activity (n = 1,269–1,270 genes for the ON categories) using the tidyverse v.1.3.1 function cut_number. For each gene expression category, the average methylome profile (average mCG/CG) from 2 kbp upstream of the TSS to 2 kbp downstream of the transcription end site including the entire gene bodies were plotted using deepTools v.3.2.1. Spearman correlation tests were performed between transcriptional activity and methylation levels at gene bodies and promoters using cor.test (‘stats’ R package v.4.2.0). Benthic individuals were used for this analysis and are representative of the other populations (n = 3 biological replicates for liver WGBS [average mCG/CG levels] and n = 5 biological replicates for liver RNA-seq).
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Source: Ecology - nature.com