Genomic insights into local adaptation and future climate-induced vulnerability of a keystone forest tree in East Asia

Plant materials and genome sequencing

Fresh leaves of a wild P. koreana plant in the Changbai Mountains of Jilin province in China were collected, and the total genomic DNA was extracted using the CTAB method. For the Illumina short-read sequencing, paired-end libraries with insert sizes of 350 bp were constructed and sequenced using an Illumina HiSeq X Ten platform. For the long-read sequencing, the genomic libraries with 20-kbp insertions were constructed and sequenced using the PromethION platform of Oxford Nanopore Technologies (ONT). For the Hi-C experiment, approximately 3 g of fresh young leaves of the same P. koreana accession was ground to powder in liquid nitrogen. A sequencing library was then constructed by chromatin extraction and digestion, DNA ligation, purification, and fragmentation⁵³ and was subsequently sequenced on an Illumina HiSeq X Ten platform.

Genome assembly and scaffolding

The quality-controlled reads were first corrected via a self-align method using the NextCorrect module in the software NextDenovo v2.0-beta.1 (https://github.com/Nextomics/NextDenovo) with parameters “reads_cutoff=1k (filter reads with length <1kbp) and seed_cutoff=32k (minimum seed length = 32kbp)”. Smartdenovo v1.0.0 (https://github.com/ruanjue/smartdenovo) was then used to assemble the draft genome with the default parameters. To improve the accuracy of the draft assembly, two-step polishing strategies were applied. The first step included three rounds of polishing by Racon v1.3.1⁵⁴ based on the corrected ONT long reads. The second step included four rounds of polishing by Nextpolish v1.0.5⁵⁵ based on cleaned Illumina short reads after removing adapters and low-quality reads using fastp v0.20.0⁵⁶ with parameters ‘-f 5 -F 5 -t 5 -T 5 -n 0 -q 20 -u 20’ (parameters ‘-f 5 –F 5 -t 5 –T 5’ were used to trim five bases in the front and tail for both read1 and read2; parameters ‘-n 0 -q 20 –u 20’ were used to keep reads with Phred quality >20, percent of unqualified bases <20, and with no N base). Finally, allelic haplotigs were removed using the purge_haplotigs v1.1.1⁵⁷ software with the options ‘-l 5 -m66 -h 170’ (set read depth between 5 and 170 and the low point between the haploid and diploid peaks as 66) to obtain the final contig-level assembly.

For chromosome-level scaffolding, the Hi-C reads were first filtered by fastp v0.20.0 with the parameters described above. Each pair of the clean reads was then aligned onto the contig-level assembly by Bowtie2 v2.3.2⁵⁸ with parameters ‘-end-to-end, -very-sensitive -L 30’. The quality of Hi-C data was evaluated by HiC-Pro v2.11.4⁵⁹, which further classified read pairs as valid or invalid interaction pairs. Only valid interaction pairs were retained for further analysis. Finally, scaffolds were clustered, ordered and oriented onto chromosomes using LACHESIS⁶⁰ with parameters: CLUSTER MIN RE SITES = 100; CLUSTER MAX LINK DENSITY = 2.5; CLUSTER NONINFORMATIVE RATIO = 1.4; ORDER MIN N RES IN TRUNK = 60; ORDER MIN N RES IN SHREDS = 60. The placement and orientation errors that exhibit obvious discrete chromosome interaction patterns were then manually adjusted.

The completeness of the genome assembly was assessed both by the representation of Illumina whole-genome sequencing short reads from mapping back read to the assembly using bwa v0.7.12⁶¹ and by Benchmarking Universal Single-Copy Orthologs (BUSCO) v4.0.5⁶² with the searching database of “embryophyte_odb10”.

Repeat and gene annotation

For repeat annotation, we used the Extensive de-novo TE Annotator (EDTA v1.9.3)⁶³, which incorporates well-performed structure-based and homology-based programs (including LTRharvest, LTR_FINDER, LTR_retriever, TIR-learner, HelitronScanner, and RepeatModeler) and subsequent filtering scripts, for a comprehensive repeat detection. Subsequently, TEsorter (v1.2.5, https://github.com/zhangrengang/TEsorter/)⁶⁴ was used to reclassify those TEs that were annotated as “LTR/unknown” by EDTA.

For gene annotation, we first used RepeatMasker v4.1.0⁶⁵ to mask the whole-genome sequences with the TE library constructed using EDTA. An integrated strategy that combined homology-based prediction, transcriptome-based prediction, and ab initio prediction was used to predict the protein-coding genes. For homology-based gene prediction, published protein sequences of six plant species, including Populus euphratica, Salix brachista, Salix purpurea, Populus trichocarpa, Arabidopsis thaliana, and Vitis vinifera, were downloaded and aligned onto the repat-masked genome by using the TBLASTN (ncbi-BLAST v2.2.28⁶⁶) program with E-value cutoff setting of 1e⁻⁵, and GeneWise v2.4.1⁶⁷ was then used to predict gene models with default settings. For transcriptome-based gene prediction, trimmed RNA sequencing reads from leaf, stem, and bud tissues were mapped to the reference genome using HISAT v2.2.1⁶⁸ with parameters “-max-intronlen (maximum intron length) 20000 –dta (report alignments tailored for transcript assemblers including StringTie) -score-min (set a function governing the minimum alignment score needed for an alignment to be considered “valid”) L, 0.0, -0.4”, and Trinity v2.8.4⁶⁹ was used for transcripts assembly with default parameters. Assembled transcripts were subsequently aligned to the corresponding genome to predict gene structure using PASA v2.4.1⁷⁰. For the ab initio prediction, Augustus v3.3.2⁷¹ was employed using default parameters after incorporating the transcriptome-based and homology-based evidence for gene model training. Finally, all predictions of gene models generated from these approaches were integrated into the final consensus gene set using EvidenceModelerv1.1.1⁷⁰. After prediction, PASA was again used to update alternatively spliced isoforms to gene models and to produce a final gff3 file with three rounds of iteration.

In addition, we also performed non-coding RNAs (ncRNAs) annotation. Transfer RNAs (tRNAs) were identified using tRNAscan-SE v2.0.7⁷² with default parameters. Ribosomal RNAs (rRNAs) were identified by aligning rRNA genes of P. trichocarpa_v3.1 to the assembly using blast. The other three types of ncRNA (microRNA, small nuclear RNA, and small nucleolar RNA) were identified using Infernal v1.1.4⁷³ by searching the Rfam database v12.0⁷⁴.

For functional annotation, our predicted protein-coding genes were aligned to multiple public databases including NR, Swiss-Prot, TrEMBL⁷⁵, COG, and KOG using NCBI BLAST + v.2.2.31 with an E-value of 1e−5 as the cutoff⁶⁶. Motifs and domains were annotated by searching against InterProScan (release 5.32-71.0)⁷⁶. Gene ontology (GO) terms and KEGG pathways of predicted sequences were assigned by InterProScan and KEGG Automatic Annotation Server, respectively⁷⁷.

Genome resequencing, read mapping, and variant calling

A total of 230 individuals were collected from 24 natural populations across the total distribution of the species. Within each population, individuals were sampled after ensuring that sampled individuals were at least 100 m apart from each other. Genomic DNA was extracted from leaves with a Qiagen DNeasy plant kit, and whole-genome paired-end sequencing was generated using the Illumina NovaSeq 6000 platform with a target coverage of 20× per individual.

For raw resequencing reads, we used Trimmomatic v0.36⁷⁸ to remove adapters and cut off bases from either the start or the end of reads if the base quality was <20. Trimmed reads shorter than 36 bases were further discarded. After quality control, all high-quality reads were mapped to our de novo assembled P. koreana genome using the BWA-MEM algorithm of bwa v.0.7.17⁶¹ with default parameters. The alignment results were then processed by sorting and PCR duplicate marking using SAMtools v.1.9⁷⁹ and Picard v.2.18.11 (http://broadinstitute.github.io/picard/). For genetic variant identification, SNP and indel calling were performed using the Genome Analysis Toolkit (GATK v.4.0.5.1)⁸⁰ and its subcomponents HaplotypeCaller, CombineGVCFs, and GenotypeGVCFs to form a merged VCF file with “all sites” (including non-variant sites) included using the ‘EMIT_ALL_SITES’ flag. SV calling was performed using the software DELLY v0.8.3⁸¹ with default parameters. We further performed multiple filtering steps to only retain high-quality variants for downstream analysis. For SNPs, SNPs with multi-alleles (>2) and those located at or within 5 bp from any indels were removed. In addition, after treating genotypes with read depth (DP) < 5 and genotype quality (GQ) < 10 as missing, SNPs with missing rate higher than 20% were filtered; for indels, those with muti-alleles (>2) and with QualByDepth (QD) < 2.0, strand bias estimated using Fisher’s exact test (FS) > 200.0, StrandOddsRatio (SOR) > 10.0, MappingQualityRankSumTest (MQRankSum) <-12.5, ReadPosRankSum < -8.0 were removed. Indels with missing rate >20% after treating genotype with DP < 5 and GQ < 10 as missing were further filtered out; for SVs, those with length <50 bp and with imprecise breakpoints (flag IMPRECISE) were removed. After treating genotypes with GQ < 10 as missing, we further filtered SVs with a missing rate >20%. Finally, we implemented the software SNPable (http://lh3lh3.users.sourceforge.net/snpable.shtml) to mask genomic regions where reads were not uniquely mapped, and we filtered out variants located in these regions. After these filtering steps, 16,619,620 SNPs, 2,663,202 indels, and 90,357 SVs were retained for subsequent analyses. The filtered variants were further phased and imputed using Beagle v4.1⁸² and the effects of individual variants were annotated using SnpEff v.4.3⁸³ with “-ud 2000” to define the length of upstream and downstream regions around genes, with other parameters being set to default.

Ecological niche modeling

To investigate the current (1970–2000) potential distribution range of P. koreana around China, we performed ecological niche modeling (ENM) using Maxent v.3.3.3⁸⁴ with 19 current bioclimatic variables (Supplementary Table 11). In addition to the geographic distribution data of our 24 natural populations (Supplementary Data 1), we also added another seven geographical data from the Chinese Virtual Herbarium (https://www.cvh.ac.cn/) and the Global Biodiversity Information Facility (https://www.gbif.org/zh/) into the ENM analyses, which was performed with default setting after excluding the highly correlated environmental variables (Spearman correlation coefficient >0.7).

Population structure analysis

We first used PLINK v1.90⁸⁵ with the parameters “indep-pairwise 50 10 0.2” to extract an LD-pruned SNP set with minor allele frequency (MAF) > 5%, which yielded 535,191 independent SNPs to be used in the population structure analysis. First, we used ADMIXTURE v.1.3.0⁸⁶ with default parameters to investigate population genetic structure across all individuals, with the number of clusters (K) being set from 1 to 8. We then used the rda function from the R package vegan 2.6-2⁸⁷ to perform the PCA on the pruned SNPs. To further assess the relatedness between individuals, the identify-by-state (IBS) genetic distance matrix was calculated using the “-distance 1-ibs” parameter in PLINK v1.90. We constructed the NJ phylogenetic tree based on the distance matrix using MEGAX⁸⁸ and displayed the tree using FigTree v.1.4.4. Finally, for the IBD analysis, we first used VCFtools v0.1.15⁸⁹ to calculate the population differentiation coefficient (F_ST). The matrix of F_ST (1−F_ST) and the matrix of geographic distance (km) among different groups of populations were then used for performing the Mantel tests using the R package vegan⁸⁷, with the significance being determined based on 999 permutations.

Genetic diversity, linkage disequilibrium and demographic history analysis

To estimate and compare genetic diversity across populations of P. koreana, we calculated both intra-population (π) and inter-population (d_xy) nucleotide diversity after taking into account both polymorphic and monomorphic sites using the program pixy v0.95.0⁹⁰ over 100-kbp non-overlapping windows. In addition, Tajima’s D statistics were calculated using VCFtools v0.1.15 in 100-kbp non-overlapping windows for the northern and southern groups of populations using the complete SNP dataset, respectively. To further estimate and compare the pattern of LD among different groups of populations, PopLDdecay v.3.40⁹¹ was used to calculate the squared correlation coefficient (r²) between pairwise SNPs with MAF > 0.1 in a 100-kbp window and then averaged across the whole genome.

PSMC⁹² was used to infer historical changes in the effective population size (N_e) of P. koreana using default parameters with the entire genomic dataset. We selected seven individuals from both the northern and southern groups of populations to run the PSMC analyses, and 100 bootstrap estimates were performed per individual. Assuming a generation time of 15 years and a mutation rate of 3.75 × 10⁻⁸ mutations per generation⁹³, we converted the scaled population parameters into N_e and years.

Identification of environment-associated genetic variants

We used two different approaches to identify environment-associated variants (SNPs, indels, and SVs) across the whole genome. We kept only common variants with MAF > 10%, including a total of 5,182,474 SNPs, 736,051 indels and 30,934 SVs, for these analyses. First, we used a univariate latent-factor linear mixed model (LFMM) implemented in the R package LEA v3.3.2⁹⁴ to search for associations between allele frequencies and the 19 BIOCLIM environmental variables⁹⁵. Based on the number of ancestry clusters inferred with ADMIXTURE v.1.3.0, we ran LFMM with three latent factors to account for population structure in the genotype data. For each environmental variable, we ran five independent MCMC runs using 5000 iterations as burn-in followed by 10,000 iterations. P values from all five runs were then averaged for each variant and adjusted for multiple tests using a false discovery rate (FDR) correction of 5% as the significance cutoff.

Second, we performed a redundancy analysis (RDA) to identify genetic variants showing an especially strong relationship with multivariate environmental axes^29,96. RDA has been shown to be one of the best-performing multivariant GEA approaches and exhibits low false-positive rates²⁹. After considering the ranked importance of the 19 environmental variables estimated using GF analyses with R package “gradientForest”⁴⁶ and correlations among the variables, six environmental variables (BIO1, BIO3, BIO5, BIO13, BIO15, and BIO19) with pairwise correlation coefficients |r | < 0.6 were selected for the RDA analyses using the R package vegan. Significant environment-associated variants were defined as those having loadings in the tails of the distribution using a standard deviation cutoff of 3 along one or more RDA axes.

To investigate and compare the role of geography and environment in shaping spatial genetic variation of adaptive (the 1779 adaptive variants identified by both LFMM and RDA) and neutral (the 535,191 LD-pruned SNPs as used for population structure analyses) variants, Mantel and partial Mantel tests were separately used to test for associations between F_ST(F_ST/1−F_ST) and geographic (IBD) and environmental (IBE) distance (after accounting for the geographic distance) with significance determined using 999 permutations in the R package vegan⁸⁷, where environmental distance was represented by Euclidean distance of all scaled environmental variables. In addition, we used partial RDA to quantify the relative contribution of geography, population structure, and environment in explaining the proportion of adaptive and neutral genetic variation. Three datasets were used: (1) six uncorrected environmental variables used as in the above RDA analysis (‘clim’); (2) three proxies of population structure (population scores along the first three axes of PCA, ‘struct’); and (3) population coordinates (latitude and longitude, ‘geog’) to characterize explanatory variables of climate, population structure, and geography. For the two RDA models, population allele frequencies of adaptive and neutral variants were used as the response variables, respectively. The significance of explanatory variables was assessed using 999 permutations with the function anova.cca of the R package vegan.

To further assess selection pressures acting on climate adaptive variants, we assessed the extended haplotype homozygosity (EHH) pattern for a selected set of strongly associated variants using the R package “rehh”⁹⁷ and calculated the standardized iHS across the genome for common variants using the software selscan v.1.3.0⁹⁸.

Stress treatment and expression analysis by qRT-PCR

Stem segments from wild genotypes of P. koreana were surface sterilized by soaking in 10% sodium hypochlorite solution and 70% ethyl alcohol for 5 min and then thoroughly washed five times with distilled water. The stem segments were inserted into MS medium (0.05 mg/L NAA) for 30 days at 25/20 °C (day 16 h/night 8 h), and, after rooting, the stem segments were transplanted to the soil for 40 days at 25/20 °C (day 16 h/night 8 h). To explore the effect of different genotypes of one candidate adaptive SNP located in the 5’ UTR of Pokor12247 (LG04:25159299) in mediating adaptation to extreme precipitation, we carried out a submergence treatment. For the submergence treatment, water was maintained at 2 cm above the soil surface and plants were maintained in the growth chamber providing 25 °C/20 °C (day 16 h/night 8 h) for 0, 3, 6, 9, and 12 h. In addition, we also carried out a heat stress treatment to explore the effect of one candidate adaptive SNP located in the intronic region of Pokor17228 (LG07: 4796402) in response to heat stress. For the heat stress treatment, plants were placed into a plant incubator at 42 °C/20 °C (day/night) with the illumination of 16/8 h (day/night) for 0, 1, 2, 3, and 24 h. At each time point, leaf tissues were collected from each plant at the same place and frozen immediately in liquid nitrogen for expression analyses.

Quantitative reverse transcription PCR (qRT-PCR)⁹⁹ was used to investigate the expression levels of selected genes in the abiotic treatments (Pokor12247 for submergence stress and Pokor17228 for heat stress). Total RNA was extracted from pooled leaf materials using a Plant RNA extract kit (Biofit, Chengdu, China), and the HiScript II RT SuperMix for qPCR kit (+gDNA wiper) (Vazyme, Nanjing, China) was used to obtain cDNA. qPCR was performed with gene-specific primers (Supplementary Table 14) using the Taq Pro Universal SYBR qPCR Master Mix (Vazyme, Nanjing, China) reaction system on the CFX96 Real-Time detection system (Bio-Rad). Each experiment was performed with three technical replicates, and the UBQ10 was used as the endogenous control for data analysis.

ATAC-seq analysis

For the ATAC experiment, fresh leaf tissues were collected from the same individual used for the genome assembly of P. koreana and prepared according to the experimental protocol following ref. 100. In brief, approximately 500 mg of flash-frozen leaves were immediately chopped and processed for ATAC-seq, followed by library construction, and were then subjected to sequencing on the Illumina HiSeq X-Ten platform. The raw reads generated were first trimmed using Trimmomatic v.0.36⁷⁸ with a maximum of two seed mismatches, and the adapters were trimmed by NexteraPE. Then, the clean reads were aligned to the reference genome using Bowtie v.2.3.2⁵⁸ using the following parameters: ‘bowtie2 -very-sensitive -N 1 -p 4 -X 2000 -q’ (the number of mismatches allowed for seed comparison was set to 1, the threads were set to 4, and the longest insertion clip length was set to 2000). Aligned reads were sorted using SAMtools v.1.1.1⁷⁹. The redundant reads from PCR amplification and reads that mapped to either chloroplast or mitochondria were removed using Picard v.2.18.11 (http://broadinstitute.github.io/picard/). Finally, only high-quality properly paired reads were retained for further analysis. ATAC-seq peak calling was done by MACS2¹⁰¹ with the ‘-keep dup all’ function.

Genomic offset assessment

For each sampling location, we downloaded future (2061–2080 and 2081–2100) environmental data for the 19 BIOCLIM variables from the WorldClim CMIP6 dataset of four different climate models (BCC-CSM2-MR model, ACCESS-CM2 model, CanESM5 model, and GISS-E2-1-G model; resolution 2.5 arcmin)⁹⁵. Each of the two future environmental datasets consists of two shared socioeconomic pathways (SSPs): SSP126 and SSP370. We used three different approaches to evaluate the genomic offset to future climate change.

First, we calculated the RONA¹⁸, which quantifies the theoretical average change in allele frequency needed to cope with climate change, under projected future climate scenarios. Following the method used in ref. 24, a linear relationship between allele frequencies at significantly associated loci (detected by both LFMM and RDA) and environmental variables was first established using linear regressions. For each locus, population, and environmental variable, the theoretical allele frequency change needed to cope with future climate conditions (RONA) was calculated for each of the four climate models and then combined, and the average RONA values were further weighted by the R² for each linear regression following ref. 45. In addition, to explore and compare the patterns of RONA calculated by different types of adaptive variants (SNPs, indels, and SVs), we further calculated RONA using the three separated datasets, respectively, for two representative environmental variables (BIO5 and BIO13).

Second, as a complementary approach to RONA, we used a non-parametric, machine-learning GF analysis to calculate genomic offset across the range of P. koreana using ‘gradientForest’ in R^17,46. For each given climate model, we built a GF model for estimating the genetic offset under the different future scenarios with both the 19 environmental variables and the six environmental variables. The genetic offset was calculated as a metric for the Euclidean distance of the genomic composition between the current and future projected climates and then mapped with ArcGIS 10.2 to display its geographical distribution. Given the high correlation of genetic offset that was estimated across models, the average values of genetic offset across models were used to predict the local maladaptation to future climate. Same as RONA, the genetic offset was also calculated and compared among the three different types of variants (SNPs, indels, and SVs).

Third, following the approach used in ref. 23, we integrated migration to predict potential maladaptation to future climate change and calculated three different formulations of genetic offset: the local, forward, and reverse offsets for each climate model. After quantifying the correlations of local, forward, and reverse offsets across the four different climate models, we predicted the three genetic offset metrics by calculating the average values across the climate models. For forward genetic offset, we further assessed its sensitivity to dispersal constraints and tested how forward offset varied when the maximum allowable migration was limited to different distance classes, including 100, 250, 500, and 1000 km, and unlimited to any location in the Eurasian continent. Furthermore, to visualize local, forward, and reverse offsets simultaneously, we mapped these three metrics as the red, green, and blue bands of an RGB image, respectively, as in ref. 23.

Finally, we explored whether there was an association between the genomic offset to future climate change and the accumulation of genetic load across populations of P. koreana. To assess the deleterious genetic load carried by each population, the effects of SNP variants on protein-coding gene sequences were first annotated and categorized as LOF, deleterious (SIFT score < 0.05), tolerated (SIFT score < 0.05), or synonymous based on the sorting intolerant from tolerant (SIFT) algorithm implemented in SIFT4G software using UniRef100 as the protein database⁴⁸. The derived versus ancestral allelic state was determined at each SNP position using the est-sfs software¹⁰² through comparison with P. trichocarpa sequences¹⁰³. Then, the ratio between the number of derived mutations at LOF, deleterious, and tolerated sites relative to the number of synonymous variants was calculated and used as proxies for genetic load per population. In addition, as SVs are, on average, deleterious, we further calculated the SV burden represented by the averaged ratio of heterozygous SV to heterozygous SNP across individuals for each population. Finally, we used the cor.test function in R to calculate the Spearman’s correlation coefficients between the three metrics of genetic offsets (local, forward, and reverse) under two future scenarios (SSP126 and SSP370) in 2061–2080 and the above proxies of genetic load across the 24 populations, respectively.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.