The genome of Shorea leprosula (Dipterocarpaceae) highlights the ecological relevance of drought in aseasonal tropical rainforests
Sequencing of Shorea leprosula genomeSample collectionLeaf samples of S. leprosula were obtained from a reproductively mature (diameter at breast height, 50 cm) diploid tree B1_19 (DNA ID 214) grown in the Dipterocarp Arboretum, Forest Research Institute Malaysia (FRIM).DNA extractionGenomic DNA was extracted from leaf samples using the 2% cetyltrimethylammonium bromide (CTAB) method90 and purified using a High Pure PCR Template Purification kit (Roche).Library preparation and sequencingPaired-end (170, 500, and 800 bp) and mate-pair (2 kb) genomic libraries were prepared using a TruSeq DNA Library Preparation kit (Illumina) and a Mate Pair Library Preparation kit (Illumina), respectively. Mate-pair libraries with larger insert sizes were constructed using a Nextera Mate Pair Library Preparation kit (Illumina). Ten micrograms of genomic DNA were tagmented in a 400 μl reaction and fractionated using SageELF, with the recovery of 11 fractions with 3–16+ kb. Each fraction was circularized and fragmented with a Covaris S2. Biotin-containing fragments were purified using Dynabeads M-280 streptavidin beads. Sequencing adapters (KAPA TruSeq Adapter kit) were attached using a KAPA Hyper Prep kit. The libraries were amplified for 10–13 cycles and purified with 0.8× AMpure XP. DNA libraries were then sequenced (~388× coverage) using Illumina HiSeq2000 (TruSeq libraries) and HiSeq2500 (Nextera libraries) at the Functional Genomics Center Zurich (FGCZ), University of Zurich, Switzerland (Supplementary Table 1).Genome assemblyAdapters and low-quality bases for all paired-end and mate-pair reads were removed using Trimmomatic91. The filtered paired-end reads of the 170 bp library were used to identify the genome size using k-mer distribution generated by Jellyfish92 that was implemented in the scripts by Joseph Ryan42. The raw R1 reads from paired-end 170 and 800 bp libraries (clipped at 95 bp, representing about 70 genome equivalents) were used to estimate the heterozygosity using KAT43 with a k-mer size of 23 nt. De novo genome assembly of all reads was performed using ALLPATHSLG assembler v5248840.Assembly verification and assessment of the assembled genomeAssembly validationTo validate the genome assembly, we mapped (i) the short reads used for the genome assembly, (ii) scanned the assembly for the presence of single-copy orthologs, and (iii) mapped transcriptome sequences obtained from seven organs.Assembly verification by mapping of short readsFor each library used for genome assembly, all trimmed reads were aligned to the assembled S. leprosula genome using Burrows–Wheeler Aligner (BWA) v0.7.1293. Then, mapping ratio was calculated for each BAM file using Samtools94 with “flagstat” command.Identification of highly conserved single-copy orthologsBUSCO v3.1.042 was run with the Embryophyta dataset and Arabidopsis as the species for AUGUSTUS prediction (see subsection below “Protein-coding gene prediction”).Assembly verification by mapping transcriptome sequencesFor mapping transcriptome sequences, samples of seven organs (leaf bud, flower bud, flower, inner bark, small seed, large seed, and calyx) were obtained from the S. leprosula individual used for the genome sequencing (Supplementary Table 2). Total RNA was extracted from each sample using RNeasy Plant Mini Kit (Qiagen) and it was treated with Turbo DNase I (Takara). Library preparation was carried out using a TruSeq RNA Library Preparation kit (Illumina). Paired-end sequencing was conducted for all the libraries using Illumina HiSeq2000 at the FGCZ, University of Zurich, Switzerland. Adapters and low-quality bases for all paired-end reads were removed using Trimmomatic. The trimmed sequences of each library were mapped onto the assembled genome using STAR aligner v2.4.2a95, and mapping ratio was obtained from the output file of STAR.Genome annotationRepeat sequence analysisBoth homology-based and de novo prediction analyses were used to identify the repeat content in the S. leprosula assembly. For the homology-based analysis, we used Repbase (version 20120418) to perform a TE search with RepeatMasker (4.0.5) and the WuBlast search engine. For the de novo prediction analysis, we used RepeatModeler to construct a TE library. Elements within the library were then classified by homology to Repbase sequences (see subsection below “Preparation of repeat sequences for evidence-based gene prediction”).Protein-coding gene predictionS. leprosula protein-coding genes were predicted by AUGUSTUS v3.245. For ab initio gene prediction, we used a pre-trained A. thaliana metaparameter implemented in AUGUSTUS. For the evidence-based gene prediction, we used the information of exon, intron and repeat sequences of S. leprosula as hints for the AUGUSTUS gene prediction. The details of the preparation of the hints were described in the following subsections.Preparation of repeat sequences for evidence-based gene predictionWe used RepeatModeler to construct a de novo library of repeated sequences in the S. leprosula assembly. Then, using RepeatMasker, we generated a file containing the information of the positions of repeat sequences in the S. leprosula genome based on the RepeatModeler library. Elements within the library were then classified by homology to Repbase sequences. Finally, the hint file for repeat sequences in GFF format was prepared using the two scripts, “10_makeGffRm.pl” and “12_makeTeHints.pl”, stored in https://gitlab.com/rbrisk/ahalassembly.Preparation of the exon and intron information for evidence-based gene predictionTo obtain the exon and intron hints, we used the mapping data of RNA-seq obtained from seven organs of the sequenced S. leprosula individual as described above. First, we merged all the mapping data stored in different BAM files into a single BAM file using SAMtools. Then, we prepared the intron hint file in GFF format using the, “bam2hints” script of AUGUSTUS. The exon hint file was also generated from the merged BAM file using the two AUGUSTUS scripts, “bam2wig” and “wig2hints.pl”. To conduct evidence-based gene prediction with AUGUSTUS, the three hint files (repeat sequences, intron and exon) described above were merged into a single file in GFF format.BUSCO analysisGenome annotation completeness were assessed with BUSCO v3.1.044 using the Embryophyta odb9 dataset composed of 1440 universal Embryophyta single-copy genes. We referred to these 1440 genes as core genes in the main text.Comparison with the proteome of Theobroma cacao
T. cacao’s gene models18 were downloaded from Phytozome 11 (https://phytozome.jgi.doe.gov/pz/portal.html). Then, comparison was conducted with BLASTP96 using the T. cacao proteomes as the BLAST database (E-value cutoff: 1.0E-10). Only the best hit was stored for each gene. We considered these best hits of the T. cacao genes as orthologs of the S. leprosula genes. When the T. cacao orthologs were identified by the BLASTP search, the orthologs of A. thaliana were defined based on the T. cacao-A. thaliana orthologous information provided by Phytozome 11 (Supplementary Table 4). When the T. cacao orthologs were not identified, the orthologs of A. thaliana were searched by BLASTP (E-value cutoff: 1.0E-10) using the A. thaliana proteomes obtained from TAIR 10 (https://www.arabidopsis.org) as the BLAST database.Synteny analysisBased on the result of the above BLASTP searches, we assessed synteny between the S. leprosula scaffolds and the T. cacao chromosomes using MCScanX97. Genome information of T. cacao in GFF format was also obtained from Phytozome 11 as described above, which was used as an input file for MCScanX.Assessment of the genome assemblyPopulation data and other dipterocarp speciesTo assess whether the genome assembly could be used as a reference for the S. leprosula individuals from various populations, we checked mapping ratio, SNP positions, and admixture using the distribution-wide S. leprosula samples. Similarly, to assess whether the S. leprosula assembly could be used as a reference for aligning data from closely related species and determining their mapping ratios. For interspecific analysis, the following three Dipterocarpoideae species: S. platycarpa, D. aromatica, and N. heimii were used (Supplementary Table 7).Sample collection and DNA extractionLeaf samples of 19 S. leprosula individuals from different populations and three other dipterocarp species (S. platycarpa, D. aromatica, and N. heimii) were used as described in Supplementary Tables 6 and 7. Genomic DNA was extracted using the same method as described above.Library preparation and sequencingPaired-end genomic libraries (200 bp) were prepared using a TruSeq DNA Library Preparation kit (Illumina). DNA libraries were then sequenced (~16× coverage each) using Illumina HiSeq2000.Mapping and SNP callingAdapters and low-quality bases from resequencing reads were removed using Trimmomatic. All trimmed reads were then mapped and aligned to the S. leprosula assembly using BWA. Variants were called using GATK v3.598. Duplicated reads were marked using Picard 2.6.0. Within GATK, HaplotypeCaller was used to identify variants for each sample by generating an intermediate genomic variant call format (gVCF). Subsequently, gVCF files were merged using GenotypeGVCFs to produce a raw VCF file containing SNPs and INDELs. Low-quality variants were removed from the raw VCF file by applying the hard filters implemented in GATK. Variants with genotype quality (GQ) More