Abstract
Berchemiella wilsonii (Schneid.) Nakai is one of the National Key Protected Wild Plants in China, and one of the 120 Plant Species with Extremely Small Populations in China. By combining PacBio HiFi sequencing, DNBSEQ sequencing and Hi-C sequencing, we have assembled a chromosome-scale, haplotype-resolved genome for B. wilsonii. The final assembled haplotype A and haplotype B genomes were 216.84 Mb and 217.69 Mb, anchored to 2n = 24 chromosomes, all chromosomal ends contain telomeric characteristic motifs (TTTAGGG), only haplotype A has one gap, both close to a T2T genome assembly, with contig N50 lengths of 18.24 Mb and 18.03 Mb, respectively. Further, BUSCO analysis showed an extremely high assembly completeness (98.9% complete BUSCO genes). The genome contains a total of 22,828 coding genes, of which 21,992 (96.34%) were functionally annotated. This is the first report of the genome of B. wilsonii and the high-quality genome will provide new insights into evolutionary history and taxonomic classification challenges of endangered plant species.
Similar content being viewed by others
A nearly telomere-to-telomere diploid genome assembly of Firmiana kwangsiensis, a threatened species in China
Near telomere-to-telomere (T2T) level genome assembly of the critically endangered plant Magnolia zenii (Magnoliaceae)
Chromosome-level and haplotype-resolved genome assembly of Bougainvillea glabra
Data availability
The raw date of PacBio HiFi, DNBSEQ and Hi-C sequencing data generated in this study have been deposited at the NCBI Sequence Read Archive (SRA) under the BioProject number of PRJNA1229120. The accession numbers of PacBio HiFi, DNBSEQ and Hi-C sequencing data are publicly accessible as SRR3577978367, SRR3577978568, and SRR3577978469, respectively. Furthermore, the raw data were also deposited at the National Genomics Data Center (NGDC, https://ngdc.cncb.ac.cn/) with accession number CRA02411470 under BioProject accession numbers PRJCA037665. RNA-seq data for root, stem, and leaf tissues have been deposited at the NCBI Sequence Read Archive (SRA) under accessions SRR3278399471, SRR3276777072, and SRR3278399373, respectively. The final genome assembly has been deposited in NCBI GenBank with accession number JBMIOF00000000074 and JBMIOG00000000075 for haplotype A and haplotype B, respectively. The annotation files of the genome are available at the Figshare database76.
Code availability
We did not use any custom code in this study. The versions and parameters of the bioinformatic tools used in this study were described in the Methods section. If a parameter was used with other than its default value, this was stated above as well.
References
Fu, L. K., Jin, J. M. China plant red data book: rare and endangered plants (Science Press, 1992).
Li, J. et al. Rediscovery of Berchemiella wilsonii (Schneid.) Nakai (Rhamnaceae) an endangered species from Hubei, China. Acta Phytotaxon. Sin. 42, 86–88 (2004).
Pang, J. H. et al. Population structure and dynamic characteristics of endangered plant species (Berchemiella wilsonii) and its variety Berchemiella wilsonii var. pubipetiolata. Guihaia 45, 108–120 (2025).
Qian, H. A study on the cenus Berchemiella Nakai (Rhamnaceae) endemic to eastern Asia. Bulletin of Botanical Research 8, 119–128 (1988).
Kang, M., Jiang, M. X. & Huang, H. W. Genetic diversity in fragmented populations of Berchemiella wilsonii var. pubipetiolata (Rhamnaceae). Ann. Bot. 95, 1145–1151 (2005).
Kang, M., Xu, F. H., Lowe, A. & Huang, H. W. Protecting evolutionary significant units for the remnant populations of Berchemiella wilsonii var. pubipetiolata (Rhamnaceae). Conserv. Genet. 8, 465–473 (2007).
Iwatsuki, K., Boufford, D.E., Ohba, H. Flora of Japan, Vol. IIc (Kodansha Ltd., 1999).
Kang, M., Wang, J. & Huang, H. W. Demographic bottlenecks and low gene flow in remnant populations of the critically endangered Berchemiella wilsonii var. pubipetiolata (Rhamnaceae) inferred from microsatellite markers. Conserv. Genet. 9, 191–199 (2008).
Chang, C. S., Kim, H. & Park, T. Y. Patterns of allozyme diversity in several selected rare species in Korea and implications for conservation. Biodivers Conserv. 12, 529–544 (2003).
Dang, H. S., Zhang, Y. J., Jiang, M. X., Huang, H. D. & Jin, X. A Preliminary Study on Dormancy and Germination Physiology of Endangered Species Berchemiella wilsonii (Schneid.) Nakai var.pubipetiolata H.Qian Seeds. Plant Science Journal 23, 327–331 (2005).
Yang, Y. Z. et al. Genomic effects of population collapse in a critically endangered ironwood tree Ostrya rehderiana. Nat. Commun. 9, 5449 (2018).
Zhao, Y. P. et al. Resequencing 545 ginkgo genomes across the world reveals the evolutionary history of the living fossil. Nat. Commun. 10, 4201 (2019).
Li, H. D., Ding, J. L. & He, X. Berchemiella wilsonii: a new plant record from Zhejiang discovered in Shengzhou. Journal of Zhejiang A&F University. 29, 639–640 (2012).
Karbstein, K. et al. Species delimitation 4.0: integrative taxonomy meets artificial intelligence. Trends Ecol. Evol. 39, 771–784 (2024).
Hu, Y. B. et al. Genomic evidence for two phylogenetic species and long-term population bottlenecks in red pandas. Sci. Adv. 6, eaax5751 (2020).
Song, B. et al. Plant genome resequencing and population genomics: Current status and future prospects. Mol. Plant. 16, 1252–1268 (2023).
Li, H. & Durbin, R. Genome assembly in the telomere-to-telomere era. Nat. Rev. Genet. 25, 658–670 (2024).
Li, K. et al. Haplotype-resolved T2T reference genomes for wild and domesticated accessions shed new insights into the domestication of jujube. Hortic. Res. 11, uhae071–uhae071 (2024).
Vilanova, S. et al. SILEX: a fast and inexpensive high-quality DNA extraction method suitable for multiple sequencing platforms and recalcitrant plant species. Plant Methods 16, 110 (2020).
Chen, S. F., Zhou, Y. Q., Chen, Y. R. & Gu, J. Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 11, 1432 (2020).
Cheng, H. Y., Concepcion, G. T., Feng, X. W., Zhang, H. W. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Zhang, X. T., Zhang, S. C., Zhao, Q., Ming, R. & Tang, H. B. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat. Plants 5, 833–845 (2019).
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99–101 (2016).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Jain, C. et al. Weighted minimizer sampling improves long read mapping. Bioinformatics 36, i111–i118 (2020).
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, 1–9 (2004).
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Fu, L. M., Niu, B. F., Zhu, Z. W., WU, S. T. & LI, W. Z. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Flynn, J. A.-O. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. USA 117, 9451–9457 (2020).
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265–W268 (2007).
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 18–18 (2008).
Ou, S. & Jiang, N. LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons. Plant Physiol. 176, 1410–1422 (2018).
Abrusán, G., Grundmann, N., DeMester, L. & Makalowski, W. TEclass-a tool for automated classification of unknown eukaryotic transposable elements. Bioinformatics 25, 1329–1330 (2009).
Tarailo-Graovac, M. & Chen, N. S. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics 4, Unit 4.10 (2009).
Beier, S., Thiel, T., Münch, T., Scholz, U. & Mascher, M. MISA-web: a web server for microsatellite prediction. Bioinformatics 33, 2583–2585 (2017).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
Slater, G. S. C. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005).
Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008).
Delcher, A. L. et al. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23, 673–679 (2007).
Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491 (2011).
Dainat J. Another Gtf/Gff Analysis Toolkit (AGAT): Resolve interoperability issues and accomplish more with your annotations. https://zenodo.org/records/13799920 (2024).
Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol. Biol. Evol. 38, 4647–4654 (2021).
Nevers, Y. et al. Quality assessment of gene repertoire annotations with OMArk. Nat. Biotechnol. 43, 124–133 (2025).
Sommer, M. J., Zimin, A. V. & Salzberg, S. L. PSAURON: a tool for assessing protein annotation across a broad range of species. NAR Genomics Bioinforma. 7, lqae189 (2025).
The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169 (2017).
Deng, Y. Y. et al. Integrated nr database in protein annotation system and its localization. Computer Engineering 32, 71–72 (2006).
Tatusov, R. L. et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41–41 (2003).
Blum, M. et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res. 49, D344–D354 (2020).
Aramaki, T. et al. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics 36, 2251–2252 (2019).
Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. EggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 38, 5825–5829 (2021).
Chan, P. P., Lin, B. Y., Mak, A. J. & Lowe, T. M. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic Acids Res. 49, 9077–9096 (2021).
Lagesen, K. et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 35, 3100–3108 (2007).
Kalvari, I. et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res. 49, D192–D200 (2020).
Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 238 (2019).
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. TrimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).
Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).
Yang, Z. H. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007).
Han, M. V., Thomas, G. W. C., Lugo-Martinez, J. & Hahn, M. W. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. Mol. Biol. Evol. 30, 1987–1997 (2013).
Wang, Y. P. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 40, e49, https://doi.org/10.1093/nar/gkr1293 (2012).
Frith, M. C., Hamada, M. & Horton, P. Parameters for accurate genome alignment. BMC Bioinformatics 11, 80 (2010).
Tang, H. B. et al. Synteny and collinearity in plant genomes. Science 320, 486–488 (2008).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR35779783 (2025).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR35779785 (2025).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR35779784 (2025).
NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA024114 (2025).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR32783994 (2025).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR32767770 (2025).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR32783993 (2025).
Pang, J. H. GenBank, https://identifiers.org/ncbi/insdc:JBMIOF000000000 (2025).
Pang, J. H. GenBank, https://identifiers.org/ncbi/insdc:JBMIOG000000000 (2025).
Pang, J. H. T. 2T Genome assembly and annotation of Berchemiella wilsonii. figshare https://doi.org/10.6084/m9.figshare.28869281 (2025).
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://doi.org/10.48550/arXiv.1303.3997 (2013).
Acknowledgements
This work was financially supported by National Natural Science Foundation of China (U2571202; 32371653) and National Key R&D Program of China (2024YFF1307400).
Author information
Authors and Affiliations
Contributions
X.W. and M.J. conceived and supervised the study. J.P., Z.X., Y.W., Y.C., S.S., H.W. and X.W. performed the experiments and analysed the data. J.P. and X.W. wrote the draft of the manuscript. All authors reviewed and contributed to the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Tables
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
About this article
Cite this article
Pang, J., Xiao, Z., Wang, Y. et al. Telomere-to-Telomere genome assembly of an endangered tree Berchemiella wilsonii (Rhamnaceae).
Sci Data (2025). https://doi.org/10.1038/s41597-025-06433-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-025-06433-3
Source: Ecology - nature.com
