An array of initiatives are underway to compile reference-grade genome assemblies of life on Earth. Such assemblies can shed light on many aspects of biodiversity. As Hogg says, a reference genome helps scientists determine if a sequence is a gene, to see what it encodes and assess if there is diversity at that gene. Conservation biologists might decide to move a population to improve gene flow. When one population clears a disease quicker than another, “we can move animals with the specific genetic variant that helps deal with disease.” Unfortunately, most characteristics are polygenic, she says, but “in conservation we aim to maintain and promote as much genetic diversity as we can.” Reference genomes, she says, provide a “blueprint of life” and help researchers understand how species interact with their often rapidly changing environment.
A consortium has assembled the kākāpō reference genome, and Urban has been part of the team compiling one for the takahē. It involves the Takahē Recovery team, the DOC, a team at Rockefeller University and Māori members. A high-quality takahē genome can inform all the downstream conservation efforts for this species, says Urban. It was challenging to get the right kind of samples in adequate quality, she says, “but it was totally worth it because it told us a lot about the actual genomic architecture of the takahē.”
Takahē genomic information has been a crucial help in developing a computational method to assemble haplotype-resolved genomes when no parental data are available, which could prove helpful in many areas of biology. The quality of this phasing, says Urban, is comparable to that of one that involved parents’ genomes. The method combines two types of genomic information: HiFi reads from Pacific Biosciences instruments and Hi-C chromatin interaction data. Pacific Biosciences introduced circular consensus sequencing a few years ago, which builds consensus reads, or HiFi reads, from multiple passes over a DNA molecule.
In developing this method, Heng Li at the Dana-Farber Cancer Institute, colleagues at University of Otago in New Zealand including Lara Urban and Neil Gemmel, and several teams from other US institutions such as Rockefeller University’s Vertebrate Genome Project and the Center for Species Survival at the National Zoo, used data from the takahē and other animals, such as the critically endangered black rhinoceros.
When handling diploid and polyploid genomes, many long-read assembly tools collapse differing homologous haplotypes into a ‘consensus assembly’. Some tools avoid erasing heterozygous differences and phase genomic regions with low levels of heterozygosity, and then build contiguous sequence by stitching these blocks together. The final assembly tends to include those phased blocks as an ‘alternate assembly’.
With a method called trio-binning, which uses data from individuals and their parents, scientists can obtain a haplotype-resolved assembly with two sets of contiguous sequence: two haploid genomes. Other methods draw on additional data, such as chromatin interaction data from Hi-C or Strand-Seq, which applies single-cell sequencing and resolves homologs within a cell. In Strand-Seq, only the DNA template strand used during DNA replication is sequenced.
Li and colleagues developed the hifiasm algorithm5 to address complications they saw in this area, such as lengthy computational pipelines. Hifiasm applies string overlap graphs, which represent different paths along the assembled genomes. In a hifiasm graph, each node is a contiguous sequence put together from ‘phased’ HiFi reads. Li and colleagues have extended hifiasm to combine HiFi reads and Hi-C data6. First, hifiasm produces a phased assembly graph onto which Hi-C reads are mapped. The graph is made up of ‘unitigs’, contiguous sequence from heterozygous and from homozygous regions. Read coverage can be used to distinguish the two. Hifiasm further processes unitigs to build a haplotype-resolved assembly of a diploid organism.
The method avoids the traditional consensus assembly approach for a diploid sample, in which half of sequences are randomly discarded, and it mixes sequences from parents, which is clearly not ideal, says Li. With people, parental data can be hard to obtain and ethical approval is needed. Meanwhile, with samples obtained from animals in the wild, as in biodiversity studies, scientists usually have little or no way to locate parents. Methods exists for haplotype-resolved assembly without parent data, but they have only been tested on human samples, he says. “Making a haplotype-resolved assembler robust to various species is a lot more challenging,” says Li. An algorithm designed for species of low sequence diversity, such as humans, may not work well for species of high diversity, such as insects. “Then there are species with mixed sequence diversity, which demands an algorithm can smoothly work with all these cases without users’ intervention,“ he says. This motivated the team to extend hifiasm.
There are around 440 individual South Island takahē (Porphyrio hochstetteri) left. High-quality assemblies of the species’ genome—parents and offspring—were used to benchmark a new computational tool.
Credit: I. Warren
The takahē data from parents and chicks helped the researchers build a haplotype-resolved assembly that was a benchmark for their computational tool. “It is critical to have trio data as the ground truth,” says Li. Instead of using human ‘trios’, they wanted to develop a robust algorithm that works for various diploid samples. Says Li, “Lara’s data is invaluable.”
The approach is applicable to many species, he says, but users should remember that the genomes of different species can vary dramatically in size, sequence diversity and repetitive sequence sections. “Although we have tried hard to make hifiasm work for various species, we may have overlooked cases or properties special to certain genomes,” he says. He recommends that researchers also evaluate their assemblies carefully based on what they know about the organisms they study. Users can raise a github issue or contact him and colleagues if they can’t resolve something on their own. “We are still learning how to build better assemblies,” he says, and assembly algorithms keep evolving as data quality improves.
Whenua Hou, an island off New Zealand’s South Island, is a refuge for kākāpō, a critically endangered bird species.
Credit: L. Urban
Source: Ecology - nature.com