More stories

  • in

    Rapid evolution of an adaptive taste polymorphism disrupts courtship behavior

    Cockroach strainsAll cockroaches were maintained on rodent diet (Purina 5001, PMI Nutrition International, St. Louis, MO) and distilled water at 27 °C, ~40% RH, and a 12:12 h L:D cycle. The WT colony (Orlando Normal) was collected in Florida in 1947 and has served as a standard insecticide-susceptible strain. The GA colony (T-164) was collected in 1989, also in Florida, and shown to be aversive to glucose; continued artificial selection with glucose-containing toxic bait fixed the homozygous GA trait in this population (approximately 150 generations as of 2020).Generating recombinant lines and life history dataTo homogenize the genetic backgrounds of the WT and GA strains, two recombinant colonies were initiated in 2013 by crossing 10 pairs of WT♂ × GA♀ and 10 pairs of GA♂ × WT♀ (Fig. 3a). At the F8 generation (free bulk mating without selection), 400 cockroaches were tested in two-choice feeding assays (see below) that assessed their initial response to tastants, as described in previous studies11,26. The cockroaches were separated into glucose-accepting and glucose-rejecting groups by the rapid Acceptance-Rejection assay (described in Feeding Bioassays). These colonies were bred for three more generations, and 200 cockroaches from each group were assayed in the F11 generation and backcrossed to obtain homozygous glucose-accepting (aa) and glucose-averse (AA) lines. Similar results were obtained in both directions of the cross, confirming previous findings of no sex linkage of the GA trait27. These two lines were defined as WT_aa (homozygotes, glucose-accepting) and GA_AA (homozygotes, glucose-averse). To obtain heterozygous GA cockroaches, GA_Aa, a single intercross group was generated from crosses of 10 pairs of WT_aa♂ × GA_AA♀ and 10 pairs of GA_AA♂ × WT_aa♀.The GA trait follows Mendelian inheritance. Therefore, we used backcrosses, guided by two-choice feeding assays and feeding responses in Acceptance-rejection assays, to determine the homozygosity of WT and GA cockroaches. The cross of WT♂ × WT♀ produced homozygous F1 cockroaches showing maximal glucose-acceptance. The cross of GA♂ × GA♀ produced homozygous F1 cockroaches showing maximal glucose-aversion. The cross of WT × GA produced F1 heterozygotes with intermediate glucose-aversion. When the F1 heterozygotes were backcrossed with WT cockroaches, they produced F2 cockroaches with a 1:1 ratio of WT and GA phenotypes.The two-choice feeding assay assessed whether cockroaches accepted or rejected glucose (binary: yes-no). Insects were held for 24 h without water, or starved without food and water. Either 10 adults or 2 day-old first instar siblings (30–40) were placed in a Petri dish (either 90 mm or 60 mm diameter × 15 mm height). Each Petri dish contained two agar discs: one disc contained 1% agar and 1 mmol l−1 red food dye (Allura Red AC), and the second disc contained 1% agar, 0.5 mmol l−1 blue food dye (Erioglaucine disodium salt) and either 1000 mmol l−1 or 3000 mmol l−1 glucose. The assay duration was 2 h during the dark phase of the insects’ L:D cycle. After each assay, the color of the abdomen of each cockroach was visually inspected under a microscope to infer the genotype.We assessed whether the recombinant colonies had different traits from the parental WT and GA lines. We paired single newly eclosed females (day 0) with single 10–12 days-old males of the same line in a Petri dish (90 mm diameter, 15 mm height) with fresh distilled water in a 1.5 ml microcentrifuge tube and a pellet of rodent food, and monitored when they mated. When females formed egg cases, each gravid female was placed individually in a container (95 × 95 × 80 mm) with food and water until the eggs hatched. After removing the female, her offspring were monitored until adult emergence. We recorded the time to egg hatch, first appearance of each nymphal stage, first appearance of adults and the end of adult emergence. The first instar nymphs and adults in each cohort were counted to obtain measures of survivorship. Although there were significant differences in some of these parameters across all four strains, we found no significant differences between the two recombinant lines, except mating success, which was significantly lower in GA_AA♀ than WT_aa♀ (Supplementary Table 11).Mating bioassaysAll mating sequences were recorded using an infra-red-sensitive camera (Polestar II EQ610, Everfocus Electronics, New Taipei City, Taiwan) coupled to a data acquisition board and analyzed by searchable and frame-by-frame capable software (NV3000, AverMedia Information) at 27 °C, ~40% RH and a 12:12 h L:D cycle. For behavioral analysis, tested pairs were classified into two groups: mated (successful courtship) and not-mated (failed courtship). Four distinct behavioral events (Fig. 1c, Contact, Wing raising, Nuptial feeding, and Copulation) were analyzed using seven behavioral parameters as shown in Supplementary Table 2.We extracted behavioral data from successful courtship sequences, defined as courtship that led to Copulation. For failed courtship sequences, we extracted the behavioral data from the first courtship of both mated and not-mated groups, because most pairs in both groups failed to copulate in their first encounter, and there were no significant differences in behavioral parameters between the two groups.To assay female choice, we conducted two-choice mating assays (Fig. 1a). A single focal WT♀ or GA♀ and two males, one WT and one GA, were placed in a Petri dish (90 mm diameter, 15 mm height) with fresh distilled water in a 1.5 ml microcentrifuge tube and a pellet of rodent food (n = 25 WT♀ and 27 GA♀). To assay male choice, a single focal WT♂ or GA♂ was given a choice of two females, one WT♀ and one GA♀ (n = 27 WT♂ and 18 GA♂). Experiments were started using 0 day-old sexually unreceptive females and 10–12 days-old sexually mature males. Newly emerged (0 day-old) females were used to avoid the disruption of introducing a sexually mature female into the bioassay. B. germanica females become sexually receptive at 5–7 days of age, so the mating behavior of the focal insect was video-recorded for several days until they mated. Fertility of mated females was evaluated by the number of offspring produced. We assessed the gustatory phenotype of nymphs (either WT-type or GA-type) to determine which of the two adult cockroaches mated with the focal insect. Each gravid female was maintained individually in a container (95 × 95 × 80 mm) with food and water until the eggs hatched. Two day-old first instar nymphs were starved for one day without water and food, and then they were tested in Two-choice feeding assays using 1000 mmol l−1 glucose-containing agar with 0.5 mmol l−1 blue food dye vs. plain sugar-free agar with 1 mmol l−1 red food dye. If all the nymphs chose the glucose-containing agar, their parents were considered WT♂ and WT♀. When all the nymphs showed glucose-aversion, they were raised to the adult stage. Newly emerged adults were backcrossed with WT cockroaches, and their offspring were tested in the Two-choice assay. When the parents were both GA, 100% of the offspring exhibited glucose-aversion. When the parents were WT and GA, the offspring showed a 1:1 ratio of glucose-accepting and glucose-aversive behavior. Mate choice, mating success ratio and the number of offspring were analyzed statistically.We conducted no-choice mating assay using the WT and GA strains (Fig. 1b, d). A female and a male were placed in a Petri dish with fresh water and a piece of rodent food and video-recorded for 24 h. The females were 5–7 days-old and males were 10–12 days-old. Four treatment pairs were tested: WT♂ × WT♀ (n = 20, 18 and 14 pairs for 5, 6 and 7 day-old females, respectively); GA♂ × GA♀ (n = 23, 22 and 35 pairs); GA♂ × WT♀ (n = 21, 14 and 17 pairs); and WT♂ × GA♀ (n = 33, 19 and 15 pairs).To confirm that gustatory stimuli guide nuptial feeding, we artificially augmented the male nuptial secretion and assessed whether the duration of nuptial feeding and mating success of GA♀ were affected (Fig. 2c). Before starting the mating assay with 5 day-old GA♀, 10–12 days-old WT♂ were separated into three groups: A control group did not receive any augmentation; A water control group received distilled water with 1 mmol l−1 blue dye (+Blue); A fructose group received 3000 mmol l−1 fructose solution with blue dye (+Blue+Fru). Approximately 50 nl of the test solution was placed into the tergal gland reservoirs using a glass microcapillary. No-choice mating assays were carried out for 24 h. n = 20–25 pairs for each treatment.We evaluated the association of short nuptial feeding (Fig. 1c) and the GA trait we conducted no-choice mating assays using females from the recombinant lines (Fig. 3c). Before starting each mating assay with 4 day-old females from the WT, GA and recombinant lines (WT_aa, GA_AA and GA_Aa), the EC50 for glucose was obtained by the instantaneous Acceptance-Rejection assay using 0, 10, 30, 100, 300, 1000 and 3000 mmol l−1 glucose (WT♀ and WT_aa♀, non-starved; GA♀, GA_AA♀ and GA_Aa♀, 1-day starved). After the Acceptance-Rejection assay, GA_Aa♀ were separated into two groups according to their sensitivity for rejecting glucose; the GA_Aa_high sensitivity group rejected glucose at 100 and 300 mmol l−1, whereas the GA_Aa_low sensitivity group rejected glucose at 1000 and 3000 mmol l−1. We paired these females with 10–12 days-old WT♂ (n = 15 WT_aa♀, n = 20 GA_AA♀, n = 20 GA_Aa_high♀ and n = 17 GA_Aa_low♀).Feeding bioassayWe conducted two feeding assays: Acceptance-Rejection assay and Consumption assay. The Acceptance-Rejection assay assessed the instantaneous initial responses (binary: yes-no) of cockroaches to tastants, as previously described7,22,27. Briefly, acceptance means that the cockroach started drinking. Rejection means that the cockroach never initiated drinking. The percentage of positive responders was defined as the Number of insects accepting tastants/Total number of insects tested. The effective concentration (EC50) for each tastant was obtained from dose-response curves using this assay. The Consumption assay was previously described27. Briefly, we quantified the amount of test solution females ingested after they started drinking. Females were observed until they stopped drinking, and we considered this a single feeding bout.We used the Acceptance-Rejection assay and Consumption assay, respectively, to assess the sensitivity of 5 day-old WT♀ and GA♀ for accepting and consuming the WT♂ nuptial secretion (Fig. 2a, b). The secretion was diluted with HPLC-grade water to 0.001, 0.01, 0.03, 0.1, 0.3 and 1 male-equivalents/µl (n = 20 non-starved females each). The amount of nuptial secretion consumed was tested at 0.1 male-equivalents/µl in the Consumption assay (n = 10 each).The Acceptance-Rejection assay was used to calculate the effective concentration (EC50) of glucose for females in the WT, GA and recombinant lines (Fig. 3a, b). A glucose concentration series of 0.1, 1, 10, 100 and 1000 mmol l−1 was tested with one-day starved 4-day old females (n = 65 GA_Aa♀, n = 50 GA_AA♀ and n = 50 GA♀) and non-starved females (n = 50 WT_aa♀ and n = 16 WT♀).The effects of female saliva on feeding responses of 5 day-old WT♀ and GA♀ were tested using the Acceptance-Rejection assay (Fig. 4a). Freshly collected saliva of WT♀ and GA♀ was immediately used in experiments. Assays were prepared as follows: 3 µl of 200 mmol l−1 maltose or maltotriose were mixed with 3 µl of either HPLC-grade water or saliva of WT♀ or GA♀. The final concentration of each sugar was 100 mmol l−1 in a total volume of 6 µl. This concentration represented approximately the acceptance EC70 for WT♀ and GA♀27. Nuptial secretion (1 µl representing 10 male-equivalents) was mixed with 1 µl of either HPLC-grade water or saliva from WT♀ or GA♀, and 8 µl of HPLC-grade water was added to the mix. The final concentration of the nuptial secretion was 1 male-equivalent/µl in a total volume of 10 µl. This concentration also represented approximately the acceptance EC70 for WT♀ and GA♀ (Fig. 2a). The mix of saliva and either sugar or nuptial secretion was incubated for 300 s at 25 °C. Additionally, we tested the effect of only saliva in the Acceptance-Rejection assay. Either 1-day starved or non-starved females were tested with water only and then a 1:1 mixture of saliva and water. Saliva alone did not affect acceptance or rejection of stimuli. n = 20–33 females from each strain.To evaluate whether salivary enzymes are involved in the hydrolysis of oligosaccharides, the contribution of salivary glucosidases was tested using the glucosidase inhibitor acarbose in the Acceptance-Rejection assay (Fig. 4b), as previously described27. We first confirmed that the range of 0–125 mmol l−1 acarbose in HPLC-grade water did not disrupt the acceptance and rejection of tastants. Test solutions were prepared as follows: 2 µl of either HPLC-grade water or saliva of GA♀ was mixed with 1 µl of either 250 µmol l−1 of acarbose or HPLC-grade water, then the mixture was added to 1 µl of 400 mmol l−1 of either maltose or maltotriose solution. The total volume was 4 µl, with the final concentration of sugar being 100 mmol l−1. For assays with nuptial secretion, 1 µl of either HPLC-grade water or saliva from 5 day-old GA♀ was mixed with 0.5 µl of either 250 µmol l−1 of acarbose or HPLC-grade water. This mixture was added to 0.5 µl of 10 male-equivalents of nuptial secretion (i.e., 20 male-equivalents/µl). HPLC-grade water was added for a total volume of 10 µl and a final concentration of 1 male-equivalent/µl. The mix of saliva and either sugars or nuptial secretion was incubated for 5 min at 25 °C. All test solutions contained blue food dye. Test subjects were 5 day-old GA♀ and 20–25 females were tested in each assay.Nuptial secretion and saliva collectionsThe nuptial secretion of WT♂ was collected by the following method: Five 10–12 days-old males were placed in a container (95 × 95 × 80 mm) with 5 day-old GA♀. After the males displayed wing-raising courtship behavior toward the females, individual males were immediately decapitated and the nuptial secretion in their tergal gland reservoirs was drawn into a calibrated borosilicate glass capillary (76 × 1.5 mm) under the microscope. The nuptial secretions from 30 males were pooled in a capillary and stored at −20 °C until use. Saliva from 5 day-old WT♀ and GA♀ was collected by the following method: individual females were briefly anesthetized with carbon dioxide under the microscope and the side of the thorax was gently squeezed. A droplet of saliva that accumulated on the mouthparts was then collected into a microcapillary (10 µl, Kimble Glass). Fresh saliva was immediately used in experiments.GC-MS procedures for analysis of sugarsStandards of D-( + )-glucose (Sigma-Aldrich), D-( + )-maltose (Fisher Scientific) and maltotriose (Sigma-Aldrich) were diluted in HPLC-grade water (Fisher Scientific) at 10, 50, 100, 500 and 1000 ng/µl to generate calibration curves. Samples were vortexed for 20 s and a 10 μl aliquot of each sample was transferred to a Pyrex reaction vial containing a 10 μl solution of 5 ng/μl sorbitol (≥98%) in HPLC-grade water as internal standard and dried under a gentle flow of N2 for 20 min.Samples containing degradation products from nuptial secretions were prepared by adding 15 μl of HPLC-water to each sample in a 1.5 ml Eppendorf tube, vortexed for 30 s and centrifuged at 8000 rpm (5223 RCF) for 5 min to separate lipids from the water layer. The water phase was transferred to a reaction vial using a glass capillary. This procedure was repeated with the remaining lipid layer and the water layers were combined in the same reaction vial containing 10 μl of a solution of 5 ng/μl sorbitol and dried under N2 for 20 min.For derivatization of sugars and samples, each reaction vial received 12 μl of anhydrous pyridine under a constant N2 flow, then vortexed and incubated at 90 °C for 5 min. Three μl of N-methyl-N-(trimethylsilyl)trifluoroacetamide (MSTFA; Sigma-Aldrich) was added to each reaction vial and centrifuged at 1000 rpm (118 RCF) for 2 min. Vials were incubated in a heat block at 90 °C for 1.5 hr and vortexed every 10 min for the first 30 min of incubation.The total volume of sample was ~10 μl, and 1 μl was injected into the GC-MS (6890 GC coupled to a 5975 MS, Agilent Technologies, Palo Alto, CA). The inlet was operated in splitless mode (17.5 psi) at 290 °C. The GC was equipped with a DB-5 column (30 m, 0.25 mm, 0.25 μm, Agilent), and helium was used as the carrier gas at an average velocity of 50 cm/s. The oven temperature program started at 80 °C for 1 min, increased at 10 °C/min to 180 °C, then increased at 5 °C/min to 300 °C, and held for 10 min. The transfer line was set at 250 °C for 24 min, ramped at 5 °C/min to 300 °C and held until the end of program. The ion source operated at 70 eV and 230 °C, while the MS quadrupole was maintained at 200 °C. The MSD was operated in scan mode, starting after 9 min (solvent delay time) with a mass range of 33–650 AMU.For GC-MS data analysis, the sorbitol peak area was obtained from the extracted ion chromatograms with m/z = 205, the sorbitol base peak. The area of peaks of glucose, maltose and maltotriose were obtained from the extracted ion chromatograms using m/z = 204, the base peak of the three sugars. The most abundant peaks of each sugar were selected for quantification36, and these peaks did not coelute with other peaks. Then, the peak areas of the three sugars were divided by the area of the respective sorbitol peak in each sample to normalize the data and to correct technical variability during sample processing. This procedure was performed to obtain the calibration curves and quantification of sugars in our experiments.The results of sugar analysis using GC-MS are reported in Supplementary Figs. 1–4.Analysis of nuptial secretionsWe focused the GC-MS analysis on glucose, maltose and maltotriose in WT♂ nuptial secretion (Fig. 4c). To quantify the time-course of saliva-catalyzed hydrolysis of WT♂ nuptial secretion to glucose, 1 µl of GA♀ saliva was mixed with 1 µl of 10 male-equivalents/µl. We incubated the mixtures for 0, 5, 10 and 300 s at 25 °C, and added 4 µl of methanol to stop the enzyme activity (n = 5 each treatment). Each sample contained the nuptial secretions of 5 males to obtain enough detectable amount of sugars. For the statistical analysis, the amounts of sugars were divided by 5 to obtain the amount of sugars in 1 male (1 male-equivalent). These amounts were also used for generating Fig. 4c and Supplementary Table 9. In calculations of the concentration of the three sugars (mmol l−1), the mass and volume of the nuptial secretion were measured using 70–130 male-equivalents of undiluted secretion of each strain (n = 3). The mass and volume of the nuptial secretion/male, including both lipid and aqueous layers, were approximately 30–50 µg and 40–50 nl. Because it was difficult to separate the lipid layer from the water layer at this small scale, we roughly estimated that the tergal reservoirs of the four cockroach lines had 30 nl of aqueous layer that contained sugars.To quantify the time-course of saliva-catalyzed hydrolysis of maltose and maltotriose to glucose, 1 µl of GA♀ saliva was mixed with 1 µl of 200 mmol l−1 of either maltose or maltotriose (Fig. 4d, e). Incubation time points were 0, 5, 10 and 300 s at 25 °C and methanol was used to stop the enzyme activity. Controls without saliva were also prepared using HPLC-grade water instead of saliva and 300 s incubations. n = 5 for each treatment.PhotomicroscopyThe photographs of the tergal glands and mouthparts (Fig. 5) were obtained using an Olympus Digital camera attached to an Olympus CX41 microscope (Olympus America, Center Valley, PA).Statistics and reproducibilityThe sample size and number of replicates for each experiment are noted in the respective section describing the experimental details. In summary, the samples sizes were: Mating bioassays, n = 18–80; Feeding assays, n = 16–65; Sugar analysis, n = 5; Life history parameters, n  > 14. All statistical analyses were conducted in R Statistical Software (v4.1.0; R Core Team 2021) and JMP Pro 15.2 software (SAS Institute Inc., Carey, NC). For bioassay data and sugar analysis data, we calculated the means and standard errors, and we used the Chi-square test with Holm’s method for post hoc comparisons, t-test, and ANOVA followed by Tukey’s HSD test (all α = 0.05), as noted in each section describing the experimental details, results, and in Supplementary Tables 1–11.Reporting summaryFurther information on research design is available in the Nature Research Reporting Summary linked to this article. More

  • in

    We can have biodiversity and eat too

    Godfray, H. C. J. et al. Science 327, 812–818 (2010).ADS 
    CAS 
    Article 

    Google Scholar 
    Pimm, S. L. et al. Science 344, 1246752 (2014).CAS 
    Article 

    Google Scholar 
    Chung, M. G. & Liu, J. Nat. Food https://doi.org/10.1038/s43016-022-00499-7 (2022).Myers, N., Mittermeier, R. A., Mittermeier, C. G., Da Fonseca, G. A. & Kent, J. Nature 403, 853–858 (2000).ADS 
    CAS 
    Article 

    Google Scholar 
    A complex prairie ecosystem. National Park Service https://www.nps.gov/tapr/learn/nature/a-complex-prairie-ecosystem.htm (2022)Davalos, L. M. et al. Environ. Sci. Technol. 45, 1219–1227 (2011).ADS 
    CAS 
    Article 

    Google Scholar 
    Vijay, V., Pimm, S. L., Jenkins, C. N. & Smith, S. J. PLoS ONE 11, e0159668 (2016).Article 

    Google Scholar 
    Liu, J. et al. Ecol. Soc. 18, 26 (2013).CAS 
    Article 

    Google Scholar 
    Liu, J. Consumption patterns and biodiversity. The Royal Society https://go.nature.com/3M19vup (2020).Xu, Z. et al. Nat. Sustain. 3, 964–971 (2020).Article 

    Google Scholar 
    Dou, Y., da Silva, R. F. B., Yang, H. & Liu, J. J. Geogr. Sci. 28, 1715–1732 (2018).Article 

    Google Scholar  More

  • in

    Retinas revived after donor's death open door to new science

    Listen to the latest from the world of science, with Shamini Bundell and Benjamin Thompson.

    Your browser does not support the audio element.

    Download MP3

    In this episode:00:57 Reviving retinas to understand eyesResearch efforts to learn more about diseases of the human eye have been hampered as these organs degrade rapidly after death, and animal eyes are quite different to those from humans. To address this, a team have developed a new method to revive retinas taken from donors shortly after their death. They hope this will provide tissue for new studies looking into the workings of the human eye and nervous system.Research article: Abbas et al.08:05 Research HighlightsA technique that simplifies chocolate making yields fragrant flavours, and 3D imaging reveals some of the largest-known Native American cave art.Research Highlight: How to make a fruitier, more floral chocolateResearch Highlight: Cramped chamber hides some of North America’s biggest cave art10:54 Did life emerge in an ‘RNA world’?How did the earliest biochemical process evolve from Earth’s primordial soup? One popular theory is that life began in an ‘RNA world’ from which proteins and DNA evolved. However, this week a new paper suggests that a world composed of RNA alone is unlikely, and that life is more likely to have begun with molecules that were part RNA and part protein.Research article: Müller et al.News and Views: A possible path towards encoded protein synthesis on ancient Earth17:52 Briefing ChatWe discuss some highlights from the Nature Briefing. This time, the ‘polarised sunglasses’ that helped astronomers identify an ultra-bright pulsar, and how a chemical in sunscreen becomes toxic to coral.Nature: A ‘galaxy’ is unmasked as a pulsar — the brightest outside the Milky WayNature: A common sunscreen ingredient turns toxic in the sea — anemones suggest whySubscribe to Nature Briefing, an unmissable daily round-up of science news, opinion and analysis free in your inbox every weekday.Never miss an episode: Subscribe to the Nature Podcast on Apple Podcasts, Google Podcasts, Spotify or your favourite podcast app. Head here for the Nature Podcast RSS feed. More

  • in

    Alpha and beta phylogenetic diversities jointly reveal ant community assembly mechanisms along a tropical elevational gradient

    Ricklefs, R. E. A comprehensive framework for global patterns in biodiversity. Ecol. Lett. 7, 1–15 (2004).Article 

    Google Scholar 
    Dolson, S. J. et al. Diversity and phylogenetic community structure across elevation during climate change in a family of hyperdiverse neotropical beetles (Staphylinidae). Ecography 44, 740–752 (2021).Article 

    Google Scholar 
    Montaño-Centellas, F. A., McCain, C. & Loiselle, B. A. Using functional and phylogenetic diversity to infer avian community assembly along elevational gradients. Glob. Ecol. Biogeogr. 29, 232–245 (2020).Article 

    Google Scholar 
    Wiens, J. J. et al. Niche conservatism as an emerging principle in ecology and conservation biology. Ecol. Lett. 13, 1310–1324 (2010).PubMed 
    Article 

    Google Scholar 
    Cavender-Bares, J., Kozak, K. H., Fine, P. V. A. & Kembel, S. W. The merging of community ecology and phylogenetic biology. Ecol. Lett. 12, 693–715 (2009).PubMed 
    Article 

    Google Scholar 
    Mayfield, M. M. & Levine, J. M. Opposing effects of competitive exclusion on the phylogenetic structure of communities. Ecol. Lett. 13, 1085–1093 (2010).PubMed 
    Article 

    Google Scholar 
    Webb, C. O., Ackerly, D. D., McPeek, M. A. & Donoghue, M. J. Phylogenies and community ecology. Annu. Rev. Ecol. Syst. 33, 475–505 (2002).Article 

    Google Scholar 
    Hubbell, S. P. The Unified Neutral Theory of Biodiversity and Biogeography (MPB-32) Vol. 32 (Princeton University Press, 2001).
    Google Scholar 
    Kraft, N. J. B., Cornwell, W. K., Webb, C. O. & Ackerly, D. D. Trait evolution, community assembly, and the phylogenetic structure of ecological communities. Am. Nat. 170, 271–283 (2007).PubMed 
    Article 

    Google Scholar 
    Cadotte, M. W. & Tucker, C. M. Should environmental filtering be abandoned?. Trends Ecol. Evol. 32, 429–437 (2017).PubMed 
    Article 

    Google Scholar 
    Mouchet, M. A. et al. Functional diversity measures: An overview of their redundancy and their ability to discriminate community assembly rules. Funct. Ecol. 24, 867–876 (2010).Article 

    Google Scholar 
    Graham, C. H. & Fine, P. V. A. Phylogenetic beta diversity: Linking ecological and evolutionary processes across space in time. Ecol. Lett. 11, 1265–1277 (2008).PubMed 
    Article 

    Google Scholar 
    Qian, H., Jin, Y., Leprieur, F., Wang, X. & Deng, T. Geographic patterns and environmental correlates of taxonomic and phylogenetic beta diversity for large-scale angiosperm assemblages in China. Ecography 43, 1706–1716 (2020).Article 

    Google Scholar 
    Swenson, N. G. et al. Phylogenetic and functional alpha and beta diversity in temperate and tropical tree communities. Ecology 93, 112–125 (2012).Article 

    Google Scholar 
    Qian, H., Hao, Z. & Zhang, J. Phylogenetic structure and phylogenetic diversity of angiosperm assemblages in forests along an elevational gradient in Changbaishan, China. J. Plant Ecol. 7, 154–165 (2014).Article 

    Google Scholar 
    Chase, J. M. & Myers, J. A. Disentangling the importance of ecological niches from stochastic processes across scales. Philos. Trans. R. Soc. B Biol. Sci. 366, 2351–2363 (2011).Article 

    Google Scholar 
    Leibold, M. A., Economo, E. P. & Peres-Neto, P. Metacommunity phylogenetics: Separating the roles of environmental filters and historical biogeography. Ecol. Lett. 13, 1290–1299 (2010).PubMed 
    Article 

    Google Scholar 
    Ricklefs, R. E. Evolutionary diversification and the origin of the diversity-environment relationship. Ecology 87, 3–13 (2006).Article 

    Google Scholar 
    Zhang, J. L. et al. Phylogenetic beta diversity in tropical forests: Implications for the roles of geographical and environmental distance. J. Syst. Evol. 51, 71–85 (2013).Article 

    Google Scholar 
    Baselga, A. The relationship between species replacement, dissimilarity derived from nestedness, and nestedness. Glob. Ecol. Biogeogr. 21, 1223–1232 (2012).Article 

    Google Scholar 
    Leprieur, F. et al. Quantifying phylogenetic beta diversity: Distinguishing between ‘true’ turnover of lineages and phylogenetic diversity gradients. PLoS ONE https://doi.org/10.1371/journal.pone.0042760 (2012).Article 
    PubMed 
    PubMed Central 

    Google Scholar 
    Bishop, T. R., Robertson, M. P., van Rensburg, B. J. & Parr, C. L. Contrasting species and functional beta diversity in montane ant assemblages. J. Biogeogr. 42, 1776–1786 (2015).PubMed 
    PubMed Central 
    Article 

    Google Scholar 
    Economo, E. P., Narula, N., Friedman, N. R., Weiser, M. D. & Guénard, B. Macroecology and macroevolution of the latitudinal diversity gradient in ants. Nat. Commun. 9, 1–8 (2018).CAS 
    Article 

    Google Scholar 
    Lessard, J. P., Fordyce, J. A., Gotelli, N. J. & Sanders, N. J. Invasive ants alter the phylogenetic structure of ant communities. Ecology 90, 2664–2669 (2009).PubMed 
    Article 

    Google Scholar 
    Liu, C., Dudley, K. L., Xu, Z. H. & Economo, E. P. Mountain metacommunities: climate and spatial connectivity shape ant diversity in a complex landscape. Ecography 41, 101–112 (2018).Article 

    Google Scholar 
    Smith, M. A., Hallwachs, W. & Janzen, D. H. Diversity and phylogenetic community structure of ants along a Costa Rican elevational gradient. Ecography 37, 720–731 (2014).Article 

    Google Scholar 
    Machac, A., Janda, M., Dunn, R. R. & Sanders, N. J. Elevational gradients in phylogenetic structure of ant communities reveal the interplay of biotic and abiotic constraints on diversity. Ecography 34, 364–371 (2011).Article 

    Google Scholar 
    Guo, Q. et al. Global variation in elevational diversity patterns. Sci. Rep. 3, 1 (2013).CAS 

    Google Scholar 
    Kluge, J., Kessler, M. & Dunn, R. R. What drives elevational patterns of diversity? A test of geometric constraints, climate and species pool effects for pteridophytes on an elevational gradient in Costa Rica. Glob. Ecol. Biogeogr. 15, 358–371 (2006).Article 

    Google Scholar 
    Sanders, N. J., Lessard, J. P., Fitzpatrick, M. C. & Dunn, R. R. Temperature, but not productivity or geometry, predicts elevational diversity gradients in ants across spatial grains. Glob. Ecol. Biogeogr. 16, 640–649 (2007).Article 

    Google Scholar 
    Malsch, A. K. F. et al. An analysis of declining ant species richness with increasing elevation at Mount Kinabalu, Sabah, Borneo. Asian Myrmecol. 2, 33–49 (2008).
    Google Scholar 
    Pérez-Toledo, G. R., Valenzuela-González, J. E., Moreno, C. E., Villalobos, F. & Silva, R. R. Patterns and drivers of leaf-litter ant diversity along a tropical elevational gradient in Mexico. J. Biogeogr. 48, 2515 (2021).Article 

    Google Scholar 
    Szewczyk, T. M. & McCain, C. M. A systematic review of global drivers of ant elevational diversity. PLoS ONE 11, e155040 (2016).Article 

    Google Scholar 
    McCain, C. M. & Grytnes, J.-A.A. Elevational gradients in species richness. In Encyclopedia of Life Sciences (ed. Wiley, J.) (Wiley, 2010). https://doi.org/10.1002/9780470015902.a0022548.Chapter 

    Google Scholar 
    Silva, R. R. & Brandão, C. R. F. Morphological patterns and community organization in leaf-litter ant assemblages. Ecol. Monogr. https://doi.org/10.1890/08-1298.1 (2010).Article 

    Google Scholar 
    Dunn, R. R. et al. Climatic drivers of hemispheric asymmetry in global patterns of ant species richness. Ecol. Lett. 12, 324–333 (2009).PubMed 
    Article 

    Google Scholar 
    Warren, R. J. & Chick, L. Upward ant distribution shift corresponds with minimum, not maximum, temperature tolerance. Glob. Chang. Biol. 19, 2082–2088 (2013).ADS 
    PubMed 
    Article 

    Google Scholar 
    Cerdá, X. & Retana, J. Alternative strategies by thermophilic ants to cope with extreme heat: Individual versus colony level traits. Oikos 89, 155–163 (2000).Article 

    Google Scholar 
    Kadochová, Š & Frouz, J. Thermoregulation strategies in ants in comparison to other social insects, with a focus on red wood ants (Formica rufa group). F1000 Res. 2, 280 (2013).Article 

    Google Scholar 
    Moreau, C. S., Bell, C. D., Vila, R., Archibald, S. B. & Pierce, N. E. Phylogeny of the ants: diversification in the age of angiosperms. Science 312, 101–104 (2006).ADS 
    CAS 
    PubMed 
    Article 

    Google Scholar 
    Rabeling, C., Brown, J. M. & Verhaagh, M. Newly discovered sister lineage sheds light on early ant evolution. Proc. Natl. Acad. Sci. 105, 14913–14917 (2008).ADS 
    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar 
    Ward, P. S., Brady, S. G., Fisher, B. L. & Schultz, T. R. The evolution of myrmicine ants: Phylogeny and biogeography of a hyperdiverse ant clade (Hymenoptera: Formicidae). Syst. Entomol. 40, 61–81 (2015).Article 

    Google Scholar 
    Pie, M. R. The macroevolution of climatic niches and its role in ant diversification. Ecol. Entomol. 41, 301–307 (2016).Article 

    Google Scholar 
    Smith, M. R. Revision of the genus Stenamma Westwood in America north of Mexico (Hymenoptera, Formicidae). Am. Midl. Nat. 57, 133–174 (1957).Article 

    Google Scholar 
    Herbers, J. M. & Johnson, C. A. Social structure and winter survival in acorn ants. Oikos 116, 829–835 (2007).Article 

    Google Scholar 
    Kaspari, M. & Weiser, M. D. Ant activity along moisture gradients in a neotropical forest1. Biotropica 32, 703–711 (2006).Article 

    Google Scholar 
    Flores, O., Seoane, J., Hevia, V. & Azcárate, F. M. Spatial patterns of species richness and nestedness in ant assemblages along an elevational gradient in a Mediterranean mountain range. PLoS ONE 13, 1–16 (2018).
    Google Scholar 
    Almeida, R. P. S. et al. Induced drought strongly affects richness and composition of ground-dwelling ants in the eastern Amazon. BioRxiv (2020).Le Breton, J., Chazeau, J. & Jourdan, H. Immediate impacts of invasion by Wasmannia auropunctata (Hymenoptera: Formicidae) on native litter ant fauna in a New Caledonian rainforest. Austral Ecol. 28, 204–209 (2003).Article 

    Google Scholar 
    Vonshak, M., Dayan, T., Ionescu-Hirsh, A., Freidberg, A. & Hefetz, A. The little fire ant Wasmannia auropunctata: A new invasive species in the Middle East and its impact on the local arthropod fauna. Biol. Invasions 12, 1825–1837 (2010).Article 

    Google Scholar 
    Wheeler, W. M. Ants: Their Structure, Development and Behavior (Columbia University Press, 1910).
    Google Scholar 
    Cavender-Bares, J., Ackerly, D. D., Baum, D. A. & Bazzaz, F. A. Phylogenetic overdispersion in Floridian oak communities. Am. Nat. 163, 823–843 (2004).CAS 
    PubMed 
    Article 

    Google Scholar 
    Parr, C. L., Sinclair, B. J., Andersen, A. N., Gaston, K. J. & Chown, S. L. Constraint and competition in assemblages: A cross-continental and modeling approach for ants. Am. Nat. 165, 481–494 (2005).PubMed 
    Article 

    Google Scholar 
    Retana, J. & Cerdá, X. Patterns of diversity and composition of Mediterranean ground ant communities tracking spatial and temporal variability in the thermal environment. Oecologia 123, 436–444 (2000).ADS 
    CAS 
    PubMed 
    Article 

    Google Scholar 
    Hawkins, B. A. et al. Energy, water, and broad-scale geographic patterns of species richness. Ecology 84, 3105–3117 (2003).Article 

    Google Scholar 
    Graham, C. H., Parra, J. L., Rahbek, C. & McGuire, J. A. Phylogenetic structure in tropical hummingbird communities. Proc. Natl. Acad. Sci. 106, 19673–19678 (2009).ADS 
    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar 
    Camacho, G. P., Loss, A. C., Fisher, B. L. & Blaimer, B. B. Spatial phylogenomics of acrobat ants in Madagascar—Mountains function as cradles for recent diversity and endemism. J. Biogeogr. 1, 1706–1719. https://doi.org/10.1111/jbi.14107 (2021).Article 

    Google Scholar 
    Lobo, J. M. & Halffter, G. Biogeographical and ecological factors affecting the altitudinal variation of mountainous communities of coprophagous beetles (Coleoptera: Scarabaeoidea): A comparative study. Ann. Entomol. Soc. Am. 93, 115–126 (2000).Article 

    Google Scholar 
    Halffter, G., Favila, M. & Arellano, L. Spatial distribution of three groups of Coleoptera along an altitudinal transect in the Mexican Transition Zone and its biogeographical implications. Elytron 9, 1–10 (1995).
    Google Scholar 
    Blaimer, B. B. et al. Phylogenomic methods outperform traditional multi-locus approaches in resolving deep evolutionary history: a case study of formicine ants. BMC Evol. Biol. 15, 1–14 (2015).Article 

    Google Scholar 
    Longino, J. T., Branstetter, M. G. & Colwell, R. K. How ants drop out: ant abundance on tropical mountains. PLoS ONE 9, e104030 (2014).ADS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar 
    Longino, J. T. & Branstetter, M. G. The truncated bell: An enigmatic but pervasive elevational diversity pattern in Middle American ants. Ecography 42, 272–283 (2019).Article 

    Google Scholar 
    Branstetter, M. G. Origin and diversification of the cryptic ant genus Stenamma Westwood (Hymenoptera: Formicidae), inferred from multilocus molecular data, biogeography and natural history. Syst. Entomol. 37, 478–496 (2012).Article 

    Google Scholar 
    Prebus, M. Insights into the evolution, biogeography and natural history of the acorn ants, genus Temnothorax Mayr (hymenoptera: Formicidae). BMC Evol. Biol. 17, 1–22 (2017).Article 

    Google Scholar 
    Kluge, J. & Kessler, M. Phylogenetic diversity, trait diversity and niches: Species assembly of ferns along a tropical elevational gradient. J. Biogeogr. 38, 394–405 (2011).Article 

    Google Scholar 
    Janzen, D. H. Why mountain passes are higher in the tropics. Am. Nat. 101, 233–249 (1967).Article 

    Google Scholar 
    Fernandes, G. W. et al. Cerrado to rupestrian grasslands: Patterns of species distribution and the forces shaping them along an altitudinal gradient. in Ecology and Conservation of Mountaintop Grasslands in Brazil 345–378 (2016). https://doi.org/10.1007/978-3-319-29808-5_15.Leibold, M. A. et al. The metacommunity concept: A framework for multi-scale community ecology. Ecol. Lett. 7, 601–613 (2004).Article 

    Google Scholar 
    Perrigo, A., Hoorn, C. & Antonelli, A. Why mountains matter for biodiversity. J. Biogeogr. 47, 315–325 (2020).Article 

    Google Scholar 
    Myers, N., Mittermeier, R. A., Mittermeier, C. G., Da Fonseca, G. A. B. & Kent, J. Biodiversity hotspots for conservation priorities. Nature 403, 853–858 (2000).ADS 
    CAS 
    PubMed 
    Article 

    Google Scholar 
    Colwell, R. K., Brehm, G., Cardelus, C. L., Gilman, A. C. & Longino, J. T. Global warming, elevational range shifts and lowland biotic attrition in the wet tropics. Science 322, 258–261 (2008).ADS 
    CAS 
    PubMed 
    Article 

    Google Scholar 
    Moreau, C. S. & Bell, C. D. Testing the museum versus cradle tropical biological diversity hypothesis: Phylogeny, diversification, and ancestral biogeographic range evolution of the ants. Evolution 67, 2240–2257 (2013).PubMed 
    Article 

    Google Scholar 
    Borowiec, M. L. Generic revision of the ant subfamily Dorylinae (Hymenoptera, Formicidae). Zookeys 1, 280 (2016).
    Google Scholar 
    Lapolla, J. S., Brady, S. G. & Shattuck, S. O. Phylogeny and taxonomy of the Prenolepis genus-group of ants (Hymenoptera: Formicidae). Syst. Entomol. 35, 118–131 (2010).Article 

    Google Scholar 
    Schmidt, C. A. & Shattuck, S. O. The higher classification of the ant subfamily Ponerinae (Hymenoptera: Formicidae), with a review of ponerine ecology and behavior. Zootaxa 3817, 1–242 (2014).CAS 
    PubMed 
    Article 

    Google Scholar 
    Revell, L. J. phytools: An R package for phylogenetic comparative biology (and other things). Methods Ecol. Evol. 3, 217–223 (2012).Article 

    Google Scholar 
    Arnan, X., Arcoverde, G. B., Pie, M. R., Ribeiro-Neto, J. D. & Leal, I. R. Increased anthropogenic disturbance and aridity reduce phylogenetic and functional diversity of ant communities in Caatinga dry forest. Sci. Total Environ. 631, 429–438 (2018).ADS 
    PubMed 
    Article 

    Google Scholar 
    Divieso, R., Silva, T. S. R. & Pie, M. R. Morphological evolution in the ant reproductive caste. BioRxiv https://doi.org/10.1101/2020.07.18.210302 (2020).Article 

    Google Scholar 
    Paradis, E. et al. Package ‘ape’. Anal. Phylogenet. Evol. 2, 1–10 (2019).
    Google Scholar 
    Faith, D. P. Conservation evaluation and phylogenetic diversity. Biol. Conserv. 61, 1–10 (1992).Article 

    Google Scholar 
    Tucker, C. M. et al. A guide to phylogenetic metrics for conservation, community ecology and macroecology. Biol. Rev. 92, 698–715 (2017).PubMed 
    Article 

    Google Scholar 
    Webb, C. O. Exploring the phylogenetic structure of ecological communities: An example for rain forest trees. Am. Nat. 156, 145–155 (2000).PubMed 
    Article 

    Google Scholar 
    Tucker, C. M. et al. Assessing the utility of conserving evolutionary history. Biol. Rev. 94, 1740–1760 (2019).PubMed 
    Article 

    Google Scholar 
    Kembel, S. W. et al. Picante: R tools for integrating phylogenies and ecology. Bioinformatics 26, 1463–1464 (2010).CAS 
    PubMed 
    Article 

    Google Scholar 
    R Core Team. A language and environment for statistical computing. R Found. Stat. Comput. 2, https://www.R-project.org (2021).Baselga, A. & Orme, C. D. L. Betapart: An R package for the study of beta diversity. Methods Ecol. Evol. 3, 808–812 (2012).Article 

    Google Scholar 
    Dobrovolski, R., Melo, A. S., Cassemiro, F. A. S. & Diniz-Filho, J. A. F. Climatic history and dispersal ability explain the relative importance of turnover and nestedness components of beta diversity. Glob. Ecol. Biogeogr. 21, 191–197 (2012).Article 

    Google Scholar 
    Peixoto, F. P. et al. Geographical patterns of phylogenetic beta-diversity components in terrestrial mammals. Glob. Ecol. Biogeogr. 26, 573–583 (2017).Article 

    Google Scholar 
    Körner, C. The use of ‘altitude’ in ecological research. Trends Ecol. Evol. 22, 569–574 (2007).PubMed 
    Article 

    Google Scholar 
    Sundqvist, M. K., Sanders, N. J. & Wardle, D. A. Community and ecosystem responses to elevational gradients: Processes, mechanisms, and insights for global change. Annu. Rev. Ecol. Evol. Syst. 44, 261–280 (2013).Article 

    Google Scholar 
    Cuervo-Robayo, A. P. et al. An update of high-resolution monthly climate surfaces for Mexico. Int. J. Climatol. 34, 2427–2437 (2014).Article 

    Google Scholar 
    Hijmans, R. J., Phillips, S., Leathwick, J., Elith, J. & Hijmans, M. R. J. Package ‘dismo’. Circles 9, 1–68 (2017).
    Google Scholar 
    Schwarz, G. Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978).MathSciNet 
    MATH 
    Article 

    Google Scholar 
    Guthery, F. S., Burnham, K. P. & Anderson, D. R. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach Vol. 67 (Springer, 2003).
    Google Scholar 
    Mazerolle, M. J. Improving data analysis in herpetology: Using Akaike’s information criterion (AIC) to assess the strength of biological hypotheses. Amphib. Reptil. 27, 169–180 (2006).Article 

    Google Scholar 
    Ferrier, S., Manion, G., Elith, J. & Richardson, K. Using generalized dissimilarity modelling to analyse and predict patterns of beta diversity in regional biodiversity assessment. Divers. Distrib. 13, 252–264 (2007).Article 

    Google Scholar 
    Fitzpatrick, M. C. et al. Environmental and historical imprints on beta diversity: Insights from variation in rates of species turnover along gradients. Proc. R. Soc. B Biol. Sci. 280, 20131201 (2013).Article 

    Google Scholar 
    Manion, G. et al. gdm: Generalized dissimilarity modeling. R Packag. version (2018).Wickham, H. Ggplot2. Wiley Interdiscip. Rev. Comput. Stat. 3, 180–185 (2011).Article 

    Google Scholar  More

  • in

    Conservation genomics in practice

    An array of initiatives are underway to compile reference-grade genome assemblies of life on Earth. Such assemblies can shed light on many aspects of biodiversity. As Hogg says, a reference genome helps scientists determine if a sequence is a gene, to see what it encodes and assess if there is diversity at that gene. Conservation biologists might decide to move a population to improve gene flow. When one population clears a disease quicker than another, “we can move animals with the specific genetic variant that helps deal with disease.” Unfortunately, most characteristics are polygenic, she says, but “in conservation we aim to maintain and promote as much genetic diversity as we can.” Reference genomes, she says, provide a “blueprint of life” and help researchers understand how species interact with their often rapidly changing environment.A consortium has assembled the kākāpō reference genome, and Urban has been part of the team compiling one for the takahē. It involves the Takahē Recovery team, the DOC, a team at Rockefeller University and Māori members. A high-quality takahē genome can inform all the downstream conservation efforts for this species, says Urban. It was challenging to get the right kind of samples in adequate quality, she says, “but it was totally worth it because it told us a lot about the actual genomic architecture of the takahē.”Takahē genomic information has been a crucial help in developing a computational method to assemble haplotype-resolved genomes when no parental data are available, which could prove helpful in many areas of biology. The quality of this phasing, says Urban, is comparable to that of one that involved parents’ genomes. The method combines two types of genomic information: HiFi reads from Pacific Biosciences instruments and Hi-C chromatin interaction data. Pacific Biosciences introduced circular consensus sequencing a few years ago, which builds consensus reads, or HiFi reads, from multiple passes over a DNA molecule.The computational genome assembly method hifiasm has been extended. HiFi reads and Hi-C data are combined into a graph assembly that ultimate leads to haplotype-resolved assembly of diploid genomes for which parental data are lacking. Credit: Adapted with permission from ref. 5.In developing this method, Heng Li at the Dana-Farber Cancer Institute, colleagues at University of Otago in New Zealand including Lara Urban and Neil Gemmel, and several teams from other US institutions such as Rockefeller University’s Vertebrate Genome Project and the Center for Species Survival at the National Zoo, used data from the takahē and other animals, such as the critically endangered black rhinoceros.When handling diploid and polyploid genomes, many long-read assembly tools collapse differing homologous haplotypes into a ‘consensus assembly’. Some tools avoid erasing heterozygous differences and phase genomic regions with low levels of heterozygosity, and then build contiguous sequence by stitching these blocks together. The final assembly tends to include those phased blocks as an ‘alternate assembly’.With a method called trio-binning, which uses data from individuals and their parents, scientists can obtain a haplotype-resolved assembly with two sets of contiguous sequence: two haploid genomes. Other methods draw on additional data, such as chromatin interaction data from Hi-C or Strand-Seq, which applies single-cell sequencing and resolves homologs within a cell. In Strand-Seq, only the DNA template strand used during DNA replication is sequenced.Li and colleagues developed the hifiasm algorithm5 to address complications they saw in this area, such as lengthy computational pipelines. Hifiasm applies string overlap graphs, which represent different paths along the assembled genomes. In a hifiasm graph, each node is a contiguous sequence put together from ‘phased’ HiFi reads. Li and colleagues have extended hifiasm to combine HiFi reads and Hi-C data6. First, hifiasm produces a phased assembly graph onto which Hi-C reads are mapped. The graph is made up of ‘unitigs’, contiguous sequence from heterozygous and from homozygous regions. Read coverage can be used to distinguish the two. Hifiasm further processes unitigs to build a haplotype-resolved assembly of a diploid organism.The method avoids the traditional consensus assembly approach for a diploid sample, in which half of sequences are randomly discarded, and it mixes sequences from parents, which is clearly not ideal, says Li. With people, parental data can be hard to obtain and ethical approval is needed. Meanwhile, with samples obtained from animals in the wild, as in biodiversity studies, scientists usually have little or no way to locate parents. Methods exists for haplotype-resolved assembly without parent data, but they have only been tested on human samples, he says. “Making a haplotype-resolved assembler robust to various species is a lot more challenging,” says Li. An algorithm designed for species of low sequence diversity, such as humans, may not work well for species of high diversity, such as insects. “Then there are species with mixed sequence diversity, which demands an algorithm can smoothly work with all these cases without users’ intervention,“ he says. This motivated the team to extend hifiasm.There are around 440 individual South Island takahē (Porphyrio hochstetteri) left. High-quality assemblies of the species’ genome—parents and offspring—were used to benchmark a new computational tool.
    Credit: I. WarrenThe takahē data from parents and chicks helped the researchers build a haplotype-resolved assembly that was a benchmark for their computational tool. “It is critical to have trio data as the ground truth,” says Li. Instead of using human ‘trios’, they wanted to develop a robust algorithm that works for various diploid samples. Says Li, “Lara’s data is invaluable.”The approach is applicable to many species, he says, but users should remember that the genomes of different species can vary dramatically in size, sequence diversity and repetitive sequence sections. “Although we have tried hard to make hifiasm work for various species, we may have overlooked cases or properties special to certain genomes,” he says. He recommends that researchers also evaluate their assemblies carefully based on what they know about the organisms they study. Users can raise a github issue or contact him and colleagues if they can’t resolve something on their own. “We are still learning how to build better assemblies,” he says, and assembly algorithms keep evolving as data quality improves.Whenua Hou, an island off New Zealand’s South Island, is a refuge for kākāpō, a critically endangered bird species.
    Credit: L. Urban More

  • in

    Urban blue–green space landscape ecological health assessment based on the integration of pattern, process, function and sustainability

    Study areaHarbin is located in the centre of Northeast Asia, between 44°04’–46° 40′ N and 125° 42′–130° 10′ E24,26. The site has a mid-temperate continental monsoon climate, with an average annual temperature of 3.6° C and an average annual precipitation is 569.1 mm. The main precipitation months being from June to September, accounting for about 60% of the annual precipitation, the main snow months are from November to January24,25. The overall topography is high in the east and low in the west, with mountains and hills predominating in the east and plains predominating in the west27. In this study, we identified the central district of Harbin, where urban construction activities are frequent and the population is dense, as the study area. According to the “Harbin City Urban Master Plan (2011–2020)” (revised draft in 2017), the specific scope includes Daoli District, Daowai District, Nangang District, Xiangfang District, Pingfang District, Songbei District’s administrative district, Hulan District, and Acheng District part of the area, with a total area of 4187 km2 (Fig. 2). The blue–green space in this study included woodland, grassland, cultivated land, wetland and water that permeate inside and outside the construction sites. They all have integrated functions such as ecology, supply, beautification, culture, and disaster prevention and avoidance, and have a decisive influence on the urban ecological environment.Figure 2Schematic of study area. The Figure is created using ArcGIS ver.10.2 (https://www.esri.com/).Full size imageData sourcesThe data used in this research included the following: land-cover date (30 m × 30 m) of two periods (2011, 2020) spported by the China Geographic National Conditions Data Cloud Platform (http://www.dsac.cn/), Meteorological datasets (1 km × 1 km) were obtained from the Resource and Environmental Science Data Center of the Chinese Academy of Sciences (http:∥www.resdc.cn/), including air temperature, precipitation, and surface runoff. ASTER GDFM elevation data (30 m × 30 m) came from the Geospatial Data Cloud (http:∥www.gscloud.cn), from which the slope was extracted. Soil data (1 km × 1 km) were from the World Soil Database (HWSD) China Soil Data Set (v1.1). The normalized difference vegetation index (NDVI) and modified normalized difference water index (MNDWI) data (30 m × 30 m) came from the National Comprehensive Earth Observation Data Sharing Platform (http://www.chinageoss.org/), ET datasets (30 m × 30 m) were drawn from the NASA-USGS (https://lpdaac.usgs.gov/). Social and economic data were mainly obtained through the Harbin statistical yearbook and the Harbin social and economic bulletin.Framework of urban blue–green space LEH assessmentUrban blue–green space is a politically defined man-land coupling region composed of ecological, economic, and social systems, which is greatly disturbed by human activities11. The essence of urban blue–green space LEH is that the landscape ecological function sustainably meets human needs28,29. The landscape ecological function reflects the value orientation of human beings to blue–green space, and to a large extent affects the blue–green landscape ecological pattern and process. The interaction between the blue–green landscape ecological pattern and process drives the overall dynamics of blue–green space. Meanwhile, presenting certain landscape ecological function characteristics, which provide ecological support for various human activities30,31,32. While the pattern and process of blue–green space both profoundly influence and are influenced by human activities33,34. This influence is long-term, the standard of LEH should not be fixed in real-time health, but should fully consider the sustainability of the health state.In summary, the landscape ecological pattern, process, function, and sustainability are not separate, but a complex of mutual integration, and organic unity. In this study, we constructed an integrated assessment framework of blue–green space LEH that included four units: pattern, process, service, and sustainability (Fig. 3). In the assessment framework, the LEH of urban blue–green space involves two dimensions: the first is the health status of the urban blue–green space itself, emphasizing the maintenance of the ecological conditions, thereby potentially satisfying a series of diversity goals. The other is that urban blue–green space, as a part of social and economic development, could sustainably provide the ability to meet (subject) needs and goals.Figure 3Key units, interactions of urban blue–green space LEH.Full size imageLandscape ecological patternThe landscape ecological pattern of urban blue–green space is a spatial mosaic combination of landscape elements at different levels or the same level. Affected by human activities interference31, the landscape ecological pattern shows the changing trend of landscape structure complexity, landscape type diversification, and landscape fragmentation. The assessment of urban landscape ecological pattern should be a comprehensive reflection of this changing trend1. Landscape pattern indexes are the most frequently applied which could reflect the structural composition and spatial configuration characteristics of the landscape4,35. This study took landscape ecology as the entry point and selected the landscape pattern indexes that can quantitatively reflect the change characteristics of landscape structural composition and spatial configuration under the disturbance. In this way, the landscape disturbance index (U), landscape connectivity index (CON), and landscape adaptability index (LAI) were used as the indexes for the assessment of landscape ecological pattern health.

    (1)

    Landscape disturbance index (U)

    There are two kinds of relationships between the landscape ecological pattern and the external disturbance: compatibility and conflict. As the landscape ecological pattern has accommodating characteristics, the disturbance beyond the accommodating capacity will degrade the landscape ecological pattern36,37. The landscape disturbance index (U) could characterize the degree of fragmentation, dispersion, and morphological changes in landscape pattern38. The index is a comprehensive index that can reflect the health of the landscape pattern by quantifying the ability of ecosystems to accommodate external disturbances. It consists of the landscape fragmentation index, the inverse of the fractional dimension, and the dominance index. They measure the response of the landscape pattern to external disturbance from the perspective of different landscape types, the same landscape type, and landscape diversity, respectively36,38, and their weights were determined by the entropy weight method. The formula is as follows:$$ U = alpha N_{{{Fi}}} + bD_{{{Fi}}} + cD_{{{Oi}}} $$
    (1)
    where NFi is the landscape fragmentation index, DFi is the inverse of the fractional dimension, DOi is the dominance index, and a, b, and c are the corresponding weights, which were 0.20, 0.5, and 0.3 in this study, respectively.

    (2)

    Landscape connectivity index (CON)

    The most direct result of landscape ecological pattern degradation caused by external disturbance is that the flow of energy, material, and information among ecological patches is reduced or even blocked, ultimately the stability of the landscape pattern is decreased. The connectivity could characterize the ability of landscape ecological pattern to mitigate risk transmission, which is significant for the dynamic stability of landscape ecological pattern39,40. The landscape connectivity index (CON) could measure the connectivity between ecosystem components through the aggregation or dispersion trend of patches41. The better the connectivity, the stronger the stability of landscape ecological pattern. The formula is as follows:$$ CON = frac{{100sumlimits_{s = 1}^{q} {sumlimits_{h ne l}^{p} {C_{{{shl}}} } } }}{{sumlimits_{s = 1}^{s} {left[ {q_{{s}} (q_{{s}} – 1)/2} right]} }} $$
    (2)
    where qs is the number of plaques of patch type s, Cshl is the link between patch h and patch l in s within the delimited distance.

    (3)

    Landscape Restorability Index (LRI)

    The ability to recover to its original structure when subjected to disturbances is an important criterion for the landscape ecological pattern42. Research confirmed that the restorability of the landscape ecological pattern is closely related to the structure, function, diversity, and uniformity of distribution. The landscape restorability index (LRI) combines the above landscape information and could indicate the restorability of the landscape ecological pattern in response to disturbance43. The index consists of the patch density, Shannon diversity index, and the landscape evenness, the patch density is the number of patches per square kilometer. The Shannon diversity index reflects the change in the proportion of landscape types. The landscape evenness index shows the distribution evenness of patches in terms of area. The larger the LRI index, the more complex and evenly distributed the structure is, and the more recovery ability of the landscape pattern against disturbance is. The formula is as follows:$$ LRI = PD times SHDI times SHEI $$
    (3)
    where PD is the patch density, SHDI is the Shannon diversity index, and SHEI is the landscape evenness index.Landscape ecological processThe landscape ecological process of urban blue–green space is extremely complex for it involves multiple factors such as natural ecology, economy, and culture. Landscape ecological process assessment is the measure of the self-organized capacity and the efficiency of ecological processes within and among patches44. A blue–green space with a healthy landscape ecological process should have the ability to adapt to conventional land use under human management and maintain physiological integrity while maintaining the balance of ecological components. Specifically, the landscape ecological process could quickly restore its balance after severe disturbances, with strong organization, suitability, recoverability, and low sensitivity45,46. A single model hardly to gets good research on landscape ecological process under the urban scale. The comprehensive application of multidisciplinary methods is effective means to solve the problem. Regarding this, we selected ecological indexes and models from four aspects: organization, suitability, restoration, and sensitivity to assess the landscape ecological process of urban blue–green space.

    (1)

    Organization index (O)

    The organization of the landscape ecological process is the maintenance ability of stable and orderly material cycling and energy flow within and between landscapes47. The normalized vegetation index (NDVI) and the modified normalized difference water index (MNDWI) could reflect the efficiency and order of ecological processes. Such as accumulation of organic matter, fixation of solar energy, nutrient cycling, regeneration, and metabolism13. The indexes are the external performance of the internal dynamics and organizational capabilities of the ecological process. In recent years, it has been widely used in the assessment of related to landscape ecological process. The formulas are as follows:$$ NDVI = frac{NIR – R}{{NIR + R}} $$$$ MNDWI = frac{p(green) – p(MIR)}{{p(green) + p(MIR)}} $$
    (4)
    where (NDVI) is the normalized vegetation index, (MNDWI) is the modified water body index, (NIR) is the reflectance value in the near-infrared band, (R) is the reflectance value in the visible channel, (p(green)) and (p(MIR)) are the normalized values in the green and mid-infrared bands.

    (2)

    Suitability index (Q)

    The suitability of the landscape ecological process is a measurement of the self-regulating ability of the landscape ecosystem. That is, to effectively maintain the ecological process in a state of being protected from disturbance during the occasional changes caused by the external environment2. The water conservation amount index (Q) can measure the operating capacity of ecosystems to maintain ecological balance, water conservation, climate regulation, and other ecological processes by integrating the water balance of rainfall, surface runoff, and evaporation41. It could reflect the suitability of landscape ecological process to regional environment and developmental conditions. The formula is as follows:$$ Q = R – J – ET $$
    (5)
    where Q is the water conservation amount, R is the annual rainfall, J is the surface runoff, ET is the evapotranspiration.

    (3)

    Recoverability index (ECO)

    The recoverability of the landscape ecological process refers to the ability of an ecosystem to return to its original operating state after being subjected to external impacts. Land-use types play an essential role in landscape ecological recoverability48. The ecological recoverability index (ECO) uses the resilience coefficients of land-use types to reflect the level of ecosystem resilience38. Based on previous studies, the resilience coefficient of land-use types was assigned (Table 1).

    (4)

    Sensitivity index(A)

    Table 1 Resilience coefficients of different land use types.Full size tableThe sensitivity index (A) could be used to indicate landscape ecological process formation, change, and vulnerability to disturbance31. We started from the physical effects of blue–green space on sand production, water confluence, and sediment transport, introduced the Soil Erosion Modulus to characterize the sensitivity of landscape ecological processes to disturbance. The index effectively combines landscape ecology, erosion mechanics, soil science, and sediment dynamics49. The formula is as follows:$$ begin{gathered} A = R_{{i}} cdot K cdot LS cdot C cdot P hfill \ L = (l/22.1)^{m} hfill \ S = left{ begin{gathered} 10.8sin theta + 0.03,theta < 5^{ circ } hfill \ 16.8sin theta - 0.50,5^{ circ } le theta < 10^{ circ } hfill \ 21.9sin theta - 0.96,theta ge 10^{ circ } hfill \ end{gathered} right. hfill \ C = left{ begin{gathered} 1,c = 0 hfill \ 0.6508 - 0.3436lg c,0 < c le 78.3% hfill \ 0,c > 78.3% hfill \ end{gathered} right. hfill \ end{gathered} $$
    (6)
    where A is the soil erosion modulus. Ri is the rainfall erosion factor, K is the soil erosion factor, L and S are slope the length factor and the slope factor respectively, C is the vegetation coverage and management factor, P is the soil and water conservation factor, l is the slope length value, m is the slope length index, and θ the is slope value.Landscape ecological functionThe landscape ecological function determines the ability of ecological service50,51,52, the ecological service of urban blue–green space depends on the human value orientation48. It includes four categories: supply, support, regulation, and culture. Based on Maslow’s Hierarchy of Needs and Alderfer’s ERG theory, scholars have summarized the three major needs of human beings for urban blue–green space. Namely, securing the living environment to meet the survival needs, improving social relationships to meet the interaction needs, and cultivating cultural cultivation to meet the development needs53. Specifically corresponding to the landscape ecological function of urban blue–green space, supply is not the main function, only plays a subsidiary role, support is the basic guarantee, regulation is the basic need for urban environmental construction, and culture is an important element of high-quality social life. Ecosystem service value (ESV) can realize the measurement of ecological service function by calculating the specific value of life support products and services produced by the ecosystem54,55,56. Considering the human value orientation of the urban blue–green space landscape ecological function, the weights were given by consulting 16 experts, with supply, regulation, support, and culture weights of 0.2, 0.3, 0.3, 0.2, respectively. The formula is as follows:$$ ESV = sumlimits_{k = 1}^{n} {S_{k} times V_{k}^{{}} } $$
    (7)
    where Sk is the area of landscape type k, Vk is the value coefficient of the ecosystem service function of landscape type k .Landscape ecological sustainabilityWu (2013) proposed a research framework for landscape sustainability based on a summary of related studies, stating that landscape ecological sustainability is the ability to provide ecosystem services in a long-term and stable manner34. The framework emphasized that landscape sustainability should focus on the analysis of ecosystem service trade-offs effect34,57. In the process of dynamic change of urban blue–green space ecosystem, there are complex trade-offs among various ecosystem services. This is important for promoting the optimal overall benefits of various ecosystem services and achieving sustainable development of urban ecology58. In addition, as a special type of human-centered ecosystem developed by humans based on nature, human well-being is also very important for the landscape ecological sustainability of urban blue–green space. For this reason, we introduced ecosystem service trade-offs (EST) and ecological construction input (IEC) as assessment indexes of landscape ecological sustainability.

    (1)

    Ecosystem service trade-offs (EST)

    This study applied the root mean square deviation of ecological services to quantify ecosystem service trade-offs (EST). The index could effectively measure the average difference in standard deviation between individual ecosystem services and the average ecosystem services. It is a simple and effective way to evaluate the trade-offs among ecosystem services. The formula is as follows:$$ EST = sqrt {frac{1}{n – 1}sumnolimits_{i = 1}^{n} {(ES_{std} – overline{ES}_{std} } } )^{2} $$
    (8)
    where ESstd is the normalized ecosystem services, n is the number of ecosystem services , and (overline{ES}_{std}) is the mean value of normalized ecosystem services.

    (2)

    Ecological construction input (ECI)

    Human well-being is a premise for the landscape ecological sustainability of urban blue–green spaces, it is closely related to government investment in ecological construction planning34. From the perspective of economics, this study assessed the human well-being obtained by urban blue–green space with the ratio of urban ecological construction investment to GDP, that is, the ecological construction input (ECI). The formula is as follows:$$ ECI = EI/G $$
    (9)
    where EI is the amount of ecological construction investment, and G is the gross regional product.Evaluation methodThe index weight determines its relative importance in the index system, and the selection of the weight calculation method in the decision-making of multi-attribute problems has an important impact on the assessment results21. Traditional weighting methods can be divided into two categories, subjective weighting method and objective weighting method21,38. The subjective weighting method is represented by the analytic hierarchy process (AHP), Delphi method, and so on. It has the advantage of simplicity, but the disadvantage is too subjective and randomness because it was completely dependent on the knowledge and experience of decision makers. The objective weighting method is represented by the entropy weighting method (EWM), principal component analysis, variation coefficient method, and so on. And it has been widely recognized for reflecting the variability of assessment results18, but the values of indexes have significant influence and the calculation results are not stable. Considering the limitations of the single weighting method, the weights of each assessment index in this study were determined by the combination of subjective weight and objective weight. Among them, the subjective weighting selected the AHP, and the objective weighting selected the EWM (Table 2). The formula is as follows:$$ w_{{j}} = alpha w_{{j}}^{{{AHP}}} + (1 – alpha )w_{{j}}^{{{EWM}}} $$
    (10)
    $$ w_{{j}}^{{{EWM}}} = d_{{j}} /sumlimits_{i = 1}^{m} {d_{{j}} } $$
    (11)
    $$ d_{{j}} = 1 – e_{{j}} $$
    (12)
    $$ e_{{j}} = – ksumlimits_{i = 1}^{n} {f_{{{ij}}} ln (f_{{{ij}}} )} ,;k = 1/ln (n) $$
    (13)
    $$ f_{{{ij}}} = X^{prime}_{{{ij}}} /sumlimits_{i = 1}^{n} {X^{prime}_{{{ij}}} } $$
    (14)
    where (W_{{j}}^{{}}) is the combined weight. (W_{{j}}^{{_{AHP} }}) is the weight of the j-th index of the AHP, (W_{{j}}^{{{EWM}}}) is the weight of the j-th index of the EWM, dj is the information entropy of the j-th index, ej is the entropy value of the j-th index, (f_{{{ij}}}) is the proportion of the index value of the j-th sample under the i-th indexm, (X^{prime}_{{{ij}}}) is the standardized value of the i-th sample of the j-th index, m is the number of index, n is the number of samples, and (alpha) was taken as 0.5.Table 2 Weight of assessment index.Full size tableSince the dimensions of indexes are different, it is necessary to unify the dimensions of the index to avoid the errors caused by direct calculation to make the evaluation results inaccurate. The range standardization was used to normalize the index data and bound its value in the interval [0, 1], the range standardization can be expressed as follows15,23:$$ {text{Positive indicator}}left( + right):A_{{{ij}}} = (X_{{{ij}}} – X_{{{jmin}}} )/(X_{{{jmax}}} – X_{{{jmin}}} ) $$
    (15)
    $$ {text{Negative indicator}}left( – right):A_{{{ij}}} = (X_{{{jmax}}} – X_{ij} )/(X_{{{jmax}}} – X_{{{jmin}}} ) $$
    (16)
    Additionally, we divided the LEH index into five levels from high to low using an equal-interval approach as follows40: [1–0.8) healthy, [0.8–0.6) sub-healthy, [0.6–0.4) moderately healthy, [0.4–0.2) unhealthy, [0.2–0] pathological, corresponding level I–V. And the level transfer of LEH in different periods was divided into three types: improvement type, degradation type, and stabilization type. For example, III-II means that the transfer from level III to level II is the improvement type.Spatial autocorrelation analysisSpatial autocorrelation analysis is one of the basic methods in theoretical geography. It could deeply investigate the spatial correlation characteristics of data, including global spatial autocorrelation and local spatial autocorrelation23. The global spatial autocorrelation uses global Moran’s I to evaluate the degree of their spatial agglomeration or differentiation of an attribute value in the study area. The local spatial autocorrelation is a decomposed form of the global spatial autocorrelation18,21, including four types: HH(High-High), LL(Low-Low), HL(High-Low), LH(Low–High). In this study, spatial autocorrelation analysis was applied to study the spatial correlation characteristics of blue–green space LEH. The calculation formulas are as follows:$$ I = frac{{Nsumlimits_{i} {sumlimits_{v} {W_{iv} (Y_{i} – overline{Y} )(Y_{v} – overline{Y} )} } }}{{(sumlimits_{i} {sumlimits_{v} {W_{iv} } } )sumlimits_{i} {(Y_{i} – overline{Y} )} }} $$
    (17)
    $$ I_{i} = frac{{Y_{i} – overline{Y} }}{{S_{x}^{2} }}sumlimits_{v} {left[ {W_{iv} (Y_{i} – overline{Y} )} right]} $$
    (18)
    where N is the number of space units, (W_{iv}) is the spatial weight, (Y_{i} ,Y_{v}) are the variable attribute values of the area (i,v), (overline{Y}) is the variable mean, (S_{x}^{2}) is the variance, (I) is the global Moran’s I index, and (I_{i}) is the local Moran’s I index. More

  • in

    Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter

    LookingGlass design and optimizationDataset generationThe taxonomic organization of representative Bacterial and Archaeal genomes was determined from the Genome Taxonomy Database, GTDB51 (release 89.0). The complete genome sequences were downloaded via the NCBI Genbank ftp52. This resulted in 24,706 genomes, comprising 23,458 Bacterial and 1248 Archaeal genomes.Each genome was split into read-length chunks. To determine the distribution of realistic read lengths produced by next-generation short-read sequencing machines, we obtained the BioSample IDs52 for each genome, where they existed, and downloaded their sequencing metadata from the MetaSeek53 database using the MetaSeek API. We excluded samples with average read lengths less than 60 or greater than 300 base pairs. This procedure resulted in 7909 BioSample IDs. The average read lengths for these sequencing samples produced the read-length distribution (Supplementary Fig. 1) with a mean read length of 136 bp. Each genome was split into read-length chunks (with zero overlap in order to maximize information density and reduce data redundancy in the dataset): a sequence length was randomly selected with replacement from the read-length distribution and a sequence fragment of that length was subset from the genome, with a 50% chance that the reverse complement was used. The next sequence fragment was chosen from the genome starting at the end point of the previous read-length chunk, using a new randomly selected read length, and so on. These data were partitioned into a training set used for optimization of the model; a validation set used to evaluate model performance during parameter tuning and as a benchmark to avoid overfitting during training; and a test set used for final evaluation of model performance. To ensure that genomes in the training, validation, and test sets had low sequence similarity, the sets were split along taxonomic branches such that genomes from the Actinomycetales, Rhodobacterales, Thermoplasmata, and Bathyarchaeia were partitioned into the validation set; genomes from the Bacteroidales, Rhizobiales, Methanosarcinales, and Nitrososphaerales were partitioned into the test set; and the remaining genomes remained in the training set. This resulted in 529,578,444 sequences in the training set, 57,977,217 sequences in the validation set, and 66,185,518 sequences in the test set. We term this set of reads the GTDB representative set (Table 1).Table 1 Summary table of datasets used.Full size tableThe amount of data needed for training was also evaluated (Supplementary Fig. 2). Progressively larger amounts of data were tested by selecting at random 1, 10, 100, or 500 read-length chunks from each of the GTDB representative genomes in the GTDB representative training set. Additionally, the performance of smaller but more carefully selected datasets, representing the diversity of the microbial tree of life, were tested by selecting for training one genome at random from each taxonomic class or order in the GTDB taxonomy tree. In general, better accuracy was achieved in fewer epochs with a greater amount of sequencing data (Supplementary Fig. 2); however, a much smaller amount of data performed better if a representative genome was selected from each GTDB taxonomy class.The final LookingGlass model was trained on this class-level partition of the microbial tree of life. We term this dataset the GTDB class set (Table 1). The training, validation, and test sets were split such that no classes overlapped across sets: the validation set included 8 genomes from each of the classes Actinobacteria, Alphaproteobacteria, Thermoplasmata, and Bathyarchaeia (32 total genomes); the test set included 8 genomes from each of the classes Bacteroidia, Clostridia, Methanosarcinia, and Nitrososphaeria (32 total genomes); and the training set included 1 genome from each of the remaining classes (32 archaeal genomes and 298 bacterial genomes for a total of 330 genomes). This resulted in a total of 6,641,723 read-length sequences in the training set, 949,511 in the validation set, and 632,388 in the test set (Supplementary Data 1).Architecture design and trainingRecurrent neural networks (RNNs) are a type of neural network designed to take advantage of the context dependence of sequential data (such as text, video, audio, or biological sequences), by passing information from previous items in a sequence to the current item in a sequence54. Long short-term memory networks (LSTMs)55 are an extension of RNNs, which better learn long-term dependencies by handling the RNN tendency to “forget” information farther away in a sequence56. LSTMs maintain a cell state which contains the “memory” of the information in the previous items in the sequence. LSTMs learn additional parameters which decide at each step in the sequence which information in the cell state to “forget” or “update”.LookingGlass uses a three-layer LSTM encoder model with 1152 units in each hidden layer and an embedding size of 104 based on the results of hyperparameter tuning (see below). It divides the sequence into characters using a kmer size of 1 and a stride of 1, i.e., is a character-level language model. LookingGlass is trained in a self-supervised manner to predict a masked nucleotide, given the context of the preceding nucleotides in the sequence. For each read in the training sequence, multiple training inputs are considered, shifting the nucleotide that is masked along the length of the sequence from the second position to the final position in the sequence. Because it is a character-level model, a linear decoder predicts the next nucleotide in the sequence from the possible vocabulary items “A”, “C”, “G”, and “T”, with special tokens for “beginning of read”, “unknown nucleotide” (for the case of ambiguous sequences), “end of read” (only “beginning of read” was tokenized during LookingGlass training), and a “padding” token (used for classification only).Regularization and optimization of LSTMs require special approaches to dropout and gradient descent for best performance57. The fastai library58 offers default implementations of these approaches for natural language text, and so we adopt the fastai library for all training presented in this paper. We provide the open source fastBio python package59 which extends the fastai library for use with biological sequences.LookingGlass was trained on a Pascal P100 GPU with 16GB memory on Microsoft Azure, using a batch size of 512, a back propagation through time (bptt) window of 100 base pairs, the Adam optimizer60, and utilizing a Cross Entropy loss function (Supplementary Table 1). Dropout was applied at variable rates across the model (Supplementary Table 1). LookingGlass was trained for a total of 12 days for 75 epochs, with progressively decreasing learning rates based on the results of hyperparameter optimization (see below): for 15 epochs at a learning rate of 1e−2, for 15 epochs at a learning rate of 2e−3, and for 45 epochs at a learning rate of 1e−3.Hyperparameter optimizationHyperparameters used for the final training of LookingGlass were tuned using a randomized search of hyperparameter settings. The tuned hyperparameters included kmer size, stride, number of LSTM layers, number of hidden nodes per layer, dropout rate, weight decay, momentum, embedding size, bptt size, learning rate, and batch size. An abbreviated dataset consisting of ten randomly selected read-length chunks from the GTDB representative set was created for testing many parameter settings rapidly. A language model was trained for two epochs for each randomly selected hyperparameter combination, and those conditions with the maximum performance were accepted. The hyperparameter combinations tested and the selected settings are described in the associated Github repository61.LookingGlass validation and analysis of embeddingsFunctional relevanceDataset generation
    In order to assess the ability of the LookingGlass embeddings to inform the molecular function of sequences, metagenomic sequences from a diverse set of environments were downloaded from the Sequence Read Archive (SRA)62. We used MetaSeek53 to choose ten metagenomes at random from each of the environmental packages defined by the MIxS metadata standards63: built environment, host-associated, human gut, microbial mat/biofilm, miscellaneous, plant-associated, sediment, soil, wastewater/sludge, and water, for a total of 100 metagenomes. The SRA IDs used are available in (Supplementary Table 2). The raw DNA reads for these 100 metagenomes were downloaded from the SRA with the NCBI e-utilities. These 100 metagenomes were annotated with the mi-faser tool27 with the read-map option to generate predicted functional annotation labels (to the fourth digit of the Enzyme Commission (EC) number), out of 1247 possible EC labels, for each annotatable read in each metagenome. These reads were then split 80%/20% into training/validation candidate sets of reads. To ensure that there was minimal overlap in sequence similarity between the training and validation set, we compared the validation candidate sets of each EC annotation to the training set for that EC number with CD-HIT64, and filtered out any reads with >80% DNA sequence similarity to the reads of that EC number in the training set (the minimum CD-HIT DNA sequence similarity cutoff). In order to balance EC classes in the training set, overrepresented ECs in the training set were downsampled to the mean count of read annotations (52,353 reads) before filtering with CD-HIT. After CD-HIT processing, any underrepresented EC numbers in the training set were oversampled to the mean count of read annotations (52,353 reads). The validation set was left unbalanced to retain a distribution more realistic to environmental settings. The final training set contained 61,378,672 reads, while the validation set contained 2,706,869 reads. We term this set of reads and their annotations the mi-faser functional set (Table 1).
    As an external test set, we used a smaller number of DNA sequences from genes with experimentally validated molecular functions. We linked the manually curated entries of Bacterial or Archaeal proteins from the Swiss-Prot database65 corresponding to the 1247 EC labels in the mi-faser functional set with their corresponding genes in the EMBL database66. We downloaded the DNA sequences, and selected ten read-length chunks at random per CDS. This resulted in 1,414,342 read-length sequences in the test set. We term this set of reads and their annotations the Swiss-Prot functional set (Table 1).

    Fine-tuning procedure
    We fine-tuned the LookingGlass language model to predict the functional annotation of DNA reads, to demonstrate the speed with which an accurate model can be trained using our pretrained LookingGlass language model. The architecture of the model retained the 3-layer LSTM encoder and the weights of the LookingGlass language model encoder, but replaced the language model decoder with a new multiclass classification layer with pooling (with randomly initialized weights). This pooling classification layer is a sequential model consisting of the following layers: a layer concatenating the output of the LookingGlass encoder with min, max, and average pooling of the outputs (for a total dimension of 104*3 = 312), a batch normalization67 layer with dropout, a linear layer taking the 312-dimensional output of the batch norm layer and producing a 50-dimensional output, another batch normalization layer with dropout, and finally a linear classification layer that is passed through the log(Softmax(x)) function to output the predicted functional annotation of a read as a probability distribution of the 1247 possible mi-faser EC annotation labels. We then trained the functional classifier on the mi-faser functional set described above. Because the >61 million reads in the training set were too many to fit into memory, training was done in 13 chunks of ~5-million reads each until one total epoch was completed. Hyperparameter settings for the functional classifier training are seen in Supplementary Table 1.

    Encoder embeddings and MANOVA test
    To test whether the LookingGlass language model embeddings (before fine-tuning, above) are distinct across functional annotations, a random subset of ten reads per functional annotation was selected from each of the 100 SRA metagenomes (or the maximum number of reads present in that metagenome for that annotation, whichever was greater). This also ensured that reads were evenly distributed across environments. The corresponding fixed-length embedding vectors for each read was produced by saving the output from the LookingGlass encoder (before the embedding vector is passed to the language model decoder) for the final nucleotide in the sequence. This vector represents a contextually relevant embedding for the overall sequence. The statistical significance of the difference between embedding vectors across all functional annotation groups was tested with a MANOVA test using the R stats package68.
    Evolutionary relevance
    Dataset generation
    The OrthoDB database69 provides orthologous groups (OGs) of proteins at various levels of taxonomic distance. For instance, the OrthoDB group “77at2284” corresponds to proteins belonging to “Glucan 1,3-alpha-glucosidase at the Sulfolobus level”, where “2284” is the NCBI taxonomy ID for the genus Sulfolobus.
    We tested whether embedding similarity of homologous sequences (sequences within the same OG) is higher than that of nonhomologous sequences (sequences from different OGs). We tested this in OGs at multiple levels of taxonomic distance—genus, family, order, class, and phylum. At each taxonomic level, ten individual taxa at that level were chosen from across the prokaryotic tree of life (e.g., for the genus level, Acinetobacter, Enterococcus, Methanosarcina, Pseudomonas, Sulfolobus, Bacillus, Lactobacillus, Mycobacterium, Streptomyces, and Thermococcus were chosen). For each taxon, 1000 randomly selected OGs corresponding to that taxon were chosen; for each of these OGs, five randomly chosen genes within this OG were chosen.
    OrthoDB cross-references OGs to UniProt65 IDs of the corresponding proteins. We mapped these to the corresponding EMBL CDS IDs66 via the UniProt database API65; DNA sequences of these EMBL CDSs were downloaded via the EMBL database API. For each of these sequences, we generated LookingGlass embedding vectors.

    Homologous and nonhomologous sequence pairs
    To create a balanced dataset of homologous and nonhomologous sequence pairs, we compared all homologous pairs of the five sequences in an OG (total of ten homologous pairs) to an equal number of randomly selected out-of-OG comparisons for the same sequences; i.e., each of the five OG sequences was compared to 2 other randomly selected sequences from any other randomly selected OG (total of ten nonhomologous pairs). We term this set of sequences, and their corresponding LookingGlass embeddings, the OG homolog set (Table 1).

    Embedding and sequence similarity
    For each sequence pair, the sequence and embedding similarity were determined. The embedding similarity was calculated as the cosine similarity between embedding vectors. The sequence similarity was calculated as the Smith-Waterman alignment score using the BioPython70 pairwise2 package, with a gap open penalty of −10 and a gap extension penalty of −1. The IDs of chosen OGs, the cosine similarities of the embedding vectors, and sequence similarities of the DNA sequences are available in the associated Github repository61.

    Comparison to HMM-based domain searches for distant homology detection
    Distantly related homologous sequences that share, e.g., Pfam71, domains can be identified using HMM-based search methods. We used hmmscan25 (e-val threshold = 1e−10) to compare homologous (at the phylum level) sequences in the OG homolog set, for which the alignment score was less than 50 bits and the embedding similarity was greater than 0.62 (total: 21,376 gene pairs). Specifically, we identified Pfam domains in each sequence and compared whether the most significant (lowest e-value) domain for each sequence was identified in common for each homologous pair.
    Environmental relevance
    Encoder embeddings and MANOVA test
    The LookingGlass embeddings and the environment of origin for each read in the mi-faser functional set were used to test the significance of the difference between the embedding vectors across environmental contexts. The statistical significance of this difference was evaluated with a MANOVA test using the R stats package68.
    Oxidoreductase classifier
    Dataset generation
    The manually curated, reviewed entries of the Swiss-Prot database65 were downloaded (June 2, 2020). Of these, 23,653 entries were oxidoreductases (EC number 1.-.-.-) of Archaeal or Bacterial origin (988 unique ECs). We mapped their UniProt IDs to both their EMBL CDS IDs and their UniRef50 IDs via the UniProt database mapper API. Uniref50 IDs identify clusters of sequences with >50% amino acid identity. This cross-reference identified 28,149 EMBL CDS IDs corresponding to prokaryotic oxidoreductases, belonging to 5451 unique UniRef50 clusters. We split this data into training, validation, and test sets such that each UniRef50 cluster was contained in only one of the sets, i.e., there was no overlap in EMBL CDS IDs corresponding to the same UniRef50 cluster across sets. This ensures that the oxidoreductase sequences in the validation and test sets are dissimilar to those seen during training. The DNA sequences for each EMBL CDS ID were downloaded via the EMBL database API. These data generation process were repeated for a random selection of non-oxidoreductase UniRef50 clusters, which resulted in 28,149 non-oxidoreductase EMBL CDS IDs from 13,248 unique UniRef50 clusters.
    Approximately 50 nucleotide read-length chunks (selected from the representative read-length distribution, as above) were selected from each EMBL CDS DNA sequence, with randomly selected start positions on the gene and a 50% chance of selecting the reverse complement, such that an even number of read-length sequences with “oxidoreductase” and “not oxidoreductase” labels were generated for the final dataset. This procedure produced a balanced dataset with 2,372,200 read-length sequences in the training set, 279,200 sequences in the validation set, and 141,801 sequences in the test set. We term this set of reads and their annotations the oxidoreductase model set (Table 1). In order to compare the oxidoreductase classifier performance to an HMM-based method, reads with “oxidoreductase” labels in the oxidoreductase model test set (71,451 reads) were 6-frame translated and searched against the Swiss-Prot protein database using phmmer25 (reporting e-val threshold = 0.05, using all other defaults).

    Fine-tuning procedure
    Since our functional annotation classifier addresses a closer classification task to the oxidoreductase classifier than LookingGlass itself, the architecture of the oxidoreductase classifier was fine-tuned starting from the functional annotation classifier, replacing the decoder with a new pooling classification layer (as described above for the functional annotation classifier) and with a final output size of 2 to predict “oxidoreductase” or “not oxidoreductase”. Fine tuning of the oxidoreductase classifier layers was done successively, training later layers in isolation and then progressively including earlier layers into training, using discriminative learning rates ranging from 1e−2 to 5e−4, as previously described72. The fine-tuned model was trained for 30 epochs, over 18 h, on a single P100 GPU node with 16GB memory.

    Model performance in metagenomes
    Sixteen marine metagenomes from the surface (SRF, ~5 meters) and mesopelagic (MES, 175–800 meters) from eight stations sampled as part of the TARA expedition37 were downloaded from the SRA62 (Supplementary Table 3, SRA accession numbers ERR598981, ERR599063, ERR599115, ERR599052, ERR599020, ERR599039, ERR599076, ERR598989, ERR599048, ERR599105, ERR598964, ERR598963, ERR599125, ERR599176, ERR3589593, and ERR3589586). Metagenomes were chosen from a latitudinal gradient spanning polar, temperate, and tropical regions and ranging from −62 to 76 degrees latitude. Mesopelagic depths from four out of the eight stations were sampled from oxygen minimum zones (OMZs, where oxygen More

  • in

    Taking metagenomics under the wings

    AffiliationsSanger Institute, Wellcome Trust Genome Campus, Hinxton, UKPhysilia Ying Shi ChuaLaboratory of Genomics and Molecular Medicine, Department of Biology, University of Copenhagen, Copenhagen, DenmarkJacob Agerbo RasmussenCenter for Evolutionary Hologenomics, Globe Institute, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, DenmarkJacob Agerbo RasmussenAuthorsPhysilia Ying Shi ChuaJacob Agerbo RasmussenCorresponding authorCorrespondence to
    Physilia Ying Shi Chua. More