Data collection
We hypothesized that the evolution of underground species affected protein networks in a unique manner in which various types of protein domains served as building blocks of protein evolution. To study the evolution of protein networks, we collected genomic, proteomic, and protein domain classification data, namely, fully sequenced genomes with coding sequences and annotated proteomes, together with protein ortholog assignments, from 32 species living in three broad ecological niches, namely subterranean, fossorial, and aboveground (Table 1, and listed in Materials and Methods). We first sought overall statistics regarding the number of proteins and the number of corresponding orthologous protein families. Overall PPI statistics were calculated, including those predicting PPIs in organisms for which experimentally verified PPI data are missing. We used the KEGG orthologs (KO) group of orthologous proteins in KEGG (Kyoto Encyclopaedia of Genes and Genomes)17 to reproduce gain and loss of protein domains in orthologous proteins. We collected 1,350,898 proteins from the studied organisms that belong to 624,787 KO groups (10,314 are unique ortholog groups). The matching number of interactors and networks for every organism were exhaustively calculated for all these proteins (Fig. 1). We found that 361,615 of the 1,350,898 proteins are distributed among 5,879,879 (predicted and real) PPIs. The mean number of interactors per protein within each habitat, namely, aboveground (A), fossorial (F), and subterranean (S) were 32.07, 32.48, and 32.67, respectively (see details in the supplementary results and in Tables S1–S3). This shows that the number of interactors per protein is similar for organisms from different ecologies.
The study overview. Fully sequenced genomes with coding sequences and annotated proteomes were collected from 32 species living in three broad ecological niches: subterranean, fossorial, and aboveground. For collected proteins (1,350,898), protein domains, protein disordered regions, and KEGG orthologous annotation (624,787) were predicted using the Pfam search tool53 along with HMMER60 , IUPred2A44, and the KEGG database17, respectively. Next, 5,879,879 PPIs were evaluated using our previously developed ChiPPI tool15. Briefly, ChiPPI uses a domain-domain co-occurrence table. When a certain domain is missing, ChiPPI evaluates the corresponding missing interactors in the PPI network15, based on real PPI data (363,816) as obtained from BioGrid (release 3.4.163)16. Finally, PPI data are organized in PASTORAL, a dedicated database.
Additional analysis of PPI features for orthologous proteins (516 KOs) common to all organisms were similar across ecologies. These features included the number of interactors, the number of PPIs, and global/individual clustering coefficients (supplementary results, Figures S1, S2, Table S4). Thus, we studied PPI properties of genes encoding products related to stresses that differ across the ecologies considered, such as hypoxia. Our findings confirm our hypothesis that the design principles of the evolution of underground species involve various types of protein domains serving as building blocks of protein evolution.
Analysis of the PPIs of stress-response proteins cluster organisms according to habitat
To examine how organisms might have adapted to the various stresses in each habitat, we analyzed mutations and changes in the PPIs encoded by stress response genes. Heat-shock, hypoxia, and circadian stresses differ considerably between aboveground and underground environments, and are likely to drive evolutionary selection of proteins that provide optimal function in each niche1,9. We assumed that organisms subject to a shared ecological experience would face similar environmental stresses. PPI networks of stress-related proteins would thus be expected to differ substantially according to ecology.
To test our hypothesis, we performed clustering analysis of all the organisms included in our study, based on mutations and PPI network features, and compared the results for each classification. Such analysis included all orthologous stress-response, hypoxia, heat-shock, and circadian stress proteins (Table 1). In total, 85,173 PPIs related to stress-response proteins were found to be distributed among 1,103 proteins. These comprised of 730 heat shock proteins in 71,940 PPIs, 254 hypoxia-related proteins in 10,256 PPIs, and 119 circadian proteins in 2,977 PPIs (Table 1, Tables S1–S7). All orthologous stress-response genes (KO groups) were obtained by querying the KEGG database with the terms “heat-shock”, “hypoxia”, and “circadian” terms. The results are listed in Table 2, while the corresponding lists of proteins are found in Tables S5, S6 and S7, respectively.
Next, we performed clustering analysis based on sequence mutations and PPI features for the full set of heat-shock, hypoxia, and circadian stress proteins (Table 2). Remarkably, proteins related to hypoxia, heat-shock, and circadian stresses in the 32 organisms studied did not all cluster according to shared ecology based on sequence mutations (Fig. 2A) but significantly did so on the basis of “PPI network clustering coefficient” (Fig. 2B–D; p value (AU) < 0.02, p value = 0.0018, and p value = 0.0013, respectively, Pearson’s χ2-test). Moreover, the observed clustering of organisms according to ecological niches reflects adaptation towards a specific stress, rather than to the particular identity of the environment, such as a cave or within soil. Interestingly, we observed that bat clustered with other subterranean organisms based on hypoxia-related proteins. As hypoxia has been associated with spill-over, i.e., transmission of virulent viruses to other species18, other subterranean organisms may also have innate protection from virulent viruses. Moreover, the little brown bat (Myotis lucifugus) is associated with the emergence of SARS-CoV-2 responsible for the current COVID-19 pandemic19. Additional contributors to the spill-over of virulent viruses from bats include arousal from hibernation and the fact that hundreds of these bats hibernate in caves18,20. Taken together, these results showed better assignment of organisms to broad ecological niches based on their cellular PPI networks than on sequence mutations, and supports the hypothesis that organisms adapt to their specific ecologies by modulating PPI networks rather than by mutation of protein sequences.
Hierarchical clustering. (A) All studied organisms were assessed for hierarchal clustering using an identity matrix from multiple alignment of PAS domain sequences from hypoxia-related orthologous proteins. (B) Clustering of the hypoxia-related proteins (in the 32 organisms studied (Table 2). (C) Clustering of the heat-shock proteins (Table 2). (D) Clustering of circadian proteins (Table 2). AU (approximately unbiased) p values and BP (bBootstrap probability) values are shown59.
Additional analysis of PPI networks involving hypoxia-related proteins (e.g. HIF2A) revealed that distribution of central proteins within PPI network discriminates between PPIs of different ecologies, such as DMAD3, XPO1 and EWSR1, were unique to subterranean animals (supplementary results, Figure S3). This finding indicates that adaption to ecology via PPI modulation could rely on “shuffling” of protein domains, resulting in global changes in PPI networks in an ecology-specific manner.
Genes encoding common orthologous proteins of subterranean animals adopted non-optimal codons
Due to redundancy of the genetic code, amino acids are encoded by multiple synonymous codons. Moreover, the use of synonymous codons is non-uniform, such that there is a strong preference for certain codons in highly expressed genes21,22,23. According to the strength of affinity of codon-anticodon interactions, codons with high and low affinities are referred to as optimal and non-optimal, respectively24,25. We previously showed that subterranean animals adopted non-optimal codon usage as part of their adaptation to their stressful environment26. We now hypothesized that orthologous proteins of subterranean animals adopted different codon usage preferences than those of fossorial and aboveground species.
To examine differences in codon usage preferences, we considered 516 orthologous proteins from KO groups common to the 32 organisms of study by developing a tendency score to estimate codon usage preferences (codon usage preference score (CUPS), defined by Eq. (3)) from a codon usage table (CUT). Accordingly, we classified codons as optimal and non-optimal25. For the 516 common KO proteins, we computed the probability of subterranean animals adopting non-optimal and optimal codon usage, as calculated from the area under the density distribution curve, relative to aboveground animals (Fig. 3). Using the bootstrapping procedure described below, we found that subterranean and fossorial animals adopted 75.0% (p value = 0.0019) and 58.8% (p value = 0.076) more non-optimal codon usage, respectively, compared with aboveground animals (Fig. 3). Briefly, 10,000 random groups of 516 KOs were generated (as bootstrap replicates) and codon usage was calculated. p values were defined as the frequency of bootstrap replicates, with calculated values equaling or exceeding observed values (see Materials and Methods). We found that subterranean animals adopted 50.05%, on average, less optimal codon usage (CUPS: (subterranean (S) = 11.20, aboveground (A) = 22.42, A vs. S, p value < 2.2 × 10–16. Wilcoxon rank sum test with continuity correction; Table S4).
Codon usage preference score (CUPS) density plots for 516 KOs common to all studied organisms. (A) Density plot for all 516 KOs in each habitat (ab.ground (aboveground), fossorial, and subterranean). Proportions (%) of KOs, relative to those seen in aboveground dwellers are indicated in terms of optimal (positive) and non-optimal (negative) codon usage presences (CUPS). Subterranean organisms adopt 75.0% (p value = 0.0008) more non-optimal codon usage than do aboveground species.
Proteins with disordered regions are encoded by genes that adopted non-optimal codon usage and form fewer PPIs
Traditionally, proteins realize their function based on their 3-dimensional structure. However, in recent years, protein segments (> 30 residues) lacking stable secondary and/or tertiary structure, referred to as intrinsically disordered regions (IDRs) or intrinsically disordered protein regions (IDPRs), have been shown to exhibit functional capabilities within core molecular processes27,28,29,30,31,32. The tendency of a protein region to exhibit structure can be represented on a spectrum33. At one extreme, proteins without IDRs are considered as structured, while at the other end, proteins without structure over the entire sequence are referred to as intrinsically disordered proteins (IDPs)27,28,29,30,31,32. Differential inclusion of IDRs via alternative splicing was found to increase protein function capabilities. IDRs contain sequence motifs which mediate interactions, and can contain post-translational modification sites34,35,36. Differential inclusion of IDRs was also found to modulate PPIs in an tissue-specific manner by including or excluding IDRs that interact directly with protein partners34,35. IDR composition, length and position were, moreover, shown to affect protein half-life, in addition to expanding protein functional capabilities37,38,39,40. Misregulation and mutations within IDRs affect molecular function41,42,43. The presence of a high proportion of missense disease mutations within IDRs indicates the importance of IDRs to proper molecular function, as well as to the development of disease. Therefore, we expanded the 516 KO groups common to all organisms addressed in this study to consider all KO groups and intrinsically disordered regions in proteins, defined as a continuous stretch longer than seven residues with an IUPRED SCORE > = 0.544 that do not overlap with Pfam domains. We thus hypothesized that disordered segments would affect ecological adaptation; and examined this by systematic analysis of multiple data sets that describe the sequences of various ordered and disordered domains, as well as proteins with both ordered and disordered regions, roughly corresponding to structured proteins, IDPs and IDRs respectively.
Once again, we calculated the total number of PPIs and CUPS and generated scatter plots (Fig. 4). These plots were generated from orthologous proteins, with the total number of PPIs differing significantly, at least by 1.2-fold, between ecologies. We found that proteins with disordered regions generally form fewer PPIs and are encoded by more genes showing non-optimal codon usage preferences to higher degree (Fig. 4A), relative to their counterparts containing mixed (Fig. 4B) and ordered (Fig. 4C) regions. On average, proteins with disordered regions formed 35.2%, 36.92%, and 35.6% fewer PPIs than did proteins with ordered regions within aboveground, fossorial, and subterranean ecologies, respectively (p value < 2.2e−16, Wilcoxon rank sum test with continuity correction;Table S2). Moreover, proteins with disordered regions adopt, on average, 11.2%, 12.8%, and 7.6% less optimal codon usage (CUPS) than do proteins with mixed regions from aboveground, fossorial, and subterranean ecologies, respectively (p value < 0.024, Wilcoxon rank sum test with continuity correction, Table S8). These results indicate that proteins with disordered regions form fewer PPIs and are encoded by genes that adopted fewer optimal codon usage preferences than do counterpart proteins with ordered and mixed regions.
Total PPIs and codon usage preference score (CUPS) with density plots for all proteins with disordered (A), mixed (B), and ordered (C) regions, stratified by ecology (aboveground, lime green (#66c2a5) *; fossorial, soft orange (#fc8d62) *; and subterranean, light blue (#8da0cb)*). Proteins with disordered, mixed, and ordered regions formed different clusters, with varying degrees of extreme values of PPIs and CUPS. Additionally, proteins with disordered regions that adopted non-optimal CUPS [− 50, − 20] were more numerous for subterranean animals than for their counterpart proteins in fossorial and aboveground animals. *Hexadecimal color number.
Collectively, our findings are consistent and extend observations made with the fungus Neurospora, namely that non-optimal codons are used more often in intrinsically disordered regions, while optimal codons are preferentially used in structured (ordered) domains45. Moreover, experimentally optimizing codon usage of the circadian clock gene was found to impair gene function45, thus demonstrating the functional role of IDRs in protein function, in general32,46, and the functional role of non-optimal codons, in particular. The results were similar when proteins with disordered regions were compared across ecologies (supplementary results, Tables S2, S8 and S9).
We observed a higher proportion in the mean number of interactors among aboveground than subterranean animals (93.1% (ordered), 97.7% (mixed), and 147.8% (disordered), p value 1.15e−11, Pearson’s χ2-test; Table S8). This result indicates higher connectivity in the PPI networks of aboveground animals. Additionally, PPIs in subterranean, fossorial, and aboveground species displayed significant enrichment, compared with the 17,266 instances of loss of protein domains in 10,000 random PPI networks (125,956; 172,613; and 212,941, compared to 81,622 PPIs; V = 52, p value = 0.009766, V = 40, p value = 0.03906, and p = V = 91, p value = 0.0002441, Wilcoxon signed rank test with continuity correction), respectively.
These observed interactions involved 9,429; 10,676; and 13,077 proteins, on average, in subterranean, fossorial, and aboveground species (V = 36, p value = 0.4316, V = 21, p value = 0.9102, V = 89, p value = 0.0007324, respectively, Wilcoxon signed rank test with continuity correction), respectively. These values are thus significantly higher than the average 10,000 random PPI networks only for aboveground species. This is possibly due to the low number of proteins in PPIs belonging to fossorial and subterranean insects considered. Indeed, the average numbers of interactors per protein as a function of habitat (i.e., 32.67 (S), 32.48 (F) and 32.07 (A)) were significantly higher compared with the random value (16.5) (V = 55, p value = 0.001953, V = 45, p value = 0.003906, V = 91, p value = 0.0002441, respectively, Wilcoxon signed rank test with continuity correction). To confirm our results regarding codon usage preferences and PPIs, we collected such information from 61 proteins from the DisProt47,48,49 database with over 98% disorder content from Rattus norvegicus, Mus musculus, Homo sapiens, Drosophila melanogaster, Danio rerio, Sus scrofa and Bos taurus (Table S10).
Orthologous proteins were found in aboveground, fossorial and subterranean animals, and CUPS and PPI analysis were performed. We found that subterranean animals adopt extreme non-optimal codon usage preferences and form less PPIs that are on average relative to aboveground and fossorial ecologies (CUPS ( PPIs)): − 17.37 (43.57), − 14.71 (45.74) and − 14.76 (44.86) respectively (p value < 0.041, Wilcoxon rank sum test with continuity correction, Tables S11, S12, S13, respectively). These patterns are apparent in a scatter plot showing density distributions (Fig. S4). The results replicated the observations obtained from our classification of proteins as ordered, disordered or mixed. Moreover, as this analysis was performed without consideration of our ecology-based classification, our results are independent of our domain classification method. Furthermore, the results confirm that our classification method captures many aspects of the disordered nature of proteins, at least in relation to their adaptation to a subterranean environment.
The user-friendly interface of the PASTORAL server
Finally, we organized all our data in a dedicated resource, PASTORAL (Protein–Protein Interactions of Stress-Response Genes in Subterranean and Fossorial Animals). The PASTORAL database interface is user-friendly and accepts the following parameters for a selected animal as input query: Gene symbols, NCBI Entrez identifiers (NCBI_ID), protein ID, chromosomes, and gene descriptions. Upon an identified match of a search query, the user is directed to the entry webpage. From this page, all PPI data can be obtained (particularly for the heat-shock and hypoxia-related proteins) using annotations and the corresponding KEGG orthologs17 (see Fig. 5). Querying PASTORAL for two protein names (interactors) at most, or their NCBI_IDs, returns interactions for bi-level PPIs. Querying for three or more identifiers (maximum 380) returns interactions between these entities (single-level PPIs). The interactors can also be downloaded as a file in tab-delimited format. PASTORAL, written in mySQL, enables users to study proteins and their interactions in an intuitive workflow, as displayed in Fig. 5 and Figures S5–S7). Here, PASTORAL was used in an analysis involving NCBI_IDs for input proteins from 23 organisms listed in Table S1.
(A) The PASTORAL interface, showing querying and analysis features. (B) The pop-up window of protein orthologs. (C) An example of a partial codon usage table (CUT) for Nannospalax galilli. The codon usage table provides an overall %GC content and number of CDS from which the table is computed. AA, amino acid; Fraction, the proportion of usage of the codon among its degenerate set, i.e. the set of codons that code for an AA; Frequency, the expected number of codons, given the input sequence(s), per 1,000 bases; and Number, the raw number of occurrences of the codon in the input sequences.
Source: Ecology - nature.com