LookingGlass design and optimization
Dataset generation
The taxonomic organization of representative Bacterial and Archaeal genomes was determined from the Genome Taxonomy Database, GTDB51 (release 89.0). The complete genome sequences were downloaded via the NCBI Genbank ftp52. This resulted in 24,706 genomes, comprising 23,458 Bacterial and 1248 Archaeal genomes.
Each genome was split into read-length chunks. To determine the distribution of realistic read lengths produced by next-generation short-read sequencing machines, we obtained the BioSample IDs52 for each genome, where they existed, and downloaded their sequencing metadata from the MetaSeek53 database using the MetaSeek API. We excluded samples with average read lengths less than 60 or greater than 300 base pairs. This procedure resulted in 7909 BioSample IDs. The average read lengths for these sequencing samples produced the read-length distribution (Supplementary Fig. 1) with a mean read length of 136 bp. Each genome was split into read-length chunks (with zero overlap in order to maximize information density and reduce data redundancy in the dataset): a sequence length was randomly selected with replacement from the read-length distribution and a sequence fragment of that length was subset from the genome, with a 50% chance that the reverse complement was used. The next sequence fragment was chosen from the genome starting at the end point of the previous read-length chunk, using a new randomly selected read length, and so on. These data were partitioned into a training set used for optimization of the model; a validation set used to evaluate model performance during parameter tuning and as a benchmark to avoid overfitting during training; and a test set used for final evaluation of model performance. To ensure that genomes in the training, validation, and test sets had low sequence similarity, the sets were split along taxonomic branches such that genomes from the Actinomycetales, Rhodobacterales, Thermoplasmata, and Bathyarchaeia were partitioned into the validation set; genomes from the Bacteroidales, Rhizobiales, Methanosarcinales, and Nitrososphaerales were partitioned into the test set; and the remaining genomes remained in the training set. This resulted in 529,578,444 sequences in the training set, 57,977,217 sequences in the validation set, and 66,185,518 sequences in the test set. We term this set of reads the GTDB representative set (Table 1).
The amount of data needed for training was also evaluated (Supplementary Fig. 2). Progressively larger amounts of data were tested by selecting at random 1, 10, 100, or 500 read-length chunks from each of the GTDB representative genomes in the GTDB representative training set. Additionally, the performance of smaller but more carefully selected datasets, representing the diversity of the microbial tree of life, were tested by selecting for training one genome at random from each taxonomic class or order in the GTDB taxonomy tree. In general, better accuracy was achieved in fewer epochs with a greater amount of sequencing data (Supplementary Fig. 2); however, a much smaller amount of data performed better if a representative genome was selected from each GTDB taxonomy class.
The final LookingGlass model was trained on this class-level partition of the microbial tree of life. We term this dataset the GTDB class set (Table 1). The training, validation, and test sets were split such that no classes overlapped across sets: the validation set included 8 genomes from each of the classes Actinobacteria, Alphaproteobacteria, Thermoplasmata, and Bathyarchaeia (32 total genomes); the test set included 8 genomes from each of the classes Bacteroidia, Clostridia, Methanosarcinia, and Nitrososphaeria (32 total genomes); and the training set included 1 genome from each of the remaining classes (32 archaeal genomes and 298 bacterial genomes for a total of 330 genomes). This resulted in a total of 6,641,723 read-length sequences in the training set, 949,511 in the validation set, and 632,388 in the test set (Supplementary Data 1).
Architecture design and training
Recurrent neural networks (RNNs) are a type of neural network designed to take advantage of the context dependence of sequential data (such as text, video, audio, or biological sequences), by passing information from previous items in a sequence to the current item in a sequence54. Long short-term memory networks (LSTMs)55 are an extension of RNNs, which better learn long-term dependencies by handling the RNN tendency to “forget” information farther away in a sequence56. LSTMs maintain a cell state which contains the “memory” of the information in the previous items in the sequence. LSTMs learn additional parameters which decide at each step in the sequence which information in the cell state to “forget” or “update”.
LookingGlass uses a three-layer LSTM encoder model with 1152 units in each hidden layer and an embedding size of 104 based on the results of hyperparameter tuning (see below). It divides the sequence into characters using a kmer size of 1 and a stride of 1, i.e., is a character-level language model. LookingGlass is trained in a self-supervised manner to predict a masked nucleotide, given the context of the preceding nucleotides in the sequence. For each read in the training sequence, multiple training inputs are considered, shifting the nucleotide that is masked along the length of the sequence from the second position to the final position in the sequence. Because it is a character-level model, a linear decoder predicts the next nucleotide in the sequence from the possible vocabulary items “A”, “C”, “G”, and “T”, with special tokens for “beginning of read”, “unknown nucleotide” (for the case of ambiguous sequences), “end of read” (only “beginning of read” was tokenized during LookingGlass training), and a “padding” token (used for classification only).
Regularization and optimization of LSTMs require special approaches to dropout and gradient descent for best performance57. The fastai library58 offers default implementations of these approaches for natural language text, and so we adopt the fastai library for all training presented in this paper. We provide the open source fastBio python package59 which extends the fastai library for use with biological sequences.
LookingGlass was trained on a Pascal P100 GPU with 16GB memory on Microsoft Azure, using a batch size of 512, a back propagation through time (bptt) window of 100 base pairs, the Adam optimizer60, and utilizing a Cross Entropy loss function (Supplementary Table 1). Dropout was applied at variable rates across the model (Supplementary Table 1). LookingGlass was trained for a total of 12 days for 75 epochs, with progressively decreasing learning rates based on the results of hyperparameter optimization (see below): for 15 epochs at a learning rate of 1e−2, for 15 epochs at a learning rate of 2e−3, and for 45 epochs at a learning rate of 1e−3.
Hyperparameter optimization
Hyperparameters used for the final training of LookingGlass were tuned using a randomized search of hyperparameter settings. The tuned hyperparameters included kmer size, stride, number of LSTM layers, number of hidden nodes per layer, dropout rate, weight decay, momentum, embedding size, bptt size, learning rate, and batch size. An abbreviated dataset consisting of ten randomly selected read-length chunks from the GTDB representative set was created for testing many parameter settings rapidly. A language model was trained for two epochs for each randomly selected hyperparameter combination, and those conditions with the maximum performance were accepted. The hyperparameter combinations tested and the selected settings are described in the associated Github repository61.
LookingGlass validation and analysis of embeddings
Functional relevance
Dataset generation
In order to assess the ability of the LookingGlass embeddings to inform the molecular function of sequences, metagenomic sequences from a diverse set of environments were downloaded from the Sequence Read Archive (SRA)62. We used MetaSeek53 to choose ten metagenomes at random from each of the environmental packages defined by the MIxS metadata standards63: built environment, host-associated, human gut, microbial mat/biofilm, miscellaneous, plant-associated, sediment, soil, wastewater/sludge, and water, for a total of 100 metagenomes. The SRA IDs used are available in (Supplementary Table 2). The raw DNA reads for these 100 metagenomes were downloaded from the SRA with the NCBI e-utilities. These 100 metagenomes were annotated with the mi-faser tool27 with the read-map option to generate predicted functional annotation labels (to the fourth digit of the Enzyme Commission (EC) number), out of 1247 possible EC labels, for each annotatable read in each metagenome. These reads were then split 80%/20% into training/validation candidate sets of reads. To ensure that there was minimal overlap in sequence similarity between the training and validation set, we compared the validation candidate sets of each EC annotation to the training set for that EC number with CD-HIT64, and filtered out any reads with >80% DNA sequence similarity to the reads of that EC number in the training set (the minimum CD-HIT DNA sequence similarity cutoff). In order to balance EC classes in the training set, overrepresented ECs in the training set were downsampled to the mean count of read annotations (52,353 reads) before filtering with CD-HIT. After CD-HIT processing, any underrepresented EC numbers in the training set were oversampled to the mean count of read annotations (52,353 reads). The validation set was left unbalanced to retain a distribution more realistic to environmental settings. The final training set contained 61,378,672 reads, while the validation set contained 2,706,869 reads. We term this set of reads and their annotations the mi-faser functional set (Table 1).
As an external test set, we used a smaller number of DNA sequences from genes with experimentally validated molecular functions. We linked the manually curated entries of Bacterial or Archaeal proteins from the Swiss-Prot database65 corresponding to the 1247 EC labels in the mi-faser functional set with their corresponding genes in the EMBL database66. We downloaded the DNA sequences, and selected ten read-length chunks at random per CDS. This resulted in 1,414,342 read-length sequences in the test set. We term this set of reads and their annotations the Swiss-Prot functional set (Table 1).
Fine-tuning procedure
We fine-tuned the LookingGlass language model to predict the functional annotation of DNA reads, to demonstrate the speed with which an accurate model can be trained using our pretrained LookingGlass language model. The architecture of the model retained the 3-layer LSTM encoder and the weights of the LookingGlass language model encoder, but replaced the language model decoder with a new multiclass classification layer with pooling (with randomly initialized weights). This pooling classification layer is a sequential model consisting of the following layers: a layer concatenating the output of the LookingGlass encoder with min, max, and average pooling of the outputs (for a total dimension of 104*3 = 312), a batch normalization67 layer with dropout, a linear layer taking the 312-dimensional output of the batch norm layer and producing a 50-dimensional output, another batch normalization layer with dropout, and finally a linear classification layer that is passed through the log(Softmax(x)) function to output the predicted functional annotation of a read as a probability distribution of the 1247 possible mi-faser EC annotation labels. We then trained the functional classifier on the mi-faser functional set described above. Because the >61 million reads in the training set were too many to fit into memory, training was done in 13 chunks of ~5-million reads each until one total epoch was completed. Hyperparameter settings for the functional classifier training are seen in Supplementary Table 1.
Encoder embeddings and MANOVA test
To test whether the LookingGlass language model embeddings (before fine-tuning, above) are distinct across functional annotations, a random subset of ten reads per functional annotation was selected from each of the 100 SRA metagenomes (or the maximum number of reads present in that metagenome for that annotation, whichever was greater). This also ensured that reads were evenly distributed across environments. The corresponding fixed-length embedding vectors for each read was produced by saving the output from the LookingGlass encoder (before the embedding vector is passed to the language model decoder) for the final nucleotide in the sequence. This vector represents a contextually relevant embedding for the overall sequence. The statistical significance of the difference between embedding vectors across all functional annotation groups was tested with a MANOVA test using the R stats package68.
Evolutionary relevance
Dataset generation
The OrthoDB database69 provides orthologous groups (OGs) of proteins at various levels of taxonomic distance. For instance, the OrthoDB group “77at2284” corresponds to proteins belonging to “Glucan 1,3-alpha-glucosidase at the Sulfolobus level”, where “2284” is the NCBI taxonomy ID for the genus Sulfolobus.
We tested whether embedding similarity of homologous sequences (sequences within the same OG) is higher than that of nonhomologous sequences (sequences from different OGs). We tested this in OGs at multiple levels of taxonomic distance—genus, family, order, class, and phylum. At each taxonomic level, ten individual taxa at that level were chosen from across the prokaryotic tree of life (e.g., for the genus level, Acinetobacter, Enterococcus, Methanosarcina, Pseudomonas, Sulfolobus, Bacillus, Lactobacillus, Mycobacterium, Streptomyces, and Thermococcus were chosen). For each taxon, 1000 randomly selected OGs corresponding to that taxon were chosen; for each of these OGs, five randomly chosen genes within this OG were chosen.
OrthoDB cross-references OGs to UniProt65 IDs of the corresponding proteins. We mapped these to the corresponding EMBL CDS IDs66 via the UniProt database API65; DNA sequences of these EMBL CDSs were downloaded via the EMBL database API. For each of these sequences, we generated LookingGlass embedding vectors.
Homologous and nonhomologous sequence pairs
To create a balanced dataset of homologous and nonhomologous sequence pairs, we compared all homologous pairs of the five sequences in an OG (total of ten homologous pairs) to an equal number of randomly selected out-of-OG comparisons for the same sequences; i.e., each of the five OG sequences was compared to 2 other randomly selected sequences from any other randomly selected OG (total of ten nonhomologous pairs). We term this set of sequences, and their corresponding LookingGlass embeddings, the OG homolog set (Table 1).
Embedding and sequence similarity
For each sequence pair, the sequence and embedding similarity were determined. The embedding similarity was calculated as the cosine similarity between embedding vectors. The sequence similarity was calculated as the Smith-Waterman alignment score using the BioPython70 pairwise2 package, with a gap open penalty of −10 and a gap extension penalty of −1. The IDs of chosen OGs, the cosine similarities of the embedding vectors, and sequence similarities of the DNA sequences are available in the associated Github repository61.
Comparison to HMM-based domain searches for distant homology detection
Distantly related homologous sequences that share, e.g., Pfam71, domains can be identified using HMM-based search methods. We used hmmscan25 (e-val threshold = 1e−10) to compare homologous (at the phylum level) sequences in the OG homolog set, for which the alignment score was less than 50 bits and the embedding similarity was greater than 0.62 (total: 21,376 gene pairs). Specifically, we identified Pfam domains in each sequence and compared whether the most significant (lowest e-value) domain for each sequence was identified in common for each homologous pair.
Environmental relevance
Encoder embeddings and MANOVA test
The LookingGlass embeddings and the environment of origin for each read in the mi-faser functional set were used to test the significance of the difference between the embedding vectors across environmental contexts. The statistical significance of this difference was evaluated with a MANOVA test using the R stats package68.
Oxidoreductase classifier
Dataset generation
The manually curated, reviewed entries of the Swiss-Prot database65 were downloaded (June 2, 2020). Of these, 23,653 entries were oxidoreductases (EC number 1.-.-.-) of Archaeal or Bacterial origin (988 unique ECs). We mapped their UniProt IDs to both their EMBL CDS IDs and their UniRef50 IDs via the UniProt database mapper API. Uniref50 IDs identify clusters of sequences with >50% amino acid identity. This cross-reference identified 28,149 EMBL CDS IDs corresponding to prokaryotic oxidoreductases, belonging to 5451 unique UniRef50 clusters. We split this data into training, validation, and test sets such that each UniRef50 cluster was contained in only one of the sets, i.e., there was no overlap in EMBL CDS IDs corresponding to the same UniRef50 cluster across sets. This ensures that the oxidoreductase sequences in the validation and test sets are dissimilar to those seen during training. The DNA sequences for each EMBL CDS ID were downloaded via the EMBL database API. These data generation process were repeated for a random selection of non-oxidoreductase UniRef50 clusters, which resulted in 28,149 non-oxidoreductase EMBL CDS IDs from 13,248 unique UniRef50 clusters.
Approximately 50 nucleotide read-length chunks (selected from the representative read-length distribution, as above) were selected from each EMBL CDS DNA sequence, with randomly selected start positions on the gene and a 50% chance of selecting the reverse complement, such that an even number of read-length sequences with “oxidoreductase” and “not oxidoreductase” labels were generated for the final dataset. This procedure produced a balanced dataset with 2,372,200 read-length sequences in the training set, 279,200 sequences in the validation set, and 141,801 sequences in the test set. We term this set of reads and their annotations the oxidoreductase model set (Table 1). In order to compare the oxidoreductase classifier performance to an HMM-based method, reads with “oxidoreductase” labels in the oxidoreductase model test set (71,451 reads) were 6-frame translated and searched against the Swiss-Prot protein database using phmmer25 (reporting e-val threshold = 0.05, using all other defaults).
Fine-tuning procedure
Since our functional annotation classifier addresses a closer classification task to the oxidoreductase classifier than LookingGlass itself, the architecture of the oxidoreductase classifier was fine-tuned starting from the functional annotation classifier, replacing the decoder with a new pooling classification layer (as described above for the functional annotation classifier) and with a final output size of 2 to predict “oxidoreductase” or “not oxidoreductase”. Fine tuning of the oxidoreductase classifier layers was done successively, training later layers in isolation and then progressively including earlier layers into training, using discriminative learning rates ranging from 1e−2 to 5e−4, as previously described72. The fine-tuned model was trained for 30 epochs, over 18 h, on a single P100 GPU node with 16GB memory.
Model performance in metagenomes
Sixteen marine metagenomes from the surface (SRF, ~5 meters) and mesopelagic (MES, 175–800 meters) from eight stations sampled as part of the TARA expedition37 were downloaded from the SRA62 (Supplementary Table 3, SRA accession numbers ERR598981, ERR599063, ERR599115, ERR599052, ERR599020, ERR599039, ERR599076, ERR598989, ERR599048, ERR599105, ERR598964, ERR598963, ERR599125, ERR599176, ERR3589593, and ERR3589586). Metagenomes were chosen from a latitudinal gradient spanning polar, temperate, and tropical regions and ranging from −62 to 76 degrees latitude. Mesopelagic depths from four out of the eight stations were sampled from oxygen minimum zones (OMZs, where oxygen <20 μmol/kg). Each metagenome was rarefied to twenty million randomly selected sequences. We term this set of reads the oxidoreductase metagenome set (Table 1 and Supplementary Table 3). Predictions of “oxidoreductase” or “not oxidoreductase” were made for these sequences with the oxidoreductase classifier. To compare model predictions to alternative functional annotation methods, reads in the oxidoreductase metagenome set were annotated with mi-faser27 with the read-map option, and with the MG-RAST functional annotation pipeline28 using default settings.
Reading frame classifier
Dataset generation
For each taxonomic order, the CDS files of one of the genome IDs in the GTDB representative set were downloaded from NCBI52. These were split into read-length chunks as described above. Note that because each sequence is a CDS, the true frame of translation for each read-length chunk was known; this translation frame label of (1, 2, 3, −1, −2, or −3) was recorded for each read-length input61. We term this set of reads the reading frame set (Table 1).
Fine-tuning procedure
The translation frame classifier was adjusted with a pooling classification layer with an output size of six for the six possible translation frame labels. Fine tuning was performed over successive layers with discriminative learning rates ranging from 1e−3 to 5e−5 as described for the oxidoreductase classifier. Training of the fine-tuned model for 24 epochs took a total of 72 h on a single P100 GPU node.
Optimal temperature classifier
Dataset generation
The optimal growth temperature for 19,474 microorganisms was manually curated from multiple sources: BacDive73, DSMZ74, Pasteur Institute (PI), the National Institute for Environmental Studies (NIES)75, and a curated list from a previous work76. BacDive data are available through their API, which contains calls to retrieve the species list and to get all data about a specific species. For DSMZ, PI, and NIES databases we used previously published77 data files (for DSMZ and PI) or scripts and method (NIES) to query optimal growth temperature information (accessed July 2020). We finally cross-referenced optimal growth temperature of these organisms to their NCBI taxonomy ID78.
Previous studies have shown a strong correlation between enzyme optimal temperature and organism optimal growth temperature77. We assumed that core housekeeping enzymes, such as those involved in transcription and translation, would have the same optimal functional temperature as the organism itself. Thus, we cross-referenced the 19,474 microorganisms identified above to the UniProt IDs belonging to those taxa for the housekeeping genes: RNA polymerase (EC 2.7.7.6), RNA helicase (EC 3.6.4.13), DNA polymerase (EC 2.7.7.7), DNA primase (EC 2.7.7.101 for Bacteria, EC 2.7.7.102 for Archaea), DNA helicase (EC 3.6.4.12), DNA ligase (ECs 6.5.1.1, 6.5.1.2, 6.5.1.6, and 6.5.1.7), and topoisomerase (ECs 5.6.2.1 and 5.6.2.2). Finally, we linked these UniProt IDs to the corresponding EMBL CDS IDs, downloaded the gene sequences, and split them into read-length chunks as described above.
The optimal temperature label for each read was derived from the optimal growth temperature from its source organism; range [4–104.5] °C. The optimal temperature labels were converted to categorical labels of “psychrophilic” for optimal temperatures <15 °C, “mesophilic” for [20–40] °C, and “thermophilic” for >50 °C. The training, validation, and test sets were split by EC number such that only sequences from EC 3.6.4.13 were in the validation set, only sequences from EC 6.5.1.2 were in the test set, and all other EC numbers were in the training set. Finally, the inputs from each label category were either downsampled or upsampled (as described above for the mi-faser functional set) to a balanced number of inputs for each class. This resulted in 5,971,152 inputs in the training set with ~2,000,000 reads per label; 597,136 inputs in the validation set with ~200,000 reads per label; and 296,346 inputs to the test set with ~100,000 reads per label. We term this set of reads and their annotations the optimal temp set (Table 1).
Fine-tuning procedure
The optimal temperature classifier was adjusted with a pooling classification layer with an output size of three for the three possible optimal temperature labels, as described above. Fine tuning was performed over successive layers with discriminative learning rates ranging from 5e−2 to 5e−4 as described for the oxidoreductase classifier, for a total of 15 epochs spanning 22 h on a single P100 GPU node.
Metrics
Model performance metrics for accuracy (all classifiers), precision, recall, and F1 score (binary classifiers only) are defined as below:
$${{{{{rm{Accuracy}}}}}}:frac{{{{{{rm{TP}}}}}}+{{{{{rm{TN}}}}}}}{{{{{{rm{TP}}}}}}+{{{{{rm{FP}}}}}}+{{{{{rm{TN}}}}}}+{{{{{rm{FN}}}}}}}$$
(1)
$${{{{{rm{Precision}}}}}}:frac{{{{{{rm{TP}}}}}}}{{{{{{rm{TP}}}}}}+{{{{{rm{FP}}}}}}}$$
(2)
$${{{{{rm{Recall}}}}}}:frac{{{{{{rm{TP}}}}}}}{{{{{{rm{TP}}}}}}+{{{{{rm{FN}}}}}}}$$
(3)
$${{{{{rm{F}}}}}}{{{{{rm{1}}}}}},{{{{{rm{score}}}}}}:{{{{{rm{2}}}}}}cdot frac{{{{{{rm{Precision}}}}}}cdot {{{{{rm{Recall}}}}}}}{{{{{{rm{Precision}}}}}}+{{{{{rm{Recall}}}}}}}$$
(4)
where TP is a true positive (correct positive label prediction), FP is a false positive (incorrect prediction of the positive label), TN is a true negative (correct negative label prediction), and FN is a false negative (incorrect prediction of the negative label).
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Source: Ecology - nature.com