Proteomic traits vary across taxa in a coastal Antarctic phytoplankton bloom

Field sampling

We collected samples once per week over four weeks at the Antarctic sea ice edge, in McMurdo Sound, Antarctica (December 28, 2014 “GOS-927”; January 6 “GOS-930”, 15 “GOS-933”, and 22 “GOS-935”, 2015; as previously described in [27]). Sea water (150–250 l) was pumped sequentially through three filters of decreasing size (3.0, 0.8, and 0.1 μm, 293 mm Supor filters). Separate filter sets were acquired for metagenomic, metatranscriptomic, and metaproteomic analyses, over the course of ∼3 h, each week (36 filters in total). Filters for nucleic acid analyses were preserved with a sucrose-based buffer (20 mM EDTA, 400 mM NaCl, 0.75 M sucrose, 50 mM Tris-HCl, pH 8.0) with RNAlater (Life Technologies, Inc.). Filters for protein analysis were preserved in the same sucrose-based buffer but without RNAlater. Filters were flash frozen in liquid nitrogen in the field and subsequently stored at −80 °C until processed in the laboratory.

Metagenomic and metatranscriptomic sequencing

We used metagenomics and metatranscriptomics to obtain reference databases of potential proteins for metaproteomics. We additionally used a database assembled from a similarly processed metatranscriptomic incubation experiment [28], conducted with source water from the January 15, 2015 time point (these samples were collected on a 0.2 μm Sterivex filter and processed as previously described).

For samples from the GOS-927, GOS-930, GOS-933, and GOS-935 filters, RNA was purified from a DNA and RNA mixture [29]. In total, 2 µg of the DNA and RNA mixture was treated with 1 µl of DNase (2 U/µl; Turbo DNase, TURBO DNase, Thermo Fisher Scientific), followed by processing with an RNA Clean and Concentrator kit (Zymo Research). An Agilent TapeStation 2200 was used to observe and verify the quality of RNA. In total, 200 ng of total RNA was used as input for rRNA removal using Ribo-Zero (Illumina) with a mixture of plant, bacterial, and human/mouse/rat Removal Solution in a ratio of 2:1:1. An Agilent TapeStation 2200 was used to subsequently observe and verify the quality of rRNA removal from total RNA. rRNA-deplete total RNA was used for cDNA synthesis with the Ovation RNA-Seq System V2 (TECAN, Redwood City, USA). DNA was extracted for metagenomics from the field samples (GOS-927, GOS-930, GOS-933, and GOS-935) according to [29]. RNase digestion was performed with 10 µl of RNase A (20 mg/ml) and 6.8 μl of RNase T1 (1000 U/µl), which were added to 2 µg of genomic DNA and RNA mixture in a total volume of 100 µl, followed by 1 h incubation at 37 °C and subsequent ethanol precipitation in −20 °C overnight.

Samples of double stranded cDNA and DNA were fragmented using a Covaries E210 system with the target size of 400 bp. In total, 100 ng of fragmented cDNA or DNA was used as input into the Ovation Ultralow System V2 (TECAN, Redwood City, USA), following the manufacturer’s protocol. Ampure XP beads (Beckman Coulter) were used for final library purification. Library quality was analyzed on a 2200 TapeStation System with Agilent High Sensitivity DNA 1000 ScreenTape System (Agilent Technologies, Santa Clara, CA, USA). Twelve DNA and 18 cDNA libraries were combined into two pools with concentration 4.93 and 4.85 ng/µl, respectively. Resulting library pools were subjected to one lane of 150 bp paired-end HiSeq 4000 sequencing (Illumina). Prior to sequencing, each library was spiked with 1% PhiX (Illumina) control library. Each lane of sequencing resulted in between 106,000 and 111,000 Mbp total and 6900–12,000 Mbp and 4800–6900 Mbp for individual DNA or cDNA libraries, respectively.

Metagenomic and metatranscriptomic bioinformatics

Metagenomic and metatranscriptomic data were annotated with the same pipelines. Briefly, adapter and primer sequences were filtered out from the paired reads, and then reads were quality trimmed to Phred33. rRNA reads were identified and removed with riboPicker [30]. We then assembled reads into transcript contigs using CLC Assembly Cell, and then we used FragGeneScan to predict open reading frames (ORFs) [31]. ORFs were functionally annotated using Hidden Markov models and blastp against PhyloDB [32]. Annotations which had low mapping coverage were filtered out (less than 50 reads total over all samples), as were proteins with no blastp hits and no known domains. For each ORF, we assigned a taxonomic affiliation based on Lineage Probability Index taxonomy [32, 33]. Taxa were assigned using two different reference databases: NCBI nt and PhyloDB [32]. Unless otherwise specified, we used taxonomic assignments from PhyloDB, because of the good representation of diverse marine microbial taxa.

ORFs were clustered by sequence similarity using Markov clustering (MCL) [34]. Sequences were assigned MCL clusters by first running blastp for all sequences against each other, where the query was the same as the database. The MCL algorithm was subsequently used with the input as the matrix of E-values from the blastp output, with default parameters for the MCL clustering. MCL clusters were then assigned consensus annotations based on KEGG, KO, KOG, KOG class, Pfam, TIGRFAM, EC, GO, annotation enrichment [28, 32, 35,36,37,38,39]. Proteins were assigned to coarse-grained protein pools (ribosomal and photosynthetic proteins) based on these annotations. For assignment, we used a greedy approach, such that a protein was assigned a coarse-grained pool if at least one of these annotation descriptions matched our search strings (we also manually examined the coarse grains to ensure there were no peptides that mapped to multiple coarse-grained pools). For photosynthetic proteins, we included light harvesting proteins, chlorophyll a-b binding proteins, photosystems, plastocyanin, and flavodoxin. For ribosomal proteins, we just included the term “ribosom*” (where the * represents a wildcard character), and excluded proteins responsible for ribosomal synthesis.

Sample preparation and LC-MS/MS

We extracted proteins from the samples by first performing a buffer exchange from the sucrose-buffer to an SDS-based extraction buffer, after which proteins were extracted from each filter individually (as previously described) [27]. After extraction and acetone-based precipitation, we prepared samples for liquid chromatography tandem mass spectrometry (LC-MS/MS). Precipitated protein was first resuspended in urea (100 µl, 8 M), after which we measured the protein concentration in each sample (Pierce BCA Protein Assay Kit). We then reduced, alkylated, and enzymatically digested the proteins: first with 10 µl of 0.5 M dithiothreitol for reduction (incubated at 60 °C for 30 min), then with 20 µl of 0.7 M iodoacetamide (in the dark for 30 min), diluted with ammonium bicarbonate (50 mM), and finally digested with trypsin (1:50 trypsin:sample protein). Samples were then acidified and desalted using C-18 columns (described in detail in ref. [40]).

To characterize each metaproteomic sample, we employed one-dimensional liquid chromatography coupled to the mass spectrometer (VelosPRO Orbitrap, Thermo Fisher Scientific, San Jose, California, USA; detailed in [40]). For each injection, protein concentrations were equivalent across sample weeks, but different across filter sizes. We had higher amounts of protein on the largest filter size (3.0 μm) and less on the smaller filters, so we performed three replicate injections per 3.0 µm filter sample, and two replicate filter injections for 0.8 and 0.1 µm filters. We used a non-linear LC gradient totaling 125 min. For separation, peptides eluted through a 75 µm by 30 cm column (New Objective, Woburn, MA), which was self-packed with 4 µm, 90 A, Proteo C18 material (Phenomenex, Torrance, CA), and the LC separation was conducted with a Dionex Ultimate 3000 UHPLC (Thermo Scientific, San Jose, CA).

LC-MS/MS bioinformatics—database searching, configuration, and quantification

Metaproteomics requires a database of potential protein sequences to match observed mass spectra with known peptides. Because we had sample-specific metagenome and metatranscriptome sequencing for each metaproteomic sample, we assessed various database configurations, including those that we predict would be suboptimal, to examine potential options for future metaproteomics researchers. We used five different configurations, described below. In each case, we appended a database of common contaminants (Global Proteome Machine Organization common Repository of Adventitious Proteins). We evaluated the performance of different database configurations based on the number of peptides identified (using a peptide false discovery rate of 1%).

In order to make these databases (Table 1), we performed three separate assemblies on (1) the metagenomic reads (from samples GOS-927, GOS-930, GOS-933, and GOS-935), (2) metatranscriptomic reads (from samples GOS-927, GOS-930, GOS-933, and GOS-935), and (3) metatranscriptomic reads from a concurrent metatranscriptomic experiment, started at the location where GOS-933 was taken [28]. Database configurations were created by subsetting from these assemblies. The first configuration was “one-sample database”, constructed to represent the scenario where only one sample was used for metagenomic and metatranscriptomic sequencing (we chose the first sampling week). Specifically, this was done by subsetting and including ORFs from the metagenomic and metatranscriptomic assemblies if reads from this time point were present in that sample (reads mapped as in [28]), and then removing redundant protein sequences (P. Wilmarth, fasta utilities). The second configuration was the “sample-specific database”, where each metaproteomic sample had one corresponding database (prepared from both metagenome and metatranscriptome sequencing completed at the same sampling site), also done by subsetting ORFs from the metagenomic and metatranscriptomic assemblies as described above. The third configuration was pooling databases across size fractions—such that all metagenomic and metatranscriptomic sequences across the same filter sizes (e.g., 3.0 µm) were combined. ORFs were subsetted from the metagenomic and metatranscriptomic assemblies as above. The fourth and fifth configurations are from the concurrent metatranscriptomic experiment [28]. The fourth configuration (“metatranscriptome experiment (T0)”) was the metatranscriptome of the in situ microbial community (i.e., at the beginning of the experiment). This database was created by subsetting from the “metatranscriptome experiment (all)” assembly. Finally, the fifth configuration was the metatranscriptome of all experimental treatments pooled together (two iron levels, three temperatures; “metatranscriptome experiment (all)”). The overlap between databases (potential tryptic peptides) in different samples is presented graphically in Supplementary Figs. S1–S3.

Table 1 Characteristics of the five different database configurations we used for metaproteomic database searches.

Full size table

After matching mass spectra with peptide sequences for each database configuration (MSGF + with OpenMS, with a 1% false discovery rate at the peptide level; [41, 42]), we used MS1 ion intensities to quantify peptides. Specifically, we used the FeatureFinderIdentification approach, which cross-maps identified peptides from one mass spectrometry experiment to unidentified features in another experiment—increasing the number of peptide quantifications [43]. This approach requires a set of experiments to be grouped together (i.e., which samples should use this cross-mapping?). We grouped samples based on their filter sizes (including those samples that are replicate injections). First, mass spectrometry runs within each group were aligned using MapAlignerIdentification [44], and then FeatureFinderIdentification was used for obtaining peptide quantities.

After peptides have been identified and quantified, we mapped them to proteins or MCL clusters of proteins, which have corresponding functional annotations (KEGG, KO, KOG, Pfams, TIGRFAM; [28, 32, 35,36,37,38,39]). Functional annotations were used in three separate analyses. (1) Exploring the overall functional changes in microbial community metabolism, we mapped peptides to MCL clusters—groups of proteins with similar sequences. These clusters have consensus annotations based on the annotations of proteins found within the clusters (described in detail in [28]). For this section, we only used peptides that uniquely map to MCL clusters. (2) We restricted the second analysis to two protein groups: ribosomal and photosynthetic proteins. For this analysis, we mapped peptides to one of these protein groups if at least one annotation mapped to the protein group (via string matching with keywords). This approach is “greedy” because does not exclude peptides if they also correspond with other functional groupings, but this is necessary because of the difficulties in comparing various annotation formats. (3) The last analysis for functional annotations was for targeted proteins, and we only mapped functions to peptides where the peptides uniquely identify a specific protein (e.g., plastocyanin).

Code for the database setup and configuration, database searching, and peptide quantification is open source (https://github.com/bertrand-lab/ross-sea-meta-omics).

LC-MS/MS bioinformatics—normalization

Normalization is an important aspect of metaproteomics: it influences all inferred peptide abundances. Typically, the abundance of a peptide is normalized by the sum of all identified peptide abundances. We use the term normalization factor for the inferred sum of peptide abundances. Note that the apparent abundance of observed peptides is dependent on the database chosen. In theory, if fewer peptides are observed because of a poorly matching database, this will decrease the normalization factor, and those peptides that are observed will appear to increase in abundance. It is not known how much this influences peptide quantification in metaproteomics.

For each database configuration, we separately calculated normalization factors. We then correlated the sum of observed peptide abundances with each other. To get a database-independent normalization factor, we used the sum of total ion current (TIC) for each mass spectrometry experiment (using pyopenms; [45]), and also examined the correlation with database-dependent normalization factors. If normalization factors are highly correlated with each other, that would indicate database choice does not impact peptide quantification. Using TIC for normalization may have drawbacks, particularly if there are differences in contamination, or amounts of non-peptide ions across samples.

Defining proteomic mass fraction

Protein abundance can be calculated in two ways: (1) the number of copies of a protein (independent of a proteins’ mass), or (2) the total mass of the protein copies (the sum of peptides). We refer to the latter as a proteomic mass fraction. For example, to calculate a diatom-specific, ribosomal mass fraction, we sum all peptide abundances that are diatom- and ribosome-specific, and divide by the sum of peptide abundances that are diatom-specific. Note that this is slightly different to other methods, like the normalized spectral abundance factor, which normalizes for total protein mass (via protein length; [46]).

Combining estimates across filter sizes

Organisms should separate according to their sizes when using sequential filtration with decreasing filter pore sizes. In practise, however, organisms can break because of pressure during filtration, and protein is typically present for large phytoplankton on the smallest filter size and vice versa. We used a simple method for combining observations across filter sizes, weighted by the number of observations per filter. We begin with the abundance of a given peptide, which was only considered present if it was observed across all injections of the same sample. We calculated the sum of observed peptide intensities (i.e., the normalization factor), and divided all peptide abundances by this normalization factor. Normalized peptide abundances are then averaged across replicate injections. If we are estimating the ribosomal mass fraction of the diatom proteome, we first normalize the diatom-specific peptide intensities as a proportion of diatom biomass (i.e., divide all diatom-specific peptides by the sum of all diatom-specific peptides). We then summed all diatom-normalized peptides intensities that are unique to both diatoms and ribosomal proteins, which would give us the ribosomal proportion of the diatom proteome. Yet, we typically would obtain multiple estimates of, for example, ribosomal mass fraction of diatoms, on different filters. We combined the three values by multiplying each by a coefficient that represents a weight for each observation (specific to a filter size). These coefficients sum to one, and are calculated by summing the total number of peptides observed at a time point for a filter, and dividing by the total number of peptides observed across filters (but within each time point). For example, if we observed 100 peptides that are diatom- and ribosome-specific, and 90 of these peptides were on the 3.0 µm filter and only ten were on the 0.8 μm filter, we would multiply the 3.0 µm filter estimate by 0.9 and the 0.8 µm filter by 0.1. This method uses all available information about proteome composition across different filter sizes (similar to [47]).

When we estimate the proteomic mass fraction of a given protein pool, we do not need to adjust for the total protein on each filter. This is because this measurement is independent of total protein. However, for merging estimates of total relative abundance of different organisms across filters, we needed to additionally weight the abundance estimate by the amount of protein on each filter. Therefore, in addition to the weighting scheme described above, we multiplied taxon abundance estimates by the total protein on each filter divided by the total protein across filters on a given day.

LC-MS/MS simulation

We used simulations of metaproteomes and LC-MS/MS to (1) quantify biases associated with inferring coarse-grained proteomes from metaproteomes, and (2) to mitigate these biases in our inferences. Specifically, we asked the question: how does sequence diversity impact quantification of coarse-grained proteomes from metaproteomes? Consider a three organism microbial community. If two organisms are extremely similar, there will be very few peptides that can uniquely map to those organisms, resulting in underestimated abundance. The third organism would also be underestimated, but to a lesser degree, unless it had a completely unique set of peptides. A similar outcome is anticipated with differences in sequence diversity across protein groups, such that highly conserved protein groups will be underestimated.

Our mass spectrometry simulations offer a unique perspective on this issue: we know the “true” metaproteome, and we can compare this with an “inferred” metaproteome. We simulated variable numbers of taxonomic groups, each with different protein pools of variable sequence diversity. From this simulated metaproteome, we then simulated LC-MS/MS-like sampling of peptides. Complete details of the mass spectrometry simulation are available in [48] and the Supplementary materials. The only difference between this model and that presented in [48] is here we include dynamic exclusion. The ultimate outcomes from these simulations were (1) identifying which circumstances lead to biased inferences about proteomic composition, and (2) determining the underpinnings of these biases.

Cofragmentation bias scores for peptides

We recently developed a computational model (“cobia”) that predicts a peptides’ risk for interference by sample complexity (more specifically, by cofragmentation of multiple peptides; [48]). This study showed that coarse-grained taxonomic and functional groupings are more robust to bias, and that this model can also be used to estimate bias. We ran cobia with the sample-specific databases, which produces a “cofragmentation score”—a measure of risk of being subject to cofragmentation bias. Specifically, the retention time prediction method used was RTPredict [49] with an “OLIGO” kernel for the support vector machine. The parameters for the model were: 0.008333 (maximum injection time); 3 (precursor selection window); 1.44 (ion peak width); and 5 (degree of sparse sampling). Code for running this analysis, as well as the corresponding input parameter file, is found at https://github.com/bertrand-lab/ross-sea-meta-omics.

Description of previously published datasets analyzed

We leveraged several previously published datasets to compare our metaproteomic results. Specifically, we used proteomic data of phytoplankton cultures of Phaeocystis antarctica and Thalassiosira pseudonana [27, 50], and of cultures of Escherichia coli under 22 different culture conditions [51]. Coarse-grained proteomic estimates were also compared with previously published targeted metaproteomic data [27].

Source: Ecology - nature.com