Development of a time-series shotgun metagenomics database for monitoring microbial communities at the Pacific coast of Japan

Database construction

Collection of metagenomic data

Between March 2012 and May 2016, seawater samples were collected from five different locations around Sendai Bay, Ofunato Bay, and A-line. Samples (N = 142) were collected from the surface and the subsurface chlorophyll-a maximum (SCM) layers (Fig. 1). DNA was extracted from microorganisms trapped in filters (pore sizes, 0.2, 0.8, 5, 20, and 100 μm). Shotgun metagenomic sequence data (N = 454) were acquired (Supplementary Information 1), comprising 3.57 × 109 reads and 3.56 × 1011 bases in total (Supplementary Information 2). 16S rRNA gene sequences in the 0.2-μm fraction were subjected to PCR using universal primers to obtain 111 amplicon metagenomic sequences (6.92 × 106 reads, 1.69 × 109 bases) (Supplementary Information 2).

Figure 1

Location, changes in water temperature, and changes in chlorophyll-a (Chl-a) concentrations at the sampling points. (a) Sampling points along the Pacific coast of northeastern Japan (Sendai Bay, Ofunato Bay, and A-line). (b) Sampling points C5 and C12 in Sendai Bay. The map was generated using Ocean Data View ( with data imported from the NOAA server (accessed on 22 February 2021). (c) Changes in water temperature and chlorophyll concentrations at Sendai Bay and A-line sampling points. Red circles indicate the depth of the sampled water. X-axis: dates from 2012 to 2014. Y-axis: water depth from the surface.

Full size image

Time-series analysis of microbial species composition

We performed microbial taxonomic assignments through analyses of amplicon and shotgun metagenomic sequence data using the SILVA and NCBI NT databases. For C5 and C12 samples from Sendai Bay and A4 and A21 samples from A-line, sufficient time-series data were available to plot changes in microbiota over time (Fig. 2). By clicking the displayed taxon at the website of Ocean Monitoring Database, the microbiota composition of lower taxa is revealed. For example, Fig. 2 shows that the cyanobacteria community increased in abundance during the summer.

Figure 2

Time-series analysis of microbial communities along the Pacific coast of northeastern Japan. Each sampling point shows the number of ribosomal sequences normalized to 1000 (excluding no hits). Clicking on the graph at the website of Ocean Monitoring Database exhibits the next taxonomic levels. This figure shows an example of the change in Cyanobacteria communities over time from April 2012 to May 2014. SUF: surface layer (1 m), SCM: the subsurface chlorophyll-a maximum layer.

Full size image

We acquired substantial 16S rRNA gene amplicon sequencing data at the C5 fixed point at Sendai Bay between 2012 and 2014. We generated a 3D graph to simultaneously display the date, water depth, and species composition at this site (Fig. 3). By clicking on the taxon shown in the graph at the website of Ocean Monitoring Database, the composition of the microbiota within a lower taxon is displayed. The 3D display is suitable for presenting a bird’s-eye view of the metagenomic data, which is extremely useful for visualizing and understanding the relationships among microbial communities among sampling points. This innovative function was incorporated into the metagenomic database. Figure 3 shows a contour map of chlorophyll concentrations on the x-axis and the proportion of microbial communities on the z-axis. The proportion of flavobacteria may increase following an increase in chlorophyll concentration during a spring bloom. For example, Buchan et al. reported that the proportion of flavobacteria increase late in a spring bloom16; our results show similar patterns (Fig. 3).

Figure 3

Three-dimensional (3D) display of microbial communities. 3D display of bacterial communities identified using 16S rRNA gene amplicon analysis of the Sendai Bay C5 samples from 2012 to 2014. The x-axis indicates the date, the y-axis indicates the water depth, and the z-axis indicates the percentage abundance of bacterial genera. The contour plot on the xy plane indicates the chlorophyll concentration. The composition of Flavobacteriaceae is shown as an example.

Full size image

Digital DNA chip (DDC) database

A DDC analysis (DDCA) system is useful for visualizing the characteristics of shotgun metagenomic data as a microarray17 of, for example, filter size, water sampling point, water sampling time, temperature, salinity, and nitrate and phosphate concentrations (Fig. 4a). By mapping sequence data against the probe sets described above, which are associated with environmental factors, we predicted that sequence data would be more enriched and inclusive of environmental information. Figure 4b displays the DDCA shotgun metagenomic data of the 0.2–0.8-μm fraction of the C5 sample collected from Sendai Bay on December 1, 2013. The sample contains a bacterial-fraction DNA marker with filter sizes of 0.2–0.8 µm and a specific DNA marker for December in Sendai Bay. Even if there is only NGS data and no environmental information, just by looking at the digital DNA chip, we can assume this sample is extracted from 0.2 to 0.8 µm fraction and is from Sendai Bay (Fig. 4b).

Figure 4

Visualization of metagenomics data using digital DNA chips. (a) Overview of in silico probes associated with the environmental factors on a digital DNA chip (See Supplementary Information 8 for details). (b) Digital DNA chip of shotgun metagenomics data of a 0.2–0.8-μm fraction of December 1, 2013, Sendai Bay C5. There are 748 probes, and spots that are positive for digital hybridization are shown in red. Negative spots are black. The hybridization positive probes are an indicator of environmental information of the sequence data.

Full size image

Development of a shotgun metagenomic database

We assembled the shotgun metagenomic sequence data using Megahit version 1.0.218. There were 57.95 M contigs, with an N50 of 995 bp, a maximum length of 307,212 bp, and a total of 12.39 Gbp (Supplementary Information 3). We calculated the abundance pattern of each contig. Those contigs whose appearance pattern matched with a Pearson correlation coefficient of ≥ 0.95 were clustered into a MAG. We next added the annotation of assembled contigs to the results of the BLAST search of the NCBI NT database and using classification by clustering with Pfam (CCP). This novel annotation method is described below. We developed the database showing the abundance pattern of homologous contigs against a queried sequence by BLAST for each sampling point and filter size (Fig. 5). For example, we found novel PolD families using this database.

Figure 5

Search for homologous contigs to a query sequence and display of temporal variation patterns. Using nucleotide and amino acid sequences as queries, contigs homologous to the query sequence are identified using BLAST, and the temporal variation patterns and taxonomy information of the hit contigs are displayed.

Full size image

Development of a new annotation method for metagenome contigs

We annotated the assembled contigs using BLAST to analyze the NCBI NT database. However, we were unable to annotate > 50% of the contigs (Fig. 6). Therefore, we developed a novel method, i.e., CCP, to annotate contigs according to their species names. Analysis using a single contig generally does not provide sufficient information for assigning an annotation. However, CCP assigns the appropriate annotation to the sequence because it aggregates the Pfam information of all contigs in a MAG. CCP annotates a MAG by comparing the similarity to the reference genome. Comparing the nucleotide sequences of a MAG directly using blastn to analyze the NCBI NT database shows relatively low homology to de novo virus sequences.

Figure 6

Comparison between BLAST and CCP annotation results at the super-kingdom level. Comparison of classification results using BLAST to annotate contigs and classification by clustering with Pfam (CCP); the percentage of unknowns was 57% for BLAST and 8% for CCP.

Full size image

However, a Pfam domain search using HMMER, which employs a different principle19 than BLAST, often detects more informative sequences, even those of viruses. For example, phylogenetic trees constructed according to the type and number of Pfam domains of individual bacterial genomes and those of higher eukaryotes such as humans closely approximate those generated using existing phylogenetic trees20. Thus, genomes with a similar Pfam domain may represent phylogenetically closely related species. We, therefore, searched the Pfam domains for reference genomes of viruses, bacteria, archaea, and eukaryotes included in RefSeq (as of August 31, 2015). We next calculated the number of domains for each species and constructed a CCP database. The types and numbers of Pfam domains contained in the contig obtained from the metagenome were summarized in MAG units. We compared the results using the CCP database and annotated the known genomes with the closest correlation coefficient (Fig. 7).

Figure 7

Overview of CCP. Flowchart of the search of the Pfam domain against known genomes of viruses, bacteria, archaea, and eukaryotes included in RefSeq to create a Pfam hit database. The Pfam domains were searched in metagenome-assembled genome (MAG) units and the known genomes whose type and number of Pfam domains are closest to the MAG.

Full size image

By annotating the top 10,000 contigs with the highest abundance in our database using CCP, > 90% of the contigs were explained (Fig. 6). In contrast, the BLAST species search (the existing method) returned < 50% annotated contigs (Fig. 6), indicating that CCP classified the contigs more comprehensively. In particular, virus annotation significantly improved from 2 to > 15%, indicating that CCP is a robust method, particularly when applied to the identification of virus annotation. We next compared the agreement between CCP and BLAST annotations using contigs annotated using both CCP and BLAST (Supplementary Information 4). The virus-level agreement between CCP and BLAST was 89.6%, and the kingdom-level agreement was 76.9%. It was difficult to determine whether the contig represented a virus using BLAST; however, CCP showed higher accuracy.

Shotgun metagenomic analysis

Periodicity of metagenomic data

For the bacterial fraction (0.2–0.8 µm) of the shotgun metagenomic data from Sendai Bay, we generated a multidimensional scaling (MDS) plot according to the pattern of the abundance of the assembled contigs (Fig. 8). The MDS plot shows similarities among the samples collected during the same month during different years. Furthermore, the plot reveals that the shotgun metagenomic data exhibit an annual seasonal cycle like the 18S rRNA amplicon data10. However, we did not observe the same annual cycle among all contigs. We, therefore, extracted contigs included in the top 20 highly abundant MAGs from the bacterial fractions of Sendai Bay samples collected from March 2012 to April 2014 and plotted the fluctuation patterns. Only one such contig showed a complete 2-year cycle (Fig. 9a), and four contigs showed an incomplete 2-year cycle with peaks in March 2012 and 2013 but not in 2014 (Fig. 9b). Furthermore, 13 MAGs showed a transient pattern (Fig. 9c), and two MAGs showed peaks with irregular patterns (Fig. 9d). These results suggest that marine microbial communities generally undergo an annual cycle.

Figure 8

Multidimensional scaling (MDS) plot as a function of the abundance of contigs. MDS plots of bacterial fractions (0.2–0.8 µm) of shotgun metagenomic data from 2012 to 2015 acquired from Sendai Bay according to the pattern of abundance of assembled contigs.

Full size image
Figure 9

Variation patterns of contigs in the top 20 most abundant metagenome-assembled genomes (MAGs).The top 20 MAGs in the bacterial fractions of Sendai Bay C5 and C12 from March 13, 2012, to April 2, 2014, were classified as follows: (a) complete 1-year cycle for 2.5 years, (b) Incomplete 1-year cycle for 2.5 years, (c) transient peaks, and (d) irregular peaks. A peak within 1 month of ≥ 25% relative to the previous year’s peak was considered cyclical.

Full size image

Identification of repeat sequences in the metagenomes

During the collection and analysis of the DNA sequencing data, we identified a number of repeat sequences in the metagenomes as follows: (TAG)n, (TGA)n, (GAA)n, and (ACA)n microsatellites. We then determined the frequencies and highest numbers of (TAG)n repeats as a function of filter size. We found that the (TAG)n repeats included up to 7.5% of the 5–20-μm fraction (Supplementary Information 5a,b). To investigate whether this was a characteristic feature of the northeastern coastal region of Japan, we analyzed the shotgun metagenomic sequence data of Tara Oceans21,22. As shown in Supplementary Information 5c,d, Tara Oceans data contained up to 1.9% of TAG repeats.

To determine whether these (TAG)n repeats represented artifacts of the NGS method, we performed Southern blot and dot-blot hybridization analyses of the DNA samples extracted from seawater (Fig. 10). The dot-blot hybridization experiment analyzed 13 different samples with various content rates (Fig. 10a). We detected signals from the eight samples containing the (TAG)n that were repeated in > 0.9% of the labeled d54-mer with the (TAG)18 repeat. In contrast, six samples with a low content (> 0.2%) were negative (Fig. 10b). To determine whether these repeated sequences originated from a single locus or multiple loci, we performed Southern blot analysis (Fig. 10c, Supplementary Information 6) using two samples with high contents of (TAG)n repeats. A (TAG)n representing a single locus is detectable as a discrete band versus the diffuse bands exhibited by two samples with a high content of (TAG)n repeats. The data (Fig. 10c) suggest that the (TAG)n repeats were derived from multiple loci of distinct genomes. Samples with low numbers of (TAG)n repeats were negative.

Figure 10

Detection of TAG repeats using Southern blot and dot-blot hybridization analyses. (a) Contents of the TAG repeats of the samples according to next-generation sequencing analysis. (b) Dot-blot analyses. The sample numbers and their amounts, (right side) correspond to the signals of each dot in the left panels. The intact pTV119N plasmid without an insert indicates pTV(0). The calculated contents of TAG repeats (%) are indicated in parentheses. (c) Southern blot analysis of EcoRI-digested samples subjected to 0.8% agarose gel electrophoresis. The plasmid pTV (TAG) (0.35 ng and 1 ng) served as a positive control. E. coli genomic DNA served as a negative control. The calculated contents of TAG repeats (%) are indicated on the bottom of each graph. The length (nt) of the TAG-repeated fragment excised from pTV (TAG) is shown on the right.

Full size image

These results reveal for the first time that such repeat sequences are abundant in the genomes of marine microorganisms. However, their species of origin and functional roles were not identified here. The repeat sequences found in Escherichia coli23, subsequently called CRISPR, led to fundamental discoveries that are essential in the field of genetic engineering24. Thus, understanding the biological significance of trinucleotide repeats in marine microorganisms is of particular importance and may reveal a new research frontier.

Source: Ecology -

MIT J-WAFS awards eight grants in seventh round of seed funding

Non-uniform tropical forest responses to the ‘Columbian Exchange’ in the Neotropics and Asia-Pacific