in

Extensive gut virome variation and its associations with host and environmental factors in a population-level cohort

Sample collection and metagenomic sequencing

Written informed consent was obtained prior to participation in the project. The study protocol for the Japanese (Disease, Drug, Diet, Daily life) microbiome project was approved by the medical ethics committees of the Tokyo Medical University (Approval No: T2019-0119), National Center for Global Health and Medicine (Approval No: 1690), the University of Tokyo (Approval No: 2019185NI), Waseda University (Approval No: 2018-318), and the RIKEN Center for Integrative Medical Sciences (Approval No: H30-7). We conducted a prospective cross-sectional study of 4198 individuals participating in the Japanese 4D microbiome project, which commenced in January 2015 and is ongoing20.

Participants registered in the project were those who visited hospitals in the area for disease diagnosis or a health checkup. Faecal samples are collected from both healthy and diseased participants. The eligibility criteria for participants are as follows: (1) born and raised in Japan; (2) age >15 years; (3) written informed consent provided; and (4) having an endoscopic diagnosis on colonoscopy; either having undergone a colonoscopy within the last 3 years or planning to undergo colonoscopy for colorectal cancer screening, surveillance, and diagnosis of various gastrointestinal symptoms. The exclusion criteria were as follows: (1) suspected acute infectious disease based on clinical findings (e.g., acute enterocolitis, pneumonia, tuberculosis etc.); (2) acute bleeding; (3) hearing loss; (4) unable to understand written documents; (5) unable to write and (6) limited ability to perform activities of daily living. No compensation was paid to participants.

Participants collected faecal samples using a Cary–Blair medium-containing tube60 at home, and the samples were refrigerated for up to 2 days before the hospital visit. Immediately after participants arrived at the hospital, their faecal samples were frozen at −80 °C until DNA extraction. We avoided collecting samples within 1 month of administering bowel preparation for colonoscopy because it has a profound effect on the gut microbiome and metabolome61. Health professionals checked that the amount of stool was sufficient for analysis. Shotgun metagenomic sequencing was performed for 4241 faecal samples and quality controls were conducted20, from which 43 samples were excluded from further analyses due to the low number of high-quality reads (<5 million reads) as described in detail previously20 (Supplementary Data 1). To explore the viral profiles of VLPs and whole metagenomes from the same samples, we collected additional faecal samples from 24 individuals in the same manner as described earlier.

Metadata collection

Details for metadata collection were described previously20. Briefly, the participants completed a self-reported questionnaire on body weight, height, alcohol consumption, smoking, dietary habits, physical activity, and Bristol Stool Scale score62. Health professionals checked the entries to correct obvious inaccuracies and obtain any missing data. BMI was categorised into five groups according to the standard World Health Organization (WHO) classification and considering the threshold value of mortality risk63 (0, underweight, <18.5 kg/m2; 1, normal weight [low], 18.5–20.0 kg/m2; 2, normal weight [high], 20.1–24.9 kg/m2; 3, overweight, 25.0–29.9 kg/m2; and 4, obese [≥ 30.0 kg/m2]). Dietary habits were assessed using a 7-point Likert scale (1, never or rarely; 2, 1–3 times/month; 3, 1–3 times/week; 4, 4–6 times/week; 5, 1 time/day; 6, 2 times/day; and 7, ≥3 times/day). Physical activity was evaluated with the International Physical Activity Questionnaire–Short Form64. Exercise reported as vigorous intensity, moderate intensity, or walking was denoted as 1, and <60 min/week was denoted as 0. The total sitting hours per day and the total metabolic equivalent of tasks65 were divided into four groups based on quartiles for the entire dataset. For the diagnosis of gastrointestinal diseases, an electronic high-resolution video endoscope was used. Comorbidities, or a history of hypertension, dyslipidemia, and any component of the Charlson comorbidity index66, were evaluated. The definite diagnosis of the disease was based on histopathological or cytological examinations or imaging modalities (e.g. computed tomography, magnetic resonance imaging and ultrasound). For medication, health professionals evaluated entries in the participant’s medication pocketbook (the Okusuri-techo) made by pharmacists when filling prescriptions20 to ensure that there were no omissions or discrepancies with the self-reported data. Electronic medical records were also checked to identify medications used. Drug use was defined as oral or self-injected administration within the previous month. All medications with pharmaceutical brand names were grouped according to the WHO’s Anatomical Therapeutic Chemical classification system (4th level)67. In total, 232 metadata were assessed and used in this study.

Preparation of VLP DNA and sequencing

Frozen faecal samples (30–500 mg) were suspended in a 2.5 mL SM buffer with 0.01% gelatine by vortexing and centrifuged at 5000 × g for 10 min at 4 °C to remove debris. The supernatant was filtered with 5.0 μm and 0.45 μm PVDF pore membrane filters (Millex-HP Syringe Filter; Merck Millipore) to remove bacterial cells. An equal volume of 20% polyethylene glycol solution (PEG-6000-2.5 M NaCl) was added to the filtrate and stored overnight at 4 °C. The solution was centrifuged at 20,000 × g for 45 min at 4 °C, and the supernatant was discarded to collect VLPs. The VLP pellet was suspended in 1 mL SM buffer with lysozyme (10.0 mg/reaction; Sigma Aldrich) and incubated for 60 min at 37 °C with gentle shaking to degrade unfiltered bacterial cells. The lysate was incubated with 10 U DNase (NIPPON GENE), 5 U TURBO DNase (Thermo Fisher Scientific), 5 U Baseline-ZERO DNase (Epicentre), 25 U Benzonase (Sigma Aldrich), and RNase (25 g/sample; NIPPON GENE) in DNase buffer (1× concentration) for 1 h at 37 °C with gentle shaking. To inactivate the DNases, EDTA (final concentration 20 mM) was added to the DNase-treated lysate and heated for 15 min at 70 °C. Proteinase K (0.5 mg/reaction; Sigma Aldrich) and SDS (final concentration 0.1%) were added to the VLPs and gently mixed at 55 °C for 30 min. An equal volume of phenol/chloroform/isoamyl alcohol (Life Technologies Japan, Ltd) was added to the lysate and gently mixed for 10 min at room temperature (20–25 °C). The lysate was centrifuged at 9000 × g for 10 min at 25 °C, and the aqueous phase was collected. Sodium acetate (final concentration 0.3 M) and an equal volume of isopropanol with Dr. GenTLE precipitation carrier (Takara Bio) were added to the DNA solution and pelleted by centrifugation at 12,000 × g for 15 min at 4 °C. The DNA pellet was rinsed with 75% ethanol and dissolved in TE buffer (10 mM Tris-HCl, 10 mM EDTA). An equal volume of polyethylene glycol solution (20% PEG6000-2.5 M NaCl) was added and kept on ice for at least 10 min, and the DNA was pelleted by centrifugation at 12,000 × g for 10 min at 4°C. Finally, the DNA was rinsed with 75% ethanol, dried, and dissolved in TE buffer. For NovaSeq shotgun metagenomic sequencing, libraries were constructed from 2.5 ng VLP DNA using a KAPA HyperPrep Kit (KAPA Biosystems) with 12 cycles of amplification. The libraries were subjected to 150-bp paired-end sequencing on a NovaSeq platform.

Whole metagenomic DNA was also prepared from the same faecal samples (10 to 250 mg faeces) with an enzymatic lysis method as described previously68. Libraries were constructed from 100 ng whole metagenomic DNA and sequenced by NovaSeq using the same method as for VLP DNA.

Identification of phages in the metagenomic data

To construct a high-quality double-stranded DNA (dsDNA) phage catalogue with minimum contamination of bacterial chromosome and plasmid sequences, we developed a custom pipeline and applied it to the 4198 whole gut metagenomes as described below. The metagenomic reads of each individual were assembled into contigs using the MEGAHIT assembler (v1.2.9)69. The circularity of the assembled contigs (>10 kb) was assessed using the check_circularity.pl script, included in the sprai assembler package (https://sprai-doc.readthedocs.io/en/latest/index.html), by modifying the threshold for terminal redundancy as follows: >97% identity and >130 bp. Encoded genes in the contigs were predicted by MetaGeneMark (3.38)70. Assembled contigs were defined as phages if they passed all of the following six criteria.

  1. 1.

    A genome size threshold was applied, and contigs less than 10 Kb were excluded, as typical dsDNA phages have genomes larger than >10 Kb71.

  2. 2.

    Viral-specific k-mer patterns were checked by DeepVirFinder (v1.0)22. Contigs with p-values >0.05 were excluded from further analysis.

  3. 3.

    To detect viral hallmark genes (VHGs) and plasmid hallmark genes, we performed a highly sensitive HMM-HMM search against the Pfam database72. First, the encoded genes were aligned to the viral protein database, collected from complete (circular) viral genomes (n = 13,628) in the IMG/VR v2 database30 using JackHMMER. The obtained HMM profiles were searched against the Pfam database using hhblits73 with a >95% probability cut-off. These procedures were performed using the pipeline_for_high_sensitive_domain_search script (https://github.com/yosuken/pipeline_for_high_sensitive_domain_search)74,75. Contigs with plasmid hallmark genes or those without VHGs were excluded. The hallmark genes used in this analysis are summarised in Supplementary Data 3.

  4. 4.

    The presence of housekeeping marker genes of prokaryotic species was checked by fetchMG (v1.0)76, and ribosomal RNA genes (5 S, 16 S and 23 S) were identified by barrnap (0.9) (https://github.com/tseemann/barrnap). Contigs with the marker genes and ribosomal RNA genes were excluded from further analysis.

  5. 5.

    The encoded genes of each contig were aligned to the viral protein database and a plasmid protein database constructed from the reference plasmids in RefSeq (n = 16,136, in April 2020) using DIAMOND (v0.9.29.130)77 with the more-sensitive option. The number of genes aligned to each database was compared, and contigs with more genes aligned to the plasmid protein database were excluded from further analysis.

  6. 6.

    The proportion of provirus regions was assessed by CheckV (v0.7)24, and contigs estimated with <80% of provirus regions were excluded.

First, we screened complete phage genomes from the circular contigs using these six criteria (Supplementary Fig. 1a). To identify phage genomes that were not complete but were of high or medium quality, we next screened possible phage contigs in the linear contigs. We aligned genes identified in the linear contigs to gene sets obtained from the complete phage genomes identified in this study (n = 1125) and the IMG/VR database (n = 13,628). The alignment was performed using DIAMOND with the more-sensitive option and e-value <1E-5 as a threshold. Contigs were defined as possible phage contigs if >40% of the genes were aligned to genes from a complete phage genome and the size of the contig was >70% and <120% of the complete genome. For these possible phage contigs, the above six criteria were applied, and those that did not pass were excluded. Finally, CheckV was used to screen for excess host bacterial genomes and exclude linear contigs defined as low quality or having >10% contamination.

To evaluate the performance of this custom pipeline, we applied the pipeline to reference phage genomes (n = 2609, as positive data) and plasmid sequences (n = 16,136, as negative data) in Refseq. The true positive rate was defined as the number of phages detected as phages by the pipeline divided by the number of reference phages. The false positive rate was defined as the number of plasmids detected as phages by the pipeline divided by the number of reference plasmids. DeepVirFinder22, VirSorter (v1.0.3)23 Virsorter2 (2.2.3)25, VIBRANT (v1.2.1)26, Seeker (v1.0.3)27 and ViralVerify (v1.1)28 were also applied to the same datasets with the default parameters, and the performance was compared among them.

Analysis of phage genomes

Viral operational taxonomic units (vOTUs) were constructed by clustering phage genomes with a > 95% identity29 using dRep (v2.2.3)78 with the default options. Representative sequences of each vOTU selected by dRep were further clustered with reference sequences in RefSeq, IMG/VR30, gut virome database (GVD)15, gut phage database (GPD)9, and metagenomic gut virus (MGV) database31 with >95% identity and >85% length coverage using aniclust.py script in the CheckV package to identify common sequences among the databases.

To further construct broader viral clusters (VC), proportions of protein clusters shared between phages were assessed. First, to define protein clusters, similarity searches of all protein sequences from all the phages identified in this study were performed using DIAMOND with the more-sensitive option (e-value <1E-5). Based on the similarities between proteins, protein clusters were defined by MCL (v14-137)79 with an inflation factor of 2. The percentage of shared protein clusters was calculated for each phage pair, and phages sharing >20% of clusters were grouped as a VC, which corresponds approximately to family- or subfamily-level clusters7,37. Rarefaction curves of the vOTUs and VCs were estimated with the iNEXT function in the iNEXT package (v2.0.20)80. The similarity matrix of the phages based on the percentage of shared protein clusters was further projected by tSNE using the tsne function in the Rtsne package (v0.16).

Taxonomy annotation of phages was performed with a voting approach described previously16 with minor modifications. First, the protein sequences of each phage were aligned to viral proteins detected from phage genomes in RefSeq (n = 2609, in April 2020) using DIAMOND with the more-sensitive option. Then, the best-hit taxonomy of each protein (family levels) was counted, and the most common taxonomy was assigned to the phage if >20% of proteins in the phage were aligned to the same taxonomy.

Phage lifestyles (i.e. virulent or temperate) were predicted by BACPHLIP40 and alignments to reference bacterial genomes in the RefSeq. Phages were defined as temperate if the BACPHLIP score was >0.8 or the phage genome was aligned to any reference genomes with >1000 bp alignment length with >95% identity.

Host prediction

Bacterial and archaeal genomes were downloaded from the RefSeq database (in April 2019). To reduce the redundancy of genomes from closely related strains in the same species (e.g. Escherichia coli), 10 genomes were selected randomly for species with more than 10 genomes, and other genomes were excluded from the dataset. The reference dataset consisted of 33,215 bacterial and 822 archaeal genomes.

Host prediction of the identified phages was performed using CRISPR spacers81. CRISPR spacers were predicted from the reference microbial genomes and assembled contigs (>10,000 bp) from the 4198 metagenomic datasets using PILER-CR (1.06)82. Short (<25 bp) or long (>100 bp) spacers were discarded. In total, 679,323 and 283,619 spacers were identified from the reference microbial genomes and assembled contigs, respectively. Taxonomy information was assigned to the assembled contigs if they were aligned to the microbial reference genomes with >90% identity and >70% length coverage thresholds using MiniMap283. The CRISPR spacers were mapped to the phage genomes using BLASTN with the option for short sequences: -a20 -m9 -e1 -G10 -E2 -q1 -W7 -F F81. CRISPR spacers, which were mapped with 100% identity or 1 mismatch/indel with >95% sequence alignment, were used for host assignment at the genus level. Assignments of host species were checked manually, and if any of the following non-human intestinal species were assigned, the host was excluded: Dickeya, Anaerobutyricum, Rubellimicrobium, Eisenbergiella, Harryflintia, Leucothrix, Photorhabdus, Spirosoma, Syntrophobotulus, Thermincola, Algoriphagus, Franconibacter, Kandleria, Lawsonibacter, Methylomonas, Provencibacterium, Pseudoruminoccoccus, Rhodanobacter, Romboutsia, Sharpea, Varibaculum and Thioalkalivibrio.

Quantification of viral abundance and analysis of the virome profile

To quantify the viral abundances in each sample, metagenomic reads were mapped to the gene set of VHGs (Supplementary Data 3) of each representative vOTU using Bowtie2 with a > 95% identity threshold, and reads per kilobase million (RPKM) were calculated for each vOTU. The reason for using only VHGs in the analysis was to avoid over-counting of viral reads, which could be caused by spurious mapping of reads from horizontally transferred genes of other phages or bacterial species. The α-diversity (Shannon diversity) of the vOTU-level viral profile was calculated using the diversity function in the vegan package. The β-diversity (Bray-Curtis distance) between individuals was assessed using the vegdist function, and the average distance against other individuals was calculated for each individual. The VC-level viral profile was obtained by summing all the RPKM of vOTUs for each VC.

Phylogenetic analysis of novel VCs

To construct phylogenetic trees for the vOTUs and reference genomes, protein sequences of large terminases, portal proteins, and major capsid proteins (Supplementary Data 3), which are often used to construct phage phylogenetic trees7,9, were extracted from the vOTUs in the 10 most abundant VCs (VC_19, 1, 2, 24, 12, 15, 3, 44, 18, 6), and their homologues were searched for in the reference phage genomes in RefSeq using DIAMOND with the more-sensitive option (e-value <1E-5). The collected protein sequences were aligned by MAFFT (v7.458)84 with the linsi option, and the alignments were trimmed by Trimal (v1.4.rev15)85 with the automated1 option. Phylogenetic trees were constructed by FastTree (2.1.10)86. The phylogenetic trees were visualised with iTOL (v5)87. For each VC, vOTUs with the highest number of genomes were selected, and their genomic structures were visualised by the circlize package (v0.4.15)88.

Taxonomic and functional analysis of the bacteriome

Taxonomic and functional profiles of the bacteriome were obtained as described previously20. Briefly, bacterial profiles at the species and genus levels were obtained with the single copy marker gene-based method using mOTUs (v2.1.1)89. Functional profiles at the Kyoto Encyclopaedia of Genes and Genomes (KEGG) orthology (KO) level were obtained by mapping the metagenomic reads to a non-redundant gene set constructed from the 4198 subjects’ metagenomic data20. Functional annotation of the non-redundant genes was performed using eggNOG-mapper90, in which DIAMOND was used for alignment to the eggNOG orthology database (version 4.5)91.

KOs involved in prokaryotic defence mechanisms, such as CRISPR-Cas and RM (Supplementary Data 8)58, were collected, and their total relative abundance in each system was calculated. Since functional annotation for the Abi system is not included in the KEGG database, we collected genes annotated as ‘abortive phage infection’ and ‘abortive phage resistance’ in the eggNOG annotation and calculated the total abundance. The 4198 individuals were classified into three groups (high, middle, and low) based on tertiles of the total abundance, and Shannon diversity of the virome was compared among the three groups by the Wilcoxon rank-sum test.

Phage-host correlation analysis

To explore the phage-host association in the community, Spearman correlations between relative abundances of vOTUs and microbial species at the genus level were evaluated. If the vOTU was predicted to infect more than one genus (i.e. generalist phage), the correlation was calculated for every predicted host. If a phage-host pair was absent (0 abundance) in a sample, the sample was excluded from the correlation analysis. vOTUs with average relative abundance >0.01% (n = 865) and genera with average relative abundance >0.5% (n = 32) were included in the analysis.

Analysis of VLPs and whole metagenomes from 24 faecal samples

Quality filtering of sequenced reads from the 24 VLPs and whole metagenomes was performed using fastp (version 0.20.1)92 with the default parameters. Contamination with human (hg38) or phiX genomes was excluded by mapping the reads to the genomes using Bowtie2.

To exclude bacterial DNA contamination in the VLP dataset, we performed further filtering. First, the VLP reads were assembled into contigs using MEGAHIT and the contigs were checked for virus or not. Contigs were defined as viral contigs if they were predicted as viruses by DeepVirFinder (P-value <0.05) and did not encode rRNA and marker genes checked by barrnap and fetchMG, respectively. Then, VLP reads were mapped to the viral contigs using Bowtie2, and those not mapped to the viral contigs were excluded from the VLP dataset. Viral profiles at the vOTU and VC levels for the de-contaminated VLP and whole metagenomic datasets were obtained with the same methodology for the 4198-subject metagenomic dataset described earlier.

Association analysis between the virome and various host/environmental factors

The association between each vOTU/VC and age/sex was assessed by multivariable regression analysis considering the effects of other covariates as described before20. Briefly, the relative abundance of each vOTU/VC was log10-transformed, and single linear-regression analysis was performed using the transformed abundance as a response variable and metadata as an explanatory variable. This single linear-regression analysis was performed for age, sex, and other metadata (n = 230, Supplementary Data 2), and metadata significantly associated with the vOTU/VC were determined (FDR < 0.05) by taking into account the total number of single regression analyses (number of vOTU/VCs multiplied by number of metadata). Then, multiple regression analysis was performed including all the significant metadata in the single regression analysis as explanatory variables. To exclude confounding factors, stepwise variable selection was performed based on Akaike’s information criterion with the step function. Metadata was defined as significantly associated with the vOTU/VCs if they remained in the model with a P-value <0.05. All regression models were constructed using the glm2 function in the glm2 package (v1.2.1). In total, 390 vOTUs and 112 VCs, whose average relative abundances in the 4198 metagenomic dataset were >0.05% and >0.1%, respectively, were included in the analysis. For visualisation, individuals younger than 20 years (n = 2) and older than 80 years (n = 6) were excluded due to the low numbers of such individuals (Fig. 5a, b).

Stepwise redundancy analysis was performed to evaluate the total variance of the virome and bacteriome (relative abundance data) explained by each metadata category using the ordriR2step function in the vegan package (v2.5.7)93. To investigate the associations between the virome/bacteriome and each single metadata item, permutational analysis of variance was performed using the adonis function in the vegan package based on the Bray–Curtis distance with 10,000 permutations. P-values were corrected for multiple comparisons by the Benjamini–Hochberg method94.

Statistics

All statistical analyses were conducted using R (v3.5.0) with two-sided test and the Benjamini–Hochberg method for multiple comparisons unless otherwise stated. Of the 4211 metagenomic samples sequenced, 43 samples were excluded due to less reads (5 million) than the others. No statistical method was used to predetermine sample size.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.


Source: Ecology - nature.com

Increased drought effects on the phenology of autumn leaf senescence

MIT students contribute to success of historic fusion experiment