Extensive gut virome variation and its associations with host and environmental factors in a population-level cohort
Sample collection and metagenomic sequencingWritten informed consent was obtained prior to participation in the project. The study protocol for the Japanese (Disease, Drug, Diet, Daily life) microbiome project was approved by the medical ethics committees of the Tokyo Medical University (Approval No: T2019-0119), National Center for Global Health and Medicine (Approval No: 1690), the University of Tokyo (Approval No: 2019185NI), Waseda University (Approval No: 2018-318), and the RIKEN Center for Integrative Medical Sciences (Approval No: H30-7). We conducted a prospective cross-sectional study of 4198 individuals participating in the Japanese 4D microbiome project, which commenced in January 2015 and is ongoing20.Participants registered in the project were those who visited hospitals in the area for disease diagnosis or a health checkup. Faecal samples are collected from both healthy and diseased participants. The eligibility criteria for participants are as follows: (1) born and raised in Japan; (2) age >15 years; (3) written informed consent provided; and (4) having an endoscopic diagnosis on colonoscopy; either having undergone a colonoscopy within the last 3 years or planning to undergo colonoscopy for colorectal cancer screening, surveillance, and diagnosis of various gastrointestinal symptoms. The exclusion criteria were as follows: (1) suspected acute infectious disease based on clinical findings (e.g., acute enterocolitis, pneumonia, tuberculosis etc.); (2) acute bleeding; (3) hearing loss; (4) unable to understand written documents; (5) unable to write and (6) limited ability to perform activities of daily living. No compensation was paid to participants.Participants collected faecal samples using a Cary–Blair medium-containing tube60 at home, and the samples were refrigerated for up to 2 days before the hospital visit. Immediately after participants arrived at the hospital, their faecal samples were frozen at −80 °C until DNA extraction. We avoided collecting samples within 1 month of administering bowel preparation for colonoscopy because it has a profound effect on the gut microbiome and metabolome61. Health professionals checked that the amount of stool was sufficient for analysis. Shotgun metagenomic sequencing was performed for 4241 faecal samples and quality controls were conducted20, from which 43 samples were excluded from further analyses due to the low number of high-quality reads (130 bp. Encoded genes in the contigs were predicted by MetaGeneMark (3.38)70. Assembled contigs were defined as phages if they passed all of the following six criteria.
1.
A genome size threshold was applied, and contigs less than 10 Kb were excluded, as typical dsDNA phages have genomes larger than >10 Kb71.
2.
Viral-specific k-mer patterns were checked by DeepVirFinder (v1.0)22. Contigs with p-values >0.05 were excluded from further analysis.
3.
To detect viral hallmark genes (VHGs) and plasmid hallmark genes, we performed a highly sensitive HMM-HMM search against the Pfam database72. First, the encoded genes were aligned to the viral protein database, collected from complete (circular) viral genomes (n = 13,628) in the IMG/VR v2 database30 using JackHMMER. The obtained HMM profiles were searched against the Pfam database using hhblits73 with a >95% probability cut-off. These procedures were performed using the pipeline_for_high_sensitive_domain_search script (https://github.com/yosuken/pipeline_for_high_sensitive_domain_search)74,75. Contigs with plasmid hallmark genes or those without VHGs were excluded. The hallmark genes used in this analysis are summarised in Supplementary Data 3.
4.
The presence of housekeeping marker genes of prokaryotic species was checked by fetchMG (v1.0)76, and ribosomal RNA genes (5 S, 16 S and 23 S) were identified by barrnap (0.9) (https://github.com/tseemann/barrnap). Contigs with the marker genes and ribosomal RNA genes were excluded from further analysis.
5.
The encoded genes of each contig were aligned to the viral protein database and a plasmid protein database constructed from the reference plasmids in RefSeq (n = 16,136, in April 2020) using DIAMOND (v0.9.29.130)77 with the more-sensitive option. The number of genes aligned to each database was compared, and contigs with more genes aligned to the plasmid protein database were excluded from further analysis.
6.
The proportion of provirus regions was assessed by CheckV (v0.7)24, and contigs estimated with 70% and 10% contamination.To evaluate the performance of this custom pipeline, we applied the pipeline to reference phage genomes (n = 2609, as positive data) and plasmid sequences (n = 16,136, as negative data) in Refseq. The true positive rate was defined as the number of phages detected as phages by the pipeline divided by the number of reference phages. The false positive rate was defined as the number of plasmids detected as phages by the pipeline divided by the number of reference plasmids. DeepVirFinder22, VirSorter (v1.0.3)23 Virsorter2 (2.2.3)25, VIBRANT (v1.2.1)26, Seeker (v1.0.3)27 and ViralVerify (v1.1)28 were also applied to the same datasets with the default parameters, and the performance was compared among them.Analysis of phage genomesViral operational taxonomic units (vOTUs) were constructed by clustering phage genomes with a > 95% identity29 using dRep (v2.2.3)78 with the default options. Representative sequences of each vOTU selected by dRep were further clustered with reference sequences in RefSeq, IMG/VR30, gut virome database (GVD)15, gut phage database (GPD)9, and metagenomic gut virus (MGV) database31 with >95% identity and >85% length coverage using aniclust.py script in the CheckV package to identify common sequences among the databases.To further construct broader viral clusters (VC), proportions of protein clusters shared between phages were assessed. First, to define protein clusters, similarity searches of all protein sequences from all the phages identified in this study were performed using DIAMOND with the more-sensitive option (e-value 20% of clusters were grouped as a VC, which corresponds approximately to family- or subfamily-level clusters7,37. Rarefaction curves of the vOTUs and VCs were estimated with the iNEXT function in the iNEXT package (v2.0.20)80. The similarity matrix of the phages based on the percentage of shared protein clusters was further projected by tSNE using the tsne function in the Rtsne package (v0.16).Taxonomy annotation of phages was performed with a voting approach described previously16 with minor modifications. First, the protein sequences of each phage were aligned to viral proteins detected from phage genomes in RefSeq (n = 2609, in April 2020) using DIAMOND with the more-sensitive option. Then, the best-hit taxonomy of each protein (family levels) was counted, and the most common taxonomy was assigned to the phage if >20% of proteins in the phage were aligned to the same taxonomy.Phage lifestyles (i.e. virulent or temperate) were predicted by BACPHLIP40 and alignments to reference bacterial genomes in the RefSeq. Phages were defined as temperate if the BACPHLIP score was >0.8 or the phage genome was aligned to any reference genomes with >1000 bp alignment length with >95% identity.Host predictionBacterial and archaeal genomes were downloaded from the RefSeq database (in April 2019). To reduce the redundancy of genomes from closely related strains in the same species (e.g. Escherichia coli), 10 genomes were selected randomly for species with more than 10 genomes, and other genomes were excluded from the dataset. The reference dataset consisted of 33,215 bacterial and 822 archaeal genomes.Host prediction of the identified phages was performed using CRISPR spacers81. CRISPR spacers were predicted from the reference microbial genomes and assembled contigs ( >10,000 bp) from the 4198 metagenomic datasets using PILER-CR (1.06)82. Short (100 bp) spacers were discarded. In total, 679,323 and 283,619 spacers were identified from the reference microbial genomes and assembled contigs, respectively. Taxonomy information was assigned to the assembled contigs if they were aligned to the microbial reference genomes with >90% identity and >70% length coverage thresholds using MiniMap283. The CRISPR spacers were mapped to the phage genomes using BLASTN with the option for short sequences: -a20 -m9 -e1 -G10 -E2 -q1 -W7 -F F81. CRISPR spacers, which were mapped with 100% identity or 1 mismatch/indel with >95% sequence alignment, were used for host assignment at the genus level. Assignments of host species were checked manually, and if any of the following non-human intestinal species were assigned, the host was excluded: Dickeya, Anaerobutyricum, Rubellimicrobium, Eisenbergiella, Harryflintia, Leucothrix, Photorhabdus, Spirosoma, Syntrophobotulus, Thermincola, Algoriphagus, Franconibacter, Kandleria, Lawsonibacter, Methylomonas, Provencibacterium, Pseudoruminoccoccus, Rhodanobacter, Romboutsia, Sharpea, Varibaculum and Thioalkalivibrio.Quantification of viral abundance and analysis of the virome profileTo quantify the viral abundances in each sample, metagenomic reads were mapped to the gene set of VHGs (Supplementary Data 3) of each representative vOTU using Bowtie2 with a > 95% identity threshold, and reads per kilobase million (RPKM) were calculated for each vOTU. The reason for using only VHGs in the analysis was to avoid over-counting of viral reads, which could be caused by spurious mapping of reads from horizontally transferred genes of other phages or bacterial species. The α-diversity (Shannon diversity) of the vOTU-level viral profile was calculated using the diversity function in the vegan package. The β-diversity (Bray-Curtis distance) between individuals was assessed using the vegdist function, and the average distance against other individuals was calculated for each individual. The VC-level viral profile was obtained by summing all the RPKM of vOTUs for each VC.Phylogenetic analysis of novel VCsTo construct phylogenetic trees for the vOTUs and reference genomes, protein sequences of large terminases, portal proteins, and major capsid proteins (Supplementary Data 3), which are often used to construct phage phylogenetic trees7,9, were extracted from the vOTUs in the 10 most abundant VCs (VC_19, 1, 2, 24, 12, 15, 3, 44, 18, 6), and their homologues were searched for in the reference phage genomes in RefSeq using DIAMOND with the more-sensitive option (e-value 0.01% (n = 865) and genera with average relative abundance >0.5% (n = 32) were included in the analysis.Analysis of VLPs and whole metagenomes from 24 faecal samplesQuality filtering of sequenced reads from the 24 VLPs and whole metagenomes was performed using fastp (version 0.20.1)92 with the default parameters. Contamination with human (hg38) or phiX genomes was excluded by mapping the reads to the genomes using Bowtie2.To exclude bacterial DNA contamination in the VLP dataset, we performed further filtering. First, the VLP reads were assembled into contigs using MEGAHIT and the contigs were checked for virus or not. Contigs were defined as viral contigs if they were predicted as viruses by DeepVirFinder (P-value More