Diverse viruses in deep-sea hydrothermal vent fluids have restricted dispersal across ocean basins

In the ocean, viruses impact microbial mortality, regulate biogeochemical cycling, and alter the metabolic potential of microbial lineages. At deep-sea hydrothermal vents, abundant viruses infect a wide range of hosts among the archaea and bacteria that inhabit these dynamic habitats. However, little is known about viral diversity, host range, and biogeography across different vent ecosystems, which has important implications for how viruses manipulate microbial function and evolution. Here, we examined viral diversity, viral and host distribution, and viral-host interactions in microbial metagenomes generated from venting fluids from several vent sites within three different geochemically and geographically distinct hydrothermal systems: Piccard and Von Damm vent fields at the Mid-Cayman Rise in the Caribbean Sea, and at several vent sites within Axial Seamount in the Pacific Ocean. Analysis of viral sequences and Clustered Regularly InterSpaced Palindromic Repeats (CRISPR) spacers revealed highly diverse viral assemblages and evidence of active infection. Network analysis revealed that viral host range was relatively narrow, with very few viruses infecting multiple microbial lineages. Viruses were largely endemic to individual vent sites, indicating restricted dispersal, and in some cases viral assemblages persisted over time. Thus, we show that hydrothermal vent fluids are home to novel, diverse viral assemblages that are highly localized to specific regions and taxa. Importance Viruses play important roles in manipulating microbial communities and their evolution in the ocean, yet not much is known about viruses in deep-sea hydrothermal vents. However, viral ecology and evolution are of particular interest in hydrothermal vent habitats because of their unique nature: previous studies have indicated that most viruses in hydrothermal vents are temperate rather than lytic, and it has been established that rates of horizontal gene transfer (HGT) are particularly high among thermophilic vent microbes, and viruses are common vectors for HGT. If viruses have broad host range or are widespread across vent sites, they have increased potential to act as gene-sharing “highways” between vent sites. By examining viral diversity, distribution, and infection networks across disparate vent sites, this study provides the opportunity to better characterize and constrain the viral impact on hydrothermal vent microbial communities. We show that viruses in hydrothermal vents are diverse and apparently active, but most have restricted host range and are not widely distributed among vent sites. Thus, the impacts of viral infection are likely to be highly localized and constrained to specific taxa in these habitats.

To calculate relative viral abundance, the reads in each of the metagenomes were mapped 216 against all of the viral contigs from the corresponding geographic region using bowtie2 (56) v2.2.9. 217 The reads from each sample were mapped against all of the viral contigs from the corresponding 218 geographic region rather than solely the viral contigs in the sample because there were viral reads in 219 samples that did not assemble into viral contigs. This method therefore allowed the identification of 220 more viral sequences. The number of reads that mapped to viral contigs was normalized by the 221 number of merged reads as a measure of relative viral abundance. This measure of relative viral 222 abundance reflects only the proportion of viruses that were retained on the filter as viral capsids or 223 prophages. We used the number of spacers per read and CRISPR direct repeat types per read as 224 measures of spacer and CRISPR relative abundance, respectively. Paired rather than merged reads 225 were used for these analyses. Relative abundance and diversity of viruses, microbes, CRISPR spacers 226 and CRISPR loci were visualized using the Seaborn library within Python (60). 227

Relative compositions and abundances of viral populations, spacer assemblages, and hosts 229
The relative compositions of the microbial community, the viral assemblage, and CRISPR 230 spacers were compared between vent sites within Von Damm, Piccard, and Axial Seamount. To 231 calculate the relative abundance of each vOTU in each sample, the number of reads in the sample 232 that mapped to the viral contigs in the vOTU was determined using bowtie2 (56) v2.2.9. The 233 number of reads in each metagenome that mapped to the vOTU was normalized by the total length 234 of the viral contigs in the vOTU and the number of merged reads in the metagenome. We defined 235 the most common vOTUs as the six clusters with the highest relative abundance in each sample. 236 The relative abundance of each CRISPR spacer cluster in each sample was calculated as the percent 237 of spacers in the sample that were part of the spacer cluster. For spacer clusters, we defined the for computing distances between samples containing compositional data. For hierarchical clustering, 250 we did not normalize; we performed analyses on the number of reads that mapped to each vOTU, 251 the number of CRISPR spacers in each spacer cluster, and the number of reads that mapped to 16S 252 rRNA gene sequences of each microbial host in each sample; for microbial hosts, we did not include 253 reads that mapped to unclassified sequences or sequences classified as eukaryotes. We replaced zero 254 counts with estimates using the count zero multiplicative method for vOTUs and hosts and the 255

Identification of putative viral contigs 285
We used VirFinder to identify putative viral contigs because it is a k-mer frequency-based 286 method that has higher potential to identify novel viruses (Ren et al., 2017). However, this method 287 allows for the possibility of false positives. VirFinder assigns scores between 0-1 to indicate the 288 likelihood that a contig is viral, with higher values reflecting a higher likelihood that the sequence is

Taxonomy of viruses and hosts 300
All analyses were carried out from diffuse flow fluids sampled directly from the vent orifice, as 301 well as vent plume waters, which were sampled up to 100m above the vent orifice and had much 302 lower temperatures (Supplementary Table 1 Table 2). We also quantified the diversity and abundance of loci and 321 spacers within CRISPRs each of the metagenomes. CRISPR loci are a microbial immune system 322 found in archaeal and bacterial genomes, consisting of direct repeats interspersed by "spacers," 323 which match foreign DNA (predominantly viruses, but also including plasmids and other forms of 324 foreign DNA) that the cell has been exposed to previously. The relative abundance of CRISPR loci 325 serves as an indication of how many microbial lineages use CRISPR as a mechanism for viral 326 immunity. It is also important to note that CRISPR loci vary in number across microbial genomes, 327 and we did not distinguish between CRISPR loci with the same direct repeat type. Therefore, our 328 measure of CRISPR relative abundance is not a direct proxy for the abundance of viruses. Instead, it 329 gives an indication of how commonly CRISPR loci are used as an antiviral mechanism within the 330 community. Moreover, CRISPRs serve as a record of past infections, and thus while viral diversity 331 reflects the diversity of viral particles sequenced at the time of sampling, CRISPR spacer diversity 332 reflects the diversity of past viral infections. 333 Our results indicate that the relative abundance of viral sequences (Figure 2A) was similar 334 both across and within each of the hydrothermal vent regions we studied (p = 0.42, t-test). There 335 was a higher relative abundance of CRISPR loci in samples collected at Axial Seamount compared to 336 both Piccard and Von Damm vent fields ( Figure 2B; p = 1.49e-05, t-test). However, there was no 337 difference in the relative abundance of spacers within CRISPR loci per read ( Figure 2C; p = 0.10, t-338 test). Within the Mid-Cayman Rise, we compared samples from mafic-hosted (Piccard) versus 339 ultramafic-hosted (Von Damm) hydrothermal systems. We did not observe significant differences in 340 the relative abundance of viral sequences, CRISPR loci, or CRISPR spacers between Piccard and 341 Von Damm (viral sequences: p = 0.66, t-test, CRISPR loci: 0.40, t-test, spacers: 0.37, t-test) ( Figure  342 2A-C). Similar results emerged from comparisons among vent fields within Axial Seamount: we did 343 not observe significant differences in the relative abundance of viral sequences, CRISPR loci, or 344 CRISPR spacers between vent fields at Axial Seamount (viral sequences: p = 0.17, t-test, CRISPR 345 loci: 0.41, t-test, spacers: 0.11, t-test). Finally, within Axial Seamount, we also compared the relative 346 abundance of viruses in samples taken from plume and diffuse flow hydrothermal fluid at Anemone 347 vent. The Anemone diffuse flow samples had a higher relative abundance of CRISPR spacers and 348 CRISPR loci compared to the Anemone plume sample, which was sampled above the vent. 349 However, the relative abundance of viral sequences did not differ between the Anemone plume and 350 diffuse flow samples (Supplementary Table 2). 351 352

Diversity of viral assemblage and microbial community 353
Overall, viral diversity analyses revealed that the viral assemblages within these vents had high 354 richness and were not dominated by specific viral strains. With the exception of a few samples, the 355 rarefaction curves for the viruses and microbes did not reach saturation ( Supplementary Fig. 2), 356 indicating that the vOTUs recovered in this study did not capture the total diversity in the samples. to samples from Von Damm and Piccard vent fields. However, we did not observe meaningful 368 correlations between vOTU, spacer cluster, or host diversity. Moreover, we did not observe significant differences in either viral or host diversity between the Piccard and Von Damm vent 370 fields at the Mid-Cayman Rise (viral diversity: p = 0.053, t-test, host diversity: p = 0.66, t-test). We 371 also examined the relative diversity of CRISPR spacers, which provides an indication of past viral 372 infections rather than the current virus pool. We did not observe a significant difference in the 373 diversity of CRISPR spacers between samples from Piccard, Von Damm, and Axial Seamount 374 ( Figure 2F, p = 0.093, t-test). 375

376
We observed significant differences in viral diversity between vent sites within Axial Seamount 377 (p = 0.022, t-test). However, no significant differences emerged in terms of the diversity of 378 microbial hosts and CRISPR spacers (host diversity: p = 0.74, t-test, spacer diversity: p = 0.11, t-379 test). At Anemone vent, the diffuse flow samples had higher CRISPR spacer and microbial diversity 380 than the plume sample, but viral diversity did not differ between these samples (Supplementary 381 In order to characterize viral distribution and community similarity across hydrothermal vent 385 fluids, we evaluated the extent to which viral sequences and CRISPR spacers were distributed across 386 samples, then compared these results to the host microbial community. As before, we clustered  (Table 1). Viral and CRISPR spacer clusters 399 were shared more widely among vent sites within Von Damm and Piccard compared to Axial Seamount, but microbial lineages were shared more widely among vent sites at Axial Seamount 401 (Table 1). 402 We created hierarchical dendrograms to assess the similarity of samples based on their viral 403 content. Viral assemblages in samples from the Mid-Cayman Rise and Axial Seamount grouped 404 separately ( Figure 3A). Within Axial Seamount, samples taken in successive years from the same 405 vent tended to have similar viral assemblage compositions ( Figure 3A). In contrast, at the Mid-406 Cayman Rise, samples taken from the same site in two different years did not cluster together, and 407 we observed weak clustering of samples by location ( Figure 3A In contrast to the viral and microbial assemblages, hierarchical dendrograms based on 415 CRISPR spacer compositions showed little clustering by location ( Figure 3C). Based on spacer 416 assemblages, samples from either Axial Seamount or the Mid-Cayman Rise did not cluster together, 417 and samples taken from the same vent sites in different years did not cluster together. Very few 418 CRISPR spacer clusters were found at multiple vent sites (Table 1). 419 420

Networks of viral infection 421
Networks of viral infection generated from viral sequences, CRISPRs, and MAGs were used  MAGs ( Figure 4A). We did not observe any connections between MAGs and vOTUs in any of the 431 non-diffuse flow samples (Background, CTD1200 and the Anemone plume sample). The number of viral connections appeared to be related to, but was not significantly correlated with, the microbial 433 hosts' relative abundance (Supplementary Table 3 (Figure 4). However, in some cases the MAGs were linked to these shared vOTUs through 465 the same CRISPR direct repeat type and were in the same sample, suggesting that these specific 466 connections may have been due to matching CRISPR direct repeat types rather than true cross-467 infection. These have been indicated in Figure 4 (red lines). on microbial community similarity and distribution in hydrothermal systems has shown that vents in 506 close proximity are often more similar to one another in terms of microbial community structure 507 than to geographically distant vents (4, 27, 40, 83). This may result from subseafloor plumbing that 508 restricts fluid flux between sites, creating "islands" of microbial diversity that are distinct from one 509 vent site to the next (26). Here, we show that these barriers to dispersal apply to viruses as well, and 510 that viruses may be even more spatially restricted than their microbial hosts. Microbial lineages that 511 spread between vent sites may face infection from novel viral strains not found in other vent sites. 512 These endemic viral populations thus further shape distinct microbial community structure at 513 individual vent sites. 514 Although the viral assemblages at hydrothermal vents had high diversity and restricted 515 dispersal between vent sites and vent fields, our results show that the viral assemblages in vents 516 persist over time, particularly at Axial Seamount where the same vents were sampled over a 3-year 517 period. Viral assemblages from samples from the same vent sites at Axial across three years clustered 518 together in the hierarchical dendrograms ( Figure 3A), and our viral infection networks revealed that 519 a number of viruses at Axial Seamount were linked to specific microbial hosts at the same vent over 520 multiple years ( Figure 4A). These patterns match those observed in the microbial communities 521 Matches between CRISPR loci and viruses provide an indication of which viruses are being 545 targeted by the CRISPR immune response. We found that while some CRISPR spacers target 546 relatively abundant viruses, most CRISPRs target relatively rare viruses in these systems. This is 547 consistent with previous observations in an archaea-dominated hypersaline lake (88), where the vast 548 majority of CRISPRs were found to target viruses with populations too small to allow for the 549 assembly of contigs. The relative scarcity of viruses targeted by CRISPRs may result from an 550 evolutionary arms race: CRISPRs limit the abundance of the viral populations they target, while 551 concurrently, viruses undergo mutations, limiting the ability of CRISPR spacers to target them. 552 Alternatively, these observations could result from an abundance of inactive spacers inherited over 553 multiple generations. However, we do not believe this to be the case because CRISPR spacer 554 clusters were infrequently present across multiple samples, suggesting that spacers were integrated 555 on sub-generational timescales. 556 557

Patterns recorded in microbial CRISPR loci do not reflect the contemporary viral assemblage
In contrast to the vOTUs, the CRISPR spacers did not demonstrate any clear biogeographic 559 patterns. We expected CRISPR spacers to be more widespread than viruses since viral composition 560 represents the virus community at the time of sampling, while spacers represent a history of viral 561 infection. In contrast to our predictions, while both had limited distributions, we found viruses to be 562 more widespread than CRISPR spacers at all scales examined (Table 1). These results are in contrast 563 to previous studies of CRISPR spacer biogeography in terrestrial hot springs, where both viral and 564 CRISPR spacers showed clear biogeographic structure (89). This suggests that there is selective 565 pressure for CRISPR spacer composition to evolve more rapidly than viral sequences via the loss or 566 mutation of CRISPR spacers. However, it is also possible that we found viruses to be more 567 widespread because of undersampling of CRISPR spacers, or our use of different clustering 568 algorithms for viruses and spacers: spacers may have been clustered at a finer resolution, resulting in 569 a narrower distribution for each spacer. 570 Given that CRISPR spacers provide a history of infection, comparing the record of past viral 571 infections via CRISPR arrays with viral sequences in the metagenomes can provide insights into 572 whether CRISPR arrays provide an accurate representation of the viral assemblage at the time of 573 sampling, as well as the rate at which CRISPR spacers are accumulated. We did not find that the 574 most common or cosmopolitan viruses and CRISPR spacers matched each other. This may arise 575 from a temporal mismatch: it takes time for spacers to be incorporated into and lost from CRISPR 576 loci as virus abundances shift and viruses evolve. Additionally, just one SNP (which we allowed for 577 when aligning spacers to viruses) can prevent a CRISPR spacer from providing resistance against a 578 virus (66). This could explain the discrepancy between viruses and spacers: once resistance of a 579 spacer to a particular virus is suppressed, the population of the virus is freed to shift independently 580 from the spacer in the host population. The discrepancy we observed between viruses and spacers is 581 important to note when attempting to use CRISPR spacers to study viral populations or vice versa. 582 Examination of the prophage abundance in MAGs revealed a relatively high abundance of 583 prophage encoded in each MAG, confirming previous results indicating that lysogeny is a common 584 lifestyle in hydrothermal systems (14, 15). However, while examination of the gene content of viral 585 contigs revealed several ORFs encoding a range of functions including cell membrane function and 586 energy production, it was difficult to prove conclusively that these genes were viral-encoded AMGs 587 rather than potential cellular contamination, and thus no concrete conclusions were provided here 588 (see Supplementary Materials). Previous research has suggested that vent viruses encode AMGs (18, 589 19), and thus it is likely that many of these genes represent viral-encoded AMGs, but further 590 research is necessary to determine AMG diversity and prevalence across vent systems.      classified as eukaryotes were excluded. Rare taxonomic groups represent taxonomies that comprised less than 849 1% of reads that mapped to 16S rRNA gene sequences in a given sample.