Introduction

Whole genome sequencing provides a powerful lens for the investigation of fungal pathogens. In providing a comprehensive snapshot of the gene content and thereby the functional potential, the genomic studies of human pathogenic fungal species have revealed the repertoire of proteins that contribute to host interactions [1•, 2, 3•, 4], predicted metabolic capabilities and requirements [1•, 5•], uncovered the potential for sexual reproduction in some species [6, 7], and have been used as the platform to study specific genes as well as for systematic functional genomic approaches [8,9,10]. For diagnostic purposes, the complete sequence of clinical isolates can type the infecting species and the subgroup or lineage with high accuracy. In addition, the identification of specific mutations that are clinically actionable, including those that confer or promote resistance to antifungal drugs, has the potential to guide treatment decisions for individual patients.

The ease and falling cost of generating whole genome sequence has dramatically expanded the use of this data. The ability to sequence genomes on demand has dramatically expanded the scope of sequenced species, and importantly allows rapid response to new or emerging pathogens. Further, comparing the genome-wide variation of clinical isolates from an outbreak with isolates from environmental or other associated samples can precisely determine how clonal an outbreak is across patients, the identity to isolates from other sources, and establish transmission chains. For recurrent infections or in the context of prolonged outbreaks, genome sequence can trace how pathogens evolve over time, tracking the emergence of new genotypes and the spread of particularly virulent or drug-resistant groups.

Whole Genome Sequencing Approaches

There are two general approaches for genomic analysis of fungal pathogens. One involves generation of a genome assembly de novo, such as for a species that has not been previously sequenced and assembled. In the other approach, commonly termed re-sequencing, variants are identified between an existing reference assembly and a sequenced isolate via alignment of sequence reads to the reference. Both methods rely on generating high depth whole genome shotgun sequence, necessary to achieve a high-quality consensus across the genome. However, the choice of technology selected to generate the sequence is influenced both by the approach selected and by the goals of the study. In the initial years of fungal genome sequencing [11], a small number of prioritized species were sequenced using long, paired-end reads produced with Sanger technology from multiple libraries. By contrast, current approaches both for assembly and for variant detection leverage short read sequence such as Illumina and often only a single small insert library. While this approach is sufficient for many goals, incorporating reads from multiple libraries including those with larger inserts and longer reads from technologies such as Pacific Biosciences or Oxford Nanopore are particularly important for accurate assembly of repetitive genomes and for examining structural variation (Fig. 1).

Fig. 1
figure 1

Overview of whole genome sequencing approaches. a De novo assembly approach; while draft assemblies are more fragmented than finished or chromosomal assemblies, both can be annotated for gene structures and repetitive elements. b Re-sequencing approach; this approach starts with initial alignment of sequence reads to a reference assembly, with separate processes required to identify copy number variants, structural variants, and SNPs and indels

Generation of a new genome assembly for a species, also known as de novo sequencing and assembly (Fig. 1), has been applied to generate reference genome assemblies for all of the major human fungal pathogens (Table 1). Many of these species have now achieved chromosome-scale assemblies, although there may still be gaps at subtelomeres or within each chromosomal sequence often in repetitive or low complexity regions. The choice of a genome sequencing strategy depends both on the properties of the genome and on the goal of the analysis. While the assembly of a short read from a single library can provide a good overview of gene content, repetitive sequences sharing high identity may not be resolved, resulting in gaps in the genome assembly [12]. This issue can be overcome either by the generation of larger insert mate pair libraries for short read sequencing or by the incorporation of longer reads to provide linkage information; highly repetitive genomes may benefit from a strategy of exclusively long read sequencing and assembly. Similar to the prior use of physical or genetic maps in validating and anchoring draft assemblies to chromosomes, technologies such as Hi-C that map the three-dimensional space of a genome also can be used for higher order scaffolding of assemblies [13, 14]. In diploid genomes, heterozygosity may also impact genome assembly; most methods seek to generate a haploid version of the reference, and the generation of a consensus sequence from heterozygous regions can incorrectly merge haplotypes. A phased diploid assembly for Candida albicans was constructed using a panel of strains homozygous for specific chromosomes [15•]; a more general approach using long reads was recently reported [16]. De novo assemblies of either haploid or diploid genomes can then be annotated by predicting the structure and function of protein coding genes, using de novo, homology-based, and evidence-based prediction algorithms [17]. The ability to generate deep coverage of RNA-Seq from multiple conditions enables a higher level of accuracy and validation of gene structures, and has been applied to systematically improve gene structures and predict alternatively spliced transcripts.

With the decreased cost of whole genome sequencing, de novo assemblies have been generated on demand to examine new species. This includes representing both rarely observed pathogens and many nonpathogenic species related to common pathogens, with the goal of using comparative genomic approaches to identify differences that could contribute to pathogenesis. The ability to rapidly generate genomes for new species is also important for the response to emerging pathogens and in the context of recent fungal outbreaks. Recent studies have also expanded our view within a single species by examining the genome of more than one “reference” isolate for some species, which can characterize differences in gene content between isolates of the same species. From a larger perspective, the increasing number and diversity of sequenced genomes enable a wide range of studies focused on comparisons of specific genes, as well as a set of references for alignment-based approaches including both metagenomics and sequencing of single isolates.

For species for which a high-quality reference assembly is available, re-sequencing is an alternative approach to identify genome-wide variants. Typically short read sequence is generated from one or more isolates of the same species, reads are aligned to a reference assembly, and high-quality variants identified from the alignments (Fig. 1). These methods have been applied to both haploid and diploid fungal genomes. The full-genome resolution and scalability of this approach make it ideal for examining transmission links and in the context of an outbreak and pathogen evolution during the course of an infection, during which few variants may be expected. Variants can also be mapped to genes, on the reference genome, to infer changes in important genes involved in drug resistance (see below).

In addition to these approaches that rely on sequencing of individual isolates, the increase of metagenomic sequencing has driven the development of methods to look directly at populations within a single sample. Also using a shotgun sequencing approach, the sequence of a pool of samples can be used to determine the species within a single sample, to categorize the gene content to suggest functional capacity, and recently to examine species level variation.

Recent Genome Sequencing Findings

With the advent of highly multi-parallel sequencing, the increased ease and low cost of generating whole genome sequence led to a dramatic expansion of the number of fungal genomes available. Notably, the 1000 fungal genomes project at the US Department of Energy Joint Genome Institute (http://1000.fungalgenomes.org) aims to provide a comprehensive representation of the fungal kingdom, where each family level division would be represented by at least two genomes. The pace of sequencing is already eclipsing the scale of this project, with over 2100 fungal genome assemblies available in NCBI (https://www.ncbi.nlm.nih.gov/genome/browse/#). Of these, only 812 have gene annotations deposited in NCBI, highlighting the more limited scope of easily available gene sets for comparative analysis.

Recent years have seen advances in the generation or improvement of reference genomes for the major human fungal pathogens (Table 1). The assembly of Cryptococcus neoformans var. grubii (serotype A) into 14 chromosomes incorporated deep RNA-Seq for annotation, allowing assessment of the structures of nearly all genes to provide a comprehensive view of coding and noncoding transcripts [24]. Building on the initial genome sequence of two lineages of Cryptococcus gattii [25], genome comparison across 16 genome assemblies representing all four predominant lineages of C. gattii identified variation in gene content including RNAi, iron-binding, and stress-related genes and selection pressure of transporters [25, 26]. Sequencing of the obligate pathogenic Pneumocystis species required optimization of strategies to purify fungal from host pulmonary tissue and iteration of sequencing strategy to improve assemblies [1•, 5•]. The high copy major surface glycoprotein gene family, which encodes the most abundant cell surface protein in Pneumocystis, was not well represented in assemblies of short read data [5•] but improved using longer PacBio reads [1•]. Building on single genome studies of the species causing mucormycosis [32], a comparative genomic study of 30 species of Rhizopus and Mucor revealed that the CotH invasin gene family is a unique feature of all invasive Mucorales and that copy number appears to correlate with species prevalence [3•]. These and other recent studies collectively provide reference genomes for the study of all the major human fungal pathogens (Table 1).

Table 1 Genome assemblies for human fungal pathogens

In addition to studies focusing on a single genome as representative of a species, multiple studies have used re-sequencing to examine variation across multiple isolates of a single species. Large studies of C. gattii have characterized the relationship of global isolates [34, 35] and identified a loss of function mutation in the mismatch repair gene MSH2 in one sublineage, VGIIa [35]. One of the largest studies to date in C. neoformans var. grubii compared the sequence of 387 isolates from clinical and environmental origin; genome-wide association study (GWAS) variants associated with the isolation source identified virulence factors and stress response genes [36•]. A parallel GWAS of melanization in these isolates identified loss of function mutation in clinical isolates in the BZP4 transcription factor required for melanin production [36•]; while melanin is a virulence factor in Cryptococcus, the presence of multiple loss of function mutations in clinical isolates suggests that loss of melanin production is observed clinically. A study of a panel of 20 clinical isolates of C. albicans characterized frequent loss of heterozygosity and pinpointed a loss of function of EFG1, a gene required for filamentous growth; this isolate further showed a competitive advantage during gastrointestinal growth over isolates that were isogenic except for the addition of a wild type copy of this gene, suggesting that this change could have provided an advantage during commensal growth [37]. Additional studies in dimorphic fungi have defined new population subdivisions and the level of genetic exchange between these groups [2, 38].

Refining Phylogenetic Relationships

Genome sequence is also utilized to assess phylogenetic relationships between species and has helped resolve inconsistencies in species naming. The Phylogenetic Species Concept requires consistency across multiple gene trees [39], as single genes could be subject to recombination or introgression and not reflect the true species relationships. This can highlight conflicts in the naming of genera or species grouped by morphological and phenotypic information and suggest how to refine species boundaries. However, where to set species boundaries and the decision of what evidence justifies changes in species naming is debated [40, 41]. More straightforward cases are those where phylogenetic analysis highlighted inconsistencies in the current genus naming; a re-assessment of the Emmonsia genus including many newly reported clinical cases [42] led to a proposed re-organization of the taxonomy of this group including a new genus name [43]. Such assessments incorporate analysis of the support for phylogenetic subdivisions and the genetic distance between groups.

While phylogenies based on whole genome data may capture the same major phylogenetic relationships and subdivisions as those observed in phylogenies based on small numbers of loci, analysis of whole genome data provides a more comprehensive view of genetic exchange between subdivisions. For example, in C. neoformans, four well-separated lineages (VNI, VNII, VNB-I, and VNB-II) in whole genome phylogenies appear similar to those identified in multi-locus phylogenies, and at a finer scale, subdivision of VNI into three subgroups is also strongly supported from phylogenetic analysis of whole genome data. However, while recombination is limited between the four lineages, the level of recombination appears similar across VNI as within each subgroup, suggesting that the phylogenetic subdivisions within VNI do not reflect genetic isolation [36•]. Such analyses of level of genetic exchange and separation can help support or question subdivisions made based only on multi-gene phylogenies. In addition, these studies can highlight unusual phylogeographic patterns, which can motivate further population sampling to evaluate rare or unexpected subgroups.

Outbreaks and Emerging Species

Multiple species of fungi have been responsible for major outbreaks of infections in the USA within the last 10 years. In contrast to the predominant species that cause of human fungal infections, many outbreaks have resulted from organisms that are not a common cause of infection, and consequently some of these species are not well studied or previously sequenced. For such cases, a primary goal for whole genome sequencing has been to generate a reference genome that could be used for identification of genome-wide variants across outbreak samples as well as for further genomic and transcriptomic studies of pathogenesis. Comparing the genomes of patient and environmental isolates from populations of these pathogens can help trace the origin and transmission patterns in an outbreak; if isolates from an outbreak and potential source show very few genome-wide differences, this supports a clonal outbreak mechanism with strong link to the potential source. In addition, the gene set predicted from the genome of outbreak isolates can help develop biomarkers and new diagnostics, and can potentially guide our understanding of what enabled a strain to cause a suddenly high rate of severe infections.

One major treatment-acquired fungal outbreak in the USA resulted from injection of methylprednisolone, as a treatment for pain management, contaminated with the phaeoid fungus Exserohilum rostratum. As of January 2013, E. rostratum had caused more than 750 cases of phaeohyphomycotic meningitis and at least 61 deaths in 19 US states [44]. A very similar but smaller fungal outbreak occurred 10 years previously, caused by a steroid contamination with Wangiella (Exophiala) dermatitidis [45]. Both species of phaeoid fungi (black or dark brown pigmented) are infrequently the cause of superficial infections, however in rare cases they result in systemic neurotropic infections. Whole genome sequencing of E. rostratum purified from clinical samples from patients injected with contaminated steroids and from steroids lots established that a clonal fungus was present in both patients and steroid lots [46]. This analysis incorporated both de novo and re-sequencing approaches (Fig. 1): a reference assembly was generated from for one of the outbreak strains, and SNPs were identified by aligning reads from all other samples to this assembly. Analysis of variants revealed that genomes from outbreak isolates were nearly identical, both from 19 patients and 6 from steroid lots; only two SNPs were found between any isolate from a patient and compounding vial. By contrast, over 136,000 SNPs differentiated the outbreak isolates from other environmental isolates, though these were collected in years prior to the outbreak in different geographic regions. This genomic analysis provided strong evidence that the fungal strains found in all patients and in the suspected steroid vials were identical.

One recent report demonstrated how whole genome sequencing can pinpoint isolate relationships and suggest new transmission patterns. To determine whether clinical cases of Coccidioidomycosis in Washington State were the first reports of local exposure or resulted from transmission during travel to the southwestern USA, patient isolates were compared with environmental isolates from the local area in Washington and from the southwestern USA. Remarkably, whole genome sequencing revealed that Coccidioides immitis isolates from these patient cases were nearly identical to local soil isolates, differing by only three SNPs across the entire 28 Mb genome [47], suggestive of local transmission and a potentially an expanded endemic area for this pathogen.

More recently, genomic analysis of drug resistance Candida auris established that patient isolates from specific geographic regions are highly identical [48•, 49]. In one study, a de novo genome assembly was generated for one isolate of C. auris, and SNPs identified in other isolates using the re-sequencing approach [48•]. While isolates from a given geographic area appear closely related, there is more variation between regions; in addition, drug-resistant isolates show candidate-resistant mutations in ERG11, based on mapping such sites from C. albicans.

Outbreaks may also occur when there is a change in a pathogen that increases resistance to stress conditions or enables survival in a new environment. Mutations in the mismatch repair gene MSH2 initially identified in the outbreak lineage of C. gattii may enable more rapid adaption to stressful conditions or new environments by allowing a higher rate of mutation [35]. Loss of MSH2 has also been detected in Candida glabrata, where this appears to accelerate the acquisition of drug resistance [50]; however another recent study failed to find a correlation between MSH mutation and azole resistance [Dellière S et al. Fluconazole and Echinocandin Resistance of Candida glabrata Correlates Better with Antifungal Drug Exposure Rather than with MSH2 Mutator Genotype in a French Cohort of Patients Harboring Low Rates of Resistance. Front Microbiol. 2016 Dec 23;7:2038.], suggesting that other factors in the genetic background could play a role in drug resistance. These studies suggest that loss of DNA repair genes in fungi could result in a higher mutation rate may provide an advantage under adverse conditions.

Evolution Within Patients

Where fungal infections persist in patients, studies of how the genome changes during chronic infection can highlight mechanisms of adaptation. Recent studies of C. albicans and Cryptococcus have used whole genome re-sequencing to identify how isolates of these species change during infection. One recent study examined serial isolates of C. albicans from 11 patients with oral candidiasis; this revealed that during clinical passage, isolates acquired new mutations, including some linked to host adaptation [51•]. In addition, genome regions showing loss of heterozygosity during passage include genes implicated in drug resistance. Another study of serial isolates of C. neoformans and gattii compared isolates during initial presentation of disease and after 120 days or more during a relapse [52]. These cases were also highly clonal, demonstrating that a second independently infecting strain was not the origin of the relapse. The lower rate is consistent with a prior report; however, higher rates of change in some isolates in a separate study were suggested to result from changes in mismatch repair proteins [53]. Analysis of wider sets of isolates is needed to validate whether such mutations are common in Cryptococcus.

Antifungal Resistance

Genome sequencing can type known drug resistance mutations, in some cases suggesting whether particular drugs will fail to control an infection. Whole genome variants could be screened for point mutations in specific drug targets that are highly correlated with resistance. For example, specific mutations in the target of azole drugs [54] or in the transcription factors that control the expression of drug efflux transporters [55, 56] can be identified from whole genome sequence data only in isolates that display drug resistance [37]. In addition, copy number variation of both drug targets and transporters can also lead to drug resistance; genomic regions showing higher sequencing read depth for such genes are also associated with drug resistance [57]. Where this method could be applied to metagenomic population sequencing, it may be possible to detect early arising drug-resistant mutants that have not yet swept through the population, and precede treatment failure.

Diagnostics

Direct genome sequencing of microbial samples can provide precise diagnostic information. With approaches that meet the need of rapid turnaround for clinical samples, similar methods can be applied to complex samples that contain multiple microbes in addition to human DNA. While analyzing the sequence of a set of organisms requires different approaches than single isolate sequencing, specialized methods can identify the potential pathogen sequence in metagenomic data from a mixed population [58, 59]. Metagenomic data can also be used to examine how much the sequence of a specific species varies within a single sample, separating out the contribution of different strains involved in mixed infections [60, 61]. With sufficient coverage or target enrichment, metagenomic sequence can also provide precise typing of specific mutations of clinical impact, such as those that promote or provide drug resistance. In this way, such approaches can tailor treatments to patients, particularly those at high risk of invasive infections, by using genomic data to predict how a patient will respond to a specific treatment.

Conclusions

Genome sequencing is becoming an increasingly common approach to study human fungal pathogens. Current studies seek to compare the genomes of hundreds of isolates of a given species and to examine pathogen populations at unprecedented scale. Sequencing is no longer a bottleneck; however, analysis approaches also need to continue to scale. Genome data offers the resolution needed to examine microevolution of isolates during the course of infection and to pinpoint the source and transmission networks involved in outbreaks. The use of genomic approaches in diagnosis may become more routine and offers the potential to provide additional clinically actionable information such predictions of the level or potential for drug resistance. Even prior to some treatments, metagenomic information may identify fungi and other microbes that are a cause for concern. While real challenges exist to the wide implementation of microbial sequencing methods in the clinic, including turnaround time, ease of use and interpretation, and cost, these approaches enable a high-resolution view of the specific fungal isolates causing disease.