Population-level distribution and putative immunogenicity of cancer neoepitopes
Tumor neoantigens are drivers of cancer immunotherapy response; however, current prediction tools produce many candidates requiring further prioritization. Additional filtration criteria and population-level understanding may assist with prioritization. Herein, we show neoepitope immunogenicity is related to measures of peptide novelty and report population-level behavior of these and other metrics.
We propose four peptide novelty metrics to refine predicted neoantigenicity: tumor vs. paired normal peptide binding affinity difference, tumor vs. paired normal peptide sequence similarity, tumor vs. closest human peptide sequence similarity, and tumor vs. closest microbial peptide sequence similarity. We apply these metrics to neoepitopes predicted from somatic missense mutations in The Cancer Genome Atlas (TCGA) and a cohort of melanoma patients, and to a group of peptides with neoepitope-specific immune response data using an extension of pVAC-Seq (Hundal et al., pVAC-Seq: a genome-guided in silico approach to identifying tumor neoantigens. Genome Med 8:11, 2016).
We show neoepitope burden varies across TCGA diseases and HLA alleles, with surprisingly low repetition of neoepitope sequences across patients or neoepitope preferences among sets of HLA alleles. Only 20.3% of predicted neoepitopes across TCGA patients displayed novel binding change based on our binding affinity difference criteria. Similarity of amino acid sequence was typically high between paired tumor-normal epitopes, but in 24.6% of cases, neoepitopes were more similar to other human peptides, or bacterial (56.8% of cases) or viral peptides (15.5% of cases), than their paired normal counterparts. Applied to peptides with neoepitope-specific immune response, a linear model incorporating neoepitope binding affinity, protein sequence similarity between neoepitopes and their closest viral peptides, and paired binding affinity difference was able to predict immunogenicity (AUROC = 0.66).
Our proposed prioritization criteria emphasize neoepitope novelty and refine patient neoepitope predictions for focus on biologically meaningful candidate neoantigens. We have demonstrated that neoepitopes should be considered not only with respect to their paired normal epitope, but to the entire human proteome, and bacterial and viral peptides, with potential implications for neoepitope immunogenicity and personalized vaccines for cancer treatment. We conclude that putative neoantigens are highly variable across individuals as a function of cancer genetics and personalized HLA repertoire, while the overall behavior of filtration criteria reflects predictable patterns.
KeywordsNeoantigens Neoepitopes Immunogenicity Immunotherapy TCGA
Area Under the Receiver Operating Characteristic Curve
Differential Agretopicity Index
Human Leukocyte Antigen
Mutation Annotation Format
Merkel Cell Carcinoma
Major Histocompatibility Complex
National Center for Biotechnology Information
personalized Variant Antigens by Cancer Sequencing
Receiver Operating Characteristic
The Cancer Genome Atlas
Variant Call Format
Variant Effect Predictor
Neoepitopes are novel peptides that correspond to tumor-specific mutations, are presented on the surface of tumor cells, and have the potential to elicit an immune response (denoting a “neoantigen”). When targeted by cytotoxic T-cells, tumor-associated neoantigens may be associated with increased survival among some cancer patients (e.g. melanoma, cholangiocarcinoma) [1, 2], and tumor neoepitope burden seems to correlate with patient survival [3, 4, 5]. Increasingly, immune checkpoint inhibitor therapies have been successful at stimulating anti-tumor immune responses in several cancer types . However, this ability to leverage the immune system against tumors remains predicated on the immune system’s ability to recognize tumor neoepitopes as “non-self” . Increased tumor neoepitope burden is associated with response to immune checkpoint inhibitor therapies [8, 9], and recent attempts to treat melanoma with personalized neoantigen vaccines have shown preliminary success [10, 11].
Importantly, not all tumor mutations produce neoepitopes. First, the mutation must result in a change in the amino acid sequence of the tumor peptide relative to the normal peptide. The resulting peptide must also be expressed within cancer cells and bind with high affinity to one or more of the patient’s major histocompatibility complexes (MHC) , the polyprotein complexes predominantly encoded by the polymorphic Human Leukocyte Antigen (HLA) loci, which are responsible for presenting peptides to the surface of both normal and cancer cells for detection by the immune system in a patient-specific manner [12, 13, 14]. With little variation, these are the criteria applied by all computational tools for neoepitope prediction from tumor genomic sequencing data, including Epi-Seq , EpiToolKit , pVAC-Seq , INTEGRATE-neo , TSNAD , MuPeXI , and CloudNeo .
A neoepitope with strong affinity for MHC (< 500 nM ) may be a more robust neoantigen candidate if the paired normal epitope has a poor affinity for MHC (> 500 nM). This concept was already implemented in the CloudNeo tool, but the effects of filtering epitope calls in this fashion were not addressed . A greater difference in MHC binding affinity between tumor and normal epitopes can increase neoepitope immunogenicity, as shown by Duan et al. using a “differential agretopicity index” .
While a tumor neoepitope may bind differently from its paired normal epitope, decreased peptide-peptide similarity of the pair at non-MHC-anchoring residues (e.g. amino acid positions 2 and 9 for a 9mer peptide ) is likely to increase neoepitope immunogenicity, per the criterion of continuity hypothesis proposed by Pradeu and Carosella, which suggests that epitopes discontinuous with those that the immune system normally encounters are more likely to trigger an immune response . In fact, Yadav et al. found that neoepitopes with amino acid changes at solvent-exposed positions elicited strong T-cell responses .
Though most approaches only consider the paired normal epitope as a counterpart to its neoepitope, the tumor neoepitope may actually be highly similar to other normal peptides; this emphasizes the importance of considering sequence homology of a neoepitope not just to its normal counterpart, but to all normal peptides that the immune system may encounter in the body. This idea has been previously addressed by the tool MuPeXI , but only by searching for exact sequence matches of neoepitopes to the reference proteome. Others have investigated the importance of neoepitope sequence similarity to known antigen sequences in predicting response to immunotherapy .
It is important to consider the sequence homology of candidate neoepitopes to bacteria and viruses, as a) immunotherapy response has shown dependence upon commensal bacteria [27, 28, 29], b) peptides of bacterial and viral pathogens can be cross-reactive with tumor peptides and recognized by the same tumor-specific T cells , and c) virus-derived oncoproteins from virus-associated cancers such as head and neck cancer , cervical cancer , and Merkel cell carcinoma (MCC)  have been shown to elicit T-cell responses .
Tumor vs. paired normal peptide binding affinity difference: the degree to which the change in predicted MHC binding affinity between tumor and normal epitopes may be immunogenic.
Tumor vs. paired normal peptide sequence similarity: the similarity between paired tumor and normal epitopes, based on protein sequence similarity measures as computed from a BLOSUM62 matrix .
Tumor vs. closest human peptide sequence similarity: the similarity between the neoepitope and other normal, unrelated human peptides.
Tumor vs. closest microbial peptide sequence similarity: the similarity between the neoepitope and known bacterial and viral peptides (from commensals or other infectious pathogens).
Herein, we apply these metrics to neoepitopes predicted for somatic mutations identified in a cohort of melanoma patients and across 18 diseases in The Cancer Genome Atlas (TCGA), with the aim of understanding how these metrics influence and stratify neoepitope predictions, both for an individual and at a population level. To our knowledge, our analysis of neoepitope predictions from TCGA represents the broadest study of this kind to date, describing variation across not only a large patient cohort, but across HLA alleles encompassing 99% of the variation in the population at these loci. We also apply our metrics to a small cohort of individual peptides to assess their efficacy of immunogenicity prediction.
pVAC-Seq analysis of the cancer genome atlas patients
For our analyses, we used somatic mutations identified with MuTect  from Mutation Annotation Format (MAF) files for 18 cancer types (see Additional file 1: Table S1) in TCGA, retrieved using gdc-scan (v1.0.0) . The MAF files were then converted to tumor-normal pair variant call format (VCF) files using the maf2vcf tool in the vcf2maf software package , with the GRCh38/hg38 genome build available from the Broad Institute resource bundle  used as the reference genome. Because these VCFs still contained data for both tumor and normal samples, they were then manipulated to remove data from the paired normal sample, leaving final, tumor-only VCF files for compatibility with pVAC-Seq, which accepts only single-sample VCFs. Each disease type consisted of a variable number of patients (see Additional file 1: Figure S1).
We then annotated these VCF files using Variant Effect Predictor (VEP, v88) . VEP was run according to pVAC-Seq’s recommendations  with the Downstream and Wildtype VEP plugins  used, gene symbols added to output where available, and mutation consequence terms based on Sequence Ontology annotation guidelines . We also used VEP’s GRCh38 annotation cache (rather than querying remotely) for efficiency.
As the TCGA data we obtained did not allow us to calculate patient-specific HLA types, we assumed each tumor could occur in the setting of any HLA allele type, allowing us to explore neoepitope distributions among a broader theoretical population. To do this, we generated a list of HLA alleles to use for subsequent analysis based on allele frequencies originating from the Allele Frequency Net Database  and summarized for use in the software POLYSOLVER (v1.0) . The average frequencies across races (Asian, Black, and Caucasian) of alleles for each HLA gene (HLA-A, HLA-B, and HLA-C) were calculated and normalized to sum to 100%. We then selected the top 145 average-frequency HLA alleles for subsequent analysis (see Additional file 1: Tables S2 – S4), encompassing all HLA alleles among 99% of individuals in the general population.
pVAC-Seq (v4.0.8) was run for each patient and allele combination using 9mer epitopes generated from 17-mer peptides surrounding each missense mutation, and using MHC binding predictions generated by NetMHCpan (v2.8) . For each resulting neoepitope from pVAC-Seq, additional metrics were applied as described below. Note that for the purposes of this study, only epitopes resulting from missense mutations were considered for further analysis, and all peptides were considered to be expressed at equal levels. Note also that neoepitopes from breast cancer (BRCA), cervical cancer (CESC) and melanoma (SKCM) were not assessed for protein sequence similarity against peptides other than their paired normal epitopes.
To assess the degree to which HLA alleles might have overlapping preference for putatively novel binding neoepitopes predicted for mutations across TCGA (see Neoepitope Prioritization Metrics), 1000 random sets of six of the previously described set of HLA alleles (two HLA-A, two HLA-B, and two HLA-C alleles) were chosen using the random.sample function (without replacement) from the random module in Python 2.7.13  (the combinations tested are available in Additional file 2: Table S5). All unique amino acid sequences of neoepitopes that bound to one or more alleles within each random allele set were counted; separate counts were kept for neoepitopes that bound to one, two, three, four, five, or six of the six alleles (i.e. increasing levels of overlap). The script for randomly sampling allele sets and determining overlap is available in our GitHub repository .
For comparison, we assessed recurrence rates among 2,813,809 simulated neoepitopes (9mers) mirroring the size of the TCGA data set. These neoepitopes were drawn randomly from the GRCh38 peptidome, with subsequent introduction of a random single amino acid substitution at a random position along each 9mer. These simulated peptides were labeled by patient and disease site to produce a random set of peptides for each patient equivalent in size to that patient’s predicted neoepitope repertoire. We repeated this process again for a smaller set of 1000 simulated neoepitopes to assess trends in peptide similarity scores. The gene of origin of the random peptide and the gene corresponding to its closest peptide match in the human proteome were retained for protein sequence similarity analysis (see “Tumor vs. closest human peptide sequence similarity” below).
Analysis of neoepitopes in melanoma patients
We identified patient-specific neoepitopes in whole exome sequencing data from 12 patients selected from a study exploring genomic features of response to immunotherapy in melanoma patients . Reads were aligned against the GRCh37d5 reference genome using the Sanger cgpmap workflow . This workflow uses bwa-mem (v0.7.15–1140)  and biobambam2 (v2.0.69)  to generate genome coordinate-sorted alignments with duplicates marked. Realignment around indels and base recalibration were performed using Genome Analysis Toolkit (v3.6) . Variants were called using VarScan (v2.3.9)  in accordance with the methods outlined in the workflow . VCF files were annotated using VEP (v88)  as described above. For all missense single nucleotide variants identified, the tumor and normal protein epitopes of 8, 9, 10, and 11 amino acids in length were produced by reconstructing the nucleotide sequence surrounding the mutation using its coordinates from the VCF file and the CDS in the hg19 gene transfer format file , and translating this sequence into amino acids. Each patient’s HLA type was determined from FASTQ files using Optitype (v1.3.1) , and the binding affinity of all predicted tumor and normal epitopes was predicted with NetMHCpan (v2.8)  for each epitope and patient-specific HLA allele combination. Additional prioritization metrics were applied as described below.
Neoepitope prioritization metrics
Tumor vs. paired normal binding affinity difference
The difference in MHC binding affinity was calculated using NetMHCpan (described above) as the tumor peptide binding affinity subtracted from the normal peptide binding affinity. A novel binding change was defined as a case in which the tumor neoepitope had an MHC binding affinity below 500 nM (tighter association), while the corresponding normal epitope had at least a 5-fold weaker MHC binding affinity (minimum 500 nM). Note that if an unrelated human peptide was found to be similar to the neoepitope (see “Tumor vs. closest human peptide sequence similarity” below), binding affinity difference was also calculated using this peptide’s sequence.
Tumor vs. paired normal peptide sequence similarity
Using a BLOSUM62 matrix, the amino acids at each position along the paired tumor and normal epitopes were given an aggregate similarity score, with higher scores indicating higher similarity. We modified the process described by Henikoff and Henikoff  to remove known MHC anchor residues (the second residue and last residue of each 9mer epitope) from scoring in order to remove redundancy with the binding affinity difference metrics, and to place emphasis upon residues that may be more accessible for recognition by T-cells . However, because these scores vary depending on amino acid composition of the proteins tested, we performed a normalization: we divided the similarity score for a neoepitope compared to another peptide by the similarity score of the neoepitope tested against itself to produce percent similarity scores. Note that if an unrelated human, bacterial, or viral peptide was found to be similar to the neoepitope (see “Tumor vs. closest human peptide sequence similarity” and “Tumor vs. closest microbial peptide sequence similarity” below), paired sequence similarity was also calculated using this peptide’s sequence instead of the paired normal epitope.
Tumor vs. closest human peptide sequence similarity
Using BLAST+ , a protein-protein, local, ungapped alignment search of all known human proteins was performed to find the closest matching peptide to each tumor peptide. This was performed using blastp (v2.4.0+) and a peptide database constructed with makeblastdb (v2.4.0+) using Ensembl’s set of all GRCh38 peptides . The BLOSUM62 scoring matrix, ungapped alignments, and an E value cap of 200,000 (to capture blast hits for as many epitopes as possible) were applied, and composition-based statistics were turned off. The top scoring alignment (i.e. lowest E value) with an alignment length of 9 was used as the best match for each neoepitope. If more than one alignment shared the top score, the names of all matching peptides were retained. If a top match was the neoepitope’s normal counterpart, a status of “matching” was assigned to the neoepitope, otherwise a status of “nonmatching” was assigned.
Tumor vs. closest microbial peptide sequence similarity
Using BLAST+ , a protein-protein, local, ungapped alignment search of all known bacterial and viral peptides was performed to find the closest matching bacterial and viral peptides to each tumor peptide. This was performed using blastp (v2.4.0+) and peptide databases made using makeblastdb (v2.4.0+). The bacterial database was assembled using the National Center for Biotechnology Information (NCBI)‘s nonredundant bacterial FASTA releases from RefSeq , while the viral database was assembled using NCBI’s nonredundant viral FASTA releases from RefSeq . The BLOSUM62 scoring matrix, ungapped alignments, and an E value cap of 200,000 were applied, and composition-based statistics were turned off. The top scoring alignment (i.e. lowest E value) with an alignment length of 9 was used as the best match for each neoepitope for both the bacterial and viral alignments. If more than one alignment shared the top score, the names of all matching peptides were retained for both the bacterial and viral alignments.
Features associated with immunogenicity
To assess how well our prioritization metrics reflect a neoepitope’s ability to elicit an immune response, we applied our criteria to predicted neoepitopes from six studies in which peptide-specific immune responses were measured [3, 11, 20, 62, 63, 64]. For data from all studies, we used only peptides which had complete information regarding the neoepitope and its paired normal peptide, as well as complete data regarding epitope-level immune response, providing a total cohort of 419 peptides. Because only binary immune response data was available from Ott et al.  and Bjerregaard et al. , we generated binary response data from the Carreno et al.  dataset for compatibility: a neoepitope was considered to have elicited an immune response if it had a percent neoantigen-specific T-cell in lymph+/CD8+ gated cells of greater than 10%. Of the seven peptides from the Le et al.  dataset that were tested for clonal T cell expansion, the three peptides that demonstrated clonal T-cell expansion were considered to have elicited an immune response, while those that only demonstrated immune reactivity in an ELISpot assay were considered not to have elicited an immune response. Among peptides evaluated in co-culture experiments from Tran et al.  and Gros et al. , those that were T-cell reactive were considered to have elicited an immune response. We produced a linear model to determine the relationship between our neoepitope novelty criteria and peptide-specific immune response (see “Statistical analysis” below). Using Scikit-learn for Python , SVM and Random Forest models were also trained with 10 fold cross validation for comparison.
Statistical analysis was performed using R (v3.3.2) in RStudio. To test the relationship between per-allele neoepitope burden and neoepitope frequency across the TCGA dataset, we obtained the Pearson’s product-moment correlation and associated p-value using a two-sided test. To determine whether a difference in tumor vs. paired normal peptide binding affinity difference exists between epitopes with and without an amino acid change at an anchor position, we applied a Wilcoxon rank sum test. We also used the Wilcox test to compare tumor vs. paired normal peptide sequence similarity scores between epitopes with novel vs. non-novel binding changes, and to compare the difference in tumor vs. paired normal peptide sequence similarity scores for the neoepitope with its paired normal epitope and its closest matching human peptide from BLAST for matching versus non-matching genes. A Welch’s two sample t-test was used to compare the similarity of neoepitopes to bacterial vs. viral peptides. We used the package pROC  to obtain AUROC scores and the lm function to determine the relationship between our continuous predictors and observed peptide-specific immune response data. Our analyses are available as an R script on our GitHub repository .
We next sought to explore the repertoire of shared neoepitopes across TCGA as a function of HLA subtypes. Among the ten HLA alleles with the greatest number of high-affinity epitopes, there was surprisingly little repetition of epitopes binding to that allele across TCGA (mean 1.06 epitopes repeated); however, each allele had at least one epitope encountered multiple times across patients and diseases (31–168 occurrences). The most frequently repeated neoepitope for HLA-B*15:03, KQMNDARHG, was found most often in breast carcinoma patients. In all cases, this neoepitope originated from a H1047R substitution caused by a single nucleotide variant in the gene PIK3CA, a known driver mutation . Two other recurrent epitopes, LSKITEQEK and STRDPLSKI, were identified 168 times for HLA-A*30:01 and HLA-B*15:17, respectively, with both originating from the same oncogenic mutation in PIK3CA (E542K substitutions) , and were found exclusively in breast carcinoma, cervical squamous cell carcinoma, and prostate adenocarcinoma patients. There were 5175 other occurrences of PIK3CA neoepitopes, as well as 7139 TP53 (a known cancer driver gene ) neoepitopes, and 35,872 occurrences of neoepitopes originating from MUC16, the gene encoding the CA-125 cancer biomarker . On average, 4.7% of patient epitopes were repeated across patients within their own disease site and 7.9% of patient epitopes were repeated across all of TCGA, a rate significantly higher than that anticipated by random chance alone (see Additional file 1: Figure S3 and Additional file 1: Table S6), and likely attributable to common cancer mutations. It is, finally, important to note that the true number of shared neoepitopes among cancers is likely to be smaller due to the random assortment of actual HLA alleles across the population.
We were also interested in the degree to which an overlap of neoepitope preferences existed between the HLA alleles studied, as an epitope that binds strongly to more than one of a patient’s suite of HLA alleles would likely be a better candidate for applications such as peptide vaccines. For 1000 randomly sampled sets of six HLA alleles (see Methods), on average, 3.1% of neoepitopes across TCGA had affinity for at least one of the six alleles in each set, emphasizing the importance of a patient’s unique HLA repertoire in neoepitope presentation. Of these epitopes, the majority (87.5% on average) only had affinity for one allele, and 11.0% on average had affinity for two alleles, but there were some cases where epitopes had affinity for all six alleles (0.0007% of epitopes on average, and up to 0.2% of epitopes for one allele set; see Additional file 1: Figure S4).
Tumor vs. paired normal peptide binding affinity difference
We next assessed novelty of MHC tumor vs. paired normal binding affinity change across TCGA, noting that a minority of neoepitopes (20.3% on average) met this criterion, with a similar distribution across all TCGA cohorts (average proportion per patient of 18.0–24.7%, see Fig. 2a). This finding was dependent upon HLA type, with a median 5.6-fold difference in the number of neoepitopes with novel binding changes between the 25th and 75th percentile HLA alleles among diseases (see Fig. 2b). All diseases had the greatest number of neoepitopes associated with the allele HLA-B*15:03; however, there was no statistical association between HLA allele frequency in the general population and that allele’s corresponding number of novel binding change neoepitopes (Pearson’s product-moment correlation of 0.1; p = 0.1; see Fig. 2b).
Importantly, we note that the mutation of amino acid residues at MHC anchor positions (i.e. the second residue and last residue of each 9mer epitope) tends to result in more dramatic predicted binding affinity differences between the tumor and paired normal peptides compared to mutation of non-anchor residues (median of 2935.1 nM vs. 28.1 nM, respectively; p < 2.2 × 10− 16; see Additional file 1: Figure S5).
Tumor vs. paired normal peptide sequence similarity
While anchor residue mutations may influence differential peptide binding, anchor residues are anticipated to be more directly engaged with the MHC complex and thus less accessible for T-cell recognition . We therefore sought to investigate the differences in peptide sequence between tumor neoepitopes with mutations in T-cell exposed (i.e. non-anchor) residues and paired normal epitopes. The average protein sequence similarity score (see Methods) between paired epitopes with single nucleotide variants at non-anchor positions across all diseases was 83.5% (ranging from 60.0% to 98.1%, sd = 6.3%). When we assessed these similarity scores in conjunction with our novel binding criteria (see Methods), we observed that neoepitopes which displayed a putatively novel binding change tended to have lower similarity to their paired normal counterpart than those without such a binding affinity change (mean 81.5% vs. 83.7%, respectively; p < 2.2 × 10− 16). This level of significance held even when controlling for tumor neoepitope binding affinity (see Additional file 1: Table S7).
Tumor vs. closest human peptide sequence similarity
Tumor vs. closest microbial peptide sequence similarity
Next, we assessed neoepitope sequence homology with peptides from pathogenic and commensal microorganisms. Almost all neoepitopes were found to have at least one matching bacterial or viral peptide by blastp (87.6% and > 99.9%, respectively). Overall, tumor neoepitopes were more similar to bacterial peptides compared to viral peptides (mean percent peptide sequence similarity score of 91.4% (sd = 6.6%) and 76.7% (sd = 9.1%), respectively, (p < 2.2 × 10− 16)). Interestingly, in 56.9% of cases where a neoepitope had a bacterial blastp hit, the bacterial peptide was more similar to the neoepitope than either its normal counterpart or its most similar normal protein as determined by blastp; this was only true for 15.8% of the viral peptide matches for neoepitopes (see Fig. 1d for example). More strikingly, when considering protein sequence similarity scores across all residues, 59.6% of neoepitopes with bacterial blastp hits had higher similarity to these peptides than to either of their normal peptide matches; only 5.8% of viral epitopes showed this phenomenon. This was true despite the fact that neoepitopes had significantly more mismatches in sequence with bacterial peptides than they did with either their paired normal epitopes (mean 1.5 vs 1; p < 2.2 × 10− 16) or their closest matching peptides from blastp (mean 1.5 vs 1.4; p < 2.2 × 10− 16). However, in terms of total amino acid length, the bacterial peptide data base was 577.5 times larger than the human peptide database, and the viral peptide data base was only 2.0 times larger, so these phenomena may be in part reflective of these differences.
Additional file 1: Figure S6 shows the distribution of protein percent similarity scores for bacterial and viral hits for each TCGA disease site and for 1000 randomly simulated neoepitopes (see Methods) across non-anchor residues. Predicted TCGA neoepitopes were significantly more similar than random peptides to their closest matching bacterial peptide (mean 91.4% vs. 82.3%, p < 2.2 × 10− 16, Welch Two Sample t-test) and their closest matching viral peptide (mean 76.7% vs. 58.7%, p < 2.2 × 10− 16, Welch Two Sample t-test), indicating that this phenomenon of sequence similarity to microorganism peptides may be specific to cancer neoepitopes. We determined the top 10 most frequently occurring bacterial genera in cases where a bacterial peptide was a closer match to a neoepitope than either of its human peptide counterparts (see Fig. 5), which includes, of particular interest, frequently pathogenic genera such as Clostridium , Mycobacterium [72, 73], and Vibrio , and the frequently commensal genus Lactobacillus .
Features associated with immunogenicity
Finally, we applied our criteria to a cohort of neoepitopes with paired immune response data. Applying any single criterion to predict immune response to a neoepitope in a linear model did not lead to significant prediction in any case, except for the percent protein sequence similarity between a neoepitope and its closest viral peptide (p = 0.046; see Table 1, Figs. 6a-f). We also observed how well immune response was predicted by neoepitope binding affinity, paired normal epitope binding affinity, and the number of mismatches in amino acid sequence between the neoepitope and its paired normal epitope. Only the number of mismatches was alone able to predict neoepitope immunogenicity, favoring those neoepitopes with multiple amino acid changes (p = 0.03; see Table 1). Our putatively novel binding change criteria alone was able to predict true immunogenicity with an AUROC of 0.53 (see Additional file 1: Figure S7). A linear model incorporating 1) neoepitope binding affinity, 2) putatively novel binding change status of the neoepitope, 3) binding affinity difference between the neoepitope and both its paired normal epitope and 4) its closest BLAST peptide match, 5) number of amino acid sequence mismatches between the neoepitope and its paired normal epitope, and 6) percent protein sequence similarity between the neoepitope and its paired normal epitope, 7) its closest human peptide match, 8) its closest bacterial peptide match, and 9) its closest viral peptide match was able to significantly predict immune response to neoepitopes (p = 0.02). However, only three individual predictors contributed significantly to the model: neoepitope binding affinity (p = 0.003), percent protein sequence similarity of the neoepitope to its closest viral peptide match (p = 0.048), and binding affinity difference between the neoepitope and its closest human peptide match (p = 0.002). The contribution of the number of amino acid sequence mismatches between the neoepitope and its paired normal epitope approached significance (p = 0.075). A reduced, multiplicative version of our linear model incorporating only these four predictors was able to predict immune response to neoepitopes with greater significance (p = 3.3 × 10− 6; AUROC = 0.66; see Fig. 6g; mathematical representation available in supplementary materials). For comparison, we also trained SVM and Random Forest models to predict peptide-specific immune response; however, the simpler linear model remained the best predictor (see Additional file 1: Figure S8).
Significance of prioritization metrics in predicting immune response. Based on a linear model, each prioritization metric, along with tumor and paired normal epitope binding affinities and the number of sequence mismatches between neoepitopes and paired normal epitopes, were tested for the ability to predict immune response to a predicted neoepitope
Predictor of immune response
Significance (p value)
Neoepitope binding affinity
Paired normal epitope binding affinity
Difference in binding affinity between neoepitope and paired normal epitope
Difference in binding affinity between neoepitope and closest human protein
Number of mismatches between neoepitope and paired normal epitope
Percent protein sequence similarity between neoepitope and paired normal epitope
Percent protein sequence similarity between neoepitope and closest human protein
Percent protein sequence similarity between neoepitope and closest bacterial protein
Percent protein sequence similarity between neoepitope and closest viral protein
In this study, we have explored the frequency and distribution of neoepitopes among patients across diverse TCGA disease types using a broad set of HLA alleles. We have further described and evaluated multiple neoepitope prioritization criteria that can significantly refine patient neoepitope predictions to enrich for biologically meaningful candidate neoantigens. In particular, we proposed four metrics that emphasize neoepitope novelty: tumor vs. paired normal peptide binding affinity difference, tumor vs. paired normal peptide sequence similarity, tumor vs. closest human peptide sequence similarity, and tumor vs. closest microbial peptide sequence similarity. By applying these metrics to predicted neoepitopes from 572,170 combinations of patients and HLA alleles, we estimated the behavior of these metrics in the general cancer population. We have shown that tumor-normal MHC binding affinity differences and non-anchor peptide sequence similarity are independent metrics, and can be used to dramatically refine a list of neoepitopes for further consideration. Applying our criteria to a cohort of neoepitopes with paired immune response data reaffirmed the importance of neoepitope MHC binding affinity in eliciting an immune response. We also demonstrated the significance of novel MHC binding of neoepitopes relative to human proteins, and the degree of sequence similarity of neoepitopes to viral peptide sequences.
To our knowledge, this is the first study to analyze neoepitope predictions broadly across the population by investigating candidate neoantigens among a large cohort of patients within TCGA and across an extensive set of HLA alleles encompassing 99% of the variation in the population at each HLA locus. The approach detailed here also represents the first systematic comparison of neoepitopes to unrelated peptides, demonstrating that a neoepitope may be more similar to other human, commensal, or pathogenic peptides than its paired normal epitope. This approach also builds upon pVAC-Seq in several significant ways, and could in theory be applied as a post-processing step in any neoantigen prediction pipeline. Although the peptide immune response data we analyzed was limited in size and scope, it represents the largest such cohort published to date. We expect further refinement of neoepitope prioritization with additional data and emerging biological insights.
Our study also has several limitations which must be considered when interpreting these results or applying a similar approach prospectively. For simplicity’s sake, we did not consider expression levels or variant allele frequencies of the neoepitopes analyzed, which are important and well-established criteria for prioritizing predicted neoepitopes which are robustly present in the tumor of interest. We also did not enrich a priori the subset of microorganisms most closely associated with human health and disease, thus broadening the peptide search space to include likely uninformative sequences (e.g. non-human viruses) while excluding some potentially relevant species (e.g. yeasts). Additionally, as we did not have HLA typing information for the TCGA cohort, we were unable to explicitly address combinatorial overlap of epitope preference for HLA alleles on a per-patient basis. Lastly, this analysis only considers single nucleotide missense mutations.
In the future, we aim to include more complex variants such as small insertions and deletions, as these have the potential to produce highly novel, immunogenic neoantigens  and are currently omitted from consideration by all but a few neoepitope prediction tools (e.g. MuPeXI  and TSNAD ). Further, we believe that the incorporation of tumor neoepitope sequence similarity into studies of the microbiome in cancer patients could be important for better understanding patient response to immunotherapy treatment. Incorporating the criteria proposed here into the development of neoepitope vaccines may help refine the vaccine production process and potentially improve the success of cancer treatments.
In our exploration of neoepitopes across broad populations and sets of HLA alleles, we have evaluated multiple neoepitope prioritization criteria that emphasize peptide novelty, concluding that neoepitopes should be considered not only with respect to their paired normal epitope, but with respect to the entire human proteome, as well as bacterial and viral peptides, with potential implications for neoepitope immunogenicity and personalized vaccines for cancer treatment. We further conclude that the sequences of putative neoantigens are highly variable across individuals as a function of both cancer genetics and personalized HLA repertoire, while the overall behavior of filtration criteria reflects more predictable patterns.
The results shown here are based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/.
Funding for this project was generously provided by the Sunlin & Priscilla Chou Foundation.
Availability of data and materials
The sequence data used for our melanoma cohort as published by Hugo et al.  are available from the Sequence Read Archive  under the accession numbers SRA: SRP067938 and SRA: SRP090294. TCGA variant call sets are available from the TCGA data portal . The peptide and paired immune response data sets used in our cohort to identify features associated with immunogenicity is available within the published articles from which they originated and their associated supplementary materials [3, 11, 20, 62, 63, 64]. The datasets produced for our analyses and supporting the conclusions of this article are available upon request.
MAW – project design, data analysis, data interpretation, manuscript preparation and review, MP – data analysis, manuscript review, MPP – data analysis, manuscript review, AN – data analysis, manuscript review, AJS – data analysis, manuscript review, KE – data analysis, manuscript review, AM – project design, manuscript review, AN – data analysis, manuscript review, RFT – project conceptualization and design, data analysis and interpretation, manuscript preparation and review. All authors read and approved the final manuscript.
Ethics approval and consent to participate
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 20.Bjerregaard A, Nielsen M, Hadrup SR, Szallasi Z, Eklund AC. MuPeXI: prediction of neoepitopes from tumor sequencing data. Cancer Immunol Immunother. 2017; https://doi.org/10.1007/s00262-017-2001-3.
- 21.Bais P, Namburi S, Gatti DM, Zhang X, Chuang JH. CloudNeo: a cloud pipeline for identifying patient-specific tumor neoantigens. Bioinformatics. 2017; https://doi.org/10.1093/bioinformatics/btx375.
- 36.Cibulkis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature. Biotechnol. 2013;31:213–9.Google Scholar
- 37.Spangler R: gdc-scan. https://github.com/ohsu-comp-bio/gdc-scan (2017). Accessed 1 May 2016.
- 38.Memorial Sloan Kettering: vcf2maf. https://github.com/mskcc/vcf2maf (2017). Accessed 2 May 2017.
- 39.Broad Institute: Resource Bundle. ftp://firstname.lastname@example.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz (2016). Accessed 17 Mar 2017.
- 41.Hundal J, Kiwala S, Graubert A, Walker J, Miller C, Griffith M, et al.: Usage. http://pvac-seq.readthedocs.io/en/v4.0.8/run.html (2016). Accessed 2 May 2017.
- 42.Ensembl: VEP_plugins. https://github.com/Ensembl/VEP_plugins (2017). Accessed 15 Mar 2017.
- 43.Cunningham F, Moore B, Ruiz-Schultz M, Ritchie GR, Eilbeck K. Improving the sequence ontology terminology for genomic variant annotation. J Biomed Sci. 2015;6(1):32.Google Scholar
- 47.Python Software Foundation: random – Generate pseudo-random numbers. https://docs.python.org/2/library/random.html (2017). Accessed 7 Aug 2017.
- 48.Wood, M: neoepitope_novelty. https://github.com/ohsu-comp-bio/neoepitope_novelty (2017). Accessed 23 Aug 2017.
- 50.Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. PMID: 29596782. https://doi.org/10.1016/j.cels.2018.03.002.
- 52.Tischler, G: biobambam2. https://github.com/gt1/biobambam2 (2017).
- 55.GENCODE: Release 19. ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz (2017). Accessed 19 Jun 2017.
- 59.Ensembl: Gene Annotation. ftp://ftp.ensembl.org/pub/release-88/fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.all.fa.gz (2017). Accessed 12 Apr 2017.
- 60.National Center for Biotechnology Information: RefSeq Release – Bacteria. ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/ (2017). Accessed 13 Jul 2017.
- 61.National Center for Biotechnology Information: RefSeq Release – Viral. ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/ (2017). Accessed 13 Jul 2017.
- 65.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.Google Scholar
- 67.Turajlic S, Litchfield K, Xu H, Rosenthal R, McGranahan N, Reading JL, et al. Insertion-and-deletion-derived tumor-specific neoantigens and the immunogenic phenotype: a pan-cancer-analysis. Lancet Oncol. 2017; https://doi.org/10.1016/S1470-2045(17)30516-8.
- 69.Surget S, Khoury MP, Bourdon J. Uncovering the role of p53 splice variants in human malignancy: a clinical perspective. Onco Targets Ther. 2014;7:57–68.Google Scholar
- 76.National Center for Biotechnology Information: SRA. https://www.ncbi.nlm.nih.gov/sra (2017).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.