Urothelial bladder cancer (UBC) results in approximately 165,000 deaths worldwide annually [1]. In patients with metastatic UBC, median overall survival is approximately 9–15 months following first-line platinum-based chemotherapy and 7–9 months for patients relapsing after platinum-based treatment [2,3,4]. Cancer immunotherapy is a treatment modality addressing this high medical need in UBC. Immune checkpoints, such as the programmed death-ligand 1 (PD-L1)/programmed death-1 (PD-1) pathway, block the development of active antitumor immune responses [5], and inhibition of this pathway has demonstrated excellent response and survival rates in locally advanced or metastatic UBC [6,7,8,9].

Efforts to identify patients most likely to benefit from anti-PD-L1/PD-1 therapy suggest that expression of PD-L1 on tumor cells (TC) and/or tumor-infiltrating immune cells (IC) may correlate with efficacy of anti-PD-L1 therapy in UBC [7, 8, 10, 11], as may other potential biomarkers such as tumor mutational burden [12].

In the USA, the anti-PD-L1 antibody atezolizumab is approved for the treatment of patients with locally advanced or metastatic UBC, ineligible for cisplatin-containing therapy, whose tumors have PD-L1-stained IC, covering ≥ 5% of the tumor area, or ineligible for any platinum-containing therapy regardless of PD-L1 expression [13]. The anti-PD-1 antibody pembrolizumab is also approved in the USA for the treatment of patients with locally advanced or metastatic UBC ineligible for cisplatin-containing therapy whose tumors express PD-L1 (combined positive score ≥ 10) or ineligible for any platinum-containing chemotherapy regardless of PD-L1 status [14]. Two other PD-L1 inhibitors, durvalumab and avelumab, and the PD-1 inhibitor nivolumab are approved for second-line treatment in the USA. In Europe, atezolizumab is approved for the treatment of patients with locally advanced or metastatic UBC after platinum-containing chemotherapy or cisplatin-ineligible patients whose tumors have PD-L1-stained IC covering > 5% of the tumor area. Pembrolizumab is approved in Europe for the treatment of patients with advanced UBC recurrent after platinum-based therapy or those ineligible for cisplatin-containing regimens whose tumors express PD-L1 with a combined positive score ≥ 10, while the PD-1 inhibitor nivolumab is approved for second-line treatment. PD-L1 testing for cisplatin-ineligible patients for atezolizumab and pembrolizumab was imposed by the European Medicines Agency (EMA) and the US Food and Drug Administration (FDA) in June 2018 after early data from two first-line studies suggested decreased survival with single-agent checkpoint inhibitors compared with platinum-based chemotherapy in patients with low PD-L1 expression levels [15, 16].

In light of this requirement for PD-L1 testing in cisplatin-ineligible patients with locally advanced or metastasized UBC, accurate and reproducible measurement of PD-L1 expression of TC and IC is crucial. However, assays for PD-L1 expression were developed and validated independently in clinical studies of different checkpoint inhibitors. Several different assays for PD-L1 expression are available, all of which involve the scoring of immunohistochemically stained tumor sections by trained pathologists, but differences in antibodies used, cell types assessed, scoring systems, and cutoffs, as well as inter-observer variability, suggest that assay results may not be concordant [17]. This in turn may impact therapy decisions.

To address this problem, in this multicenter study, we investigated the technical comparability of four clinically relevant PD-L1 immunohistochemistry (IHC) assays in terms of concordance of the percentage of PD-L1-stained IC (per tumor area) and TC. Additionally, consistency of scoring for each assay between different trained readers was assessed.

The primary objective was to assess the overall technical comparability of the four assays in terms of percentage of PD-L1-stained IC, adjusted for reader effects. Secondary objectives included inter-reader agreement of PD-L1 IC staining for each assay, inter-assay agreement of PD-L1 staining for each assay, and comparability of PD-L1 TC staining overall, inter-assay and inter-reader.

Materials and methods

Staining for VENTANA SP142 and VENTANA SP263 (Roche Diagnostics, Mannheim, Germany) was performed at the Technical University of Munich, Germany, according to manufacturer’s protocol, on a VENTANA BenchMark Ultra (Roche Diagnostics). Briefly, slides were deparaffinized and incubated with cell conditioning solution (Cell Conditioning 1 [CC1], Roche Diagnostics) at 100 °C for 40 min for VENTANA SP142 and at 95 °C for 64 min for VENTANA SP263. The incubation period with the primary antibody was 16 min for both antibodies. For VENTANA SP142, incubation with primary antibody was followed by incubation with the OptiView Detection and Amplification Kit (Roche Diagnostics). Staining for DAKO 22C3, DAKO 28-8, and pan-cytokeratin (Agilent Technologies, Waldbronn, Germany) was performed at Uniklinik RWTH Aachen, Germany, according to manufacturer’s protocol. Formalin-fixed, paraffin-embedded tissue sections underwent a 3–1 target retrieval procedure using PT Link (Agilent Technologies) and low pH, followed by peroxidase blocking and incubation of the primary antibody, linker, visualization reagent, and DAB chromogen (in both PD-L1 antibodies following the companies’ recommended instructions, included in both kits, PD-L1 IHC 22C3 pharmDX and PD-L1 IHC 28.8 pharmDX, Agilent Technologies). Pan-cytokeratin staining was carried out in a similar way, using a broad-spectrum monoclonal mouse antibody (clone AE1/AE3) and EnVision FLEX Kit (Agilent Technologies). Staining procedures were carried out using Autostainer Link 48 (Agilent Technologies). After hematoxylin counterstain, dehydration, and coverslipping, slides were analyzed.

To select samples for the study cohort, sections from archived, formalin-fixed, paraffin-embedded tissue from patients with locally advanced UBC (n = 150) were chosen randomly and whole slides stained for PD-L1 using the VENTANA SP142 (Roche Diagnostics). Forty (26.7%) of the selected cases were transurethral resections of bladder tumors and 110 (73.3%) were from cystectomies. All cases were reviewed by two board-certified pathologists trained on scoring PD-L1 IC with VENTANA SP142 (Wilko Weichert and Kristina Schwamborn), and 30 cases were selected based on PD-L1 expression on IC in invasive cancer areas (< 1%, 1–5%, or > 5%; 10 cases, with approximately 30% cystectomies and 70% transurethral resections each) to resemble distribution of PD-L1 IC expression in the atezolizumab IMvigor210 study (cohort 2) [8]. Sample characteristics are summarized in Table S1 in the supplementary material. Only classical urothelial carcinomas (and no histological subtypes) were included in this study. Representative examples of staining for IC are shown in Fig. S1 in the supplementary material. All serial sections for this study were cut at the Technical University of Munich, Germany, and distributed for further staining to the RWTH Aachen, Germany. To aid in defining the tumor area, serial sections from each case were also stained with a pan-cytokeratin antibody and by hematoxylin and eosin.

For every selected case, whole slides were stained with each assay as well as pan-cytokeratin/hematoxylin and eosin and were distributed to the five university pathology departments at the Technical University of Munich, RWTH Aachen, Heidelberg, Erlangen, and Dresden for assessment. Observers were blinded for the assay used but not for the case. At each site, a board-certified pathologist who had been trained in scoring PD-L1 IC in UBC using the VENTANA SP142 IHC assay (0.5-day digital classroom training) [18] scored each case/assay combination (30 cases × 4 assays = 120 slides). All readers were trained in the proper interpretation and scoring of IC with the VENTANA SP142 assay using a method previously outlined for non-small-cell lung cancer and UBC [19]. The training session was performed using the novel digital platform, PathoTrainer (Pathomation Inc., Antwerp, Belgium), with 75 cases, including a 40-case proficiency exam that required a minimum passing score of 85% (the average proficiency score was 95%). Training was conducted across the dynamic range of PD-L1 positivity. Additionally, training specifically included consensus on the following criteria: IC were identified by morphology and TC by morphology and also pan-cytokeratin staining if required. TC were counted as positive if they showed a membranous staining (complete or incomplete) of any intensity. PD-L1 on TC was scored as the percentage of stained cells; the intensity of staining was not assessed. IC were defined as granulocytes, lymphocytes, and macrophages within the tumor, in the vicinity of the TC nest or in the stroma between two adjacent TC nests. Staining of granulomas was included if they met the criteria mentioned above. Necrotic areas, granulomas, or lymphoid aggregates adjacent (but not directly attached) to, or distant from, the tumor, intravascular IC, and areas showing cauterization artifacts were excluded. IC were included if they displayed any intensity of membranous (VENTANA SP263, DAKO 22C3, and DAKO 28-8) or granular cytoplasmic staining (VENTANA SP142). PD-L1 on IC was scored as the percentage of invasive tumor area occupied with/covered by IC showing PD-L1 staining at any intensity.

To compare the percentage of PD-L1 staining, an analysis of variance (ANOVA) was conducted using assay, reader and patients as effects. From this model, adjusted mean percentages were obtained for each assay, with 95% confidence intervals for means, and differences estimated and adjusted for multiple comparisons using Tukey’s range test. To allow consideration of the data in the context of the Blueprint study [17], data were visualized after the percentage of PD-L1 staining was averaged over the five readers (Fig. 1).

Fig. 1
figure 1

Average percentage of PD-L1-stained IC (a) and TC (b) using each assay. IC = tumor-infiltrating immune cells; PD-L1 = programmed death-ligand 1; TC = tumor cells

To investigate inter-reader and inter-assay concordance, intra-class correlations (ICCs) were calculated. The degree of concordance (Fleiss’ Kappa) and averaged percentage disagreement between assays were calculated for retrospectively selected cutoffs for PD-L1 positivity of > 1%, > 5%, and > 10%. The 25% cutoff was not included as too few samples were PD-L1 TC > 25% in the non-enriched screening and study cohort.


Percentages of PD-L1-stained IC or TC

Clinicopathological characteristics of the 30 selected cases are shown in Table S1 in the supplementary material. No consistent pattern of higher or lower staining for PD-L1-stained IC was seen for any particular assay, and only small differences between assays were observed (Fig. 1a). The adjusted mean percentage of IC staining for PD-L1 varied from 6.5 to 8.2% depending on the assay used (Table 1). There was broad agreement between readers for each assay, apart from reader 3, who tended to score PD-L1-stained IC slightly higher than the other readers (Fig. S2 in the supplementary material).

Table 1 Mean percentages of PD-L1-stained IC and TC across all samples using each assay, adjusted for sample effects

In contrast, there was more variation in the percentage of PD-L1-stained TC between assays, with the VENTANA SP142 assay yielding consistently lower percentages than the other three assays (Fig. 1b) and a lower adjusted mean percentage of stained cells (Table 1). There was also more variation between individual readers than for PD-L1-stained IC (Fig. S3 in the supplementary material), with reader 3 again scoring consistently higher than the other readers.

Pairwise comparison of assays

Pairwise comparison of adjusted means showed small differences between assays in PD-L1-stained IC but larger differences for PD-L1-stained TC, particularly between VENTANA SP142 and other assays (Fig. 2). Mean differences in adjusted means ranged from − 0.3 to 1.6 for IC, and all were non-significant. In regard to TC, staining differences between assays were larger, with wider confidence intervals than for IC (Fig. 2). Differences in adjusted means for TC ranged from − 10.5 to 2.7, with the largest differences being between VENTANA SP142 and the other three assays (range − 10.5 to − 7.8), which were statistically significant. Differences between the three other assays were in the range − 1.9 to 2.7 for TC and were non-significant (Table S2 in the supplementary material).

Fig. 2
figure 2

Difference in adjusted means of percentages of PD-L1-stained IC or TC for each assay. IC = tumor-infiltrating immune cells; PD-L1 = programmed death-ligand 1; TC = tumor cells

Inter-reader and inter-assay agreement

Inter-reader agreement for each assay was moderate to high for IC staining (ICC 0.532–0.729) and for TC staining (ICC 0.609–0.883) (Table 2). For each reader, inter-assay agreement was similarly moderate to high for IC staining (0.681–0.858) and for TC staining (0.778–0.885) (Table 3). This reflects the overall comparability of the assays, and it should be noted that the single outlier (SP142) that was identified in pairwise comparisons cannot be identified by the ICC analysis method.

Table 2 ICC values for inter-reader agreement for each assay
Table 3 ICC values for inter-assay agreement for each reader

Allocation to binary cutoffs for IC or TC

When IC results reported by each reader were allocated to cutoffs of 1%, 5%, or 10%, which have been used previously [6], average agreement between assays was high, with fewer than 15% of cases giving discordant results for any two assays (Fig. 3a). In contrast, when TC results were allocated to the same cutoffs (1%, 5%, and 10%), up to 25% of cases showed discordant results in comparisons involving VENTANA SP142 (Fig. 3b). When VENTANA SP142 was excluded, the average agreement between the other assays was high (> 88%).

Fig. 3
figure 3

Percentage of disagreement between assays averaged across five readers when results were allocated to retrospective binary cutoffs for PD-L1-stained IC (a) or TC (b). IC = tumor-infiltrating immune cells; PD-L1 = programmed death-ligand 1; TC = tumor cells

For PD-L1-stained IC, inter-assay agreement appeared highest at the lowest cutoff point, with Kappa values ranging from 0.609 to 0.923 for > 1%, from 0.683 to 0.811 for > 5%, and from 0.440 to 0.763 for > 10% (Table S3 in the supplementary material). For inter-reader agreement, Kappa values ranged from 0.533 to 0.801 for > 1%, from 0.551 to 0.732 for > 5%, and from 0.343 to 0.706 for > 10% (Table S4 in the supplementary material). For PD-L1-stained TC, no cutoff appeared to give greater agreement than any other. Inter-assay agreement showed Kappa values ranging from 0.560 to 0.844 (Table S5 in the supplementary material) and inter-reader agreement ranging from 0.572 to 0.769 (Table S6 in the supplementary material).


Cancer immunotherapy, and specifically PD-L1/PD-1 inhibition, is an effective treatment option for difficult-to-treat, locally advanced, or metastatic cancers, including UBC.

Currently there are several different PD-L1 antibodies and assays in use, developed in conjunction with different checkpoint inhibitor studies, and the technical parameters vary. This limits comparison of the different trials with respect to outcomes in the PD-L1 biomarker-positive subgroups and use of the biomarker in clinical practice.

Regarding PD-L1 IC staining, several seminal studies have shown lower concordance between readers or assays [17, 20, 21]. It was suggested that IC PD-L1 scoring is more difficult and thus may require further standardization and training before using IC-based algorithms for patient selection [17, 20,21,22].

In this study, ahead of reading the slides stained with the different assays, all readers attended classroom training for scoring PD-L1-stained IC (per tumor area) with SP142 in UBC [18]. The consistency of results for IC staining suggests that the percentage of PD-L1-stained IC per tumor area can be evaluated reproducibly by trained readers.

For IC staining we found little variation between assays, with small, non-significant differences and medium to high ICC values for inter-assay agreement. For TC staining, on the other hand, the VENTANA SP142 IHC assay gave consistently lower percentages of PD-L1-stained cells compared with the other three assays, as has been shown previously in non-small-cell lung cancer [17, 20, 21] and UBC [23]. These differences in staining could be explained by the fact that this assay was specifically designed to stain IC and compared with the other three antibodies; some of its binding epitopes are absent in the PD-L1 isoform 2 [24]. Excluding VENTANA SP142, adjusted mean differences for TC were small and non-significant.

Previously, it has been suggested that the four assays yield substantial to high correlations for PD-L1 IC positivity read by trained readers in UBC [23, 25,26,27]. These studies were based on scores from one, two, or four readers who had read core tissue microarrays. In the study with four readers, the scores were consented before analysis [25]. Interestingly, some studies [26, 27], which were based on core tissue microarrays, have reported possibly lower IC sensitivity for SP142, which may reflect a sampling bias and intra-tumor heterogeneity. To account for tumor heterogeneity and subsequently more subjective interpretations [28, 29], we used whole slides for analysis. Also, to prevent potential difficulties in differentiating TC from IC, an additional pan-cytokeratin stain was included for each case. In addition, we based our assay comparisons on five independent readers, who were blinded for the PD-L1 assay used. For assay comparisons, the results were adjusted for reader effects. Hence, we show for the first time in a clinically relevant setting that the PD-L1 assays SP142, SP263, 22C3, and 28-8 stain similar percentages of PD-L1 IC, with no statistically significant differences between assays.

In regard to overall percentage agreement at defined cutoffs for PD-L1 IC or TC positivity, our study yielded substantial to high agreement values for assay pairs as have been published in larger UBC cohorts [23, 26]. Hence, we assume that the results from our cohort of 30 patients with UBC are representative of larger UBC patient sets. However, the current analysis is exploratory, with a small sample size, and there was no formal testing of equivalence. To formally confirm the analytical similarity of the assays, comparison of computerized evaluations on digitalized slides would be desirable; future studies should address this.

As previously reported [26, 30], we could also see lower inter-reader agreement at lower cutoffs (> 1%) for TC scoring using VENTANA SP142, DAKO 28-8, and 22C3. This might be due to the fact that, in interpretation of PD-L1 staining, even faintly and non-circumferentially stained tumor cells are considered positive, whereas in HER2 testing in breast cancer this type of staining is considered negative. Regarding IC scoring, VENTANA SP142 and DAKO 22C3 also yielded lower inter-reader agreement at lower cutoffs (< 1%). On the other hand, we could detect a decline in inter-reader agreement for IC scoring at higher cutoffs (> 10%) for VENTANA SP263 and DAKO 28-8.

In the current study, no correlation of staining with clinical outcomes could be attempted, as the samples were not taken from patients treated with anti-PD-L1/PD-1 therapy. While the cohort size was sufficient to detect the relatively large differences in TC staining between SP142 and the other assays, it is not known if other, more subtle, differences would become detectable in a larger cohort. But even if they exist, it is unclear whether such minor discrepancies would have any clinical impact. One limitation is that the experienced readers in this study may have recognized subtle characteristic staining features of the different antibodies (such as the more granular staining pattern of VENTANA SP142), so true blinding may not have been possible.

This is the first multicenter comparison of assay performance and inter-observer agreement for PD-L1 testing in UBC using all currently diagnostically relevant assays by readers trained on scoring PD-L1-stained IC on whole slides. The results from 30 patients suggest that, in advanced UBC, the four assays may be considered analytically similar for assessing the percentage of PD-L1-stained IC per tumor area. In addition, three of the assays (VENTANA SP263, DAKO 22C3, and DAKO 28-8) may be considered analytically similar for assessing the percentage of PD-L1-stained TC. Our data facilitate the clinical use of the biomarker PD-L1 as it contributes to the understanding of the technical comparability and need for training in IC PD-L1 testing and scoring in UBC.