The feasibility of developing biomarkers from peripheral blood mononuclear cell RNAseq data in children with juvenile idiopathic arthritis using machine learning approaches
The response to treatment for juvenile idiopathic arthritis (JIA) can be staged using clinical features. However, objective laboratory biomarkers of remission are still lacking. In this study, we used machine learning to predict JIA activity from transcriptomes from peripheral blood mononuclear cells (PBMCs). We included samples from children with Native American ancestry to determine whether the model maintained validity in an ethnically heterogeneous population.
Our dataset consisted of 50 samples, 23 from children in remission and 27 from children with an active disease on therapy. Nine of these samples were from children with mixed European/Native American ancestry. We used 4 different machine learning methods to create predictive models in 2 populations: the whole dataset and then the samples from children with exclusively European ancestry.
In both populations, models were able to predict JIA status well, with training accuracies > 74% and testing accuracies > 78%. Performance was better in the whole dataset model. We note a high degree of overlap between genes identified in both populations. Using ingenuity pathway analysis, genes from the whole dataset associated with cell-to-cell signaling and interactions, cell morphology, organismal injury and abnormalities, and protein synthesis.
This study demonstrates it is feasible to use machine learning in conjunction with RNA sequencing of PBMCs to predict JIA stage. Thus, developing objective biomarkers from easy to obtain clinical samples remains an achievable goal.
KeywordsJuvenile idiopathic arthritis Peripheral blood mononuclear cells Transcriptome Machine learning
Active disease on therapy
Area under the curve
Clinical remission on medication
Support vector machine with a cubic kernel
Support vector machine with a Gaussian kernel
Hilbert-Schmidt independence criterion
Ingenuity pathway analysis
Juvenile idiopathic arthritis
Least absolute shrinkage and selection operator
Peripheral blood mononuclear cell
Principal component analysis
Receiver operating characteristic curve
Transcripts per million
One of the underappreciated advances in the field of pediatric rheumatology has been the recognition that therapeutic response during the course of treated juvenile idiopathic arthritis (JIA) can be staged based on specific clinical features . Using hybridization-based gene microarrays, our group has shown that these stages are associated with specific gene expression signatures in both peripheral blood mononuclear cells [2, 3] and neutrophils [3, 4]. These findings raised the possibility that gene expression patterns might be used to develop clinically useful biomarkers to guide therapeutic decision-making. For example, we have demonstrated the feasibility of using microarray-based gene expression data as a means of developing prognostic biomarkers that will determine a patient’s likelihood of achieving inactive disease within 6 months of initiating therapy . However, the development of expression-based biomarkers has been hampered by the considerable inter-patient variation that is observed in gene expression studies of children with JIA, problems that are compounded when complex cell types such as peripheral blood mononuclear cells (PBMCs) are used .
Over the past 10 years, significant advances have been made in machine learning approaches to complex datasets. Augmented computational power and new methods have facilitated successful classification even using heterogeneous populations, such as PBMCs. For instance, Showe et al. used a support vector machine model of 29 genes expressed in PBMCs to distinguish between non-small cell lung cancer and non-malignant lung disease with 86% accuracy . Serrano et al. also used support vector machines in combination with PBMC expression data to identify HIV-negative donors or HIV-positive cART-treated samples with 94% accuracy . Thus, machine learning has been successful in a wide spectrum of disease classification problems.
In the present study, we renewed our efforts to determine the feasibility of developing biomarkers for disease activity in JIA from PBMCs, which are easy and relatively inexpensive to obtain and prepare from clinical blood samples. We also sought to determine whether such biomarkers might be detected in an ethnically heterogeneous population that included Native American children, as the optimal biomarkers will be those that are useful across the age range and diverse ethnicities of the JIA population. Although patient-to-patient variability in gene expression continues to be a challenge, we demonstrate that machine learning approaches can overcome this and successfully classify patients by disease activity.
Patients and patient characteristics
This project was reviewed and approved by the IRBs of the University of Oklahoma Health Sciences Center, the University at Buffalo, and Seattle Children’s Hospital. All research was conducted according to the IRB-approved protocol. Written informed consent documents were executed for each subject, and, where appropriate, children also executed consent documents.
Samples for this study were obtained from children with polyarticular JIA as determined from standard criteria and included three children who were rheumatoid factor positive , as, once again, we are seeking biomarkers that are applicable to the broadest range of children with JIA. For the purposes of this study, we examined only children with an active disease on medication with methotrexate (MTX) plus a biologic agent (ADT group) and children who had achieved clinical remission on medication (CRM group) who were on the same medications. Active disease and remission statuses were assigned according to the criteria of Wallace et al. . Active disease was determined by the presence of physical signs of synovitis (warmth, synovial thickening) in at least one joint. Although the Wallace criteria assign active disease status to patients with laboratory findings that are not otherwise explained by an intercurrent infectious illness (e.g., elevated erythrocyte sedimentation rate, thrombocytosis), no patient in this study was assigned to the ADT group based on the isolated laboratory findings; all had synovitis in at least one joint. Children were assigned to the CRM group if they had no signs of active synovitis and morning stiffness lasting for < 10 min/day and no laboratory abnormalities other than those that might be attributed to a self-limited illness or medications. Furthermore, following the Wallace criteria, children in CRM had maintained this condition for six continuous months or more.
The first dataset included 12 samples from patients who were involved in a previous study . Of these, 4 were from patients with ADT, while 8 were from children who fit the criteria of CRM. RNA sequencing (RNAseq) data from those samples are available on the Gene Expression Omnibus (GEO, #GSE79970). These samples were all taken from children with non-Hispanic, European ancestry.
For this study, we also performed RNAseq on 38 additional samples from children with polyarticular JIA. This study included samples from 15 patients (4 boys and 11 girls) who had achieved clinical remission on medication (CRM). Of these 15 patients, 10 achieved CRM status on methotrexate (MTX) alone, while 5 achieved CRM on MTX plus a TNF inhibitor (2 on adalimumab, 3 on etanercept). The remaining 23 (6 boys and 17 girls) were classified as having an active disease on therapy (ADT). Among the patients with ADT, 18 were on MTX alone, while 5 were on TNF inhibitors (3 on etanercept and 2 on adalimumab). There were 7 individuals from whom we collected longitudinal data; therefore, these subjects have both ADT and CRM samples. Nine samples were from individuals with mixed Native American/European ancestry recruited from the pediatric rheumatology clinic at the University of Oklahoma. Only 1 of these patients achieved CRM within a year of diagnosis, which may be a reflection of the severity of disease in the Native population, which we have previously documented . In total, we analyzed an ethically heterogeneous dataset of 50 samples, 23 of which were from patients in CRM and the remaining 27 from patients in ADT.
Whole blood was drawn into 10-mL citrated CPT tubes (Becton Dickinson, Franklin Lakes, NJ). Cell separation procedures were started within 1 h from the time the specimens were drawn. Peripheral blood mononuclear cells were separated from granulocytes and red blood cells by density gradient centrifugation. Cells were then immediately placed in TRIzol® reagent (Invitrogen, Carlsbad, CA) and stored at − 80 °C.
RNA isolation and sequencing
RNA isolation and sequencing were carried out exactly as described in . In brief, total RNA was extracted using TRIzol™ reagent and further purified using the RNeasy MiniElute Cleanup kit (Qiagen, Valencia, CA). cDNA libraries were prepared for each sample using the Illumina TruSeq RNA sample preparation kit. Libraries were sequenced using 100-bp paired-end reads on the Illumina HiSeq 2500 platform.
RNA sequencing data analysis was carried out using Galaxy suite of tools. Sequenced paired-end reads were aligned to the human reference genome hg19 with Bowtie2 (version 22.214.171.124) using default settings. Counts were assigned to aligned reads using htseq-count (version 0.9.1). Counts were then converted to transcripts per million (TPM) to facilitate a comparison between the samples. As data was collected during 2 different time points, we considered the 12 samples previously analyzed to be batch 1 and the 38 new samples to be batch 2. We used ComBat under default settings in R to perform batch effect correction.
We performed principal component analysis (PCA) using raw counts of all 50 samples in R using prcomp function to visually assess the presence of outliers or batch effects in our dataset.
K-nearest neighbors: The K-nearest neighbor method using a Euclidean metric was employed. The number of neighbors, K, was set as 10. By calculating the distance to each training sample, the testing sample was classified as the class that was most common among its K-nearest neighbors.
Random forest: The random forest method was employed to build a prediction model. The number of trees was set at 1000. The random forest was built by contrasting a multitude of decision trees based on subsets of the training data generated by random sampling with replacement. The resulting model classifies testing samples by the majority vote of the decision trees.
Support vector machine: We trained prediction models using support vector machine with two different kernels: cubic and Gaussian. To separate the binary-labeled samples, the SVM transforms them into a multidimensional space using the kernel, followed by a hyper-plane that maximizes the margin separating samples of either class. The resulting model classifies testing samples by transforming them into the higher dimensional space with the corresponding kernel and making decisions based on their signed distance to the hyper-plane.
Models were firstly evaluated using tenfold cross-validation in the training cohort. Predicted disease stage was compared to clinical diagnosis in order to determine the accuracy, sensitivity, and specificity of each model, which were averaged across the tenfold cross-validation for each model. We also implemented a receiver operating characteristic (ROC) curve analysis to determine the area under the curve (AUC) for each model as an additional performance metric. Trained models were then evaluated in an independent testing cohort. We calculated the accuracy, sensitivity, specificity, and AUC for each model based on the model predictions of testing samples.
We executed this pipeline in two different populations. First, we developed models for JIA disease stage prediction in the whole dataset. Then, since clinical experience has demonstrated that patients with Native American ancestry have more difficulty reaching remission (possibly reflecting different genetic or epigenetic influences), we sought to determine whether including this diverse patient population affected model performance. Therefore, we developed models only using samples of European descent and compared their performance to models generated using all samples.
We used ingenuity pathway analysis (IPA) to investigate the biological pathways and networks associated with model genes using the whole dataset. Model genes with their log2-fold change of the ADT group relative to the CRM group and t test p values (both calculated using batch effect corrected TPMs of all 47 samples) were used as input. We considered disease, biological functions, and upstream regulators to be significant if the absolute value of their activation z-score is > 2. The z-score reflects the match between the observed gene expression and the predicted relationship direction. It is used to infer activation states of called functions and regulators. Networks were deemed significant if their score was greater than 15. Network score uses a hypergeometric distribution and calculated with right-tailed Fisher’s exact test in this equation: p-score = −log10(p value). This score reflects the approximate fit between the generated networks and eligible molecules from the input list.
For the newly sequenced samples, we achieved, on average, 73% of reads aligned concordantly exactly 1 time or more than 1 time using Bowtie2. Raw counts with a sum > 0 across all samples were used to visualize all samples by PCA. Upon visualizing the entire dataset (n = 50) using PCA, it became clear there were 3 outliers (603p-ADT, S17-CRM, and S17-ADT). These samples were removed prior to any model development, making our dataset 47 samples in total. We then used Combat in R to correct for batch effects present in this 47 sample dataset.
Whole dataset model
Analysis of model genes
Transcripts selected by HSIC LASSO for model training using all samples and only European samples
Whole dataset model
We note that multiple genes within the two models are associated with biological functions that are relevant to JIA. For example, CD97 is an early activation marker on T cells and interacts with CD55 to serve as a co-stimulator . CD97 also serves as an adhesion receptor on inflammatory cells and stimulates angiogenesis through binding integrin counter-receptors on endothelial cells . The CCAAT/enhancer-binding protein delta (CEBPD, C/EBPδ) is a transcription factor that modulates multiple biological processes, including inflammation . ZAP70 kinase is an essential protein for T cell signaling, and its absence leads to severe immunodeficiency . IFNAR2 is the high-affinity receptor for interferons α and β, important modulators of innate and adaptive immunity. We provide a complete list of the genes that emerged from both models in Additional file 2: Table S2.
Numerous studies have documented differences in the gene expression profiles in PBMC when children with JIA are compared to healthy children . The specific differences in the expression profiles from what is seen in healthy children vary from one JIA subtype to another . Thus, there has been considerable interest in exploiting these features of JIA both to develop clinically useful assays (e.g., for diagnosis and prognosis) and to better understand the biology of treatment response .
As the attainment of remission has now become a reachable goal in both pediatric rheumatology clinical trials and the “target” of “treat-to-target” therapeutic strategies , there is a pressing need for objective biomarkers that will confirm that remission has been achieved. We have shown that treatment response in JIA is a highly complex, non-linear process associated with considerable re-ordering of peripheral blood cell transcriptomes after treatment is initiated [24, 25]. The resulting state of “remission,” as reflected in the peripheral blood transcriptomes, is not a “normalization” of expression patterns, but, rather, a new and distinct state that differs little whether remission (CRM) was achieved using methotrexate alone or methotrexate combined with a TNF inhibitor . These findings suggested that peripheral PBMC expression signatures might be used to determine objectively that an individual patient had achieved remission, regardless of the medications used.
Our earliest attempts in developing expression-based biomarkers from PBMC expression data were hindered by both the heterogeneity of the cell population and the heterogeneity of the patients . However, since that earlier paper, numerous studies have demonstrated the utility of machine learning analytic approaches in developing biomarkers for disease “states,” even when heterogeneous cell and patient populations are studied [7, 8, 27, 28, 29]. For example, we have recently demonstrated that machine learning analysis of peripheral blood neutrophil expression data can accurately identify patients with intracranial aneurysms [30, 31]. We have therefore renewed our efforts to use PBMC expression data to develop biomarkers of active disease and remission in polyarticular JIA.
In the current study, we demonstrate that is it possible to predict the JIA disease stage, whether active disease or remission (CRM), using PBMC expression data and machine learning methods. Furthermore, we demonstrate that the models performed equally well whether we included only non-Hispanic, European-descended subjects or a larger cohort that included enrolled members of Native American tribes from Oklahoma. Indeed, the models performed better when the Native American patients were added, although this may simply have been due to the fact that adding these patients increased the number of subjects and thus the accuracy and power of the predictive models. We expect that we will further improve the predictive power by further increasing the sample size. Implementing other machine learning methods, such as neural networks , may further improve our ability to classify samples from a heterogeneous cell population.
Our results were corroborated by functional analysis of the genes that were used to discriminate active disease and remission. The discriminating transcripts included genes plausibly linked to the biology of therapeutic response (a process about which we still understand very little). Furthermore, the functional associations with MAP kinase signaling, intercellular communication, and cell adhesion are consistent with the findings from our previous gene expression studies [2, 3, 5].
We were unable to discriminate between children who achieved remission on MTX alone vs. those who achieved remission on MTX plus a biological agent. These findings are consistent with our earlier paper showing the CRM, as defined by the Wallace criteria, represents a distinct immunologic/transcriptional state that differs a little (at least in PBMC) whether that state was achieved with MTX or MTX plus a biological agent . These findings suggest that there are specific immune “set points” that must be reached in order for the remission state to be achieved. Our findings from neutrophils are also consistent with that hypothesis .
Despite these promising results, we are still faced with the challenge of using PBMC expression data from individual patients to assign “active disease” or “CRM” status, which will be required if such information is to be used to inform clinical care. Achieving this goal will require several additional steps, which will include  refining this model using a larger cohort of patients , validating the model on an independent cohort of patients, and  blindly assessing a third cohort to assign status in individual expression profiles. These are all achievable goals. Furthermore, it seems likely that the problem of using a bulk population of heterogeneous cells (PBMC) might be overcome using single-cell RNAseq methods. While this approach comes with its own analytic challenges , there is no reason, a priori, that machine learning approaches could not be adapted to these types of datasets to developed classification models. Furthermore, single-cell data may provide useful insights into the mechanisms leading to the attainment of remission that cannot be elucidated using bulk populations of heterogeneous cells.
Another step that will need to be taken before expression-based assessments of CRM come into clinical use will be the demonstration that this approach is superior than the method currently in use: application of the clinical and laboratory findings that make up the Wallace criteria (1). There is an ongoing concern in the field that physical examination alone is insensitive to detect subtle, smoldering synovitis, for example . In the present study, we limited ourselves to making a head-to-head comparison between expression profiling and the existing “gold standard,” i.e., the Wallace criteria. However, future studies will invariably require an assessment of the reliability of expression profiling vs. the Wallace criteria at, for example, predicting disease flares on or off medication.
PBMCs remain a promising source for the development of expression-based biomarkers, provided that the proper analytic tools are used to develop classification algorithms. The ease with which these cells can be obtained and the continued improvement and affordability of sequencing techniques make this a fruitful line of research despite the inherent challenges of cell and patient heterogeneity.
KEP and JNJ contributed to the study concept and design. KJ contributed to the wet lab procedures. KEP, JNJ, and KJ contributed to the data analysis and interpretation and drafted the manuscript. All authors critically revised and read and approved the final manuscript.
This work was supported by R01-AR-060604 (JNJ) and R01-AR-049762 from the National Institutes of Health. It was also supported by the National Center for Advancing Translational Sciences of the National Institutes of Health under award number UL1TR001412 to the University at Buffalo. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. The authors declare no support from commercial sources for this work.
Ethics approval and consent to participate
This project was reviewed and approved by the IRBs of the University of Oklahoma Health Sciences Center, the University at Buffalo, and Seattle Children’s Hospital. All research was conducted according to the IRB-approved protocol, and informed consent was obtained from all subjects.
Consent for publication
No specific patient information/images are presented in the manuscript. Consent documents for this study explicitly state that data will be published but that individual subjects will remain anonymous.
KEP, KJ, LL, YS, CAW, TH, and JNJ declare that they have no competing interests. HM is a principal investigator: three Brain Aneurysm Foundation grants (no applicable grant numbers), NIH Grants R01-NS-091075 and R01-NS-064592, and Grant support Canon Medical Systems. HM is a co-founder of Neurovascular Diagnostics, Inc.
- 2.Knowlton N, Jiang K, Frank MB, Aggarwal A, Wallace C, McKee R, et al. The meaning of clinical remission in polyarticular juvenile idiopathic arthritis: gene expression profiling in peripheral blood mononuclear cells identifies distinct disease states. Arthritis Rheumatism. 2009;60(3):892–900.CrossRefGoogle Scholar
- 7.Showe MK, Vachani A, Kossenkov AV, Yousef M, Nichols C, Nikonova EV, et al. Gene expression profiles in peripheral blood mononuclear cells can distinguish patients with non–small cell lung cancer from patients with nonmalignant lung disease. Cancer Research. 2009:0008–5472.CAN-09-1378.Google Scholar
- 14.Astuti W, Adiwijaya. Support vector machine and principal component analysis for microarray data classification. Journal of Physics: Conference Series 2018;971:012003.Google Scholar
- 18.Balamurugan K, Sterneck E. The many faces of C/EBPδ and their relevance for inflammation and cancer. Int J Biol Sci. 2013 2013; 9(9):[917-33 pp.]. Available from: http://europepmc.org/abstract/MED/24155666, http://europepmc.org/articles/PMC3805898?pdf=render, http://europepmc.org/articles/PMC3805898, https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/24155666/?tool=EBI, https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/24155666/pdf/?tool=EBI, https://doi.org/10.7150/ijbs.7224.CrossRefGoogle Scholar
- 20.Griffin TA, Barnes MG, Ilowite NT, Olson JC, Sherry DD, Gottlieb BS, et al. Gene expression signatures in polyarticular juvenile idiopathic arthritis demonstrate disease heterogeneity and offer a molecular classification of disease subsets. Arthritis Rheumatism. 2009;60(7):2113–23.CrossRefGoogle Scholar
- 32.Haykin S. Neural networks: a comprehensive foundation: Prentice Hall PTR; 1994. p. 768.Google Scholar
- 34.Miotto e Silva VB, SdAV M, RNV F, Natour J, Len CA, MTdSeLRA T. Patients with juvenile idiopathic arthritis in clinical remission with positive power Doppler signal in joint ultrasonography have an increased rate of clinical flare: a prospective study. Pediatric Rheumatol. 2017;15(1):80.CrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.