Advertisement

Analysis of AmpliSeq RNA-Sequencing Enrichment Panels

  • Marek S. Wiewiorka
  • Alicja Szabelska
  • Michal J. OkoniewskiEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9124)

Abstract

This study presents a proof of concept of encoding genomic signatures in the AmpliSeq technology. The samples of patients with a disease and healthy ones have been processed using an AmpliSeq RNA sequencing kit of a custom design, that include 290 amplicons, sequenced using an IonTorrent machine. The read count data show the sufficient coverage in most of the chosen amplicons, which results in a good separability between the disease patients and healthy donors. In addition, several amplicons allow for checking useful genomics variants (SNPs), whenever the coverage level permits. The paper presents a machine-learning classifier evaluation of the answer to the question of difference between the patients and healthy donors, based upon the AmpliSeq panel data. The outcome confirms the potential utility of similar RNA amplicon kits in the research and clinical practice to encode gene expression signatures of diseases and their phenotypes.

Keywords

Genomics Transcriptomics Amplicon sequencing Classification Genomic signatures 

1 Introduction

Next generation sequencing techniques, which have already become the driving force in molecular biology, are recently being introduced into various areas of applications in medicine. To achieve a focused insight maintaining economically affordable costs, it is essential to apply specialized kits for enrichment of particular sequences, e.g. exome kits [1, 2]. Such enrichment kits exist now for both DNA and RNA sequencing, and can be ordered in a fully customized way, running the design process via the specialized web interfaces of the technical solution providers. The other very popular technique of precise measurements in genomics and transcriptomics is RT-PCR. The primers can be freely designed or ordered in pre-defined panels, specific for a given application, e.g. TaqMan [3]. The statistical analysis of such data described in [4]. The amplicon enrichment kits for RNA sequencing are a solution that combines the advantages of both: enriched sequencing and RT-PCR approaches. Amplicon sequencing has been done recently on the sequencing platforms of all three generations [5], still combining many amplicons in one PCR run and preparing an RNA sequencing library from such an amplified product is a novel technique. The example of such technique is AmpliSeq, introduced by LifeTech in early 2013. As with other products of modern nanotechnology, the biological hardware often precedes the methodologies for in-depth analysis of data, simply because by the amount and variety of data that it produces. This paper presents a study on the technical applicability of AmpliSeq kits in the area of autoimmune disease, in particular to verify the gene expression signature [6, 7] that differentiates patients of various diseases and the healthy donors. It can also be a technical proof that will encourage researchers from other areas of medical research to encode their gene expression signatures into the amplicon sequencing panels, especially if speed, precision and cost of analysis prove to be competitive.

2 Materials and Methods

2.1 Panel Design

The amplicon panel has been designed for 289 amplicons of 284 genes known from the medical literature to be specific for the disease. 12 amplicons included a SNP in the coding region.

2.2 RNA Samples

The blood samples have been isolated from the blood of 8 patients and 8 healthy donors, matching the patients by age and gender. RNA extraction was done using RNeasy Mini Kit (Quiagen, cat. no. 74104) with subsequent purification by precipitation and ethanol washes. Concentration and purity was measured with NanoDrop 1000. Integrity of RNA was measured with Agilent 2200 TapeStation resulting RIN* values between 8.6 and 9.3. The sequencing libraries were prepared with Ion AmpliSeq RNA Library kit (Life Technologies, cat. no. 4482335) using custom primers (Life Technologies), designed as above, and Ion Xpress Barcode Adapters (Life Technologies, cat. no. 4474518), according to the manufacturers protocol. Barcoded libraries were pooled in equimolar amounts, diluted to the concentration of 20 pM and used for subsequent template preparation with Ion PGM Template OT2 200 Kit (Life Technologies, cat. no. 4480974), according to the manufacturer’s protocol.

2.3 Sequencing and Data Acquisition

The sequencing reactions were performed using a Personal Genome Machine (PGM) System with Ion PGM Sequencing 200 Kit v2 (Life Technologies, cat. no. 4482006) and Ion 318 Chip Kit (Life Technologies, cat. no. 4466617), according to the manufacturer’s instructions. The sequenced data were primarily processed using TorrentServer software (TorrentSuite v 3.6, LifeTech). Reads data were extracted from the chip using FastqCreator plugin v3.6.0-r57238.

2.4 Mapping of Reads and Variant Calling

The mapping to the canonic transcripts list (LifeTech, RefSeq based) was done using ion-alignment v 3.6,3-1, and counts of the reads in amplicons using the coverageAnalysis plugin v3.6.58977, with the BED file describing the amplicons. In addition, variant calling was done using variantCaller plugin v3.6.59049. An alternative analysis path was done by mapping of the reads by tophat mapper (v. 2.0.8b) [9] to the current human genome reference (hg19) instead, then the counts table has been generated using R, in particular the RSamtools and rnaSeqMap [10] libraries from Bioconductor. Checking of the encoded SNPs and variants was done using the variantCaller plugin (v3.6.59049) of Torrent Server, and alternatively by the samtools mpileup (v0.1.18) [11]. As the distribution of counts does not allow to apply a classic t-test, and the number of measured amplicons is too small to properly estimate a negative binomial distribution for RNA-sequencing tests [12] , the differential expression has been measured by a log2 fold change between the two groups and the non-parametric test from SAMseq [13]. The classification of the count data for the patients and healthy donors was done with an algorithms available in the MLInterfaces library of Bioconductor. The classifiers was trained on the count data, also in variants with the gender attribute and information about the presence of SNPs. Leave-one-out cross-validation based on five different classifiers was applied to count the correctness of classification.
Fig. 1.

The normalized distribution of log2 of counts in the amplicons (A/left). The blue line is a distribution for all the amplicons and all samples as mapped with tophat to hg19, green - in the least abundant sample, red - the sample with the biggest total number of reads. If the cutoff level for detection is 10, then ca 10 % of values is below the detection level. Globally, 235 amplicons in the panel has the expression over this detection level in at least one of the 16 samples. The volcano plot (B/right), shows the relationship between the log2 fold change and the q-value from the samSeq test for the amplicons in the panel, while comparing the patients and healthy donors (Color figure online).

3 Results and Discussion

The sequencing of 16 samples with the 318 chip resulted in 2853777 total reads, out of which 2.851M (97 % aligned bases) could be mapped to the canonic transcripts list. Detection levels. Out of 289 amplicons, 235 had the detection level of 10 in at least one of the samples. The coverage presenting the fraction of amplicons with a useful range is presented as the Fig. 1A.

3.1 Fold Change and Differential Expression

The Fig. 1B presents the distribution of fold changes and the plot of SAMseq results.

The combined results of detection level check and the differential expression analysis prove that the majority of amplicons was designed in such a way that it is useful for differentiating between patients and healthy donors. Detection of SNPs is possible only when the depth of coverage of the reads is sufficient. In the case of RNA it depends on the gene expression, thus is never guaranteed. Nonetheless, some of the SNPs was detected and reported in VFC files in a systematic way. The utility of the gene expression signature is confirmed by the machine learning approach. The initial clustering (Fig. 2A and B - subset of 77 amplicons with the highest absolute log fold change and 75 with highest variance) shows that most of the patients and healthy donors cluster together. The gender seems to have some influence on the clustering. There is an outlier – a healthy donor turned out to be a person from different geographical zone than the others, thus could have the immune system trained in a completely different way than the other patients and donors, coming all from Europe. The results of classification summarized in the Fig. 3, show that most of the samples are correctly classified. Adding the gender information increases the predictive power of classification. In a similar way, adding the attribute describing the SNPs found, increases the number of the samples correctly classified. All the results of the analyses described above support the claim that the gene signatures can be efficiently encoded in the AmpliSeq panel. The counts of reads representing the expression levels of genes may be used in a combined way as predictors and also in combination with clinical parameters (e.g. gender, age) and the genotyping results, that can be obtained in case of some genes from the same RNA panel.

The results described above show that such approach is feasible and may render medically useful results. There is still a room for improvement, especially in the area of custom design of the panel. In particular, tuning the selection of amplicons, can be used to distinguish between disease phenotypes in the cases that can be diagnosed from peripheral blood samples as we have proven in the case of the disease.

The approach turned out to be sufficiently precise and customizable, that it soon may become competitive to already classic RT-PCR panels measuring RNA expression. In addition, the chance of variant calling can give some extra genotyping hints, that can be used for classification of samples, and in consequence, to support the medical diagnostics.
Fig. 2.

The heatmaps of the amplicon counts most differentiating patients and healthy donors by log2 fold change (A/left) and of the amplicon counts most variable across all the samples (B/right). The settings of the heatmap.2 functions are default, color scale set to red-green, with marginal 15 % of values in the full color saturation. The heatmaps give a proof that even without a particular tuning of the classifier, the samples are grouped mainly by the disease status (Healthy/Disease) and gender (man/woman) (Color figure online).

Fig. 3.

1 The number of correctly classified samples by various classifier function from MLInterfaces. The classification can be based upon the gene expression signature limited to 100 amplicons, with additionally gender, SNP attributes or both

4 Software Availability

The software is available as the Bioconductor R package ampliQueso:

http://www.bioconductor.org/packages/release/bioc/html/ampliQueso.html

Notes

Acknowledgements

We are grateful to Kelli Bramlett, Jeoffrey Schageman, and Daniel Williams from LifeTech for discussion on AmpliSeq technology, to Andreas Tobler for coordinating the collaboration and to Marzanna Künzli-Gontarczyk, Daria Bochenek and Josias Brito Frazao for the help in the sequencing library prep and discussion on the lab aspects of the study. This work was supported by the grants Sciex.ch (nr. 11.182 to AS and MO, and nr 12.289 to MW and MO).

References

  1. 1.
    Clark, M.J., Chen, R., Lam, H.Y.K., Karczewski, K.J., Chen, R., Euskirchen, G., et al.: Performance comparison of exome DNA sequencing technologies. Nat. Biotechnol. 29(10), 908–914 (2011). doi: 10.1038/nbt.1975 CrossRefGoogle Scholar
  2. 2.
    Sulonen, A.-M., Ellonen, P., Almusa, H., Lepistoe, M., Eldfors, S., Hannula, S., et al.: Comparison of solution-based exome capture methods for next generation sequencing. Genome Biol. 12(9), R94 (2011). doi: 10.1186/gb-2011-12-9-r94 CrossRefGoogle Scholar
  3. 3.
    Rachwal, P.A., Rose, H.L., Cox, V., Lukaszewski, R.A., Murch, A.L., Weller, S.A.: The potential of TaqMan array cards for detection of multiple biological agents by real-time PCR. PloS One 7(4), e35971 (2012). doi: 10.1371/journal.pone.0035971 CrossRefGoogle Scholar
  4. 4.
    Yuan, J., Reed, A., Chen, F., Stewart, C.N.: Statistical analysis of real-time PCR data. BMC Bioinform. 7(1), 85–85 (2005). doi: 10.1186/1471-2105-7-85 CrossRefGoogle Scholar
  5. 5.
    Okoniewski, M.J., Meienberg, J., Patrignani, A., Szabelska, A., Matyas, G., Schlapbach, R.: Precise breakpoint localization of large genomic deletions using PacBio and lllumina next-generation sequencers. BioTechniques 54(2), 98–100 (2013). doi: 10.2144/000113992 CrossRefGoogle Scholar
  6. 6.
    Veer, L.V., Dai, H., Van De Vijver, M.J., He, Y.D.: Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871), 530–536 (2002)CrossRefGoogle Scholar
  7. 7.
    Van De Vijver, M.J., He, Y.D., Veer, L.V., et al.: A gene-expression signature as a predictor of survival in breast cancer. New Engl. J. Med. N 347, 1999–2009 (2002). doi: 10.1056/NEJMoa021967 CrossRefGoogle Scholar
  8. 8.
    Zhou, T., Zhang, W., Sweiss, N.J., Chen, E.S., Moller, D.R., Knox, K.S., et al.: Peripheral blood gene expression as a novel genomic biomarker in complicated sarcoidosis. PloS One 7(9), e44818 (2012). doi: 10.1371/journal.pone.0044818 CrossRefGoogle Scholar
  9. 9.
    Trapnell, C., Pachter, L., Salzberg, S.L.: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England) 25(9), 1105–1111 (2009). doi: 10.1093/bioinformatics/btp120 CrossRefGoogle Scholar
  10. 10.
    Leniewska, A., Okoniewski, M.J.: rnaSeqMap: a bioconductor package for RNA sequencing data exploration. BMC Bioinform. 12, 200 (2011). doi: 10.1186/1471-2105-12-200 CrossRefGoogle Scholar
  11. 11.
    Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., et al.: The sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England) 25(16), 2078–2079 (2009). doi: 10.1093/bioinformatics/btp352 CrossRefGoogle Scholar
  12. 12.
    Anders, S., McCarthy, D.J., Chen, Y., Okoniewski, M., Smyth, G.K., Huber, W., Robinson, M.D.: Count-based differential expression analysis of RNA sequencing data using R and bioconductor. Nat. Protoc. 8, 1765–1786 (2013). http://arxiv.org/abs/1302.3685 CrossRefGoogle Scholar
  13. 13.
    Li, J., Tibshirani, R.: Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-seq data. Stat. Methods Med. Res. 22(5), 519–536 (2011)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Marek S. Wiewiorka
    • 1
  • Alicja Szabelska
    • 2
  • Michal J. Okoniewski
    • 3
    Email author
  1. 1.Institute of Computer ScienceWarsaw University of TechnologyWarsawPoland
  2. 2.Department of Mathematical and Statistical MethodsPoznan University of Life SciencesPoznan Poland
  3. 3.Scientific IT ServicesSwiss Federal Institute of Technology (ETH Zurich)ZurichSwitzerland

Personalised recommendations