Determining the quality and complexity of next-generation sequencing data without a reference genome
- 7.5k Downloads
We describe an open-source kPAL package that facilitates an alignment-free assessment of the quality and comparability of sequencing datasets by analyzing k-mer frequencies. We show that kPAL can detect technical artefacts such as high duplication rates, library chimeras, contamination and differences in library preparation protocols. kPAL also successfully captures the complexity and diversity of microbiomes and provides a powerful means to study changes in microbial communities. Together, these features make kPAL an attractive and broadly applicable tool to determine the quality and comparability of sequence libraries even in the absence of a reference sequence. kPAL is freely available at https://github.com/LUMC/kPAL.
KeywordsWhole Genome Sequencing Problematic Sample Whole Exome Sequencing Whole Genome Sequencing Data Whole Exome Sequencing Data
k-mer Profile Analysis Library
Principal component analysis
Polymerase chain reaction
String Graph Assembler
Whole exome sequencing
Whole genome sequencing
During the past decade, DNA sequencing technologies have undergone notable improvements with great impacts on molecular diagnostics and biomedical and biological research. Today, next-generation sequencing (NGS) technologies can provide insights into sequence and structural variations by achieving unprecedented genome and transcriptome coverage. Despite molecular and computational advances, the fast growing developments in library preparation, sequencing chemistry and experimental settings are of concern as they can diversify the complexity and quality of sequencing data -. To address data quality, most strategies rely on basic statistics of the raw data, such as the quality scores associated with base calling, the total number of reads and average GC content. Technical artefacts are usually only spotted after mapping of reads to the reference genome. However, such approaches are prone to alignment biases and the loss of potentially valuable information due to the predisposed and incomplete reference genome sequences -. These biases are considerably more problematic in studies of microbiomes as the species diversity can be immense , whereas the evaluation of data complexity and quality is limited to the analysis of species for which a reference genome sequence is available.
Analyzing the k-mer (DNA words of length k) frequency spectrum of the sequencing data provides a unique perspective on the complexity of the sequenced genomes, with more complex ones showing a greater diversity in unique sequences and repeated structures. Over- and under-represented k-mers have been associated with the presence of functional or structural elements (such as repetitive, mobile or regulatory elements), negative selection, or the hypermutability of CpGs -. Notably, the prevalence of functional elements and those caused by neutrally evolving DNA (including duplications, insertions, deletions and point mutations) is reflected in the modality (number of peaks) of the k-mer frequency spectrum ,. The modality of the human genome is also subjected to its function as all coding regions, including the 5′ untranslated regions (UTRs), exhibit a unimodal k-mer spectrum, while the introns, 3′ UTRs and other intergenic regions have a multimodal distribution ,.
In recent years, k-mers have been used in a wide range of applications from the identification of regulatory elements to correction of sequencing errors, genome assembly, phylogeny analysis and the search for homologous regions -. It has also been shown that the characterization and comparative analysis of the k-mer spectrum can provide an unbiased view of genome size and structure, but it can also expose sequencing errors . However, to our knowledge, most tools fail to accommodate for differences in library size and do not reliably expose problematic samples nor provide information on potential sources of variation in series of sequencing data. Here, we present a method, k-mer Profile Analysis Library (kPAL), for assessing the quality and complexity of sequencing data without requiring any prior information about the reference sequence or the genetic makeup of the sample. The proposed method uses the distance between k-mer frequencies to measure the level of dissimilarity within or between k-mer profiles. Since most distance measures are susceptible to differences in library size, we have implemented a series of functions that ensure a more reliable assessment of the level of dissimilarity between k-mer profiles. Based on the same principle, kPAL can identify problematic samples, as their level of similarity reduces in the absence of a significant difference between the genome of the sequenced samples. In this work, we apply kPAL to four types of NGS data: 665 RNA sequencing (RNA-Seq) samples ,, 49 whole genome sequencing (WGS) samples, 43 whole exome sequencing (WES) samples, and a series of microbiomes. We report the sources of technical and biological variation present in each set of NGS data, highlight a series of artefacts that were missed by standard NGS quality control (QC) tools, and demonstrate how the complexity of microbiomes is reflected in their k-mer profiles.
Results and discussion
Principles of kPAL
To identify which k provides the best specificity for a mixed sample of bacteria, the k-mer profiles from three modelled metagenomes consisting of 30 bacterial genomes from the Firmicutes and Proteobacteria phyla (in 100:0, 50:50 and 0:100 ratios from each phylum) were compared to ten randomly shuffled sequences (without changing the overall nucleotide composition). The optimal value for k is the one that best separates metagenomes from randomly permuted sets. The overall distance between k-mer profiles of the metagenomes and the corresponding randomly permuted sets starts to level off once k exceeds 10 (Additional file 1: Figure S1). A low amount of variation in distance between the k-mer profiles of metagenomes and their permuted sets indicates that the distance measure is generally robust and only changes according to k. Interestingly, the optimal separation coincides with the k for which the complete unimodal spectrum of frequencies (from those that are too rare to those that are highly recurrent) is observed (Additional file 1: Figure S2A,B,C).
The human reference genome has a high complexity (described in Additional file 1: Notes), based on the multimodality of the k-mer profiles, which ranges from 9 to 15 (Additional file 1: Figure S3A). In humans, k = 11 is the smallest value for which unique k-mers and nullomers (absent k-mers) are observed while genomic spectra for k ≥ 13 start to lose their multimodality as they become too unique. Thus, k = 12 was used to give a relatively balanced number of nullomers, and unique and frequent k-mers. This allows for the identification of potential artefacts (mainly reflected by rare k-mers) as well as biological and contextual variations. Interestingly, the level of complexity varies between different types of genomic information (WGS, WES and RNA-Seq; see Additional file 1: Figure S3B). In contrast to genomic sequences, the coding part of the human genome exhibits a unimodal profile, as shown before ,. The minor differences between the k-mer profiles of the exome and the transcriptome reference sequences are due to the number of shared coding regions between different transcript variants of the same gene. The transcriptome reference sequences generally exhibit higher counts for observed k-mers and lower numbers of nullomers introduced by exon–exon junctions. Moreover, the k-mer spectrum derived from sequencing data is in concordance with that of the reference (Additional file 1: Figure S3C). The minor deviations from the unimodality of the exome and transcriptome data are mainly due to the capture performance (off-target reads introduce low-count k-mers that represent intronic and intergenic regions) and differences in the abundance of expressed mRNA.
In addition to the complexity of the genomic information, the sequencing depth contributes to the modality and the resolution of the k-mer spectrum derived from individual datasets. In RNA-Seq, we observed that the number of 12-nullomers correlates with the total number of reads per dataset (R = −0.80; see Additional file 1: Figure S4A,B). The variation in the total read counts per sample is partly due to study design, as sequencing was performed in seven different laboratories . Thus, the total number of 12-nullomers also varies between samples from different laboratories (Additional file 1: Figure S4C). It is crucial to account for bias introduced by poor and variable coverage, as it may obscure the identification of factors that determine the complexity of the k-mer spectrum. One obvious solution would be to opt for lower k sizes (i.e., k = 9) at the expense of specificity. However, we propose the dynamic smoothing function, which is resilient towards coverage bias and does not sacrifice the specificity of the k-mer spectrum by choosing a smaller k (Additional file 1: Notes). This function only shrinks the k-mer profile locally when the counts do not pass predefined conditions (i.e., they fall below an acceptable threshold for k-mer frequencies). In the next section, we show how kPAL can be used to assess the quality of different types of sequencing data without relying on the availability of a well-characterized reference genome.
Evaluating data quality without a reference
Comparative analysis of kPAL performance
We benchmarked the performance of kPAL in the identification of problematic samples by comparing the QC analysis of kPAL on a subset of WGS, WES and RNA-Seq samples with results from the Preqc function of the recently developed k-mer based String Graph Assembler (SGA) . SGA can estimate genome size, insert size distribution, repeat content and heterozygosity of a sequenced genome as well as the error rate and its potential consequence in de novo assembly. Unlike kPAL, SGA does not perform a pairwise comparison between k-mer profiles obtained from multiple datasets. Thus, we compared SGA’s performance to that of kPAL based on the identification of known problematic samples, using SGA’s estimated genome size, fragment size distribution and the overall error rate. A further evaluation of SGA on the selected datasets is presented in Additional file 1: Figures S14–S17.
In WGS data from the first sample (FG1), SGA confirmed the bimodal insert size distribution of libraries that were prepared based on the first protocol (Additional file 1: Figure S15). Moreover, sequencing data from the two library preparation protocols could be separated based on the position of the first occurring sequencing errors (Additional file 1: Figure S14A). This is in concordance with kPAL results and the presence of a higher level of library chimeras that led to the introduction of artificial and rare k-mers.
The selected WES data consists of two samples with failed capture (WE01_F1L1_NIM and WE02_F1L1_NIM), one sample with multiple problems (WE10_F1L3_NIM), and four samples with acceptable quality that were prepared using Agilent or Nimblegen capture kits (WE13_F2L2_AGI, WE14_F2L1_AGI, WE36_F4L1_NIM and WE37_F4L1_NIM). SGA identified the problematic sample WE10_F1L1_NIM, which suffers from an extremely high duplication rate and a very low number of on-target reads (Additional file 1: Figure S14B). The estimated genome size or duplication rate did not further assist in identifying problematic samples and the position of the first sequencing error seems to be obscured by the low coverage of off-target reads that may resemble erroneous sequences. Together, identification of problematic samples by SGA is less reliable for WES data than whole genome shotgun sequences.
For RNA-Seq data, we selected two samples that passed all quality measures (HG00096.1 and HG00108.7) and four failed samples with different underlying problems (HG00329.5: high duplication; NA12546.1: high rRNA; NA18858.1: poor alignment and NA18861.4: high genomic DNA contamination). SGA’s genome size estimation is designed for WGS data and, therefore, applying SGA on RNA-Seq data should provide an estimate of the expressed part of the genome. Genomic DNA contamination artificially increases the expressed part of the genome and allowed SGA to identify NA18861.4 as a problematic sample (Additional file 1: Figure S14C). SGA could not reliably identify HG00329.5 as a sample with an exceptionally high duplication rate (Additional file 1: Figure S14C). Unlike kPAL, the SGA analysis could not identify the other problematic RNA-Seq samples.
Detecting data complexity
Next, we explored the capability of kPAL in resolving the composition of a more complex series of simulated metagenomes. Without considering the phylogeny, 30 bacterial genomes were selected from both the Firmicutes and Proteobacteria phyla and used to construct 31 datasets where the first set comprises 30 genomes from the Firmicutes phylum. The sequence content of each set was subsequently shifted to the Proteobacteria phylum by single genome substitutions (Additional file 1: Table S2). Thus, the 31st dataset consists of 30 genomes from only the Proteobacteria phylum. After performing the pairwise distance comparison on 10-mer profiles, datasets were plotted based on their distance to each phylum (Figure 6B). Notably, the order of the datasets concords with the number of genomes from each phylum. Although the modelled metagenomes do not reflect the true relative abundance of these bacteria, they allow us to assess whether kPAL can resolve the level of similarity between a series of modelled metagenomes. Distances between k-mer profiles generated on the 16S rDNA also confirm the relative similarity of datasets with a slightly smoother transition. This is mainly due to the limited amount of genomic information that is available in 16S rDNA and different rate of evolution compared to the entire genome.
We used the previously published data by Caporaso et al.  to evaluate further the performance of kPAL in resolving microbiomes. The gut and right-palm microbiomes of a male individual and a female individual were sequenced over a period of 6 months. For this analysis, we only included samples that were collected on the same day from both individuals (122 gut microbiomes and 128 right-palm microbiomes). Furthermore, we also excluded 14 samples that were classified as being mislabeled using a random forest classifier as described by Caporaso et al. . Pairwise distances were calculated for samples from each body part using kPAL (using 10-mer profiles) and UniFrac , which relies on the characterization of operational taxonomic units and inferred phylogeny. UniFrac parameters were set to those specified in the original paper . The agreement between the expected clusters (based on the origin of samples) and that obtained from distance matrices was estimated using the weighted kappa index (Kw). PCA analysis of k-mer distance matrices from gut (Figure 6C) and right-palm (Figure 6D) microbiomes revealed that samples from each individual could be separated using the kPAL approach (Kw = 0.95 and 0.82, respectively). In addition, PC2 and PC3 indicate that temporal changes in the microbiomes of each individual influence the relative distances between datasets. We also noticed that datasets from the first 12 days of right-palm microbiomes from the male individual cluster with female samples. This can be caused by possible contamination or sample swapping. Gut microbiomes could also be resolved using UniFrac (Figure 6E), with Kw = 0.94. Concordant to the kPAL results, PC2 and PC3 jointly order samples based on the sampling day. However, UniFrac failed to differentiate right-palm microbiomes based on their origin (Kw = 0.47) with no apparent pattern corresponding to the day on which samples were collected (Figure 6F).
The continued decrease in sequencing costs and technological development have overtaken our ability to assess the quality of data and the complexity of sequencing libraries robustly. For instance, many QC steps that are essential for accurate downstream analysis of NGS data are often neglected in the absence of a reliable reference genome. In addition, NGS data are always subjected to some degree of technical and run-to-run variation, which can hamper the interpretation of the genetic makeup of the sequenced sample. As shown here, variations introduced during library preparation can have a significant influence on the complexity and quality of the sequencing data.
So far, k-mer profiles have been used in a wide range of applications, such as the identification of regulatory elements, error correction of sequencing reads, identification of point mutations, whole genome assembly, searches for homologous regions and phylogenetic analysis -,,. A number of k-mer analysis tools are capable of efficiently generating k-mer profiles (such as Jellyfish  and khmer ), and the recent work of Simpson  proposes a novel method to estimate the repeat content, genome size, heterozygosity of the sequenced genome, insert size distribution and estimated level of erroneous reads in sequencing data using a k-mer approach. Although SGA provides valuable information on the genetic makeup and quality of sequencing data, it cannot reliably identify outliers from a series of NGS data or provide information on potential sources of variation. Thus, in the absence of a well-characterized reference sequence, there is an urgent need for tools that can characterize potential biases such as sample swapping, library chimeras, high duplication rates and potential contamination. In this work, we introduce a new strategy for determining the quality and complexity of a variety of different NGS datasets without any prior information about the reference sequence. The kPAL package consists of a variety of tools to generate k-mer frequencies and enables pairwise comparisons. kPAL measures the level of similarity between multiple NGS datasets, based on the genomic information that is shared between them. We show that kPAL outperforms pre-alignment QC tools (such as FastQC) in reliably exposing samples that suffer from poor capture performance, contamination, enrichment of library chimeras or other types of artefact. Even though the last step in assessing data quality by FastQC involves the analysis of overrepresented 5-mers, FastQC fails to identify problematic samples due to the low k-mer size and the way k-mer profiles are processed. In contrast, tools that rely on aligned reads (such as RNA-SeQC  and the Picard toolkit) can expose the majority of these technical artefacts, though some of them still require a thorough and vigorous assessment to be identified. The Preqc feature of SGA performs well on WGS data and can precisely estimate insert size distribution and expose erroneous reads. However, the performance of SGA on other types of NGS data, such as WES and RNA-Seq, is less reliable since it was originally developed for pre-processing, error correction and de novo assembly of whole genome sequences. The lack of a pairwise comparison and accommodation for differences in library size limits the application of SGA in quality assessment and measuring the level of dissimilarity between k-mer profiles of sequenced samples. The unique feature of kPAL is its ability to account for biases introduced by differences in sequencing depth between samples to expose outliers and problematic samples and that, like SGA, it does not rely on prior information. Potential applications of this strategy are to determine the quality of sequencing data, estimate the sequencing depth required for de novo assembly projects and identifying sequencing reads that represent the uncharacterized regions of the genome of a given species.
Most microbiome studies have focused on phylogenetically informative markers such as 16S rDNA to reveal the relative composition and diversity of the metagenome in question (reviewed in ,). Despite the efficiency of such approaches, amplicon-based studies lack the ability to provide a genome-wide characterization of microbiomes. Moreover, sequencing errors and the presence of library chimeras can hamper the analysis of microbiomes using conventional tools, as only a handful of reads may be produced from any given fragment. This results in unreliable operational taxonomic units, which are often used in microbiome studies. The advantage of our approach is that it can potentially discriminate between different species of a common phylum by relying on sequence content beyond the resolution of 16S rDNA sequences. We show that the similarity of microbiomes based on their composition and diversity can be revealed using kPAL, which is purely founded upon the sequencing data alone. In contrast, although UniFrac could reliably resolve rather stable gut microbiomes, it struggled with resolving highly diverse and dynamic microbiomes, such as those obtained from skin (i.e., the palm). We show that kPAL is sensitive to temporal changes in microbiomes and can potentially be used for a wide range of applications, such as forensic DNA fingerprinting. It is important to note that further developments are required for reliable assessment of temporal changes in a microbial community using the kPAL approach. Although kPAL does not provide a biological reason for the sources of variation within and between datasets, it opens the way to a more accurate and unbiased determination of the quality and complexity of genomic sequences.
Materials and methods
kPAL is a Python-based toolkit and programming library that provides various tools, many of which are used in this study. kPAL is an open-source package and can be downloaded -. kPAL can also be installed (including all prerequisites) through the command line using: pip install kPAL. Detailed documentation and tutorials are available . For detailed a description of the kPAL methodology, refer to Additional file 1: Notes. The performance of kPAL, in terms of speed and memory usage, for generating and pairwise comparison of k-mer profiles is provided in Additional file 1: Figure S18.
Creating k-mer profiles
The k-mer profiles were generated using the index function built into kPAL. For all analyses k was set to 12 except when otherwise stated. To accommodate for the analysis of both sequencing reads and genome reference sequences, we have chosen to use the FASTA format as an input to kPAL. However, we provide a command-line tool to convert FASTQ files to the appropriate format . For paired-end data, the profiles for both reads were merged into a single k-mer profile using the kPAL merge function. For more information on performance, runtime and memory usage, see Additional file 1: Notes.
Measuring pairwise distances
For further information about the procedure, refer to Additional file 1: Notes.
Calculating the k-mer balance
For all samples in this study, the balance between the frequencies of k-mers and their reverse complement were found using the showbalance function in kPAL (see Additional file 1: Notes). For all paired-end datasets, k-mer profiles were first merged and then assessed for their balance.
The distance matrices produced by the pairwise comparison of all samples were used to perform a hierarchical clustering and PCA in R and MATLAB, respectively. The mRNA analysis pipeline, QC and exon quantification procedure are described elsewhere ,. For the microbiomes, the hierarchical clustering was done using the distance matrices provided by the k-mer profile or UniFrac  analyses. Subsequently, the accuracy of the clustering arrangement was assessed based on the silhouette  and weighted kappa  measures.
Library preparation and sequencing
For WGS datasets, two separate library preparation protocols were used. The gDNA libraries for full genome libraries were prepared using the reagents from a TrueSeq DNA Sample Prep Kit according to the manufacturer’s instructions (TrueSeq DNA Sample Preparation Guide, revision C; Illumina Inc., San Diego, CA) with minor modifications. After the ligation, the first protocol uses a gel-free method for samples instead of a gel step that was used for the second protocol. Furthermore, the number of PCR cycles in the PCR enrichment step differs between the two protocols (five and ten cycles, respectively). A High Sensitivity DNA chip (Agilent Technologies 2100; Santa Clara, CA) was used for quantification and samples were subsequently sequenced on an Illumina HiSeq 2000 sequencer at the same laboratory.
Libraries for the WES samples were prepared using the Agilent SureSelect Kit (Agilent Technologies, Santa Clara, CA), Nimblegen Capture Kit V2 or Nimblegen Capture Kit V3 (Roche NimbleGen Inc., Madison, WI), according to the manufacturer’s instructions. A High Sensitivity DNA chip (Agilent Technologies 2100) was used for the quantification and the samples were subsequently sequenced on an Illumina HiSeq 2000 sequencer at the same laboratory.
FastQC was run for all samples prior to analysis to assess the quality of the data. However, none of the sequencing data was removed from the analysis as they all passed the FastQC quality measures. Reads were trimmed for low quality bases (Q < 20) using sickle  and cleaned up for adapters.
Alignment to the human reference genome was performed for WGS and WES using Stampy , BWA  and Bowtie 2  with default parameters. For the WES samples, the number of on-target reads was calculated using the BEDTools  intersect, BAM files and a BED track consisting of all targets according to the manufacturer’s guidelines. Reads with no overlapping base were considered as off target. Basic alignment statistics (such as alignment rate, the fraction of properly paired reads, etc.) were extracted using SAMtools  flagstat. For WGS samples, the insert sizes were estimated using the Picard toolkit . The number of base pairs that were soft clipped during the alignment was extracted from the SAM files using a custom script.
QC and exploration of data properties were performed using the Preqc module of the SGA software. All analyses were performed according to SGA guidelines .
For the WGS and WES data, the FASTQ and BAM files have been deposited at the European Genome-phenome Archive , which is hosted by the European Bioinformatics Institute, under the accession number [EGA:S00001000600]. In addition, all k-mer profiles are available under the same accession.
For the RNA-Seq data, the k-mer profiles can be found online . The FASTQ files and BAM alignments as well as different types of quantification are available in ArrayExpress under accessions E-GEUV-1 (mRNA) and E-GEUV-2 (small RNA) for QC-passed samples and E-GEUV-3 for all sequenced samples -.
Microbiomes were obtained from the ‘Moving Pictures of the Human Microbiome’ project [MG-RAST:4457768.3-4459735.3] .
We thank Dr Jelle Goeman and Dr Erik W. van Zwet for their help, advice and input. This work was partially supported by the European Community’s Seventh Framework Program (FP7/2007-2013) GEUVADIS (grant 261123), the Center for Medical Systems Biology and the Center for Genome Diagnostics in the Netherlands.
- 3.Costello M, Pugh TJ, Fennell TJ, Stewart C, Lichtenstein L, Meldrim JC, Fostel JL, Friedrich DC, Perrin D, Dionne D, Kim S, Gabriel SB, Lander ES, Fisher S, Getz G: Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res. 2013, 41: e67-10.1093/nar/gks1443.PubMedPubMedCentralCrossRefGoogle Scholar
- 5.Kidd JM, Sampas N, Antonacci F, Graves T, Fulton R, Hayden HS, Alkan C, Malig M, Ventura M, Giannuzzi G, Kallicki J, Anderson P, Tsalenko A, Yamada NA, Tsang P, Kaul R, Wilson RK, Bruhn L, Eichler EE: Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat Methods. 2010, 7: 365-371. 10.1038/nmeth.1451.PubMedPubMedCentralCrossRefGoogle Scholar
- 23.Lappalainen T, Sammeth M, Friedlander MR, t Hoen PAC, Monlong J, Rivas MA, Gonzalez-Porta M, Kurbatova N, Griebel T, Ferreira PG, Barann M, Wieland T, Greger L, van Iterson M, Almlöf J, Ribeca P, Pulyakhina I, Esser D, Giger T, Tikhonov A: Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013, 501: 506-511. 10.1038/nature12531.PubMedPubMedCentralCrossRefGoogle Scholar
- 24.t Hoen PA, Friedlander MR, Almlof J, Sammeth M, Pulyakhina I, Anvar SY, Laros JF, Buermans HP, Karlberg O, Brannvall M, Consortiumden GEUVADIS, Dunnen JT, van Ommen GJ, Gut IG, Guijó R, Estivill X, Syvänen AC, Dermitzakis ET, Lappalainen T: Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat Biotechnol. 2013, 31: 1015-1022. 10.1038/nbt.2702.CrossRefGoogle Scholar
- 26.FastQC: a quality control tool for high-throughput sequence data [http://www.bioinformatics.babraham.ac.uk/projects/fastqc/]
- 30.Nordstrom KJ, Albani MC, James GV, Gutjahr C, Hartwig B, Turck F, Paszkowski U, Coupland G, Schneeberger K: Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers. Nat Biotechnol. 2013, 31: 325-330. 10.1038/nbt.2515.PubMedCrossRefGoogle Scholar
- 33.Brown CT, Crusoe MR, Edvenson G, Fish J, Howe A, McDonald E, Nahum J, Nanlohy K, Ortiz-Zuazaga H, Pell J, Simpson J, Scott C, Srinivasan RR, Zhang Q, Brown CT: The khmer software package: enabling efficient sequence analysis. Figshare. 2014, 14: 26-Google Scholar
- 36.k -mer Profile Analysis Library at GitHub repository [https://github.com/LUMC/kPAL]
- 37.k -mer Profile Analysis Library at LUMC repository [http://www.lgtc.nl/kPAL]
- 38.k -mer Profile Analysis Library at official Python repository for open-source packages [https://pypi.python.org/pypi/kPAL]
- 39.Online documentation for k -mer Profile Analysis Library [http://kPAL.readthedocs.org]
- 40.FASTA/FASTQ processing and manipulation toolkit at official Python repository for open-source packages [http://pypi.python.org/pypi/fastools]
- 44.Sickle: a windowed adaptive trimming tool for FASTQ files using quality [https://github.com/najoshi/sickle]
- 50.Picard: a set of tools for working with next-generation sequencing data in the BAM format [http://picard.sourceforge.net]
- 51.European Genome-phenome Archive [http://www.ebi.ac.uk/ega/]
- 52.k -mer profiles for RNA-Seq data [http://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-3/files/profiles/?ref=E-GEUV-3]
- 53.ArrayExpress Accession E-GEUV-1 (mRNA) [http://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/]
- 54.ArrayExpress Accession E-GEUV-2 (small RNA) [http://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-2/]
- 55.ArrayExpress Accession E-GEUV-3 [http://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-3/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.