SeqDoC: rapid SNP and mutation detection by direct comparison of DNA sequence chromatograms
- 8.2k Downloads
This paper describes SeqDoC, a simple, web-based tool to carry out direct comparison of ABI sequence chromatograms. This allows the rapid identification of single nucleotide polymorphisms (SNPs) and point mutations without the need to install or learn more complicated analysis software.
SeqDoC produces a subtracted trace showing differences between a reference and test chromatogram, and is optimised to emphasise those characteristic of single base changes. It automatically aligns sequences, and produces straightforward graphical output. The use of direct comparison of the sequence chromatograms means that artefacts introduced by automatic base-calling software are avoided. Homozygous and heterozygous substitutions and insertion/deletion events are all readily identified. SeqDoC successfully highlights nucleotide changes missed by the Staden package 'tracediff' program.
SeqDoC is ideal for small-scale SNP identification, for identification of changes in random mutagenesis screens, and for verification of PCR amplification fidelity. Differences are highlighted, not interpreted, allowing the investigator to make the ultimate decision on the nature of the change.
KeywordsTest Sequence Difference Profile Sequence Trace Single Base Change Sequence Chromatogram
The ability to identify single nucleotide changes in DNA is a fundamental requirement in many fields of biological research. The identification and characterisation of naturally-occurring single nucleotide polymorphisms (SNPs) underlies a vast body of work on genetically-linked disorders, diagnosis and risk prediction [1, 2, 3, 4] as well as being important in genomic mapping and population genetics [5, 6, 7, 8]. Identification of point mutations is of equal importance to many researchers, for roles as diverse as identifying specific alterations caused by random mutagenesis screens [9, 10] to validation of the fidelity of sequences amplified by PCR.
For labs studying SNPs or point mutations, identification of these can be a time-consuming and error-prone process, particularly if novel changes are being investigated. In some cases, software such as the Staden package [11, 12] or Sequencher  may provide a suitable solution. However these are sophisticated and multifunctional programs, and can prove overly complex for simple sequence comparisons. Consequently, many small-scale projects may rely solely on manual analysis, for example simply carrying out a direct text comparison of the processed sequence to a known reference.
This manual approach is affected by variations in sequence quality and incorrect base calling, and may also miss heterozygous bases if, for example, the wild-type peak is higher that the additional peak. To address these issues and to provide a simple and efficient way to accurately identify sequence changes, we have developed a web-based application which compares DNA sequence chromatograms directly. SeqDoC (Sequence Difference of Chromatograms) is freely accessible, very easy to use, and highlights differences characteristic of single base changes, including heterozygous SNPs and insertions and deletions.
Read in chromatograms
Two ABI sequence chromatograms, one a reference and the other the test sequence, are the only user-supplied data. These are uploaded through a web form and the sequence traces and other relevant data extracted using the Perl ABI.pm module . The sequence traces for each channel (i.e. A, C, G and T) are stored as individual arrays within the chromatogram object. Blank sequence at the beginning and end of each chromatogram (resulting for example from sequencing of a PCR product, when the trace continues past the end of the template) are removed by deleting the terminal values from the traces where all channel values are less than 50. In tests, we established that a filter value of as low as five resulted in the removal most blank terminal sequence, while a value of as high as 500 still retained virtually all genuine sequence; we therefore used 50 as an appropriate intermediate value.
Comparison of the test to the reference sequence is performed by subtraction of trace values, so it is necessary to normalize the trace values so that a sequence run with strong signal can be meaningfully compared to one with weaker signal. Normalisation is performed for each channel individually, and scales each data point so that the local mean value for that channel is 100 (local being defined as 500 data points prior to the point being scaled, the point itself and 499 points after). The mean value of those local points for the channel is calculated and divided by 100 to give a scaling factor, and the point being normalized is then scaled by being divided by this factor.
Special exceptions are made for the initial 500 and final 499 values of the trace, where there are either not 500 values before, or not 499 after, the point being normalized. For these two cases, the mean is based on as many points as are present between the point and the end of the sequence on one side, while still using 500 on the other side.
Due to variations arising both from the sample and the sequencing matrix, the start position of the sequence traces and the base separation within the traces may differ between the test and the reference sample. The software compensates for this by automatically calculating the optimal start alignment combined with continual fine adjustment throughout the length of the sequences to maintain alignment of the test and reference sample.
To identify start point offsets, the software tests a range of initial alignments, from -200 to +200 data points, corresponding to approximately +/- 20 bases of sequence. The offset which results in the best alignment (the smallest total value of the absolute differences between the test and reference traces for all four channels) is used for subsequent analysis. This is implemented by the addition or deletion of 'spacer' data points at the beginning of the test sequence.
Ongoing fine adjustment of the sequences is achieved by the addition or removal of individual data points from the test sequence as required. The sequences are sampled every five data points, and difference scores calculated for the subsequent 30 data points. If that difference score is reduced by the insertion or deletion of a single data point, then the test traces are adjusted accordingly (by either duplication or removal of the data point at the test position).
Following normalisation and alignment of the sequences, a 'difference profile' is calculated. The trace values of the test sequence are subtracted from the equivalent values for the reference sequence for all four channels, and the resulting values are passed through an algorithm which highlights changes characteristic of base substitutions, while reducing random noise. This is achieved by squaring the difference value and multiplying the result by the square root of the sum of the differences of other channels which change in the opposite direction.
The overall outcome of this process is to enhance the signal given by large differences with at least one channel changing in the opposite direction, which is characteristic of base substitutions, while minimising the signal from small unidirectional changes (typical of signal noise). Difference profiles are calculated for all four channels and the results overlaid in the final output.
Generate trace and difference images
User output is provided in the form of three aligned images: aligned sequence chromatograms for the reference and test sequences, and a similarly aligned difference profile. These outputs are based on the normalized, aligned traces generated during earlier stages of the analysis. The difference trace is typically primarily flat, with a large bidirectional peak at the point of any base changes between the sequence. Other patterns are possible, depending on specific features of the test and reference sequences, and are discussed in more detail below. The three images are generated by the Perl GD::Graph module, and are returned to the user as a web page. Identification of the site of base substitutions is simply a matter of examining the difference trace for the bidirectional peaks mentioned above; by examining the aligned test and reference sequences at the point of these peaks, single base changes between the two sequences can be rapidly and simply identified.
The Staden 1.5.3 Windows installer was downloaded from SourceForge and installed on a PC running Windows XP Pro. Tracediff comparison was performed through Pregap4 using the following modules in order: General Configuration, Initialise Experiment Files, Reference Traces & Sequences, Trace Difference. All user-definable parameters were left at their default values (except that 'Write trace differences out to disk' was selected). We used Gap4 to both align and view the initial and tracediff-generated traces as well as to carry out trace subtraction directly. The reference and test sequences (and difference trace where appropriate) were assembled into a contig, which was opened in the Trace display window. A subtracted trace was generated using the Diff button.
Results and discussion
All scaling factors, cutoff filters and algorithms described in the methods section were chosen after testing of multiple settings as giving the clearest identification of single base changes and best retention of genuine data while minimising the signal resulting from noise. The process was initially optimised using sequences covering known polymorphisms in different regions of the human melanocortin 1 receptor gene . In all cases the polymorphisms were clearly visible in the difference trace. The software has since been successfully used to test for single base changes in several hundred sequence chromatograms.
The automatic start alignment process means that it should not be necessary to use the same primer to sequence the reference and test sequences, providing that the two different primers produce sequences which start within approximately 20 nucleotides of each other. Alternatively, it would be possible to manually edit the raw chromatograms to bring the start points of the two sequences into approximate alignment. We do not believe that the comparison will work for samples sequenced in opposite directions, using the reverse complement of one or the other. Sequencing chemistry is such that peak heights are typically affected by the local sequence composition [16, 17], and while this is consistent for samples sequenced in the same direction from different primers, it is not true for those sequenced in opposite direction.
Because of the normalisation and noise reduction algorithms built into the software, it is relatively resilient to poor quality sequence. Typically, if the sequence quality is sufficient for an investigator to unambiguously identify the base call, it is good enough for automatic identification of sequence differences. Most problems with sequence quality only occur at the end of the sequence run, although unincorporated dye terminators may cause 'dye blobs' at the beginning of the sequence, which can partially mask base changes occurring at the same site. Examples of the output produced in these cases, along with full instructions on the use of SeqDoC, are provided on our website at http://research.imb.uq.edu.au/seqdoc/.
The Staden software package  is an established, well-supported and widely used sequence analysis package, and has functions (such as 'tracediff') for direct comparison of chromatogram traces analogous to those provided by SeqDoC. It can also display trace subtractions through the Gap4 program. We have therefore evaluated the performance of SeqDoC using Staden as a benchmark. Although the principle advantage of SeqDoC over Staden is that it is specifically designed and optimized for the single purpose of chromatogram comparison, and therefore provides a much simpler user interface, we also believe that it offers advantages in normalisation, alignment and, particularly, sensitivity. On the other hand, Staden can allow much higher throughput, since multiple sequences can be analysed at once; data can also be saved and analysed in detail with multiple additional functions.
In summary, SeqDoC proves a lightweight but effective substitute to Staden for sequence trace comparisons. While Staden is a more appropriate choice for applications where high throughput is the main priority, SeqDoC provides a better solution when sensitivity, specificity or simplicity are more important considerations.
SeqDoC is a very easy to use web-based application which rapidly highlights differences between ABI sequence chromatograms, including substitution and insertion/deletion events. It uses chromatograms directly, rather than extracted text-based sequence data, so eliminating errors introduced by base calling software and allowing identification of heterozygous substitutions which might otherwise be missed. No user intervention or adjustment is required for processing, with all normalisation, alignment and noise reduction being carried out automatically; on the other hand the ultimate decision on the specific change identified remains with the investigator. SeqDoC is free and requires no training to use, and is ideally suited for use by researchers carrying out small scale SNP analysis or mutagenesis experiments. It can also be used to rapidly screen PCR-amplified products for introduced mutations.
Availability and requirements
Program name: SeqDoC
Project home page: http://research.imb.uq.edu.au/seqdoc/
Operating system(s): Platform independent
Programming language: Perl CGI
Other requirements: Requires Perl CGI, GD::Graph and ABI modules
License: None for web access, GNU GPL for source code
Any restrictions to use by non-academics: No restrictions
This work was supported by the Australian Research Council Special Research Centre for Functional and Applied Genomics. We thank Dr R.A. Sturm for his initial conceptual suggestions, W. Chen for providing test data and Dr S.M. Grimmond for reviewing this article.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.