1 Introduction

In the original serial analysis of gene expression (SAGE) protocol, ditags were generated by ligating two blunt-ended 14-bp tags (1). To avoid bias due to cloning and PCR artifacts, duplicate ditags were discarded during the extraction process, a procedure that was continued after the LongSAGE protocol (2). By deriving the basic probability expressions for SAGE and LongSAGE, it is possible to verify whether discarding duplicate ditags has a significant effect on the resulting tag counts. Because the original SAGE protocol generates blunt-ended tags, the probability of two tags combining to form a ditag was independent of their sequence: P(AB) \(=\) P(A)\(^\ast\)P(B), where the probability is simply the frequency of each tag. For two tags with equal probabilities of 0.02, the ditag probability would be 0.004. If 25,000 ditags were sampled to form a 50,000-tag library, the tag counts would be 1000 and the number of expected duplicate ditags would be 10–-i.e., only 1 % of the tag counts. Discarding duplicate ditags, in this case, had little effect on the overall tag counts, and was thus justified by the risk that some may stem from experimental artifacts. In LongSAGE however, the MmeI enzyme generates a 21- to 22-bp long tag, which has a 2-bp overhang at the 3\(^\prime\) end. This overhang ensures that only tags complementary to each other at the 3\(^\prime\) end can ligate together. This changes the basic probability of each ditag AB to be chosen to: P(AB) \(=\) P(A)\(^\ast\)P(B)\(^\ast\)16, i.e., the probability of B is now dependent of A if A is chosen first and a uniform distribution of compatible overlaps is assumed. Going back to first example of a typical 50,000-tag SAGE study, two tags of tag count 1000 with a compatible overhang would, on average, give rise to 160 duplicate ditags, a nonnegligible 16 % fraction of the total tag count. In reality, the distribution of the 3\(^\prime\) overhangs is not uniform. Figure 1 shows the distribution of overhang dinucleotides for a human pancreas SAGE library (Pa1b), which vary considerably from the uniform 1/16 distribution (\(=\)0.0625) (3). Using these data, up to a threefold difference in tag counts is observed, depending on inclusion or exclusion of duplicate ditags.

Fig. 1
figure 11_1_978-1-59745-454-4

Tags observed with and without inclusion of duplicate ditags from the LongSAGE study of pancreatic acinar cells (3) shows a linear relationship of tag counts.

A plot of the relationship between tag counts generated by discarding or including duplicate ditag reveals no general nonrandom bias of including duplicate ditags (see Fig. 2). On the contrary, the inclusion of duplicates increases tag counts proportional with abundance, as should be expected from the probability equations.

Fig. 2.
figure 11_2_978-1-59745-454-4

The distribution of observed compatible overhangs from ditags of lengths 40 bp and 42 bp.

A major complication of analysis is variability of the MmeI enzyme in the size of the restricted DNA fragments. The enzyme digests either 20 or 19 nucleotide downstream of its recognition sequence, generating two different overlaps, and consequently gives rise to 40, 41, or 42 bp ditags in different ratios (Table 1). Therefore, the ditag probability must be calculated for each individual fragment. Taking these considerations into account, the ditag probability becomes: P(A\(^\prime\)B\(^\prime\)\(=\) T\(_{\rm A^{\prime}}\)/T\(_{\rm total ^\ast }\)T\(_{\rm B^{\prime}}\)/T\(_{\rm PPT}\), where T\(_{\rm A^{\prime}}\) is total tag count of A\(^\prime\), T\(_{\rm B^{\prime}}\) is total tag count of B\(^\prime\), T\(_{\rm total}\) is total library tag count, and T\(_{\rm PPT}\) is total possible partner tags for the overhang between A\(^\prime\) and B\(^\prime\), with A\(^\prime\) and B\(^\prime\) being one of two possible length representations of the tags. The expected occurrence of each ditag in library of D\(_{\rm total}\) becomes: D\(_{\rm AB}\) \(=\) D\(_{\rm total^\ast}\)T\(_{\rm A}\)/T\(_{\rm total^\ast}\)T\(_{\rm B}\)/T\(_{\rm PPT}\).

Table 1 The Distribution of Ditag Lengths in a LongSAGE Study of Pancreas Acinar Cells (3)

The last expression provides the theoretical basis to verify the observed duplicate ditag counts by correlating with the predicted ditag counts when duplicate ditags are included in the tag extraction. Tags with large deviations from the predicted counts can be verified manually and possible contaminants removed from further analysis. For SAGE libraries generated from amplified mRNA, this may be a useful quality check of the amplification linearity (3,4). This is implemented in the longsage_bias algorithm, which is implemented as a Perl program described in the methods section of this chapter. This algorithm can be used to correlate duplicate ditag counts and identify suspicious ditags that merit closer inspection, not to correct tag counts. An example is shown in Fig. 3, where the arrows indicate identified contaminants. Removal of these tags increases the Pearson Product Moment correlation from 0.61 to 0.95.

Fig. 3.
figure 11_3_978-1-59745-454-4

Observed vs predicted ditags from the LongSAGE study of pancreatic acinar cells (3). Outliers identified as contaminants are indicated by arrows. Removal of these increases the correlation coefficient from 0.61 to 0.95.

2 Materials

  1. 1.

    The analysis of duplicate ditags can be performed using the Perl script longsage_bias.pl. This script can be obtained from www.bio.aau.dk/en/ biotechnology/software_applications or by email request to je@bio.aau.dk. The Perl package can be obtained from www.activestate.com for windows and is normally a standard on Unix machines. The module Getopt::Std is needed for the script to run and can be obtained from either ActiveState using the PPM engine or by searching CPAN.

  2. 2.

    The input files for this script is a directory of phd files generated by the Phred base caller (6).

3 Methods

To run the analysis, type longsage_bias.pl Input_Directory. This generates a new folder, containing all files generated by the analysis. The results folder is named for the input directory and the date of run.

The output from the analysis run consists of three files: a predict file, a ditag file, and a tag file. The analysis prints out several statistics–-the log output. The log output can be redirected to a file using the \(<\) redirection operator.

  1. 1.

    The predict file is a tab separated file (see Note 1) listing all tags derived from duplicated ditags and various variables, the most important being the counts of two associated tag found in duplicated ditags and the predicted counts for each duplicated ditag (see Note 2) By plotting the observed ditag count against the predicted ditag count for each ditag pair, it is easy to spot any deviations from the linear relationship. In general, the less abundant a tag is, the more variation is seen as a consequence of sampling. As a rule of thumb, tag pairs exhibiting more than fivefold deviations from their predicted ditag count merit further investigation. For example, in a SAGE library from pancreas mRNA, an unknown ditag predicted to be found 8 times was found 86 times, a more than 10-fold increase in abundance. Further analysis by BLAST revealed this ditag to consist of two tags derived from the Escherichia coli \(\beta\)-lactamase gene, thus a likely result of contamination (see Note 3).

  2. 2.

    The log-file output contains several library-wide statistics from the ditag extraction and subsequent tag extraction (Table 2). The first section lists possible monotags, i.e., where a single anchoring enzyme (CATG for NlaIII) site was found in the beginning of the ditag sequence file, but no closing site within the chosen ditag size limit. The second section contains a resume of the ditag extraction: the number of total ditags found, the number of possible monotags, and the number of correctly formatted ditags rejected because of sequence quality. The third section is a list of ditags formed by the same monotag. The fourth section lists the total number of tags extracted from the ditag list, a distribution of the 16 possible dinucleotides participating in the overlap of the tags as generated by the MmeI enzyme (determined from ditags of length 40 and 42 bp), the number of total dinucleotide overlaps, and finally, a distribution of ditag lengths (Table 1). The ditag file and the tag file are tabulated outputs of DNA tag sequences and corresponding tag counts. These may be used for further analysis, i.e., mapping of the tags.

    Table 2 Output From the Predict File of the longsage_bias.pl Script2

4 Notes

  1. 1.

    Tabulator separated files are easily imported into any spreadsheet program, such as Excel, for further analysis.

  2. 2.

    For each ditag A-B, there are two entries: one for A-B and one for B-A. The values may differ slightly as a result of limitations in the data sets, but converge for larger tag counts.

  3. 3.

    To visualize as many outliers as possible, it is adventitious to plot the data using logarithmic axes.