Statistical evaluation of methods for quantifying gene expression by autoradiography in histological sections
- 5.6k Downloads
In situ hybridisation (ISH) combined with autoradiography is a standard method of measuring the amount of gene expression in histological sections, but the methods used to quantify gene expression in the resulting digital images vary greatly between studies and can potentially give conflicting results.
The present study examines commonly used methods for analysing ISH images and demonstrates that these methods are not optimal. Image segmentation based on thresholding can be subject to floor-effects and lead to biased results. In addition, including the area of the structure or region of interest in the calculation of gene expression can lead to a large loss of precision and can also introduce bias. Finally, converting grey level pixel intensities to optical densities or units of radioactivity is unnecessary for most applications and can lead to data with poor statistical properties. A modification of an existing method for selecting the structure or region of interest is introduced which performs better than alternative methods in terms of bias and precision.
Based on these results, suggestions are made to reduce bias, increase precision, and ultimately provide more meaningful results of gene expression data.
KeywordsGrey Level Dentate Gyrus Mineralocorticoid Receptor Thresholding Method Line Method
In situ hybridisation has been used as a standard method for quantifying gene expression in histological sections for nearly forty years. Oligonucleotide or RNA probes are either labelled with radioactive atoms such as 35S or 32P, or other non-radioactive molecules such as biotin. The location of the bound probe in the tissue is then usually visualised by autoradiography or immunohistochemistry, and captured as a digital image for quantification. Images of autoradiographic films are analysed in a semiquantitative manner where the darkness of the film is proportional to the amount of gene expression. This method is semiquantitative because the darkness of the film is only proportional to the amount of gene expression, but there is no way to map the darkness to the number of transcripts. This paper focuses on the analysis of autoradiographic films from in situ hybridisations, but the results should generalise to other autoradiographic methods such as 2-fluoro-deoxyglucose .
The methods used to quantify the amount of gene expression in the resulting images vary widely between studies and laboratories, and could potentially give conflicting results. There are two main steps in the analysis where variations in methods arise. The first is during image segmentation, or the process of determining what is to be included in the analysis (foreground) and the rest (background). During segmentation, setting a threshold based on pixel intensity is a common method, but the choice of cut-off value varies from 2 to 3.5 standard deviations above the mean background level [2, 3]; alternatively, the threshold may be manually adjusted . More sophisticated thresholding methods using Bayesian classification have also been developed , and there are some forty algorithms to automatically threshold images , although only a few are actually used in the neuroscience literature. Alternative methods for segmenting the image include outlining the structure by hand [7, 8, 9, 10, 11] or using a magic want tool , typically in combination with subtracting background levels. To help identify the boundaries of the structure when outlining by hand, an image of the film and another of the stained tissue can be superimposed . Another method involves using a 'template' or standard sized selection window which is constant across all sections and animals, and which is placed over the area of highest intensity in the structure of interest . For example, Van Hoomissen and colleagues placed an 8 × 10 pixel oval over the locus coeruleus . In this study, analysis of hippocampal subregions also involved placing ten 8 × 8 pixel ovals in random locations and taking an average. An unusual method by Gartside et al. involved placing lines of 100 pixels in length perpendicular to the dentate gyrus (DG), CA1, and CA3 subregions of the hippocampus, which sampled only a small part of the structures of interest and included many pixels in the analysis that were not part of the hippocampal subregion of interest .
The second step where variations in methods occur is in the processing of grey level (GL) values obtained from the digitised images. Once the foreground has been selected and measured, the resulting grey level values (typically 8-bit values from 0–255) are often converted into a variety of other units and expressed in various ways. This includes using one of several equations (see below) to convert the results to an optical density (OD). Sometimes GL values are multiplied by the area or the number of pixels of the structure or region of interest to give an 'integrated' value [17, 18], sometimes they are divided by the area and expressed as intensity/mm2 , sometimes the area is measured and reported as a separate variable , and sometimes the area is not included at all. In addition, many studies take into account the nonlinear nature of the film's response to radioactivity by using a 14C standard, and convert the GL values into units of radioactivity [7, 20, 21, 22, 23]. Many studies provide insufficient information to determine how the quantification was carried out, and analysis of the same dataset with various permutations of the above methods could potentially give different results. There is a need for standardisation of methods across studies, and a stronger theoretical underpinning for the methods used.
What is the best way to quantify gene expression in histological sections when analysing autoradiographic films? While there is likely no single method that will be superior in all cases, this paper analyses a set of images with a variety of methods in order to determine which method of segmentation produces the best results, defined in terms of greatest precision (as more precise estimates lead to greater statistical power) and potential for bias. In addition, a simulation study was used to determine the effects of various transformations of GL values. Other factors such as the time to carry out procedure are also considered. A modification of an existing method of segmentation is introduced which performs better than current commonly used methods.
Seventeen male Sprague-Dawley rats (Harlan, Oxon, UK) were housed individually and were eight weeks old at the start of the experiment. Ambient temperature was maintained at 21°C and humidity at 55% with ad libitum access to food and water. Animals were kept on a reversed 12-hour light/dark cycle (lights off at 10:00 AM). Animal experiments conformed to the UK Animals (Scientific Procedures) Act 1986, and procedures were carried out under appropriate Home Office (UK) project and personal licences.
In situ hybridisation
Brains were sectioned at 20 μ m with a cryostat at -20°C, and every sixth section was placed onto polylysine-coated slides (Sigma, Dorset, UK). Three sections per animal were used, with the first section beginning at approximately -2.80 mm from the bregma . Sections were allowed to air dry at room temperature and were then fixed with 4% paraformaldehyde for 5 min, washed in phosphate-buffered saline (PBS) and then dehydrated in 70% and 95% ethanol for 5 min before finally storing in fresh 95% ethanol. ISH was carried out under RNAase-free conditions and the mineralocorticoid receptor (MR) probe had the following sequence: 5' TTC GGA ATA GCA CCG GAA ACG CAG CTG ACG TTG ACA ATC T 3'. The probe was end-labelled with 35S and incubated at 37°C for one hour. The labelled probe was purified by centrifuging at 3000 rpm for two minutes through a G-50 sephadex micro-column (Amersham, UK).
Appropriate volumes of the labelled probe were added to hybridising buffer and the probes were evaluated for incorporation of the radiolabel by scintillation counting. Probes were hybridised overnight at 44°C and unbound probe was washed with saline sodium citrate (SSC; Sigma, UK) twice for 30 min at 55°C followed by 2 min washes with SSC, distilled water, 50%, 70% and 95% ethanol. Sections were allowed to dry at room temperature before exposure to the film.
14C-labelled standards of known radioactivity (range 30–862 nCi/g; Amersham) were placed in the X-ray cassette along with the brain sections and exposed to Kodak BioMax MR autoradiographic film (Amersham Biosciences) for six days. The film was developed with a Fuji Medical Film Processor (FPM-100A; Fuji Photo Film UK, London, UK).
The film was placed on a light box (Universal Electronics Industries, Hong Kong) and images recorded with a CCD camera (C3077; Hamamatsu) and Scion imaging software (Scion Corporation, Maryland, US). Images were 768 × 512 pixels and saved as 8-bit greyscale TIFF files. Figures were prepared with the GNU Image Manipulation Program (version 2.4.6).
Image analysis and quantification
Images were analysed with the NIH ImageJ software (version 1.37; http://rsb.info.nih.gov/ij/). The expression of the MR receptor in the dentate gyrus of the hippocampus was quantified by determining the grey levels of the pixels using four different methods. The left and right side of the dentate gyrus was quantified separately on three sections and the background level on each section was measured and subtracted.
The DG was outlined by hand using the polygon tool and the mean GL and the area were recorded (Fig. 1C).
A thresholding approach was used which included only those parts of the dentate gyrus that were three standard deviations above the mean background GL (Fig 1D–F). After setting the threshold, the DG was outlined and the background cleared in order to isolate only the DG. The outline was only approximate as there was no need to distinguish between background and dentate gyrus precisely. The average GL and area were then calculated for the thresholded region. Only pixels in groups of 50 or greater were included, which eliminated stray pixels not a part of the DG but that were above the threshold.
The DG was also selected using a mixture modelling approach (available as plugin for ImageJ). This is an automated thresholding method which fits two Gaussian distributions to the pixel intensity histogram of the whole image and sets the threshold at the intersection of the distributions. This is not a commonly used method for ISH analysis but was included for comparison because it is an automated procedure, which has the advantage of being fast, objective, and reproducible, and the results corresponded well to subjective visual estimates of the anatomical boundaries of the dentate gyrus.
These methods are simply the results obtained from methods 2–4 but the mean GL values were multiplied by the area of the DG (determined by outlining or thresholding), and these are referred to as the integrated grey level (IGL) values.
Converting grey levels to optical densities
Data from the first method (line method) were then converted into three different units. The first two equations converted the GL values into optical densities (Eq. 2 and Eq. 4), and the third method used a 14C standard to convert GL values into units of radioactivity (Eq. 5).
where I0 is the intensity of the incident light (on the sample), I1 is the intensity of the transmitted light (through the sample), and ℓ is the distance that the light travels through the sample. Equation 1 is used to calculate the optical density of solutions or gases, but is modified for the analysis of gene expression by ISH; for example, the thickness of the film (or the sections) are not taken into account and therefore the length term (ℓ) is removed. In addition, the value of the incident light is typically not measured, but the closest thing would be to measure the developed film in a location without any brain sections on it (this could be thought of as the 'blank'), although this seems to be rarely done. Furthermore, the transmitted light is not measured directly, but captured by a digital camera as a greyscale image.
It should be noted that some studies first convert GL values to ODs and then into units of radioactivity, so they are not mutually exclusive options .
where μ is the grand mean, R i is the difference of the i th rat from the grand mean, S ij is the difference of the j th section from the average for that rat, and ε ijk are the residuals. Associated with each level are the variance components; is the variability of rats about the grand mean, is the variability of sections nested within rats, and is the residual term which is the variability of the left and right side within sections (plus measurement error). The total variability () is simply the sum of the three variability values and therefore the percentage of variability at each level can be calculated. The raw data are provided [Additional File 1] along with the R code [Additional File 2].
Precision of different segmentation methods
Integrated values have the potential for bias
Including the area of the structure or region of interest as an integrated GL value has the potential to bias the results because the ingreated value is a product of both the GL (i.e. amount of gene expression) and the size of the structure. Similar levels of gene expression between groups or conditions could be mistakenly concluded as being different if the size of the underlying structure differs between groups; differences in the IGL would reflect differences in the area, and not in gene expression. This is a well-known problem in the stereological literature and the reason why modern stereological methods do not include the area of the structure or region of interest when determining the total number of objects (e.g. cells, synapses, etc. [35, 36, 37]). Based solely on a reported integrated GL, there is no way to know whether any significant differences are due to differential gene expression or simply differences in the size of the structure. Therefore, at best, including the area in an integrated value simply increases the variability of the data; at worst, it can bias the results, and therefore should be avoided.
Potential floor-effect with thresholding methods
Converting grey levels to other units
The second B-series data has the same variability as the A-series but the difference in means has been halved, such that the distributions now overlap. This also means that the total range of the data has been reduced, and this likely represents a more common arrangement of data that would be obtained from real experiments. Similar to the previous data, converting to ROD (B2) does little to alter the subjective interpretation of the results upon viewing the graph, and there is little or no change in the statistical properties of the data (B4). Converting to units of radioactivity (B3) appears to have increased the difference in the means of the two groups, but this is offset by increased variability, which leads to a decrease in statistical power (smaller t-value), and is again indicated by a shift to lower t-statistics (B4). Finally, data in panel C1 has the same difference between means as B1, but increased variability (and increased range of data). Once again, converting to ROD (C2) changes the statistical properties little, and converting to units of radioactivity has increased the variance in group B and created an outlier. Again, such data might be suitable for log-transformation to normalise the variances, or a decision has to be made whether to remove the offending point. When both groups have similar means and variances, neither transformation affects the t-statistic (not shown).
This analysis tells us, first, that if the data are in a narrow range (e.g. B-series) then the transformation to ROD is linear and the transformation to units of radioactivity fairly linear, with a small decrease in power due to increased variability. Second, if the data cover a wide range of values–either because the means of the groups are far apart or due to high variability–then the transformation to units of radioactivity will create outliers, skewed distributions, or both, thus creating problems for subsequent statistical analysis. The conclusion is that for the majority of studies it is better to analyse the GL values directly rather than convert them to other units. This also has the advantage of fewer calculations (less chance for computational errors), it is faster, and the data can be related to more intuitively; for example, possible values range from 0 to 255, and 80 is twice as high as 40. This does not mean that gene expression is twice as high however, but given that autoradiography is a semiquantitative technique, GL values are not directly related to the number of mRNA transcripts. It is more difficult to intuitively compare two values on a multiplicative scale that have been nonlinearly transformed.
A variety of methods have been used in published studies for image segmentation (manual outlining, thresholding, magic wand, use of templates, etc.) to determine what is part of the structure or region of interest and what is not. Similarly, once grey levels are measured, a variety of methods have been employed to convert them into optical densities or units of radioactivity. Based on the above results, the line method was the best way to select the structure or region of interest as it was not subject to floor-effects, had a low coefficient of variation, and low within-sample variability. This method is similar to, and a modification of the outline method, which requires that the actual boundaries of the structure be determined, whereas the line method sampled only from the interior of the structure. Since the DG is a long narrow structure, this was best done with a line down the centre. Larger structures can follow the same principle by 'outlining' the structure but staying well inside the border so that only the interior is sampled. A drawback of trying to outline the border of the structure is that it is not always clear exactly where it lies, especially when gene expression is low relative to background levels. A second drawback is that the need for hand-eye-mouse coordination can introduce some additional variability, although this was relatively mild with the present data as both the line and outline method had similar coefficients of variation and within-sample variability. This method is also relatively fast since structures do not have to be carefully delineated, and the only calculation involves background subtraction. Thresholding methods are common, but as was shown here, can be subject to a floor-effect, limiting their usefulness in many cases. These results may not apply to quantifying gene expression in the neocortex, where due it its laminar structure, it is common to use transects to determine gene expression across the different layers .
Once the structure is selected, only the GL value and not the integrated GL value should be used for further analysis. Using integrated GL values will be rarely preferable for analysis of films because changes in the size of the underlying structure and not changes in gene expression may be affecting the results. Changes in area could then be misinterpreted as changes in gene expression. However, even if areas are similar between groups or conditions, including the area increases the variability of the data, thereby decreasing precision and statistical power. The area should be reported separately (as in reference ), if at all, and the area of the actual structure obtained from the histological sections should then be measured and reported as well (alternatively, the volume of the relevant structure could also be reported). This is easily done as slides can be counter-stained with Cresyl violet after exposing the film, and the area determined. Integrated grey levels are appropriate for analysis of gels (e.g. Western blot, dot blot) because the protein, DNA, or RNA are not bound within cells and within structures as they are in vivo. Both the darkness and the area are then needed to quantify the amount of substance present, as the mean grey level of the dot or band will decrease as the substance is spread out over a larger area–this does not apply to histological sections.
Finally, statistical analysis should be performed on the untransformed GL values (averaging across sections to give one value per animal). There appears to be little advantage to transforming GL values into either optical densities or units of radioactivity. Some may argue that the relationship between the amount of radioactivity and the response of the film is nonlinear, and that the imaging system's response to levels of darkness are also nonlinear, and that these need to be adjusted somehow, e.g. by converting to ROD and then using 14C standards. However, one must bear in mind that autoradiography is only semiquantitative, which means for example, that while the difference in GL values between 40 and 35, and 50 and 45 is five, it does not necessarily represent an equivalent change in the number of mRNA transcripts. Adjusting for such small non-linearities suggests a much higher quality of data than is actually obtained with ISH and autoradiography. Furthermore, the grey level values will likely be in a narrow range in most studies, which means these transformations are linear and therefore pointless (as in converting from Celsius to Fahrenheit). Alternatively, if the data have a wide range, then such transformations can result in a combination of (1) increased overall variability, (2) heterogeneous variances between groups, and (3) outliers, resulting in a reduction in statistical power, and necessitating log transformations or non-parametric tests, which either reverses (log transformation) or ignores (rank-based statistical tests) the effect of such transformations. Grey levels obtained directly from the imaging system are suitable values to use for analysis and no further calculations are required. Other advantages include less time to carry out the analysis, less chance for computational errors, the values are easy to interpret (i.e. they range from 0–255 for and 8-bit image), and easy to compare between studies–it is often a mystery what the values on the y-axis of graphs represent in many studies.
There is one instance when using a 14C standard is useful: when multiple films are required because not all samples can fit on a single film. There will likely be systematic differences between films, and the 14C serves as a common reference that allows direct comparisons of samples from multiple films. There are however other alternatives such as converting the results within each film to z-scores, which standardise the data within a film to have a mean of zero and a standard deviation of one. The z-scores can then be analysed in the normal way. This requires that brains from different experimental conditions are (approximately) balanced across the films–this means not having all the controls on one film and all the treated animals on an other film. This should already be standard practice and so does not introduce any additional procedures or constraints, and it has the advantage of (1) using only a linear transformation and (2) not requiring anything else to be estimated and incorporated into an equation, which will almost certainly introduce more noise. The present study used only one film and so it was not possible to assess the relative merits of using 14C standards versus z-scores.
While this study only examined gene expression in one brain region, it is likely that most of the results apply to other regions and structures as well, although the extent to which such concerns as bias and reduced precision play a role outside of the hippocampus will have to be empirically determined. The data and R code are therefore provided so that readers can reproduce the results of this paper and use them as a template for the analysis of their own data.
Based on the above results, three recommendations are proposed. First, do not use integrated values because they are a function of both the mean grey level (i.e. gene expression) and area, making the results difficult to interpret; bias can be introduced if the area of one group is different than another group (even though GL values are the same). In addition, integrated values have reduced precision because the variability in the estimation of the area is included in the final value. Areas can be reported separately if required, although this arguably provides little in the way of new information and it is preferable to estimate area on tissue sections directly rather than on autoradiographic films. Second, manual selection of the interior of the structure or region of interest results in data with low variability (low CV), avoids ambiguities in determining the edge of structure, and is a relatively quick method requiring few calculations (only background subtraction). However, the standard method of outlining the structure by hand proved to have suitable properties with this dataset as well. Given the possibility of floor-effects with thresholding methods (at least with the global methods examined in this paper), they should be avoided. Third, statistical analysis should be performed on the GL values without transforming them to optical densities or standardising them against 14C standards (unless multiple films are used). The dynamic range of images in most studies will be fairly narrow and therefore these transformations are pointless. If the range of the data is large, standardising the values against 14C standards can have negative consequences for the distributions (skewness, outliers, and heterogeneous variances).
In summary, several suggestions have been made which should be employed in the analysis of gene expression on autoradiographic images to reduce bias, increase precision, and ultimately provide more meaningful results.
This work was supported by a Wellcome Trust grant to Prof. J. Herbert and Dr. S.B. Pinnock. I would also like to thank members of the open source community who developed and contributed to the software used in this study.
- 2.Ginsberg MD, Zhao W, Singer JT, Alonso OF, Loor-Estades Y, Dietrich WD, Globus MY, Busto R: Computer-assisted image-averaging strategies for the topographic analysis of in situ hybridization autoradiographs. J Neurosci Methods. 1996, 68 (2): 225-233. 10.1016/0165-0270(96)00084-2.CrossRefPubMedGoogle Scholar
- 4.Simonetti AW, Elezi VA, Farion R, Malandain G, Segebarth C, Remy C, Barbier EL: A low temperature embedding and section registration strategy for 3D image reconstruction of the rat brain from autoradiographic sections. J Neurosci Methods. 2006, 158 (2): 242-250. 10.1016/j.jneumeth.2006.06.004.CrossRefPubMedGoogle Scholar
- 9.Zhang H, Torregrossa MM, Jutkiewicz EM, Shi YG, Rice KC, Woods JH, Watson SJ, Ko MC: Endogenous opioids upregulate brain-derived neurotrophic factor mRNA through delta- and micro-opioid receptors independent of antidepressant-like effects. Eur J Neurosci. 2006, 23 (4): 984-994. 10.1111/j.1460-9568.2006.04621.x.PubMedCentralCrossRefPubMedGoogle Scholar
- 15.Van Hoomissen JD, Holmes PV, Zellner AS, Poudevigne A, Dishman RK: Effects of beta-adrenoreceptor blockade during chronic exercise on contextual fear conditioning and mRNA for galanin and brain-derived neurotrophic factor. Behav Neurosci. 2004, 118 (6): 1378-1390. 10.1037/0735-7044.118.6.1378.CrossRefPubMedGoogle Scholar
- 17.Rieux C, Carney R, Lupi D, Dkhissi-Benyahya O, Jansen K, Chounlamountri N, Foster RG, Cooper HM: Analysis of immunohistochemical label of Fos protein in the suprachiasmatic nucleus: comparison of different methods of quantification. J Biol Rhythms. 2002, 17 (2): 121-136. 10.1177/074873002129002410.CrossRefPubMedGoogle Scholar
- 24.Paxinos G, Watson C: The Rat Brain in Stereotaxic Coordinates. 1986, London: Academic Press, 2Google Scholar
- 25.Gonzalez RC, Woods RE: Digital Image Processing. 2002, Upper Saddle River, NJ: Prentice Hall, 2Google Scholar
- 26.Vizi S, Palfi A, Hatvani L, Gulya K: Methods for quantification of in situ hybridization signals obtained by film autoradiography and phosphorimaging applied for estimation of regional levels of calmodulin mRNA classes in the rat brain. Brain Res Brain Res Protoc. 2001, 8: 32-44. 10.1016/S1385-299X(01)00082-4.CrossRefPubMedGoogle Scholar
- 28.Ihaka R, Gentleman R: R: a language for data analysis and graphics. J Comput Graph Stat. 1996, 5: 299-314. 10.2307/1390807.Google Scholar
- 29.R Development Core Team: R: A Language and Environment for Statistical Computing. 2007, R Foundation for Statistical Computing, Vienna, AustriaGoogle Scholar
- 31.Cox DR, Solomon PJ: Componenets of Variance. 2003, Boca Raton, FL: Chapman & Hall/CRCGoogle Scholar
- 32.Faraway JJ: Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. 2006, Boca Raton, FL: Chapman & Hall/CRC PressGoogle Scholar
- 35.Mouton PR: Principles and Practices of Unbiased Stereology. 2002, Baltimore: Johns Hopkins University PressGoogle Scholar
- 36.Baddeley A, Vedel Jensen EB: Stereology for Statisticians. 2005, London: Chapman & Hall/CRCGoogle Scholar
- 38.Cleveland WS: Visualizing Data. 1993, New Jersey: Hobart PressGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.