Introduction

Population forecasts entail a significant amount of uncertainty, especially for long-range horizons and for places with small or rapidly changing populations. Almost 40 years ago, Keyfitz (1972) argued that demographers should warn forecast users about this uncertainty. Warnings have typically been given by presenting a range of forecasts (e.g., Hollmann et al. 2000), but in recent years demographers have developed statistical prediction intervals that provide an explicit probability statement regarding the level of error expected to accompany a population forecast (e.g., Alho and Spencer 1997; de Beer 2000; Lee 1999; Lutz and Goldstein 2004).

Most of the research on probabilistic population forecasting has focused on large geographic areas. National-level analyses have been performed for Australia (Wilson and Bell 2004), Austria (Lutz and Scherbov 1998), Finland (Alho 2002), the Netherlands (de Beer and Alders 1999), Norway (Keilman et al. 2002), Poland (Matysiak and Nowok 2007), Sweden (Cohen 1986), and the United States (Lee and Tuljapurkar 1994). Less research has been performed at the subnational level, but Miller (2002) produced a series of probabilistic forecasts for California; Rees and Turton (1998) investigated model input uncertainty for 71 regions in 12 countries in the European Union; Tayman et al. (2007) evaluated ARIMA models for four states in the United States; Wilson and Bell (2007) developed probabilistic forecasts for Queensland and the rest of Australia; and Gullickson and Moen (2001) used a probabilistic model for forecasting hospital admissions for two regions in Minnesota. Very little research has been done for small areas such as cities or counties.

We believe research on probabilistic forecasting methods at the subnational level is essential because state and local population forecasts are so widely used for planning, budgeting, and analytical purposes. Examples include planning for future water consumption in Texas (Texas Water Development Board 1997), choosing locations for new fire stations in San Diego (Tayman et al. 1994), evaluating the demand for hospital services in a southern metropolitan area (Thomas 1994), developing conservation plans for a river basin in Arizona and Mexico (Steinitz et al. 2003), and projecting future enrollments for public school districts in Indiana (McKibben 1996). Effective decision making for projects such as these cannot be accomplished without a clear understanding of the likely level of accuracy of the underlying population forecasts.

Probabilistic prediction intervals can be based on models that incorporate the stochastic nature of the forecasting process (e.g., Alho and Spencer 1990; Cohen 1986; Lutz et al. 1999; Pflaumer 1992), on empirical analyses of past forecast errors (e.g., Bongaarts and Bulatao 2000; Keilman 1997; Keyfitz 1981; Smith and Sincich 1988; Stoto 1983; Tayman et al. 1998), or on a combination of the two (e.g., de Beer 1997). In this study, we focus on prediction intervals based on empirical analyses of past forecast errors.

The usefulness of the empirical approach rests heavily on the assumption that the distribution of population forecast errors remains relatively stable over time. Not all forecasters accept the validity of this assumption (e.g., Alho and Spencer 1997), but few have tested it empirically. Perhaps the most comprehensive evaluation to date was conducted by Smith and Sincich (1988), who examined state-level population forecasts using data from 1900 to 1980. Following a methodology developed by Williams and Goodman (1971), they evaluated forecast errors for 10- and 20-year horizons and found that the means and variances of absolute forecast errors remained relatively stable over time, especially after 1920, and that the variances of algebraic forecast errors remained moderately stable over time but their means did not. They concluded that the study of past forecast errors can be useful for predicting the level of precision of current population forecasts, but not for predicting their tendency to be too high or too low.

In this article, we follow the approach described by Smith and Sincich (1988) to investigate the construction and performance of empirical prediction intervals. Using 100 years of population data and a large sample of counties in the United States, we applied trend extrapolation techniques to develop a set of county population forecasts; calculated forecast errors by comparing forecasts to decennial census counts; used the distribution of past forecast errors to construct empirical prediction intervals; and investigated whether those intervals provided accurate predictions of subsequent forecast errors. We found that those intervals provided reasonably accurate predictions of the precision of subsequent forecasts, but we did not test predictions of bias because the evidence showed that past forecast errors provide little guidance regarding the tendency for subsequent forecasts to be too low or too high. We believe this and other studies of empirical prediction intervals will help forecast users measure and evaluate the uncertainty inherent in population forecasts and plan more effectively for the future.

Data and Forecasting Techniques

We used decennial census data from 1900 to 2000 to construct and analyze population forecasts for counties (or county equivalents) in the United States. We restricted our analysis to the 2,482 counties for which no significant boundary changes occurred between 1900 and 2000; this group accounted for 79% of all current counties. Forecast errors for this subset were very similar to errors for a larger group of 2,978 counties whose boundaries did not change significantly after 1930 (not shown here). We used the smaller group with constant boundaries since 1900 because it permitted the analysis of a larger number of launch years and forecast horizons.

We use the following terminology to describe population forecasts:

  1. (1)

    Base year: the year of the earliest population size used to make a forecast.

  2. (2)

    Launch year: the year of the latest population size used to make a forecast.

  3. (3)

    Target year: the year for which population size is forecasted.

  4. (4)

    Base period: the interval between the base year and launch year.

  5. (5)

    Forecast horizon: the interval between the launch year and target year.

For example, if data from 1900 and 1920 were used to forecast population in 1930, then 1900 would be the base year, 1920 the launch year, 1930 the target year, 1900–1920 the base period, and 1920–1930 the forecast horizon.

We made forecasts of total population for each county using seven trend extrapolation techniques (see the Appendix for a description of these techniques). These forecasts were calculated using a 20-year base period, the base period producing the most accurate forecasts for counties in this data set (not shown here). The forecasts had launch years ranging from 1920 to 1990 and forecast horizons ranging from 10 to 30 years. The 21 combinations of launch year and forecast horizon—and their associated target years—are shown in Table 1.

Table 1 Target years by launch year and forecast horizon

Trend extrapolation techniques have a number of advantages compared to more complex forecasting methods. They require less base data, can be employed at lower cost, and can more easily be applied retrospectively to produce forecasts that are comparable over time. These characteristics are particularly important when making forecasts for a large number of geographic areas and historical time periods. Though simple in design, trend extrapolation techniques have been found to produce forecasts of total population that are at least as accurate as those produced by more complex methods (e.g., Ascher 1978; Isserman 1977; Long 1995; Murdock et al. 1984; Rayer 2008; Smith and Sincich 1992; Stoto 1983). Similar evidence has been reported in other fields as well (e.g., Makridakis 1986; Makridakis and Hibon 1979; Mahmoud 1984). Given these characteristics, trend extrapolation techniques provide a useful vehicle for assessing the stability of population forecast errors over time and for testing the validity of empirical prediction intervals.

We calculated the average of the forecasts from these seven techniques for each county (AV7) and the average after the highest and lowest forecasts were excluded (AV5). The latter measure reduces the impact of outliers and is often called a trimmed mean; we found it produced slightly smaller forecast errors than AV7. A number of studies have documented the benefits of combining forecasts, both in demography and other fields (e.g., Armstrong 2001; Makridakis et al. 1998; Rayer 2008; Smith et al. 2001; Webby and O’Connor 1996). We focus primarily on the results for AV5 in this article; many of the results for AV7 and for individual techniques were similar to those reported here for AV5. To investigate how different functional forms affect the performance of empirical prediction intervals, we conclude the analysis with an evaluation of forecasts based on individual trend extrapolation techniques.

Analysis of Forecast Errors

Forecasts for each county were made for each of the 21 launch year/forecast horizon combinations shown in Table 1 and were compared to census counts for each target year. The resulting differences are called forecast errors, although they may have been caused partly by errors in the census counts themselves. All errors are reported as percentages by dividing by census counts and multiplying by 100. We refer to errors that ignore the direction of error as absolute percent errors (APEs) and errors that account for the direction of error as algebraic percent errors (ALPEs). The former reflect the precision of population forecasts and the latter reflect their bias.

Several summary measures were used to provide a general description of forecast accuracy. For precision, we start with the most commonly used measure in population forecasting, the mean absolute percent error (MAPE), which shows how close the forecasts were to population counts regardless of whether they were too high or too low. We also report the 90th percentile error (90th PE, calculated as the APE that was larger than exactly 90% of all APEs). The 90th PE provides a different perspective on precision by reducing the impact of outliers. For bias, we report the mean algebraic percent error (MALPE), which shows the tendency for forecasts to be too high or too low. Finally, we report the standard deviation, which measures the spread of APEs and ALPEs around the mean. These and similar measures have been used frequently to evaluate the accuracy of population forecasts (e.g., Isserman 1977; Pflaumer 1992; Rayer 2007; Smith and Sincich 1988; Tayman et al. 1998).

Table 2 shows county forecast errors for AV5 by target year and forecast horizon. Three patterns stand out regarding measures of precision. First, errors increased about linearly with the length of the forecast horizon. For each ten year increase in the forecast horizon, MAPEs rose by about 10 percentage points and 90th PEs rose by about 20 percentage points.

Table 2 Selected measures of forecast error distributions for U.S. counties

Second, differences in errors across target years within each forecast horizon were relatively small. For 10-year horizons, for example, MAPEs between target years 1930 and 1980 varied only from 9.6 to 13.2, standard deviations varied from 9.7 to 13.9, and 90th PEs varied from 21.0 to 29.1. Forecast precision improved for target years 1990 and 2000, with all three measures showing lower values than in earlier years. The smaller errors for 1990 and 2000 can largely be explained by the greater average population size and more moderate growth rates of counties toward the end of the 20th century. Averaged across all target years, the MAPE for 10-year horizons was 10.2, with a standard deviation of only 2.3. A similar degree of temporal stability can be seen for 20- and 30-year horizons.

Third, ratios of standard deviations to MAPEs were similar for all target years and forecast horizons, fluctuating in a narrow range around 1.0. For the average based on all target years within a particular horizon, this ratio was 0.980 for the 10-year horizon, 0.931 for the 20-year horizon, and 0.939 for the 30-year horizon. Even for target years in which MAPEs were considerably different than for other target years (e.g., 1990 and 2000 for 10-year horizons and 2000 for 20-year horizons), the ratios did not diverge substantially from 1.0. This implies that the distribution of errors around the mean remained fairly constant over time.

Unlike MAPEs, MALPEs varied considerably from one target year to another. There was a tendency for MALPEs to be positive for earlier target years and negative for later target years, but this relationship was fairly weak (except for 30-year forecast horizons). Standard deviations for ALPEs were much larger than their means and were consistently larger than standard deviations for APEs. Standard deviations for both APEs and ALPEs were generally smaller in later target years than in earlier target years, especially in 1990 and 2000 for 10-year horizons and 2000 for 20-year horizons. That is, the spread around the mean declined over time for both APEs and ALPEs. Averaged over all target years, MALPEs were very small for all three forecast horizons, suggesting that there was little systematic bias in the population forecasts.

We constructed histograms for each target year and forecast horizon to provide a more detailed look at the distribution of errors. Results for algebraic percent errors for AV5 for 20-year horizons are shown in Fig. 1. The distributions were roughly normal for each target year, although earlier years had more outliers than later years. The spread of errors around the mean declined slowly between 1940 and 1990 but declined substantially between 1990 and 2000. The center of the distribution fluctuated considerably over time, sometimes falling above zero and sometimes falling below zero. Histograms for the other forecast horizons showed similar patterns, although distributions were generally more compact for 10-year horizons and less compact for 30-year horizons (not shown here).

Fig. 1
figure 1

Distribution of algebraic percent errors by target year, 20 year horizon. Note: Number of counties on vertical axis, algebraic percent errors on horizontal axis

Figure 2 shows the corresponding results for absolute percent errors. The distributions were asymmetric and truncated at zero. Errors became slightly more concentrated near zero between 1940 and 1990 but became substantially more concentrated in 2000, when there were considerably more small errors and fewer large errors than in previous years. Distributions for 10- and 30- year horizons were similar to those shown here, but were concentrated closer to zero for the former than the latter (not shown here).

Fig. 2
figure 2

Distribution of absolute percent errors by target year, 20 year horizon. Note: Number of counties on vertical axis, absolute percent errors on horizontal axis

The results summarized in Table 2 and Figs. 1 and 2 show that many of the forecasts analyzed in this study had large errors, especially for long forecast horizons. The prevalence of large errors may be disappointing news to forecast users, but it is an accurate reflection of reality and highlights the importance of developing measures of uncertainty to accompany small-area population forecasts. This is the topic we turn to next.

Empirical Prediction Intervals

Smith and Sincich (1988) conducted formal statistical tests of the stability of MAPEs, MALPEs, and their standard deviations over time. We replicated those tests using our data set and found only weak evidence of stability (not shown here). We believe our results differed from those reported by Smith and Sincich because counties exhibit much more variability than states with respect to population size and growth rates; both of these factors influence forecast errors. Perhaps more important, Smith and Sincich’s data set had only 50 units of analysis whereas our data set has almost 2,500. In a large data set, even small differences in means and variances (e.g., one percentage point) may be statistically significant.

The data shown in Table 2 and Figs. 1 and 2 provide evidence of a fairly high degree of stability over time in MAPEs and the distribution of APEs. This suggests that—despite the lack of stability implied by formal statistical tests—past forecast errors may help us predict the level of precision of current forecasts. However, the data showed no stability in MALPEs and the center of the distribution of ALPEs, suggesting that past forecast errors are not likely to help us predict the tendency for current forecasts to be too high or too low. A number of previous studies have drawn similar conclusions (e.g., Isserman 1977; Kale et al. 1981; Smith and Sincich 1988; Tayman et al. 1998). We therefore focus on APEs in our efforts to develop and evaluate empirical prediction intervals.

Smith and Sincich (1988) used information on the distribution of past APEs to predict the distribution of subsequent APEs. A major advantage of this approach is that it can accommodate any type of error distribution, including the asymmetric and truncated distributions characteristic of APEs. It also permits an assessment of the prediction intervals themselves; that is, we can compare the actual number of errors falling within the intervals with the predicted number. Following Smith and Sincich, we use 90th PEs to construct empirical prediction intervals.

Overall Results

We began by ranking APEs for AV5 for each of the 21 sets of forecasts and selecting the 90th PE, as shown in Table 2. Then, we used the 90th PE from target year t-n as the forecast of the 90th PE in target year t, where n is the length of the forecast horizon. For example, if 1950 was the target year for a 10-year forecast based on launch year 1940, the 90th PE for 1950 would be used to predict the 90th PE for 1960 for a 10-year forecast based on launch year 1950. If error distributions remain relatively stable over time, 90th PEs from past distributions will provide reasonably accurate predictions of future 90th PEs. To assess the validity of this hypothesis, we compared the predicted with the actual 90th PE for each target year and computed the percentage of APEs that fell within the predicted values.

Table 3 shows the percentage of APEs in each target year that was less than the predicted 90th PE. The numbers can be interpreted as follows: A value of 90 reflects a perfect prediction. Values below 90 indicate that the 90th PE for target year t was greater than the 90th PE for target year t-n (i.e., fewer APEs fell within the predicted range). Values above 90 indicate the opposite. In addition to errors for each target year, Table 3 shows 90th PEs averaged across all target years for each horizon, along with the standard deviation associated with the average 90th PE.

Table 3 Percentage of APEs less than the predicted 90th percentile error

Table 3 reflects a high degree of stability for averages covering all the target years within a given forecast horizon: approximately 91% of APEs fell within the predicted 90th PE for all three horizons. There was some variability when comparing results for individual target years within each horizon, but for the most part the values did not stray far from 90. For all three horizons, standard deviations were nearly identical and were small relative to their means, further demonstrating temporal stability. In this analysis, then, empirically-based prediction intervals performed well: in most instances, intervals based on the distribution of past forecast errors encompassed approximately 90% of subsequent forecast errors.

It is possible that using data from several historical time periods would provide better results than using data from a single time period. To test this possibility, we evaluated the percentage of 90th PEs that was less than the average of the two previous target years (not shown here). This adjustment had little impact on the results, generally leading to errors that were slightly larger than those shown here. In this sample, then, data from a single time period were sufficient for constructing empirical prediction intervals.

In order to investigate the impact of the choice of cut-off points for prediction intervals, we replicated the analysis using 75th percentile errors (75th PEs) instead of 90th percentile errors (not shown here). For the averages covering all the target years within a given forecast horizon, between 75% and 77% of APEs fell within the predicted 75th PE for all three horizons, reflecting a fair amount of temporal stability. However, there was more variability from one target year to another than for 90th PEs, which was caused by the greater concentration of APEs around the 75th PE than the 90th PE. Consequently, small differences in the size of the predicted percentile error led to a larger difference in the percentage of APEs falling within the predicted value. This can be seen in Figs. 1 and 2: the further the distance from the center of the error distribution, the lower the concentration of APEs around a particular percentile error.

Results by Population Size and Growth Rate

A number of studies have found that forecast precision improves with increases in population size and declines with increases in the absolute value of the growth rate (e.g., Keyfitz 1981; Murdock et al. 1984; Smith and Sincich 1992; Stoto 1983; White 1954). Others have found bias to be unrelated to population size but positively related to the growth rate (e.g., Isserman 1977; Rayer 2008; Smith 1987; Smith and Sincich 1988; Tayman 1996). To investigate the effects of these variables on the performance of prediction intervals, we extended our analysis to counties grouped by population size and growth rate.

Table 4 shows 90th PEs for AV5 for counties grouped by population size in the launch year. Confirming the results of previous studies, errors generally declined as population size increased for each target year and length of forecast horizon, with the largest declines typically occurring in the move from the smallest to the next-smallest size category. Furthermore, standard deviations generally declined relative to their means as population size increased, reflecting less decade-to-decade variation in errors for large counties than small counties.

Table 4 90th Percentile errors by population size

How do differences in population size affect the performance of empirical prediction intervals? Table 5 shows the percentage of APEs for AV5 that was less than the predicted 90th PE by population size in the launch year. In general, differences by population size were fairly small and followed no consistent pattern. For some combinations of target year and length of horizon, the percentages rose with population size; for others, they fell; and for some, they followed no clear pattern. The standard deviations were small and did not vary much among the four size categories or by length of forecast horizon. Although 90th PEs themselves varied considerably with differences in population size, it appears that differences in population size had no consistent impact on the predictability of 90th PEs.

Table 5 Percentage of APEs less than the predicted 90th percentile error by population size

Table 6 shows 90th PEs for AV5 for counties grouped by the average per decade rate of population growth during the base period, for each combination of target year and forecast horizon. Errors generally displayed a U-shaped pattern, with higher values for counties with large negative growth rates, smaller values for counties with moderate growth rates, and higher values for counties with large positive growth rates. These patterns are also consistent with those reported in previous studies. Standard deviations followed a U-shaped pattern for 10- and 20-year forecast horizons, but rose steadily with the growth rate for the 30-year horizon.

Table 6 90th Percentile errors by population growth rate

Table 7 shows the percentage of APEs for AV5 that was less than the predicted 90th PE by the average per decade rate of population growth during the base period. In contrast to differences in population size, differences in growth rates had a fairly consistent impact on the performance of prediction intervals: there was a tendency for the percentage of APEs that was less than the predicted 90th PE to increase with the growth rate. Values were generally smallest for counties in the lowest growth category and rose with increases in the growth rate. Furthermore—as indicated by standard deviations that declined as growth rates increased for all three lengths of forecast horizon—values for individual target years varied most for counties with rapidly declining populations and varied least for counties with rapidly growing populations. That is, there was more consistency in the results across target years for rapidly growing populations than for rapidly declining populations. Future research may lead to techniques for modifying prediction intervals by accounting for differences in population growth rates.

Table 7 Percentage of APEs less than the predicted 90th percentile error by population growth rate

Results for Individual Techniques

Our analysis thus far has focused on AV5, the average of forecasts from the seven individual techniques after the highest and lowest were omitted. Would similar results be found for the individual techniques themselves? To answer this question, we replicated Table 2 for each individual technique (not shown here). We found many similarities in summary error statistics but several differences as well. For example, the exponential technique generally had larger MAPEs and 90th PEs than other techniques, especially for long forecast horizons, and displayed a strong upward bias while most of the other techniques displayed a downward bias.

In spite of these differences, prediction intervals based on the distribution of past errors performed well for most individual techniques (see Table 8). For 10-year horizons, the overall percentage of errors falling within the predicted 90th PE for individual techniques ranged only from 89.7 to 91.6; for 20-year horizons, from 89.5 to 91.6; and for 30-year horizons, from 84.2 to 91.3 (89.4 to 91.3 for all but the shift-share technique). There was more variability for individual target years, but in many instances the percentages were fairly close to 90. It appears that differences in the functional form of the forecasting technique had little impact on the performance of empirical prediction intervals.

Table 8 Percentage of APEs less than the predicted 90th percentile error by forecasting technique

Summary and Conclusions

Under formal definitions, probability statements regarding the accuracy of population forecasts cannot be made because the distribution of future errors is unknown (and unknowable) at the time forecasts are made. However, if current forecasting methods are about as accurate as those used in the past, and if the degree of uncertainty will be about the same in the future as it was in the past, then we can assume that future forecast errors will be drawn from the same distribution as past forecast errors (Keyfitz 1981). If this is true, prediction intervals based on the distribution of past forecast errors will provide reasonably accurate measures of the uncertainty surrounding subsequent population forecasts. This is the issue we address in the present study.

We calculated population forecast errors for 2,482 counties in the United States throughout the 20th century and evaluated the performance of empirical prediction intervals based on the distribution of past forecast errors. We found that:

  1. (1)

    MAPEs and 90th PEs remained fairly constant over time, but declined over the last few decades in the century.

  2. (2)

    The standard deviations for APEs and ALPEs also remained fairly constant over time, but declined over the last few decades in the century.

  3. (3)

    MALPEs did not remain at all constant over time.

  4. (4)

    MAPEs and 90th PEs increased with the length of the forecast horizon, often in a nearly linear manner.

  5. (5)

    In most instances, the 90th PE from one time period provided a reasonably accurate forecast of the percentage of APEs falling within the predicted 90% interval in the following time period, even for long forecast horizons.

  6. (6)

    Differences in population size had little impact on the percentage of APEs falling within the predicted 90% interval, but differences in population growth rates had a fairly consistent impact.

Of particular interest is the finding that 90th PEs from previous error distributions provided reasonably accurate predictions of subsequent 90th PEs. Given the tremendous shifts in population trends that occurred during the 20th century, this is a notable finding. Drawing on these results and those reported in previous studies (e.g., Keyfitz 1981; Smith and Sincich 1988; Stoto 1983; Tayman et al. 1998), we conclude that empirical prediction intervals based on the distribution of past forecast errors can provide useful information regarding the likely level of precision of current population forecasts.

We did not construct empirical prediction intervals for measures of bias. As shown in Table 2 and Fig. 1, MALPEs and the center of the distribution of ALPEs did not remain stable over time. Although bias has been found to be related to differences in population growth rates, it appears that past forecast errors do not provide useful information regarding the tendency for a particular set of forecasts to be too high or too low (e.g., Isserman 1977; Kale et al. 1981; Smith and Sincich 1988; Tayman et al. 1998).

The prediction intervals analyzed in this study were based on forecasts derived from seven simple trend extrapolation techniques. Would similar results be found for forecasts based on cohort-component models and other complex forecasting methods? The available evidence suggests this to be the case. A number of studies have found that average forecast errors for alternative methods tend to be similar when those methods are applied to the same geographic areas and time periods (e.g., Ascher 1978; Isserman 1977; Kale et al. 1981; Long 1995; Morgenroth 2002; Pflaumer 1992; Rayer 2008; Smith and Sincich 1992; Smith and Tayman 2003). These studies examined actual, published forecasts as well as simulated forecasts based on extrapolation techniques. Furthermore, the present study found that empirical prediction intervals worked well for almost every individual technique, including techniques based on very different functional forms. Additional research is needed, but it is likely that empirically-based prediction intervals can be usefully employed in conjunction with many different population forecasting methods.

Probabilistic prediction intervals can be based on models that incorporate the stochastic nature of the forecasting process, on empirical analyses of past forecast errors, or on a combination of the two. Which of these approaches is most useful for counties and other small areas? Model-based prediction intervals require a substantial amount of base data and are subject to errors in specifying the model, errors in estimating the model’s parameters, and structural changes that invalidate the model’s parameter estimates over time (Lee 1992). In addition, many different models can be specified, each providing a different set of prediction intervals (e.g., Cohen 1986; Keilman et al. 2002; Sanderson 1995; Tayman et al. 2007). These limitations make model-based intervals more difficult to apply than empirically-based intervals and raise questions about their reliability. Although prediction intervals based on stochastic models may be useful for national and perhaps for state population forecasts, we believe an empirical approach will generally be more useful for counties and other small areas.

Empirically-based prediction intervals have their own limitations, of course. We found that more than 90% of APEs fell inside the 90% prediction intervals in some target years and less than 90% in other target years. Intervals based on 75th PEs did not perform quite as well as intervals based on 90th PEs. Perhaps more important, the empirical approach does not provide reliable forecasts of the likely direction of error for a particular set of population forecasts.

Further research on empirical prediction intervals is clearly needed. Can formal criteria be established for evaluating the stability of error distributions over time? How much historical data are needed to develop the most stable intervals? Can techniques be developed for adjusting prediction intervals to account for differences in population size, growth rate, geographic region, and perhaps other factors as well? How do differences in the choice of cut-off points (e.g., 90th vs. 75th percentile) affect the accuracy of forecast error predictions? Can information on the distribution of errors for one geographic region be used to develop prediction intervals for another region? Would the performance of prediction intervals for other functional units, i.e. areas bound together by economic and political ties such as metropolitan areas or planning districts, differ from that shown here for counties? Can techniques be developed for predicting the tendency of forecasts to be too high or too low? Answers to these and similar questions promise to increase our understanding of population forecast errors and improve the performance of empirically-based prediction intervals.