Abstract
In this chapter the basic and rudimentary aspects of sample surveys for finite populations are presented in a compact way. The concepts of population, sample, sampling design, survey data, estimating finite population parameters of interest and consequent errors and their control will be explained and detailed illustrations will be provided. The theory to address general issues will be explained first. Then the need for modification to cover the case of sensitive issues and how to do that will be explained. It will be clearly shown how in a general situation one may handle indirectly procured observations to estimate parameters of interest and also derive estimated measures of accuracy. Sophisticated theoretical details will be presented only in brief. Finally in this chapter we put emphasis on the fact that any probability sampling design may be employed for the purpose of estimating parameters related to stigmatizing characteristics.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
2.1 Introduction
A labeled finite population of identifiable individuals, each bearing real numbered values including zero and one is supposed to be surveyed through sampling and ascertaining sample-wise values to estimate certain parameters. Specifically, we take up the case of individual human beings, some of whom bear sensitive, rather stigmatizing features. A way out is needed to successfully gather individual values for respective people sampled so as to estimate proportions of a community bearing such sensitive features or the total value for a stigmatizing characteristic borne collectively by the members of a community.
As the title of this monograph announces it is avowedly one on survey sampling. So, we are under an obligation to tell our readers certain basic and rudimentary aspects of sample surveys.
By \(U = \left (1,\ldots,i,\ldots,N\right )\) we shall mean a survey population. It refers to a known finite number N of individuals labeled for identification uniquely by i which stands for the label 1 through N, each denoting just one of these N units. Each unit i bears a value y i for a real variable y, i = 1, …, N. Thus the vector \(\underline{Y } = \left (y_{1},\ldots,y_{i},\ldots,y_{N}\right )\) is defined on U. Then
are some of the parameters of common interest related to \(\underline{Y }\). The quantities Y and \(\bar{Y }\) are usually required to be estimated, called, respectively, the total and mean of y related to U. The quantity S 2 is the population variance. For estimation, a sample s of elements of U is required to be chosen with a pre-assigned probability p(s) according to a suitable probability design or design in brief for simplicity, say, such that 0 ≤ p(s) ≤ 1 for every sample s that may be selected from U. The number of units in s, say, n is called the size of s, just as N is the size of the population U. On choosing a sample, the values of y for the units in s, that is, y i for i ∈ s are to be ascertained by dint of an actual sample survey. On gathering the survey data
an estimator for Y may be employed as t = t(d), usually with the properties
and
being suitably small. The quantity M p (t) is called the mean square error, and E p denotes the expectation operator with respect to the sampling design p. Such a t is called an estimator for Y with its error (t − Y ) such that M p (t) is suitably under control. If K is a pre-assigned positive number, noting that we may write
where \(\sum _{1}\) is the sum over those samples for which | t(d) − Y | ≥ K and \(\sum _{2}\) that over the complementary set of possible samples, it follows that
The quantity
is called the bias of t in estimating Y and
is the standard error of t, where
is the variance of t. Now
Consequently, by taking K = λ σ p (t), with λ > 0, it follows that
If t is chosen to be unbiased for Y, then
The interval
is called a confidence interval (CI) for Y with a confidence coefficient (CC) of at least
So, it is desirable to employ an unbiased estimator t for Y with B p (t) = 0 and a small variance because in that case one may derive a confidence interval based on t(d) with a small width equal to 2λ σ p (t). For example, a choice of λ = 3 will yield
as a confidence interval with a confidence coefficient of at least as high as 8 ∕ 9. Neyman (1934) gave us this gift for survey sampling extending Chebyshev’s inequality in probability theory.
Unfortunately however, as Basu (1971) has shown, unless a census, i.e., a complete enumeration of a survey population is undertaken, no sampling design admits an unbiased estimator for a finite population total ensuring a minimum value for its variance, uniformly for every \(\underline{Y } = (y_{1},\ldots,y_{i},\ldots,y_{N})\) assigning any real number to each coordinate of \(\underline{Y }\). Following Godambe (1955), let us restrict to the use of an estimator t = t(d) of the form
such that b si are numbers free of \(\underline{Y }\) subject to the restriction
where \(\sum _{s\ni i}\) means summation over all samples s which contain the unit labeled i. The condition given by (2.1) is necessary as well as sufficient to render t b unbiased for Y. Unfortunately again, Godambe (1955) has shown that in the class of all such estimators for Y of the form t b as above, called the class of Homogeneous Linear Unbiased Estimators (HLUE), no one exists with a Uniformly Minimum Variance (UMV), so long as it is based on a “general” class of designs p. Godambe (1955) did not specify what he meant by his “general” class. But Hege (1965) and Hanurav (1966) independently showed that a general class of designs called Uniclustered Class of Designs (UCD) exists for which the above negative result does not hold. A design p belonging to a UCD is a design such that for any two samples s 1 and s 2 with p(s 1) > 0 and p(s 2) > 0 either
-
1.
s 1 ∩ s 2 is empty, i.e., s 2 does not intersect with s 1
or
-
2.
s 1 ∼ s 2, i.e., every unit of s 1 is in s 2 and vice versa.
Thus, Godambe’s (1955) above celebrated nonexistence result is valid only for non uni-cluster-designs (NUCD).
Hege (1965), Hanurav (1966), and quite elegantly Lanke (1975) have shown that a UCD admits a UMV estimator for Y in the HLUE class. Importantly, this UMV estimator in the class HLUE for Y is of the form
where \(\pi _{i} =\sum _{s\ni i}p(s)\). The estimator t HT is given by Horvitz and Thompson (1952). The quantity π i is called the “inclusion-probability” of a unit i in U being included in a sample chosen according to a design p. It may be noted that there is no problem in taking π i into the denominator in t HT because a well-known fact in “Survey Sampling Theory” is that a “Necessary and Sufficient Condition” that an unbiased estimator for Y may exist is that
See Chaudhuri (2010) for details.
Most of the estimators for Y used in practice are in the HLUE class. If we add the terms b s to t b such that b s is free of \(\underline{Y }\) and \(E_{p}\left (b_{s}\right ) = 0\), then we get
a nonhomogeneous Linear Unbiased Estimator for Y, the LUE class. No estimator for Y outside this LUE class is ever put into practice except that the unbiasedness conditions
are often relaxed. The estimators t L , t b for Y are admitted provided the quantities \(B_{p}\left (t_{L}\right )\) and \(B_{p}\left (t_{b}\right )\) are numerically small enough, at least for large n. The HLUEs for Y and in particular the Horvitz and Thompson’s (1952) estimator, say, are the most popular. Added to them, Hajek’s (1971) ratio estimator
where x i are known (positive) values of a variable x, highly and positively correlated with y and where \(X =\sum _{ i=1}^{N}x_{i}\), mostly exhaust the popular estimators for Y. This ratio estimator with its denominator even as a random variable as s is so, is yet in Godambe’s class of linear homogeneous estimators for Y because this t H is linear in the y i ’s and there is no term in it free of \(\underline{Y }\) and cannot be nonhomogeneous.
We consider it important to observe:
where
and
The quantity
is an unbiased estimator for \(V _{p}\left (t_{b}\right )\) on choosing C si and C sij ’s as free of \(\underline{Y }\) subject to the conditions
and
More generally, on choosing
as a vector of pre-assigned constants, it follows that
where
and
Writing
as the probability of i and j (i≠j) both included in a sample chosen according to a design p and in addition supposing π ij > 0 ∀ i≠j and of course π i > 0 ∀ i ∈ U to get t b unbiased for Y, one may get an unbiased estimator for \(V _{p}\left (t_{b}\right )\) as
on taking the d sij ’s as free of \(\underline{Y }\) subject to
e.g., as \(d_{\mathit{sij}} = \left (d_{\mathit{ij}}/\pi _{\mathit{ij}}\right )\) ∀ i≠j in s and in U. In particular,
and
Also, alternatively,
taking w i = π i and writing
and an unbiased estimator for \(V _{p}\left (t_{\mathit{HT}}\right )\) is
It should be noted that if every sample s with p(s) > 0 has a constant number of distinct units in it, then β i = 0. Consequently \(V _{p}\left (t_{\mathit{HT}}\right )\) and \(v_{p}\left (t_{\mathit{HT}}\right )\) take the familiar forms due to Yates and Grundy (1953) namely,
and
respectively.
Remark 2.1.
As in Chaudhuri’s (2011) textbook, the spirit in which this monograph has been written demands a reader/researcher to feel comfortable with randomized response data (and data produced by other indirect questioning techniques to be illustrated in this monograph), being procured from sampled persons in specifically prescribed ways no matter how selected, so that inferences may be drawn and suitably assessed on choosing the samples in desirable ways, with equal or unequal selection probabilities, with or without replacement. In a vast majority of the space covered in publications on data gathered by indirect questioning techniques, samples are chosen by Simple Random Sampling with Replacement (SRSWR). In such a case, the simple arithmetic mean of suitably transformed individual observations is employed to unbiasedly estimate \(\bar{Y }\) and variances and variance estimators are derived along the standard SRSWR procedure. In very few cases, when y is just a binary variable taking on the values 0 and 1 representing an innocuous and stigmatizing feature respectively borne by a person, samples are not chosen by simple random sampling with replacement.
Chaudhuri’s (2011) textbook is claimed to offer a quick appreciation of our approach set forth here.
2.2 Estimating Parameters
We shall consider only two possibilities:
-
(I)
Every y i is either 1, implying a person labeled i bears a stigmatizing characteristic or attribute, say, A, or 0 implying the i-th person’s attribute being A c, i.e., the complement of A, meaning that the person does not have the stigmatizing characteristic; our problem is to estimate \(\bar{Y }\) to be denoted by θ, the proportion in a community U bearing A.
-
(II)
Every y i is a real number and our aim is to estimate the population total Y.
From our treatment in Sect. 2.1 an outline of a possible procedure readily follows. As a measure of accuracy in estimation we recommend evaluating the coefficient of variation (CV) as
for a choice of t as illustrated.
Since t H is not unbiased for Y we need to consider only its Mean Square Error (MSE)
approximately taken as, on assuming n quite large,
where R = (Y ∕ X) and recognizing \(V _{p}\left (t_{\mathit{HT}}\right )\) as a formula for \(V _{p}\left (\sum _{i\in s}\left (y_{i}/\pi _{i}\right )\right )\) evaluated with y i replaced by \(d_{i} = y_{i} - Rx_{i}\), i ∈ U. By t(y), t(x) we mean an estimator for Y using y i , i ∈ s and the same one with x i , i ∈ s. Then a plausible estimator for \(M\left (t_{H}\right )\) is taken as
writing \(\hat{R} = t_{\mathit{HT}}(y)/t_{\mathit{HT}}(x)\) and in the formula for \(v_{p}\left (t_{\mathit{HT}}\right )\) taking \(y_{i} -\hat{ R}x_{i}\) throughout for i ∈ s in lieu of y i , i ∈ s. Chaudhuri (2010) may be consulted for clarification. Needless to mention the coefficient of variation for t H will be taken as
as an approximation, assuming n large enough.
All these are applicable only provided y i is available for i in s. But in the situations of our interest, they are not. An investigator, out of sheer delicacy hesitates to ask a respondent to disclose his/her value on a sensitive variable. Even if he/she dares, a respondent is likely to refuse or hide the truth giving an untruthful or misleading response.
In the subsequent chapters we shall narrate diverse procedures to tackle this situation. The gist is that we may point out at this stage that by a suitable “random” mechanism we claim we may gather observations r i , “independently” across i in s, such that
where E R denotes a generic expectation operator with respect to such a random mechanism employed.
In situation (I) when y i takes on either the value zero or one only,
say, writing V R as the generic variance operator for the variance calculation with respect to the random mechanism mentioned above. Then \(v_{i} = r_{i}\left (r_{i} - 1\right )\) provides an unbiased estimator for V i .
In situation (II) we shall be able to show for our illustrated applications of the “random” mechanisms that for real y i , an unbiased estimator r i will be derivable independently across i in s so that \(E_{R}\left (r_{i}\right ) = y_{i}\), ∀ i ∈ U; moreover, generically, \(V _{R}\left (r_{i}\right )\) will be shown to be of the form
with \(\alpha _{i},\beta _{i},\Psi _{i}\) as known constants and consequently,
is an unbiased estimator for \(V _{R}\left (r_{i}\right )\). Then, it will follow that given t b = t b (d) with E p (t) = Y, so that \(\sum _{s\ni i}p(s)b_{\mathit{si}} = 1\),
and
one might employ, writing \(\underline{R} = (r_{1},\ldots,r_{i},\ldots,r_{N})\)
Then
and
will both satisfy
where
and
Thus, it is clear how in a general situation one may handle indirectly procured observations to estimate θ and Y and also derive estimated measures of accuracy in estimation, in terms of coefficients of variation (CV). As a rule of thumb, we could consider an estimate as excellent if for an estimated CV, it is true that CV ≤ 10%, as satisfactory if 10% < CV ≤ 20%, as acceptable if 20% < CV ≤ 30% and as unacceptable if CV > 30%.
2.3 How to Sample?
Any probability sampling scheme depending upon available resources may be employed. Chaudhuri’s (2010) may be utilized as a companion to take care of this for simplicity. Chaudhuri’s (2011) monograph may be utilized in gathering guidelines in their utilization in estimation.
2.4 How to Gather Sensitive Data?
We may recognize that facing a given situation, an investigator may consider an issue to be sensitive enough so that an indirect questioning technique may seem to be necessary. For this, requisite procedures are narrated in subsequent chapters. If the investigator deems it doubtful if he/she should go for a direct or indirect questioning, then also appropriate procedures may be followed as given in later chapters. Corresponding estimation procedures are also set forth in the right places in detail.
References
Basu, D. (1971). An essay on the logical foundations of survey sampling, Part 1. In V.P. Godambe, D.A. Sprott (Eds.), Foundations of statistical inference (pp. 203–242). Toronto: Holt, Rinehart & Winston.
Chaudhuri, A. (2010). Essentials of survey sampling. New Delhi: Prentice Hall of India.
Chaudhuri, A. (2011). Randomized response and indirect questioning techniques in surveys. Boca Raton: Chapman & Hall, CRC Press, Taylor & Francis Group.
Godambe, V.P. (1955). A unified theory of sampling from finite populations. Journal of the Royal Statistical Society: Series B, 17, 269–278.
Hajek, J. (1971). Comment on a paper by Basu. In V.P. Godambe, D.A. Sprott (Eds.), Foundations of statistical inference (pp. 203–242). Toronto: Holt, Rinehart and Winston.
Hanurav, T.V. (1966). Some aspects of unified sampling theory. Sankhya, Series A, 28, 175–204.
Hege, V.S. (1965). Sampling designs which admit uniformly minimum unbiased estimators. Calcutta Statistical Association Bulletin, 14, 160–162.
Horvitz, D.G., & Thompson, D.J. (1952). A generalization of sampling without replacement from finite universe. Journal of Americal Statistical Association, 47, 663–685.
Lanke, J. (1975). Some contributions to the theory of survey sampling. Ph. D. Thesis, University of Lund, Lund, Sweden.
Neyman, J. (1934). On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society: Series B, 97, 199–214.
Yates, F., & Grundy, P.M. (1953). Selection without replacement from within strata with probability proportional to size. Journal of the Royal Statistical Society: Series B, 15, 253–261.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Chaudhuri, A., Christofides, T.C. (2013). Specification of Qualitative and Quantitative Parameters Demanding Estimation. In: Indirect Questioning in Sample Surveys. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36276-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-36276-7_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36275-0
Online ISBN: 978-3-642-36276-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)