2.1 Estimation of population means

In survey sampling, basic inferential procedures are developed for the estimation of finite population means. For the study variable $y$ , the population mean under stratification is given by

$\begin{aligned} (2.1) & μ_{y} = \frac{1}{N} \sum_{h = 1}^{H} \sum_{i_{h} = 1}^{N_{h}} y_{i_{h}} . \end{aligned}$

Estimation of a disease prevalence is a special case of estimating a population proportion with a binary study variable $y$ . The design-unbiased Horvitz-Thompson estimator of $μ_{y}$ is given by
$\begin{aligned} (2.2) & {\hat{μ}}_{y H T} = \frac{1}{N} \sum_{h = 1}^{H} \sum_{i_{h} \in S_{h}} w_{i_{h}} y_{i_{h}} . \end{aligned}$ The stratum design weight $w_{i_{h}} = π_{i_{h}}^{- 1}$ is often interpreted as the number of units in the population represented by the unit $i_{h}$ in the sample (Lohr 2010; Wu and Thompson 2020). The Horvitz-Thompson estimator for the population total $T_{y} = \sum_{h = 1}^{H} \sum_{i_{h} = 1}^{N_{h}} y_{i_{h}}$ is given by ${\hat{T}}_{y H T} = \sum_{h = 1}^{H} \sum_{i_{h} \in S_{h}} w_{i_{h}} y_{i_{h}}$ , which is also called the expansion estimator. The design weight $w_{i_{h}}$ is also called the inflation weight. The population size $N$ is sometimes unknown to data users. An unbiased estimator of $N$ is given by $\hat{N} = \sum_{h = 1}^{H} \sum_{i_{h} \in S_{h}} w_{i_{h}}$ . The resulting estimator of $μ_{y}$ is the so-called H $\overset{´}{a}$ jek estimator given by ${\hat{μ}}_{y H} = {\hat{T}}_{y H T} / \hat{N}$ .

The theoretical design-based variance of the Horvitz-Thompson estimator given in (2.2) involves both the first-order and the second-order sample inclusion probabilities $π_{i_{h}}$ and $π_{i_{h} i_{h}^{'}}$ . Under stratified sampling, the stratum samples $S_{h}$ , $h = 1, \dots, H$ are independent. The general theoretical variance formula is given by $\begin{aligned} Var ({\hat{μ}}_{y H T}) & = N^{- 2} \sum_{h = 1}^{H} Var (\sum_{i_{h} \in S_{h}} \frac{y_{i_{h}}}{π_{i_{h}}}) \\ = N^{- 2} \sum_{h = 1}^{H} [\sum_{i_{h} = 1}^{N_{h}} \sum_{i_{h}^{'} = 1}^{N_{h}} (π_{i_{h} i_{h}^{'}} - π_{i_{h}} π_{i_{h}^{'}}) \frac{y_{i_{h}}}{π_{i_{h}}} \frac{y_{i_{h}}}{π_{i_{h}}}] . \end{aligned}$ The conventional unbiased variance estimator for the Horvitz-Thompson estimator is given by $\begin{aligned} (2.3) & \hat{Var} ({\hat{μ}}_{y H T}) & = N^{- 2} \sum_{h = 1}^{H} [\sum_{i_{h} \in S_{h}} \sum_{i_{h}^{'} \in S_{h}} \frac{π_{i_{h} i_{h}^{'}} - π_{i_{h}} π_{i_{h}^{'}}}{π_{i_{h} i_{h}^{'}}} \frac{y_{i_{h}}}{π_{i_{h}}} \frac{y_{i_{h}}}{π_{i_{h}}}] . \end{aligned}$

In practice, complex survey datasets, such as the CLSA datasets used in this paper, usually do not provide the joint inclusion probabilities $π_{i_{h} i_{h}^{'}}$ which are required for computing the variance estimator given in (2.3). Most statistical software packages for survey data analyses use approximate variance estimators to bypass this difficulty.

When sampled units are drawn with-replacement, with selection probabilities $z_{i_{h}}$ , $i_{h} = 1, \dots, N_{h}$ for each selection, the Hansen-Hurwitz estimator (Hansen and Hurwitz 1943) of $μ_{y}$ has the same algebraic form of the Horvitz-Thompson estimator if we let $π_{i_{h}} = n_{h} z_{i_{h}}$ . The unbiased variance estimator for the Hansen-Hurwitz estimator is given by $\begin{aligned} (2.4) & {\hat{V}}_{0} = N^{- 2} \sum_{h = 1}^{H} \frac{1}{n_{h} (n_{h} - 1)} \sum_{i_{h} \in S_{h}} {(n_{h} w_{i_{h}} y_{i_{h}} - \sum_{i_{h} \in S_{h}} w_{i_{h}} y_{i_{h}})}^{2} . \end{aligned}$ The variance estimator ${\hat{V}}_{0}$ does not involve second-order inclusion probabilities and provides a good approximation to the variance estimator given in ((2.3)) if the sampling fractions $f_{h} = n_{h} / N_{h}$ are small for the original without-replacement survey design. When the sampling fractions are not small, an ad hoc adjustment to (2.4) is to apply the finite population correction factor $1 - f_{h}$ within each stratum. The resulting variance estimator is given by $\begin{aligned} (2.5) & \hat{V} = N^{- 2} \sum_{h = 1}^{H} \frac{1 - f_{h}}{n_{h} (n_{h} - 1)} \sum_{i_{h} \in S_{h}} {(n_{h} w_{i_{h}} y_{i_{h}} - \sum_{i_{h} \in S_{h}} w_{i_{h}} y_{i_{h}})}^{2} . \end{aligned}$

The variance estimator $\hat{V}$ given in (2.5) is exactly design-unbiased for stratified simple random sampling. For general stratified unequal probability sampling, the performance of $\hat{V}$ varies depending on the original survey design. There exist other approximate variance formulas not involving second-order inclusion probabilities and performing better for certain designs. See, for instance, (Haziza, Mecatti, and Rao 2008) for further details. The variance estimator $\hat{V}$ is the default option for most survey packages, including $R,\ SAS,\ SPSS$ and $Stata$ .

Stratified multi-stage sampling can use approximate variance estimators similar to $\hat{V}$ if sampling fractions for the first stage clusters are small within each stratum. For cases where $N$ is unknown and the H $\overset{´}{a}$ jek estimator ${\hat{μ}}_{y H}$ is used, the variance estimator $\hat{V}$ given in (2.5) needs to be modified with $N$ being replaced by $\hat{N}$ and the study variable $y_{i_{h}}$ being substituted by the residual variable $e_{i_{h}} = y_{i_{h}} - {\hat{μ}}_{y H}$ for computing the variance estimator. Further details can be found in (Wu and Thompson 2020).

Reference

Hansen, Morris H., and William N. Hurwitz. 1943. “On the Theory of Sampling from Finite Populations.” The Annals of Mathematical Statistics 14 (4): 333–62. https://doi.org/10.1214/aoms/1177731356.

Haziza, D., F. Mecatti, and J. N. K. Rao. 2008. “Evaluation of Some Approximate Variance Estimators Under the Rao-Sampford Unequal Probability Sampling Design.” Metron 66 (1): 91–108. https://econpapers.repec.org/article/mtnancoec/080105.htm.

Lohr, Sharon L. 2010. Sampling: Design and Analysis (Advanced Series). 2nd ed. Boston, MA: Richard Stratton.

Wu, Changbao, and Mary E. Thompson. 2020. Sampling Theory and Practice. Cham, Switzerland: Springer. https://doi.org/10.1007/978-3-030-44246-0.