2.1 Estimation of population means
In survey sampling, basic inferential procedures are developed for the estimation of finite population means. For the study variable yy, the population mean under stratification is given by
μy=1NH∑h=1Nh∑ih=1yih.μy=1NH∑h=1Nh∑ih=1yih.(2.1)
Estimation of a disease prevalence is a special case of estimating a population proportion with a binary study variable yy. The design-unbiased Horvitz-Thompson estimator of μyμy is given by
ˆμyHT=1NH∑h=1∑ih∈Shwihyih.^μyHT=1NH∑h=1∑ih∈Shwihyih.(2.2)
The stratum design weight wih=π−1ihwih=π−1ih is often interpreted as the number of units in the population represented by the unit ihih in the sample (Lohr 2010; Wu and Thompson 2020). The Horvitz-Thompson estimator for the population total Ty=∑Hh=1∑Nhih=1yihTy=∑Hh=1∑Nhih=1yih is given by ˆTyHT=∑Hh=1∑ih∈Shwihyih^TyHT=∑Hh=1∑ih∈Shwihyih, which is also called the expansion estimator. The design weight wihwih is also called the inflation weight. The population size NN is sometimes unknown to data users. An unbiased estimator of NN is given by ˆN=∑Hh=1∑ih∈Shwih^N=∑Hh=1∑ih∈Shwih. The resulting estimator of μyμy is the so-called Hˊa´ajek estimator given by ˆμyH=ˆTyHT/ˆN^μyH=^TyHT/^N.
The theoretical design-based variance of the Horvitz-Thompson estimator given in (2.2) involves both the first-order and the second-order sample inclusion probabilities πihπih and πihi′h. Under stratified sampling, the stratum samples Sh, h=1,⋯,H are independent. The general theoretical variance formula is given by Var(ˆμyHT)=N−2H∑h=1Var(∑ih∈Shyihπih)=N−2H∑h=1[Nh∑ih=1Nh∑i′h=1(πihi′h−πihπi′h)yihπihyihπih]. The conventional unbiased variance estimator for the Horvitz-Thompson estimator is given by ^Var(ˆμyHT)=N−2H∑h=1[∑ih∈Sh∑i′h∈Shπihi′h−πihπi′hπihi′hyihπihyihπih].
In practice, complex survey datasets, such as the CLSA datasets used in this paper, usually do not provide the joint inclusion probabilities πihi′h which are required for computing the variance estimator given in (2.3). Most statistical software packages for survey data analyses use approximate variance estimators to bypass this difficulty.
When sampled units are drawn with-replacement, with selection probabilities zih, ih=1,⋯,Nh for each selection, the Hansen-Hurwitz estimator (Hansen and Hurwitz 1943) of μy has the same algebraic form of the Horvitz-Thompson estimator if we let πih=nhzih. The unbiased variance estimator for the Hansen-Hurwitz estimator is given by ˆV0=N−2H∑h=11nh(nh−1)∑ih∈Sh(nhwihyih−∑ih∈Shwihyih)2. The variance estimator ˆV0 does not involve second-order inclusion probabilities and provides a good approximation to the variance estimator given in ((2.3)) if the sampling fractions fh=nh/Nh are small for the original without-replacement survey design. When the sampling fractions are not small, an ad hoc adjustment to (2.4) is to apply the finite population correction factor 1−fh within each stratum. The resulting variance estimator is given by ˆV=N−2H∑h=11−fhnh(nh−1)∑ih∈Sh(nhwihyih−∑ih∈Shwihyih)2.
The variance estimator ˆV given in (2.5) is exactly design-unbiased for stratified simple random sampling. For general stratified unequal probability sampling, the performance of ˆV varies depending on the original survey design. There exist other approximate variance formulas not involving second-order inclusion probabilities and performing better for certain designs. See, for instance, (Haziza, Mecatti, and Rao 2008) for further details. The variance estimator ˆV is the default option for most survey packages, including R,\ SAS,\ SPSS and Stata.
Stratified multi-stage sampling can use approximate variance estimators similar to ˆV if sampling fractions for the first stage clusters are small within each stratum. For cases where N is unknown and the Hˊajek estimator ˆμyH is used, the variance estimator ˆV given in (2.5) needs to be modified with N being replaced by ˆN and the study variable yih being substituted by the residual variable eih=yih−ˆμyH for computing the variance estimator. Further details can be found in (Wu and Thompson 2020).