2 Preliminaries of survey sampling
The purpose of statistical analyses in survey sampling is to make inferences about the target finite population from the information available in a survey sample. To obtain the sample, we select a subset of units from the population through a known sampling scheme and measure variables of interest for all selected units. The following picture illustrates the relationship between the population and the sample and the connections through sampling and inference.
Most real world survey samples are selected by a without-replacement sampling method with unequal inclusion probabilities. The general estimation theory is built based on the Horvitz-Thompson estimator (Horvitz and Thompson 1952) using the first order inclusion probabilities. Variance estimation typically requires second order inclusion probabilities, which are often unavailable in practice for a given survey dataset. Two other major features of survey designs, namely, stratification and clustering, requires additional details for point and variance estimation. Variance formulas under with-replacement sampling methods based on the Hansen-Hurwitz estimator (Hansen and Hurwitz 1943) have much simpler forms and are easy to implement. Even if the original survey sample is selected without-replacement, it is common practice for survey data analysis to treat the sample as if the units are selected with-replacement for the purpose of variance estimation. The resulting variance estimators are valid if the sampling fractions of the original survey sample are small and the variance estimates become more conservative otherwise; see (Lohr 2010) and (Wu and Thompson 2020) for further details.
In what follows, we present theoretical details for stratified single-stage sampling, which is the sampling design used by CLSA (Canadian Longitudinal Study on Aging 2020; Raina et al. 2019). Under a stratified sampling design, the population is (sometimes naturally) divided into non-overlapping subpopulations called “strata”. Within each stratum, individual units are selected based on a probability sampling design.
The following notations are used for stratified sampling.
Notation | Meaning |
---|---|
\(H\) | the number of strata |
\(\mathcal{S}_h\) | the set of sampled units from stratum \(h\), \(h=1,\cdots,H\) |
\(\mathcal{S}\) | the pooled sample, \(\mathcal{S}=\bigcup\limits_{h=1}^{H} \mathcal{S}_h\) |
\(N_h\) | the total number of units in stratum \(h\), \(h=1,\cdots,H\) |
\(N\) | the total number of units in the population, \(N= \sum_{h=1}^H N_h\) |
\(n_h\) | the sample size for stratum \(h\), \(h=1,\cdots,H\) |
\(n\) | the overall sample size, \(n = \sum_{h=1}^H n_h\) |
\(\pi_{i_h}\) | sample inclusion probability of unit \(i\) in stratum \(h\) |
\(\pi_{i_h i^{\prime}_h}\) | joint inclusion probability of units \(i\) and \(i^\prime\) in stratum \(h\) |
\(w_{i_h}\) | the stratum design weight, \(w_{i_h} = \pi_{i_h}^{-1}\) |
\(y_{i_h}\) | the value of the study variable \(y\) for unit \(i\) in stratum \(h\) |
\(\pmb{x}_{i_h}\) | the value of covariates \(\pmb{x}\) for unit \(i\) in stratum \(h\) |
The survey dataset can be represented by \(\{(y_{i_h},\pmb{x}_{i_h},w_{i_h}), i_h\in\mathcal{S}_h, h=1,\cdots,H\}\).