2.2 Regression analysis
Linear regression analysis and logistic regression analysis are commonly conducted by researchers in health sciences. Survey weighted regression analysis focuses on finite population regression coefficients and also provides valid results for the model parameters under the assumed regression model. For simplicity of notation, we assume that the covariates \({\bf x}\) contain \(1\) as the first component and the regression model has an intercept. The finite population regression coefficients \(\pmb{\beta}_N\) are the solution to the so-called census estimating equations, \[\begin{equation} U_N(\pmb{\beta}) = \sum_{h=1}^H \sum_{i_h = 1}^{N_h} \pmb{x}_{i_h} \big\{y_{i_h} - \mu(\pmb{x}_{i_h},\pmb{\beta})\big\} = {\bf 0} \,, \label{UN} \end{equation}\] where \(\mu(\pmb{x}_{i_h},\pmb{\beta}) = E(y_{i_h} \mid \pmb{x}_{i_h})\) is the mean function under the assumed regression model. For linear regression analysis, we have \(\mu(\pmb{x},\pmb{\beta}) = \pmb{x}'\pmb{\beta}\); for logistic regression analysis where \(y\) is a binary variable, we have \[ \mu(\pmb{x},\pmb{\beta}) = E(y\mid \pmb{x}) = P(y=1 \mid \pmb{x}) = \frac{\exp(\pmb{x}'\pmb{\beta})}{1+\exp(\pmb{x}'\pmb{\beta})} \,. \]
The survey weighted estimator of \(\pmb{\beta}_N\), denoted as \(\hat{\pmb{\beta}}_N\), is the solution to the survey weighted estimating equations, \[\begin{equation} U_n(\pmb{\beta}) = \sum_{h=1}^H \sum_{i_h \in \mathcal{S}_h} w_{i_h} \pmb{x}_{i_h} \big\{y_{i_h} - \mu(\pmb{x}_{i_h},\pmb{\beta})\big\} = {\bf 0} \,. \tag{2.6} \end{equation}\] Under the linear regression model, the estimator \(\hat{\pmb{\beta}}_N\) has a closed form expression. Under the logistic regression model, it requires an iterative computational procedure to find the solution \(\hat{\pmb{\beta}}_N\). The variance estimator for \(\hat{\pmb{\beta}}_N\) is derived based on the theory of estimating equations and has the well-known sandwich form (Binder 1983), \[\begin{equation} \widehat{\mbox{Var}}(\hat{\pmb{\beta}}_N) = \Big\{ H_n(\hat{\pmb{\beta}}_N)\Big\}^{-1} \hat{V}\Big\{U_n(\hat{\pmb{\beta}}_N) \Big\}\Big\{ H_n'(\hat{\pmb{\beta}}_N)\Big\}^{-1} \;, \tag{2.7} \end{equation}\] where \(H_n(\pmb{\beta}) = \partial U_n(\pmb{\beta}) / \partial \pmb{\beta}\) and \(\hat{V}\{U_n(\hat{\pmb{\beta}}_N)\}\) is the estimated variance-covariance matrix of the Horvitz-Thompson estimator \(U_n(\pmb{\beta}) = \sum_{h=1}^H \sum_{i_h \in \mathcal{S}_h} w_{i_h} {\bf g}_{i_h}\), with \({\bf g}_{i_h} = \pmb{x}_{i_h}\{y_{i_h} - \mu(\pmb{x}_{i_h},\pmb{\beta})\}\) and \(\pmb{\beta}\) being replaced by \(\hat{\pmb{\beta}}_N\) for enumerations. The variance estimator \(\hat{V}\) given in (2.5)is used again as the default option for most survey software packages on regression analysis. With the vector form of \({\bf g}_{i_h}\), the estimator of the variance-covariance matrix is given by \[\begin{align} \hat{V}\Big\{U_n(\hat{\beta}_N)\Big\} = \sum^H_{h=1} \frac{1-f_h}{n_h(n_h-1)} \sum_{i_h \in\mathcal{S}_h} \Big(n_h w_{i_h}{\bf g}_{i_h} - \sum_{i_h \in\mathcal{S}} w_{i_h}{\bf g}_{i_h} \Big)\Big(n_h w_{i_h}{\bf g}_{i_h} - \sum_{i_h \in\mathcal{S}} w_{i_h}{\bf g}_{i_h} \Big)' \; . \tag{2.8} \end{align}\] Chapter 7 of (Wu and Thompson 2020) contains detailed discussions on regression analysis using survey data.
In survey sampling, estimation of regression coefficients or other parameters related to a model is often referred to as analytic use of survey data. It is apparent from the estimating equation system given in (2.6) and the sandwich variance estimator specified in (2.8) that rescaling the design weights \(w_{i_h}\) by a constant does not change the point estimator \(\hat{\pmb{\beta}}_N\) or the variance estimator. Survey agencies sometimes provide the so-called analytic weights as part of the survey datasets. These weights are rescaled from the original design weights such that the sum of the analytic weights equals to the sample size.