7.8 What if we ignore domain analysis

If we do not specify the subpopulation in a survey and treat the subset of the dataset as a separate survey, the resulting standard error would be incorrect. We would compare the difference in SE by the \(\texttt{R}\) program. Here, we give an extreme example to emphasize the effect if we do not specify the subpopulation. Suppose we are interested in the \(\texttt{ENV_AFRDWLK_MCQ}\) variable and want to compare it between people with BMI below and above 19. There are 38 and 762 respondents with BMI < 19 and those have at least 19, respectively. Here are the codes for such a comparison.

R

# It is correct to specify the subpopulations
svytotal(~ENV_AFRDWLK_MCQ, design = subset(CLSA.design, BMI < 19))
svytotal(~ENV_AFRDWLK_MCQ, design = subset(CLSA.design, BMI >= 19))
# It is not appropriate if we divide the dataset and 
#    re-declare the survey design 
CLSAData.low.BMI <-CLSAData[which(CLSAData$BMI < 19), ]
CLSAData.high.BMI<-CLSAData[which(CLSAData$BMI >= 19),]
CLSA.design.low.BMI<- svydesign(ids = ~ entity_id, strata = ~StraVar, 
weights = ~WGHTS_INFLATION_TRM, data = CLSAData.low.BMI, nest = TRUE )

CLSA.design.high.BMI<- svydesign(ids = ~ entity_id, strata = ~ StraVar, 
weights = ~ WGHTS_INFLATION_TRM, data = CLSAData.high.BMI nest = TRUE)

svytotal(~ENV_AFRDWLK_MCQ, design = CLSA.design.low.BMI)
svytotal(~ENV_AFRDWLK_MCQ, design = CLSA.design.high.BMI)

Result comparison

Specified
Not Specified
Population Estimates Total SE Total SE
BMI \(\geq\) 19
Strongly Agree 51736.0433 7428.5580 51736.0433 7418.5693
Agree 62622.6338 8405.6710 62622.6338 8389.5494
Disagree 147101.0999 13765.9214 147101.0999 13679.5034
Strongly Disagree 261353.6022 20620.7902 261353.6022 20453.7439
BMI < 19
Strongly Agree 929.8891 708.2762 929.8891 659.1306
Agree 3196.3136 1507.3081 3196.3136 1441.5175
Disagree 2509.4450 1374.4214 2509.4450 1282.9361
Strongly Disagree 19742.0720 7747.7715 19742.0720 6942.1285

Readers can see that the estimates for the total are the same for both groups, while there are substantial differences in the standard error. If we specify the subpopulation, the standard errors given by the program would be more reasonable and generally slightly higher. It is because the statistical package would treat the sizes of subpopulations to be random in domain analysis, and therefore, there is more uncertainty in the population estimates.

Note:

The total of the subpopulations estimates from the table above is much smaller than the CLSA study population. It is because the dataset used for illustration is only a subset of a synthetic CLSA dataset. The actual dataset should give much larger sub population totals.