7.8 What if we ignore domain analysis

If we do not specify the subpopulation in a survey and treat the subset of the dataset as a separate survey, the resulting standard error would be incorrect. We would compare the difference in SE by the $R$ program. Here, we give an extreme example to emphasize the effect if we do not specify the subpopulation. Suppose we are interested in the $ENV_AFRDWLK_MCQ$ variable and want to compare it between people with BMI below and above 19. There are 38 and 762 respondents with BMI < 19 and those have at least 19, respectively. Here are the codes for such a comparison.

# It is correct to specify the subpopulations
svytotal(~ENV_AFRDWLK_MCQ, design = subset(CLSA.design, BMI < 19))
svytotal(~ENV_AFRDWLK_MCQ, design = subset(CLSA.design, BMI >= 19))
# It is not appropriate if we divide the dataset and 
#    re-declare the survey design 
CLSAData.low.BMI <-CLSAData[which(CLSAData$BMI < 19), ]
CLSAData.high.BMI<-CLSAData[which(CLSAData$BMI >= 19),]
CLSA.design.low.BMI<- svydesign(ids = ~ entity_id, strata = ~StraVar, 
weights = ~WGHTS_INFLATION_TRM, data = CLSAData.low.BMI, nest = TRUE )

CLSA.design.high.BMI<- svydesign(ids = ~ entity_id, strata = ~ StraVar, 
weights = ~ WGHTS_INFLATION_TRM, data = CLSAData.high.BMI nest = TRUE)

svytotal(~ENV_AFRDWLK_MCQ, design = CLSA.design.low.BMI)
svytotal(~ENV_AFRDWLK_MCQ, design = CLSA.design.high.BMI)

Result comparison

	Specified		Not Specified
Population Estimates	Total	SE	Total	SE
BMI $\geq$ 19
Strongly Agree	51736.0433	7428.5580	51736.0433	7418.5693
Agree	62622.6338	8405.6710	62622.6338	8389.5494
Disagree	147101.0999	13765.9214	147101.0999	13679.5034
Strongly Disagree	261353.6022	20620.7902	261353.6022	20453.7439
BMI < 19
Strongly Agree	929.8891	708.2762	929.8891	659.1306
Agree	3196.3136	1507.3081	3196.3136	1441.5175
Disagree	2509.4450	1374.4214	2509.4450	1282.9361
Strongly Disagree	19742.0720	7747.7715	19742.0720	6942.1285

Readers can see that the estimates for the total are the same for both groups, while there are substantial differences in the standard error. If we specify the subpopulation, the standard errors given by the program would be more reasonable and generally slightly higher. It is because the statistical package would treat the sizes of subpopulations to be random in domain analysis, and therefore, there is more uncertainty in the population estimates.

Note:

The total of the subpopulations estimates from the table above is much smaller than the CLSA study population. It is because the dataset used for illustration is only a subset of a synthetic CLSA dataset. The actual dataset should give much larger sub population totals.