7.8 What if we ignore domain analysis
If we do not specify the subpopulation in a survey and treat the subset of the dataset as a separate survey, the resulting standard error would be incorrect. We would compare the difference in SE by the \(\texttt{R}\) program. Here, we give an extreme example to emphasize the effect if we do not specify the subpopulation. Suppose we are interested in the \(\texttt{ENV_AFRDWLK_MCQ}\) variable and want to compare it between people with BMI below and above 19. There are 38 and 762 respondents with BMI < 19 and those have at least 19, respectively. Here are the codes for such a comparison.
R
# It is correct to specify the subpopulations
svytotal(~ENV_AFRDWLK_MCQ, design = subset(CLSA.design, BMI < 19))
svytotal(~ENV_AFRDWLK_MCQ, design = subset(CLSA.design, BMI >= 19))
# It is not appropriate if we divide the dataset and
# re-declare the survey design
CLSAData.low.BMI <-CLSAData[which(CLSAData$BMI < 19), ]
CLSAData.high.BMI<-CLSAData[which(CLSAData$BMI >= 19),]
CLSA.design.low.BMI<- svydesign(ids = ~ entity_id, strata = ~StraVar,
weights = ~WGHTS_INFLATION_TRM, data = CLSAData.low.BMI, nest = TRUE )
CLSA.design.high.BMI<- svydesign(ids = ~ entity_id, strata = ~ StraVar,
weights = ~ WGHTS_INFLATION_TRM, data = CLSAData.high.BMI nest = TRUE)
svytotal(~ENV_AFRDWLK_MCQ, design = CLSA.design.low.BMI)
svytotal(~ENV_AFRDWLK_MCQ, design = CLSA.design.high.BMI)
Result comparison
Population Estimates | Total | SE | Total | SE |
---|---|---|---|---|
BMI \(\geq\) 19 | ||||
Strongly Agree | 51736.0433 | 7428.5580 | 51736.0433 | 7418.5693 |
Agree | 62622.6338 | 8405.6710 | 62622.6338 | 8389.5494 |
Disagree | 147101.0999 | 13765.9214 | 147101.0999 | 13679.5034 |
Strongly Disagree | 261353.6022 | 20620.7902 | 261353.6022 | 20453.7439 |
BMI < 19 | ||||
Strongly Agree | 929.8891 | 708.2762 | 929.8891 | 659.1306 |
Agree | 3196.3136 | 1507.3081 | 3196.3136 | 1441.5175 |
Disagree | 2509.4450 | 1374.4214 | 2509.4450 | 1282.9361 |
Strongly Disagree | 19742.0720 | 7747.7715 | 19742.0720 | 6942.1285 |
Readers can see that the estimates for the total are the same for both groups, while there are substantial differences in the standard error. If we specify the subpopulation, the standard errors given by the program would be more reasonable and generally slightly higher. It is because the statistical package would treat the sizes of subpopulations to be random in domain analysis, and therefore, there is more uncertainty in the population estimates.
Note:
The total of the subpopulations estimates from the table above is much smaller than the CLSA study population. It is because the dataset used for illustration is only a subset of a synthetic CLSA dataset. The actual dataset should give much larger sub population totals.