5.4 Estimation of population quantiles

Let Fy(t)=N1h=1Hi=1NhI(yhit) be the finite population distribution function, where I() is the indicator function. The 100pth population quantile with p(0,1) is defined as

Q(p)=Fy1(p)=inf{tFy(t)p}.

Suppose we want to estimate the population quantiles of the self-reported weight and height (HWT_WGHT_KG_TRM, HWT_DHT_M_TRM). The population median corresponds to the 50% quantile. The following codes can be used.

R

Quant.Est <- svyquantile( ~ HWT_DHT_M_TRM + HWT_WGHT_KG_TRM, 
    quantile = c(0.025, 0.05, 0.1, 0.5, 0.9, 0.95, 0.975), 
    alpha = 0.05, interval.type = "Wald", design = CLSA.design,
    ties = c("rounded"), ci = TRUE, se = TRUE );
Quant.Est; SE(Quant.Est); 

SAS

PROC SURVEYMEANS data = CLSAData 
QUANTILE = (0.025 0.05 0.1 0.5 0.9 0.95 0.975) NONSYMCL;    
VAR    HWT_DHT_M_TRM HWT_WGHT_KG_TRM ;
STRATA GEOSTRAT_TRM ;  
WEIGHT WGHTS_INFLATION_TRM;                                       
RUN;  

SPSS and Stata

There is no formal procedure available to produce quantile estimates and their standard errors in the SPSS and Stata packages.

Result comparison

HWT_DHT_M_TRM
HWT_WGHT_KG_TRM
Quantile R SAS R SAS
Estimate 0.025 1.5364 1.5364 48.0964 48.0964
0.050 1.5457 1.5457 51.7715 51.7715
0.100 1.5835 1.5835 57.3716 57.3716
0.500 1.6814 1.6814 77.1140 77.1140
0.900 1.7643 1.7643 97.5720 97.5720
0.950 1.7902 1.7902 104.7509 104.7509
0.975 1.8115 1.8115 112.9293 112.9293
SE 0.025 0.0066 0.0067 1.3059 2.6834
0.050 0.0075 0.0051 1.8915 1.9615
0.100 0.0117 0.0117 1.7457 1.7725
0.500 0.0064 0.0064 0.8445 0.8438
0.900 0.0064 0.0060 1.9202 1.8917
0.950 0.0070 0.0070 3.1488 2.9399
0.975 0.0100 0.0097 5.5598 5.5487

For the standard error (SE) estimation, both R and SAS first construct 95% confidence intervals (CIs) by Woodruff’s method (Woodruff 1952), and then compute the standard errors from the division of the CI lengths by tdf,0.025, the 97.50th percentile of the t distribution with degrees of freedom, df, which is determined by the survey data and the survey design. For CLSA, the degrees of freedom is the number of observations minus the number of strata. If the sample size is relatively large, we can simply replace tdf,0.025 by z0.025, the 97.50th percentile from the standard normal distribution. The difference in standard errors is due to different implementation of Woodruff interval. Let y(1)y(2)y(n) denote the sample order statistics for the variable Y. The 100pth population quantile estimate is computed as
Q^(p)={y(1)if pF^y(y(1))y(k)+pF^y(y(k))F^y(y(k+1))F^y(y(k))(y(k+1)y(k))if F^y(y(k))<pF^y(y(k+1)), where F^y(t)=N^1h=1HiShwhiI(yhit) is the estimated cumulative distribution for Y and N^=h=1HiShwhi. The variance of the estimated distribution function F^y(t) at t=Q(p) can be estimated as V^(F^y(Q^(p)))=N^2h=1Hnhnh1iSh(ehie¯h)2, where ehi=whiI(yhiQ^(p))F^y(Q^(p)), e¯h=nh1iShehi. The CI for 100pth quantile can be obtained as (Q^(p^L),Q^(p^U)). In R, p^L and p^U are implemented as (p^L,p^U)=(ptdf,α/2V^(F^y(Q^(p))),p+tdf,α/2V^(F^y(Q^(p)))), while in SAS, p^L and p^U are implemented as (p^L,p^U)=(F^y(Q^(p))tdf,α/2V^(F^y(Q^(p))),F^y(Q^(p))+tdf,α/2V^(F^y(Q^(p)))), which explains the differences in the SE estimates.

One can observe that standard errors for the extreme quantiles are usually larger while the errors are smaller for quantiles around the median. This is because the data are sparser around the extreme quantiles and the sampling distributions of extreme quantile estimators are often skewed.

Reference

Woodruff, Ralph S. 1952. Quantile Variance Estimators in Complex Surveys.” Journal of the American Statistical Association 47 (260): 635. https://doi.org/10.2307/2280781.