5.4 Estimation of population quantiles

Let $F_{y} (t) = N^{- 1} \sum_{h = 1}^{H} \sum_{i = 1}^{N_{h}} I (y_{h i} \leq t)$ be the finite population distribution function, where $I (\cdot)$ is the indicator function. The $100 p$ th population quantile with $p \in (0, 1)$ is defined as

$Q (p) = F_{y}^{- 1} (p) = inf {t ∣ F_{y} (t) \geq p} .$

Suppose we want to estimate the population quantiles of the self-reported weight and height ( $HWT_WGHT_KG_TRM$ , $HWT_DHT_M_TRM$ ). The population median corresponds to the $50 %$ quantile. The following codes can be used.

Quant.Est <- svyquantile( ~ HWT_DHT_M_TRM + HWT_WGHT_KG_TRM, 
    quantile = c(0.025, 0.05, 0.1, 0.5, 0.9, 0.95, 0.975), 
    alpha = 0.05, interval.type = "Wald", design = CLSA.design,
    ties = c("rounded"), ci = TRUE, se = TRUE );
Quant.Est; SE(Quant.Est);

SAS

PROC SURVEYMEANS data = CLSAData 
QUANTILE = (0.025 0.05 0.1 0.5 0.9 0.95 0.975) NONSYMCL;    
VAR    HWT_DHT_M_TRM HWT_WGHT_KG_TRM ;
STRATA GEOSTRAT_TRM ;  
WEIGHT WGHTS_INFLATION_TRM;                                       
RUN;

SPSS and Stata

There is no formal procedure available to produce quantile estimates and their standard errors in the $SPSS$ and $Stata$ packages.

Result comparison

		HWT_DHT_M_TRM		HWT_WGHT_KG_TRM
	Quantile	R	SAS	R	SAS
Estimate	0.025	1.5364	1.5364	48.0964	48.0964
	0.050	1.5457	1.5457	51.7715	51.7715
	0.100	1.5835	1.5835	57.3716	57.3716
	0.500	1.6814	1.6814	77.1140	77.1140
	0.900	1.7643	1.7643	97.5720	97.5720
	0.950	1.7902	1.7902	104.7509	104.7509
	0.975	1.8115	1.8115	112.9293	112.9293
SE	0.025	0.0066	0.0067	1.3059	2.6834
	0.050	0.0075	0.0051	1.8915	1.9615
	0.100	0.0117	0.0117	1.7457	1.7725
	0.500	0.0064	0.0064	0.8445	0.8438
	0.900	0.0064	0.0060	1.9202	1.8917
	0.950	0.0070	0.0070	3.1488	2.9399
	0.975	0.0100	0.0097	5.5598	5.5487

For the standard error (SE) estimation, both $R$ and $SAS$ first construct 95% confidence intervals (CIs) by Woodruff’s method (Woodruff 1952), and then compute the standard errors from the division of the CI lengths by $t_{d f, 0.025}$ , the $97.50$ th percentile of the $t$ distribution with degrees of freedom, $d f$ , which is determined by the survey data and the survey design. For CLSA, the degrees of freedom is the number of observations minus the number of strata. If the sample size is relatively large, we can simply replace $t_{d f, 0.025}$ by $z_{0.025}$ , the $97.50$ th percentile from the standard normal distribution. The difference in standard errors is due to different implementation of Woodruff interval. Let $y_{(1)} \leq y_{(2)} \leq \dots \leq y_{(n)}$ denote the sample order statistics for the variable $Y$ . The $100 p$ th population quantile estimate is computed as
$\begin{aligned} \hat{Q} (p) = {\begin{cases} y_{(1)} & if p \leq {\hat{F}}_{y} (y_{(1)}) \\ y_{(k)} + \frac{p - {\hat{F}}_{y} (y_{(k)})}{{\hat{F}}_{y} (y_{(k + 1)}) - {\hat{F}}_{y} (y_{(k)})} (y_{(k + 1)} - y_{(k)}) & if {\hat{F}}_{y} (y_{(k)}) < p \leq {\hat{F}}_{y} (y_{(k + 1)}) \end{cases}, \end{aligned}$ where ${\hat{F}}_{y} (t) = {\hat{N}}^{- 1} \sum_{h = 1}^{H} \sum_{i \in S_{h}} w_{h i} I (y_{h i} \leq t)$ is the estimated cumulative distribution for $Y$ and $\hat{N} = \sum_{h = 1}^{H} \sum_{i \in S_{h}} w_{h i}$ . The variance of the estimated distribution function ${\hat{F}}_{y} (t)$ at $t = Q (p)$ can be estimated as $\hat{V} ({\hat{F}}_{y} (\hat{Q} (p))) = {\hat{N}}^{- 2} \sum_{h = 1}^{H} \frac{n_{h}}{n_{h} - 1} \sum_{i \in S_{h}} (e_{h i} - {\bar{e}}_{h \cdot})^{2}$ , where $e_{h i} = w_{h i} I (y_{h i} \leq \hat{Q} (p)) - {\hat{F}}_{y} (\hat{Q} (p))$ , ${\bar{e}}_{h \cdot} = n_{h}^{- 1} \sum_{i \in S_{h}} e_{h i}$ . The CI for $100 p$ th quantile can be obtained as $(\hat{Q} ({\hat{p}}_{L}), \hat{Q} ({\hat{p}}_{U}))$ . In R, ${\hat{p}}_{L}$ and ${\hat{p}}_{U}$ are implemented as $({\hat{p}}_{L}, {\hat{p}}_{U}) = (p - t_{d f, α / 2} \sqrt{\hat{V} ({\hat{F}}_{y} (\hat{Q} (p)))}, p + t_{d f, α / 2} \sqrt{\hat{V} ({\hat{F}}_{y} (\hat{Q} (p)))}),$ while in $SAS$ , ${\hat{p}}_{L}$ and ${\hat{p}}_{U}$ are implemented as $({\hat{p}}_{L}, {\hat{p}}_{U}) = ({\hat{F}}_{y} (\hat{Q} (p)) - t_{d f, α / 2} \sqrt{\hat{V} ({\hat{F}}_{y} (\hat{Q} (p)))}, {\hat{F}}_{y} (\hat{Q} (p)) + t_{d f, α / 2} \sqrt{\hat{V} ({\hat{F}}_{y} (\hat{Q} (p)))}),$ which explains the differences in the SE estimates.

One can observe that standard errors for the extreme quantiles are usually larger while the errors are smaller for quantiles around the median. This is because the data are sparser around the extreme quantiles and the sampling distributions of extreme quantile estimators are often skewed.

Reference

Woodruff, Ralph S. 1952. “Quantile Variance Estimators in Complex Surveys.” Journal of the American Statistical Association 47 (260): 635. https://doi.org/10.2307/2280781.