5.4 Estimation of population quantiles

Let \(F_y(t) = N^{-1}\sum_{h=1}^H\sum_{i=1}^{N_h}I(y_{hi}\le t)\) be the finite population distribution function, where \(I(\cdot)\) is the indicator function. The \(100p\)th population quantile with \(p\in (0, 1)\) is defined as

\[ Q(p) = F^{-1}_y(p) = \inf\{t \mid F_y(t) \ge p\}\,. \]

Suppose we want to estimate the population quantiles of the self-reported weight and height (\(\texttt{HWT_WGHT_KG_TRM}\), \(\texttt{HWT_DHT_M_TRM}\)). The population median corresponds to the \(50\%\) quantile. The following codes can be used.

R

Quant.Est <- svyquantile( ~ HWT_DHT_M_TRM + HWT_WGHT_KG_TRM, 
    quantile = c(0.025, 0.05, 0.1, 0.5, 0.9, 0.95, 0.975), 
    alpha = 0.05, interval.type = "Wald", design = CLSA.design,
    ties = c("rounded"), ci = TRUE, se = TRUE );
Quant.Est; SE(Quant.Est); 

SAS

PROC SURVEYMEANS data = CLSAData 
QUANTILE = (0.025 0.05 0.1 0.5 0.9 0.95 0.975) NONSYMCL;    
VAR    HWT_DHT_M_TRM HWT_WGHT_KG_TRM ;
STRATA GEOSTRAT_TRM ;  
WEIGHT WGHTS_INFLATION_TRM;                                       
RUN;  

SPSS and Stata

There is no formal procedure available to produce quantile estimates and their standard errors in the \(\texttt{SPSS}\) and \(\texttt{Stata}\) packages.

Result comparison

HWT_DHT_M_TRM
HWT_WGHT_KG_TRM
Quantile R SAS R SAS
Estimate 0.025 1.5364 1.5364 48.0964 48.0964
0.050 1.5457 1.5457 51.7715 51.7715
0.100 1.5835 1.5835 57.3716 57.3716
0.500 1.6814 1.6814 77.1140 77.1140
0.900 1.7643 1.7643 97.5720 97.5720
0.950 1.7902 1.7902 104.7509 104.7509
0.975 1.8115 1.8115 112.9293 112.9293
SE 0.025 0.0066 0.0067 1.3059 2.6834
0.050 0.0075 0.0051 1.8915 1.9615
0.100 0.0117 0.0117 1.7457 1.7725
0.500 0.0064 0.0064 0.8445 0.8438
0.900 0.0064 0.0060 1.9202 1.8917
0.950 0.0070 0.0070 3.1488 2.9399
0.975 0.0100 0.0097 5.5598 5.5487

For the standard error (SE) estimation, both \(\texttt{R}\) and \(\texttt{SAS}\) first construct 95% confidence intervals (CIs) by Woodruff’s method (Woodruff 1952), and then compute the standard errors from the division of the CI lengths by \(t_{df, 0.025}\), the \(97.50\)th percentile of the \(t\) distribution with degrees of freedom, \(df\), which is determined by the survey data and the survey design. For CLSA, the degrees of freedom is the number of observations minus the number of strata. If the sample size is relatively large, we can simply replace \(t_{df, 0.025}\) by \(z_{0.025}\), the \(97.50\)th percentile from the standard normal distribution. The difference in standard errors is due to different implementation of Woodruff interval. Let \(y_{(1)}\le y_{(2)} \le \cdots \le y_{(n)}\) denote the sample order statistics for the variable \(Y\). The \(100 p\)th population quantile estimate is computed as
\[\begin{align*} \hat Q(p) =\left\{ \begin{array}{ll} y_{(1)} & \mbox{if } \;\; p \le \hat F_y(y_{(1)})\\ y_{(k)}+\frac{p-\hat F_y(y_{(k)})}{\hat F_y(y_{(k+1)}) -\hat F_y(y_{(k)})} (y_{(k+1)}-y_{(k)}) & \mbox{if } \;\; \hat F_y(y_{(k)}) <p \le \hat F_y(y_{(k+1)})\\ %%%y_{(n)} & \mbox{if } p = 1 \\ \end{array} \right . , \end{align*}\] where \(\hat F_y (t)= \hat{N}^{-1} \sum^H_{h=1}\sum_{i\in \mathcal{S}_h} w_{hi}I(y_{hi}\leq t)\) is the estimated cumulative distribution for \(Y\) and \(\hat{N} = \sum^H_{h=1}\sum_{i\in \mathcal{S}_h} w_{hi}\). The variance of the estimated distribution function \(\hat F_y(t)\) at \(t=Q(p)\) can be estimated as \(\hat V( \hat F_y(\hat Q(p)) )= \hat{N}^{-2} \sum_{h=1}^{H} \frac{n_h}{n_h-1} \sum_{i \in \mathcal{S}_h}(e_{hi}-\bar e_{h\cdot} )^2\), where \(e_{hi}= w_{hi}I(y_{hi}\leq \hat Q(p))- \hat F_y(\hat Q(p))\), \(\bar e_{h\cdot} = n_h^{-1}\sum_{i \in \mathcal{S}_h} e_{hi}\). The CI for \(100p\)th quantile can be obtained as \((\hat Q(\hat p_L), \hat Q(\hat p_U))\). In R, \(\hat p_L\) and \(\hat p_U\) are implemented as \[ (\hat p_L, \hat p_U) = \left(p - t_{df, \alpha/2}\sqrt{\hat V( \hat F_y(\hat Q(p)) )}, \;\;\; p + t_{df, \alpha/2}\sqrt{\hat V( \hat F_y(\hat Q(p)) )} \right), \] while in \(\texttt{SAS}\), \(\hat p_L\) and \(\hat p_U\) are implemented as \[ (\hat p_L, \hat p_U) = \left( \hat F_y(\hat Q(p)) - t_{df, \alpha/2}\sqrt{\hat V( \hat F_y(\hat Q(p)) )}, \;\;\; \hat F_y(\hat Q(p)) + t_{df, \alpha/2}\sqrt{\hat V( \hat F_y(\hat Q(p)) )} \right), \] which explains the differences in the SE estimates.

One can observe that standard errors for the extreme quantiles are usually larger while the errors are smaller for quantiles around the median. This is because the data are sparser around the extreme quantiles and the sampling distributions of extreme quantile estimators are often skewed.

Reference

Woodruff, Ralph S. 1952. Quantile Variance Estimators in Complex Surveys.” Journal of the American Statistical Association 47 (260): 635. https://doi.org/10.2307/2280781.