5.4 Estimation of population quantiles
Let \(F_y(t) = N^{-1}\sum_{h=1}^H\sum_{i=1}^{N_h}I(y_{hi}\le t)\) be the finite population distribution function, where \(I(\cdot)\) is the indicator function. The \(100p\)th population quantile with \(p\in (0, 1)\) is defined as
\[ Q(p) = F^{-1}_y(p) = \inf\{t \mid F_y(t) \ge p\}\,. \]
Suppose we want to estimate the population quantiles of the self-reported weight and height (\(\texttt{HWT_WGHT_KG_TRM}\), \(\texttt{HWT_DHT_M_TRM}\)). The population median corresponds to the \(50\%\) quantile. The following codes can be used.
R
SAS
PROC SURVEYMEANS data = CLSAData
QUANTILE = (0.025 0.05 0.1 0.5 0.9 0.95 0.975) NONSYMCL;
VAR HWT_DHT_M_TRM HWT_WGHT_KG_TRM ;
STRATA GEOSTRAT_TRM ;
WEIGHT WGHTS_INFLATION_TRM;
RUN;
SPSS and Stata
There is no formal procedure available to produce quantile estimates and their standard errors in the \(\texttt{SPSS}\) and \(\texttt{Stata}\) packages.
Result comparison
Quantile | R | SAS | R | SAS | |
---|---|---|---|---|---|
Estimate | 0.025 | 1.5364 | 1.5364 | 48.0964 | 48.0964 |
0.050 | 1.5457 | 1.5457 | 51.7715 | 51.7715 | |
0.100 | 1.5835 | 1.5835 | 57.3716 | 57.3716 | |
0.500 | 1.6814 | 1.6814 | 77.1140 | 77.1140 | |
0.900 | 1.7643 | 1.7643 | 97.5720 | 97.5720 | |
0.950 | 1.7902 | 1.7902 | 104.7509 | 104.7509 | |
0.975 | 1.8115 | 1.8115 | 112.9293 | 112.9293 | |
SE | 0.025 | 0.0066 | 0.0067 | 1.3059 | 2.6834 |
0.050 | 0.0075 | 0.0051 | 1.8915 | 1.9615 | |
0.100 | 0.0117 | 0.0117 | 1.7457 | 1.7725 | |
0.500 | 0.0064 | 0.0064 | 0.8445 | 0.8438 | |
0.900 | 0.0064 | 0.0060 | 1.9202 | 1.8917 | |
0.950 | 0.0070 | 0.0070 | 3.1488 | 2.9399 | |
0.975 | 0.0100 | 0.0097 | 5.5598 | 5.5487 |
For the standard error (SE) estimation, both \(\texttt{R}\) and \(\texttt{SAS}\) first construct 95% confidence intervals (CIs) by Woodruff’s method (Woodruff 1952), and then compute the standard errors from the division of the CI lengths by \(t_{df, 0.025}\), the \(97.50\)th percentile of the \(t\) distribution with degrees of freedom, \(df\), which is determined by the survey data and the survey design. For CLSA, the degrees of freedom is the number of observations minus the number of strata. If the sample size is relatively large, we can simply replace \(t_{df, 0.025}\) by \(z_{0.025}\), the \(97.50\)th percentile from the standard normal distribution. The difference in standard errors is due to different implementation of Woodruff interval. Let \(y_{(1)}\le y_{(2)} \le \cdots \le y_{(n)}\) denote the sample order statistics for the variable \(Y\). The \(100 p\)th population quantile estimate is computed as
\[\begin{align*}
\hat Q(p) =\left\{
\begin{array}{ll}
y_{(1)} & \mbox{if } \;\; p \le \hat F_y(y_{(1)})\\
y_{(k)}+\frac{p-\hat F_y(y_{(k)})}{\hat F_y(y_{(k+1)}) -\hat F_y(y_{(k)})} (y_{(k+1)}-y_{(k)}) & \mbox{if } \;\; \hat F_y(y_{(k)}) <p \le \hat F_y(y_{(k+1)})\\
%%%y_{(n)} & \mbox{if } p = 1 \\
\end{array}
\right . ,
\end{align*}\]
where \(\hat F_y (t)= \hat{N}^{-1} \sum^H_{h=1}\sum_{i\in \mathcal{S}_h} w_{hi}I(y_{hi}\leq t)\) is the estimated cumulative distribution for \(Y\) and \(\hat{N} = \sum^H_{h=1}\sum_{i\in \mathcal{S}_h} w_{hi}\). The variance of the estimated distribution function \(\hat F_y(t)\) at \(t=Q(p)\) can be estimated as \(\hat V( \hat F_y(\hat Q(p)) )= \hat{N}^{-2} \sum_{h=1}^{H} \frac{n_h}{n_h-1} \sum_{i \in \mathcal{S}_h}(e_{hi}-\bar e_{h\cdot} )^2\), where \(e_{hi}= w_{hi}I(y_{hi}\leq \hat Q(p))- \hat F_y(\hat Q(p))\), \(\bar e_{h\cdot} = n_h^{-1}\sum_{i \in \mathcal{S}_h} e_{hi}\). The CI for \(100p\)th quantile can be obtained as \((\hat Q(\hat p_L), \hat Q(\hat p_U))\).
In R, \(\hat p_L\) and \(\hat p_U\) are implemented as
\[
(\hat p_L, \hat p_U) = \left(p - t_{df, \alpha/2}\sqrt{\hat V( \hat F_y(\hat Q(p)) )}, \;\;\; p + t_{df, \alpha/2}\sqrt{\hat V( \hat F_y(\hat Q(p)) )} \right),
\]
while in \(\texttt{SAS}\), \(\hat p_L\) and \(\hat p_U\) are implemented as
\[
(\hat p_L, \hat p_U) = \left( \hat F_y(\hat Q(p)) - t_{df, \alpha/2}\sqrt{\hat V( \hat F_y(\hat Q(p)) )}, \;\;\; \hat F_y(\hat Q(p)) + t_{df, \alpha/2}\sqrt{\hat V( \hat F_y(\hat Q(p)) )} \right),
\]
which explains the differences in the SE estimates.
One can observe that standard errors for the extreme quantiles are usually larger while the errors are smaller for quantiles around the median. This is because the data are sparser around the extreme quantiles and the sampling distributions of extreme quantile estimators are often skewed.