1 Introduction

Health-care researchers use survey data to evaluate the prevalence rates of particular diseases in a population, to establish association and causation among important health factors and measures, and to inform the creation and implementation of new health policies. For example, the goal of the Canadian Longitudinal Study of Aging (CLSA) is to facilitate vital research on healthy aging with impact, to investigate health evidence and to propose policies for aging Canadians (Raina et al. 2019).

As recognized in the literature (Damico 2009; Lumley 2010) and used by researchers from many scientific fields, R is an open-source and powerful statistical software with high capabilities in data manipulation and high potential for different types of presentation. In recent years, the development of RStudio, one of R’s graphic user interfaces, has facilitated use of the R program to analyze data and to visualize results. However, the R program still has not become widely adopted as the preferred statistical platform in the fields of public health and epidemiological studies. This is partly due to the lack of appropriate R codes and examples in publicly available technical documents presented in a systematic manner, and the concerns on the results as compared to those from established commercial software packages. The unawareness of similarities and differences between R and existing commercial software packages makes researchers reluctant to shift to using R as their primary statistical software.

The main objective of this paper is to describe the main steps required to model complex survey data using R, SAS, SPSS and Stata, with particular attention to dataset preparation, data importation and detailed statistical analyses using the CLSA datasets, and to produce the finite population estimates commonly reported by health researchers. The secondary objective of the paper is to provide R examples for the analyses and to promote R as a preferred platform for researchers in related fields.

This book compares estimates and standard errors from various statistical procedures implemented by selected proprietary statistical packages, namely SAS, SPSS and Stata, with those of the $survey$ package in R, by (Lumley 2019), using data from the Canadian Longitudinal Study on Aging (CLSA). We compute the standard errors by the Taylor series linearization with the variance estimator of the Hansen-Hurwitz estimator (Lumley 2004) adjusted by the finite population correction factor, which is the default option for all the survey packages we compared.

Reference

Damico, Anthony. 2009. “Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis Techniques in Health Policy Data.” R Journal 1 (2): 37–44. https://doi.org/10.32614/rj-2009-018.

Lumley, Thomas. 2004. “Analysis of Complex Survey Samples.” Journal of Statistical Software 9: 1–19. https://doi.org/10.18637/jss.v009.i08.

———. 2010. Complex Surveys. Hoboken, NJ, USA: John Wiley & Sons, Ltd. https://doi.org/10.1002/9780470580066.

———. 2019. : Analysis of Complex Survey Samples. https://cran.r-project.org/package=survey.

Raina, Parminder, Christina Wolfson, Susan Kirkland, Lauren Griffith, Cynthia Balion, Benoȋt Cossette, Isabelle Dionne, et al. 2019. “Cohort Profile: The Canadian Longitudinal Study on Aging (CLSA).” International Journal of Epidemiology, 1–12. https://doi.org/10.1093/ije/dyz173.