A Principled Approach to Characterize and Analyze Partially Observed Confounder Data from Electronic Health Records. Academic Article uri icon


  • OBJECTIVE: Partially observed confounder data pose challenges to the statistical analysis of electronic health records (EHR) and systematic assessments of potentially underlying missingness mechanisms are lacking. We aimed to provide a principled approach to empirically characterize missing data processes and investigate performance of analytic methods. METHODS: Three empirical sub-cohorts of diabetic SGLT2 or DPP4-inhibitor initiators with complete information on HbA1c, BMI and smoking as confounders of interest (COI) formed the basis of data simulation under a plasmode framework. A true null treatment effect, including the COI in the outcome generation model, and four missingness mechanisms for the COI were simulated: completely at random (MCAR), at random (MAR), and two not at random (MNAR) mechanisms, where missingness was dependent on an unmeasured confounder and on the value of the COI itself. We evaluated the ability of three groups of diagnostics to differentiate between mechanisms: 1)-differences in characteristics between patients with or without the observed COI (using averaged standardized mean differences [ASMD]), 2)-predictive ability of the missingness indicator based on observed covariates, and 3)-association of the missingness indicator with the outcome. We then compared analytic methods including "complete case", inverse probability weighting, single and multiple imputation in their ability to recover true treatment effects. RESULTS: The diagnostics successfully identified characteristic patterns of simulated missingness mechanisms. For MAR, but not MCAR, the patient characteristics showed substantial differences (median ASMD 0.20 vs 0.05) and consequently, discrimination of the prediction models for missingness was also higher (0.59 vs 0.50). For MNAR, but not MAR or MCAR, missingness was significantly associated with the outcome even in models adjusting for other observed covariates. Comparing analytic methods, multiple imputation using a random forest algorithm resulted in the lowest root-mean-squared-error. CONCLUSION: Principled diagnostics provided reliable insights into missingness mechanisms. When assumptions allow, multiple imputation with nonparametric models could help reduce bias.

published proceedings

  • Clin Epidemiol

author list (cited authors)

  • Weberpals, J., Raman, S. R., Shaw, P. A., Lee, H., Russo, M., Hammill, B. G., ... Desai, R. J.

complete list of authors

  • Weberpals, Janick||Raman, Sudha R||Shaw, Pamela A||Lee, Hana||Russo, Massimiliano||Hammill, Bradley G||Toh, Sengwee||Connolly, John G||Dandreo, Kimberly J||Tian, Fang||Liu, Wei||Li, Jie||Hernández-Muñoz, José J||Glynn, Robert J||Desai, Rishi J

publication date

  • January 2024