Simultaneous Causal Inference and Probabilistic Record Linkage in Observational Studies with Covariates Spread Over Two Files Institutional Repository Document uri icon


  • We consider observational studies with data spread over two files. One file includes the treatment, outcome, and some covariates measured on a set of individuals, and the other file includes additional covariates measured on a partially intersecting set of individuals. In absence of direct identifiers, researchers typically estimate causal effects in two stages: construct a linked database with probabilistic record linkage, then apply causal estimators on the linked data. This approach does not take advantage of relationships among the variables to improve the linkage quality. It also does not propagate uncertainty from imperfect linkages to the causal inferences. We address these shortcomings via a Bayesian joint modeling framework for simultaneous causal inference and probabilistic record linkage. The Markov chain Monte Carlo sampler generates multiple plausible linked data files as byproducts. We use these datasets for multiple imputation inferences with two causal estimators, one regression-adjusted and the other unadjusted, based on propensity score overlap weights. Using simulations and data from the Italian Survey on Household Income and Wealth, we show that the joint model with both estimators can improve the accuracy of estimated treatment effects compared to analogous two stage procedures.

author list (cited authors)

  • Guha, S., & Reiter, J. P.

complete list of authors

  • Guha, Sharmistha||Reiter, Jerome P

publication date

  • November 2021