Privacy-by-Design: Understanding Data Access Models for Secondary Data.
Additional Document Info
Today there is a constant flow of data into, out of, and between ever-larger and ever-more complex databases about people. Together, these digital traces collectively capture our social genome , the footprints of our society. The burgeoning field of population informatics is the systematic study of populations via secondary analysis of such massive data collections (termed "big data") about people. In particular, health informatics analyzes electronic health records to improve health outcomes for a population. Privacy protection in such secondary data analysis research is complex and requires a holistic approach which combines technology, statistics, policy and a shift in culture of information accountability through transparency rather than secrecy. We review state of the art in privacy protection technology and policy frameworks from widely different fields, and synthesize the findings to present a comprehensive system of privacy protection in population informatics research using the privacy-by-design approach. Based on common activities in the workflow, we describe the pros and cons of four different data access models - restricted access, controlled access, monitored access, and open access - that minimize risk and maximize usability of data. We then evaluate the system by analyzing the risk and usability of data through a realistic example. We conclude that deployed together the four data access models can provide a comprehensive system for privacy protection, balancing the risk and usability of secondary data in population informatics research.