Development of Hybrid Data Mining-Epidemiologic Models to Assess Air Pollution-Associated Impacts on Mortality
The advent of increasingly more and larger sources of data have created new, potentially revolutionary opportunities to combine and analyze these data to further our understanding of how environmental exposures, behaviors, and neighborhood characteristics impact the health of communities. The analysis of these so-called big datasets is complicated, however, by their high variety, where numerous, diverse covariates from multiple sources are incorporated into a single dataset for combined analysis, and by their high volume, with datasets including a gigantic number of observations and variables per observation.
In air pollution epidemiology, the analysis of big data is already underway, as researchers begin to join together repositories of environmental, electronic medical records, insurance claims, and neighborhood and behavioral characteristics. Despite this, analysis of big data in air pollution epidemiology is currently limited. In survival analyses of air pollution and health data, for example, analyses are typically performed on a combined or joint dataset in "long" form between several data sources: a person-indexed, a pollution-indexed, and a location-indexed dataset. In naively joining these data together by row, a person- location-pollution dataset is created with an enormous number of columns and especially rows (often in the billions) and a great deal of redundant information, that is too big to be read, processed, and analyzed using conventional epidemiological and computing approaches. As a result of these computing limitations, the few big data studies of air pollution health effects that have been conducted have been unable to fully capitalize on their wealth of information. These studies have reduced the size of the datasets by including data for only select metropolitan areas or aggregating data into larger bins. These measures lose important information and in so doing, limit the ability of the studies to test hypotheses, control for confounding, and identify key modifiable exposures.
Our work addresses these limitations through the development of new hybrid big data approaches that combine database management, computing and visualization with epidemiological methods. We have used these new approaches to analyze our big dataset of over 30 billion observations that includes information on air pollution exposures, mortality, and behavioral and neighborhood characteristics for over 60 million Medicare beneficiaries living in every ZIP code in the conterminous US between 2000 and 2014. In my talk, I will discuss our hybrid approach and its application to our analysis of air pollution impacts on mortality
Bio: Helen H. Suh, Sc.D. is a Professor in the Department of Civil and Environmental Engineering at Tufts University. She is also Affiliate Faculty in the Department of Health Sciences at Northeastern University and a Senior Fellow at NORC at the University of Chicago. Dr. Suh is an expert in air pollution epidemiology and exposure science. Over the past 20 plus years, she has directed and collaborated on numerous air pollution health effect studies, including those funded by the US National Institutes of Health, the US Environmental Protection Agency, the US Department of Energy, and the Electric Power Research Institute. As part of these studies, she has collaborated with experts in epidemiology, engineering, computer science, statistics, sociology, and medicine. Dr. Suh's research focuses on three general areas within air pollution health effects, including (1) assessing the impact of air pollution, lifestyle, and neighborhoods on human health, (2) examination of multi- pollutant impacts on human health, and (3) development of novel methods for epidemiological research, including GIS-based spatio- temporal modeling tools and hybrid data mining-epidemiological approaches. Findings from her research are published in leading peer-reviewed scientific journals and have been presented in numerous presentations at scientific and professional conferences and forums. Dr. Suh has previously served as a member of the charter EPA Clean Air Scientific Advisory Committee, several National Academy of Science and Institute of Medicine committees, and as an Associate Editor of the International Journal of Exposure Science and Environmental Epidemiology.