The National Cancer Institute (NCI)’s Surveillance, Epidemiology, and End Results (SEER) Program collects and releases de-identified data for individual cancer diagnoses and outcomes in the United States. The geographic locations of patients at the time of diagnosis are extremely valuable for studying associations between characteristics of an area and cancer rates, as well as for detecting cancer clusters and monitoring geospatial disparities in cancer burden.
Census tracts are a very useful geographic unit to work with but are not publicly available. The Synthetic California Breast Cancer Registry Data (SynCan) is a pilot data product that utilizes statistical models to synthesize census tracts of residence for each breast cancer patient diagnosed in California in 2012. A selected set of variables were modeled in a manner that potentially changes the census tracts of all patients within the county boundaries while preserving the covariate relationships between the census tracts and the selected variables. The purpose of the SynCan is to provide external users with access to census tract data that are not publicly available because of confidentiality concerns. Without synthetic census tracts, much of the research that requires census tracts would be logically difficult or impossible, as data permission must be obtained separately from each cancer registry and most (if not all) permissions require an IRB review. While most states within SEER catchment areas are supported by one Central Cancer Registry (CCR), the state of California is supported by four CCRs. For straightforward descriptive analyses, the SynCan has been shown to produce similar cancer statistics by census tract-based socioeconomic variables. The usefulness of SynCan for complex studies has yet to be established. More details about the SynCan, the synthesis methodology, and the utility of SynCan are documented in Yu et al. (2017).
This announcement solicits proposals for publishable substantive and geospatial methodological studies of breast cancer in California that require the use of census tract-level data using SynCan. A secondary purpose is to evaluate the usefulness of synthetic census tract data in supporting real-world studies. The NCI will select a few proposals that represent a wide range of analyses types. The proposals will be judged solely on the feasibility of the proposed analysis. Once a proposal is accepted, the user will be given an account to SEER*Stat, through which the data can be accessed and analyzed. The SynCan includes all variables that are currently released in the SEER research dataset and several census tract-level ecologic variables, such as median household income, median house value, median rent, percent below poverty, and education level. Users may be allowed to export individual data to be analyzed by any methods deemed appropriate for the study of interest. At the end of the study, the investigators will submit code for all statistical analyses (preferably written in SAS or R) to the NCI. NCI will run the analyses using the actual confidential data and report the results back to the investigators for publication (including all analyses to be reported). NCI will review publications for any confidentiality issues prior to submission. NCI will compile the results across the studies and plans to publish comparisons of SynCan results and validated results in a peer-reviewed publication. This publication will probably be a summary across completed analyses from all investigators. Interested investigators will have the option to be collaborators on this publication. Researchers are invited to submit study proposals by September 30, 2019, with decision notifications by October 31,2019. Investigators are expected to submit codes for final analyses to the NCI no later than June 30, 2020. NCI agrees not to submit its validation study until all investigators have had the opportunity to publish their results independently, although no later than March 31, 2021. For accepted proposals, extension of the dates for submitting codes or publishing results can be considered.
Please review the use the attached application form (PDF, 78 KB) to prepare the proposal. Investigators are also expected to sign the Data Use Agreement (PDF, 118 KB) before the validation study can proceed. To submit an application or send inquiries, please contact:
NCI SEER Synthetic Data Staff
Surveillance Research Program
Division of Cancer Control and Population Sciences
National Cancer Institute
National Institutes of Health
Yu M, Reiter JP, Zhu L, Liu B, Cronin KA, Feuer EJR. Protecting Confidentiality in Cancer Registry Data With Geographic Identifiers. Am J Epidemiol 2017 Jul 1;186(1):83-91. [View Abstract]