The National Cancer Institute (NCI)’s Surveillance, Epidemiology, and End Results (SEER) Program collects and releases de-identified data for individual cancer diagnoses and outcomes in the United States. The geographic locations of patients at the time of diagnosis are extremely valuable for studying associations between characteristics of an area and cancer rates, as well as for detecting cancer clusters and monitoring geospatial disparities in cancer burden.

Census tracts are a very useful geographic unit to work with but are not publicly available. The Synthetic California Breast Cancer Registry Data (SynCan) is a pilot data product that utilizes statistical models to synthesize census tracts of residence for each breast cancer patient diagnosed in California from 2006 to 2016. A selected set of variables, such as patients’ demographics, tumor characteristics, and census tract attributes, were modeled in a manner that potentially changes the census tracts of all patients within the county boundaries while preserving the covariate relationships between the census tracts and the selected variables. The purpose of the SynCan is to provide external users with access to census tract data that are not publicly available because of confidentiality concerns. Without synthetic census tracts, much of the research that requires census tracts would be logically difficult or impossible, as data permission must be obtained separately from each cancer registry and most (if not all) permissions require an IRB review. While most states within SEER catchment areas are supported by one Central Cancer Registry (CCR), the California Cancer Registry has one central cancer registry supported by 3 regional cancer registries. For straightforward descriptive analyses, the SynCan has been shown to produce similar cancer statistics by census tract-based socioeconomic variables. The usefulness of SynCan for complex studies has yet to be established. More details about an earlier version of the SynCan, the synthesis methodology, and the utility of SynCan are documented in Yu et al. (2017).

This announcement solicits proposals for publishable substantive and geospatial methodological studies of breast cancer in California that require the use of census tract-level data using SynCan. A secondary purpose is to evaluate the usefulness of synthetic census tract data in supporting real-world studies. The NCI will select a few proposals that represent a wide range of analyses types. The proposals will be judged solely on the feasibility of the proposed analysis. Once a proposal is accepted, the user will be given an account to SEER*Stat, through which the data can be accessed and analyzed. The SynCan includes all variables that are currently released in the SEER research dataset and several census tract-level ecologic variables, such as median household income, median house value, median rent, percent below poverty, and education level. Users may be allowed to export individual data to be analyzed by any methods deemed appropriate for the study of interest. At the end of the study, the investigators will submit code for all statistical analyses (preferably written in SAS or R) to the NCI. NCI will run the analyses using the actual confidential data and report the results back to the investigators for publication (including all analyses to be reported). NCI will review publications for any confidentiality issues prior to submission. NCI will compile the results across the studies and plans to publish comparisons of SynCan results and validated results in a peer-reviewed publication. This publication will probably be a summary across completed analyses from all investigators. Interested investigators will have the option to be collaborators on this publication.

The governments intent is to make up to 5 awards for this project. If you have interest in participating submit a proposal priced per deliverable (please see section 9.0 of the Statement of Work) by 3:00 p.m. ET May 31, 2020. Selection decision notifications will be sent by June 15, 2020 with awards made by June 30, 2020. Please see the Statement of Work (PDF, 236 KB) for additional details.

Under this agreement, investigators are required to submit analysis plans and table shells of analytic results to the NCI for approval by August 31, 2020 or two months after the award (whichever is later), and to submit codes for approved analyses to the NCI no later than December 31, 2020 or six months after the award (whichever is later). NCI agrees not to submit its validation study until all selected investigators have had the opportunity to publish their results independently, although no later than December 31, 2021 or 18 months after the award (whichever is later). Investigators are also expected to sign the Data Use Agreement (PDF, 81 KB) before the validation study can proceed. Please use the attached application form (PDF, 111 KB) to prepare the proposal.

To submit an application or send inquiries, please contact:

NCI SEER Synthetic Data Staff
Surveillance Research Program
Division of Cancer Control and Population Sciences
National Cancer Institute
National Institutes of Health


Yu M, Reiter JP, Zhu L, Liu B, Cronin KA, Feuer EJR. Protecting Confidentiality in Cancer Registry Data With Geographic Identifiers. Am J Epidemiol 2017 Jul 1;186(1):83-91. [View Abstract]