Executive Summary

The pandemic has caused more than 500 million confirmed cases of SARS-CoV-2 infections and 6.72 million deaths globally. In response to the pandemic, several SARS-CoV-2 vaccines were rapidly developed and received Food and Drug Administration (FDA)'s Emergency Use Authorization. However, limited data exist on the effectiveness of these vaccines in patient populations that may have an impaired immune response to vaccination or natural infection.

A critical population of interest is immunocompromised patients, who have an increased risk for COVID-19 infections, often with severe outcomes. This includes patients with cancer, systemic autoimmune rheumatic conditions or following transplantation. There is a relative lack of data from clinical trials or prospective cohort studies, on vaccine immunogenicity and efficacy in these patient populations, as well as serology response data to infection. Real-world data (RWD) provide an important opportunity to examine the potential efficacy of vaccination and natural infection both at the population level and at a more detailed level for specific higher-risk subpopulations.

Establishing a large-scale RWD infrastructure with standardized immunologically relevant data, particularly from validated serology assays, that would allow researchers to answer questions to evolving issues over time is critical to enable the development of evidence-based guidelines for diagnostic evaluation of patients across varying risk strata and for providing guidance on the role of boosters and/or treatment of patients across those strata. These strata would include patients undergoing ongoing cancer treatment, those who are immunocompromised from comorbid conditions such as diabetes or rheumatologic diseases, and those who are exposed to medications suppressing the immune system. Real-world data may also be useful in the context of an evolving situation (such as the COVID pandemic) when clinical trials are not possible or in patients unlikely to participate in clinical trials. These data can be accessible in near real time and provide information related to important questions such as durability of protection against infection or re-infection among patients based on vaccination status and/or prior exposure status to SARS CoV-2.

Project Objectives

The primary objective of the CRWDi is to develop an infrastructure using RWD from a broad set of data providers, linked through secure Privacy Preserving technology1 (PPRL) to provide maximum comprehensive longitudinal data held in a Single Data Platform. This platform will enable ongoing research on important and timely questions related to the SARs CoV-2 pandemic. The specific goals of this project revolve around the impact of COVID infection and immunization against SARS CoV-2 infection in the immunocompromised patient population, with a special focus on patients with cancer.

Data Partners

The primary data partners for this infrastructure include

  • HealthVerity (including Medical Claims, Pharmacy Claims, lab data and vaccination data [representing 5.2 million records])
  • Cancer (SEER) Data – 48.9% of U.S. cancer patient data

Data Types

Medical Claims

Medical claims include two primary types of sources, those that reflect a patient’s medical activity from the health plan perspective and those that measure activity from the data provider perspective. Each may be useful depending on the type of analysis being performed and the sample frame required. For this infrastructure, the patients included are primarily based on individuals with continuous enrollment in both medical and pharmacy benefits for at least 6 months. This ensures comprehensive capture of any health care system utilization, including visits as well as treatments including oral medications used to treat COVID and comorbidities such as cancer or Rheumatoid Arthritis.

Pharmacy Claims

Pharmacy data may be generated from several different sources and based on the source type may reflect a different view or capture rate of a patient’s pharmacy utilization. There are two primary source types, those that reflect a patient’s prescription activity from the health plan perspective (as described above) and those that measure dispensing from the pharmacy directly or indirectly (this includes pharmacies, pharmacy system vendors, clearinghouses, and Pharmacy Benefit Managers [PBMs]). For this RWD infrastructure, pharmacy data included are as described above, but in addition, any COVID vaccination for patients meeting the above criteria will be included whether or not the vaccination is covered under their health care insurance.


Vaccine data are found with pharmacy claims as well as within the events represented through the claims data.


Lab results included in this infrastructure reflect testing for SARS-CoV-2 (diagnostic and/or serologic). The data are directly from LabCorp and Quest Diagnostics, the two largest commercial laboratories in the United States. Data include the COVID test order, specific test type, the result, and the interpretation and interpretive range for the result. The latter may be specific to the lab that ran the test. For an additional 0.46 million patients in the HealthVerity-hosted data set, SARS-CoV-2 test results (diagnostic and/or serologic) will be available from the laboratory of a major integrated delivery network (Northwell Health).

Cancer Data

Information on patients with cancer are derived from multiple sources. The primary source is data captured via the SEER registries and includes cases diagnosed between 2010 and 2022. Key data include a detailed characterization of the tumor, stage at diagnosis, information on types of therapy representing the initial course and survival. In addition, detailed treatment information will be linked from both medical and pharmacy claims (to capture specific treatments) as well as comorbid conditions among these patients for any treatment provided from 2018-2023.

Data available in this infrastructure will include a broad set of detailed data from 12/2018 through 12/2023. The cohort available to address specific study questions includes claims data for continuously enrolled patients in medical insurance and pharmacy insurance programs, linked to laboratory test results related to COVID testing (NAAT diagnostic testing and serologic testing) from large national clinical laboratories and to COVID vaccination status as available.

There will be three sets of targeted diseases representing patients with an immune compromised condition in this dataset:

  1. Cancer patients: The SEER data are linked and provide more detailed information on cancer patients (including patients whose cancer may have been initially diagnosed as early as 2010). This information includes characterization of the cancer at diagnosis (stage, treatment initiation and type as well as survival), and as data are available, current cancer status of the patient. In the latter instance, the combined SEER and HealthVerity data will be leveraged to represent patients with a higher risk diagnosis (such as leukemia or lymphomas); those who are undergoing active treatment with specific agents (e.g., rituximab); and/or those who may have metastatic conditions.
  2. Patients with rheumatologic diseases (identified through claims data), including patients that are likely to be immunocompromised (due to diagnosis and/or specific treatment).
  3. Patients who have undergone a transplant, either a solid organ (identified through the claims data) or a stem cell transplant (identified both through SEER and the claims data).

The focus for each of these three disease groups will be patients with continuous health plan enrollment for both medical and pharmacy claims, to support understanding of the chronologic relationships between their condition and COVID-19 status (inclusive of vaccination status, and with chronology also informing which COVID variant status).

Additional cohorts that are available from this data infrastructure include pediatric patients (under age 19) and patients with evidence of Long COVID2. While there is currently no specific definition for this sequel of COVID, this is an area of priority for the NIH and for these immunocompromised populations.

1PPRL (Privacy Protecting Record Linkage): a strategy that allows records to be linked together without revealing identifying information.
2HHS Announces the Formation of the Office of Long COVID Research and Practice and Launch of Long COVID Clinical Trials Through the RECOVER Initiative