Considerations for Survival Cohort Definitions

When calculating limited-duration prevalence, SEER*Stat uses survival tables to adjust for lost cases. It is important to consider all variables in the data that could affect the survival of the lost cases. It is generally recommended to consider variables such as age at diagnosis, year of diagnosis, sex, race and cancer site when defining survival cohorts. In addition, it is also strongly recommended to consider variables that are used as display variables or used directly in the calculations of the requested statistics.

Many databases contain multiple variables that are related to one another; therefore, SEER*Stat may issue a warning message even if the variable is actually represented in the survival cohort definition. For example, if you are calculating prevalence percents by race, you must use the race variable that is also in the population data, e.g., "Race recode A" (All races, white, black, other, unknown). However, you may want to perform survival calculations by a more detailed race variable such as "Race/ethnicity", which includes more specific racial groups. In this case, you should use "Race recode A" as a table variable and "Race/ethnicity" in the survival cohort definition. SEER*Stat will not recognize that the variable used as the table variable (Race recode A) is for the same type of data as the variable used to define the survival cohort definition (Race/ethnicity), and will issue a warning message. In this situation, you can ignore the warning.

The three situations that cause SEER*Stat to issue the warning are:

SEER*Stat issues warnings in these situations because failure to consider a display or calculation variable in the definition of survival cohorts may cause large inaccuracies. Other situations may cause inaccuracies as well, but generally have a lesser impact.

Consider cancer site as an example. When the Surveillance Research Program calculates prevalence, survival cohorts are always defined by individual site recode values. When only displaying all cancer sites combined, including site in the cohort definition has only a minor effect on the prevalence results. This is because some lost cases would use too low a survival (e.g. breast cancer, which has a higher observed survival than all cancer sites combined), while others would use too high a survival (e.g. liver cancer, which has a lower observed survival than all cancer sites combined). In general, these inaccuracies cancel one another out, unless the likelihood of being lost to follow-up is related to the variable. However, if you display the prevalence statistics by cancer site, the effect of not using site-specific survival calculations could be much more significant. For example, breast cancer estimates would be lower than they should be, since the survival used was worse than breast cancer survival, while liver cancer estimates would be higher than they should be.