Example 1: Create a SEER*Stat Database using NAACCR Format

SEER*Stat requires case and population data to calculate age-adjusted incidence or mortality rates. Therefore, you must supply two types of files: incidence or mortality data files and population data files. These files must meet the requirements listed on Input File Formats.

This example assumes that you have a November 2022 SEER (or NPCR or NAACCR) submission file in XML format for the state of Idaho which includes 2000-2021 data. You will use this data to create a SEER*Stat database to calculate rates by 19 age groups (< 1, 1-4, 5-9, …, 80-84, 85+), race (White, Black, American Indian/Alaska Native, Asian/Pacific Islander), and Hispanic Origin at the county level.

Step 1: Get Detailed Descriptions of the Case and Population Input Files

Start SEER*Prep.
Open the Database Description (DD) file. Select Open from the File menu to select the NAACCR version 22 Database Description file distributed with the software. The file name is "naaccr22.countypops.csv.d11302022.dd" or something similar (if an update is released, the date in the filename will differ).
When you select the DD file, SEER*Prep will load information for each variable into the box on the right side of the window. The list will be sorted by the CSV Column ID (for defined NAACCR fields, this is the naaccrXmlID) in the incidence data file. Click the Pop Start Col column header to sort by the variable location in the population data file (case-only variables will have a blank entry in this column).
Set the variables used When Linking to Populations.

Since this example assumes your data are for 19 age groups, the age variable should be "Age recode with < 1 year olds."
Since this example assumes your data are for 4 races (White, Black, American Indian/Alaska Native, Asian/Pacific Islander), and Hispanic Origin, the race variable should be changed to "Race recode (W, B, AI, API) / Origin recode NHIA (Hispanic, Non-Hisp)".

Select a variable and open the Edit Variable window by double-clicking or using the Edit button. The Edit window contains a description of the variable including its valid values.

Edit the " Race recode (W, B, AI, API)" variable to view an example. Note this field is “Derived” from another field (Race 1) in the incidence data, so there is not a CSV Column ID.

Select Generate Input File Description from the File menu to create a text file containing detailed format information for the case and population files.

Step 2: Prepare your Incidence Data Files

Using software other than SEER*Prep, create an incidence data file according to the NAACCR version 22 CSV file format specified in the documentation created in Step 1.

The name of the file and the variable formats must adhere to the rules described in Input File Formats.
You may store the data in more than one file. SEER*Prep will process the data files sequentially and combine the data into one SEER*Stat database.
If using more than one CSV file, all CSV files must have the same columns/order of fields in the file (i.e., the header rows must match).

For this exercise, download the SAS code provided in the SEER*Prep Utilities. To use the SAS program, also download and unzip the naaccr-xml-utility-9.0+.
Modify the SAS code, updating all lines that have “NEED TO CHANGE THIS” comments. The only required changes are file paths, input file names, and the line that limits cases to your catchment area.
Review Section B of the SAS code that performs the following:
- Creates a state-county FIPS code variable that is used by SEER*Prep to create several fields, including labeled State-county and PRCDA 2020.
- Changes behavior from in situ (2) to malignant (3) for urinary bladder cases. This is typically done by SEER, NPCR, and NAACCR and is required to match statistics generated by these groups.
- Modifies the Race 1 variable used to create derived race recode fields in SEER*Prep. These changes are consistent with SEER, NPCR, and NAACCR practices. The original Race 1 variable is preserved in a new field (Race 1 original).
- Creates a “publicReleaseFlag” field. This field can be used to exclude some cases in SEER*Stat by default using the “Cases in Research Database” checkbox. The SAS is setup to set this flag to 1 (in Research data) for 2000-2020 diagnoses for patients with sex = 1 or 2 (male or female). You can use any logic you would like to create the flag but be consistent when creating your population file. In this example, 2021 diagnoses and patients with sex other than male or female will be excluded by default, but the cases can be included by unchecking the standard selection in SEER*Stat.
- Creates a breast subtype field. This is included as an example to show how to add fields not pre-defined in the provided DD file.
The SAS code will split the output CSV files if the input data has more than 300,000 records. SEER*Prep and SEER*Stat work better with individual files of 300,000 or fewer records. You do not need to modify the SAS code for this scenario. In this example, only one file will be created as the Idaho submission has < 300,000 records, but for some registries there could be several files created. When adding the CSV files to SEER*Prep, they should be added in order of the # appended to the state abbreviation in the file names.
Execute the SAS code and confirm that one or more CSV files were created in the folder you specified in the SAS code.

Step 3: Prepare your Population Data Files

Create one or more population data files that meet the criteria documented in the report created in Step 1. The filename, record length, and variable formats must also adhere to the rules described in Input File Formats.
When making this file, you must include only the appropriate populations.

For this example, the population data file should contain records with populations for all individuals in the state of Idaho for years 2000-2021.
If you include more years or fewer years in the population file than in the incidence data, you will generate misleading statistics.
Because of delays in the 2020 census due to differential privacy concerns with the census, we do not yet have 2021 populations available on the SEER website. For this exercise, we will copy the 2020 populations to 2021. We do not recommend publishing statistics based on the 2021 data when doing this, but it will allow for early investigation of the data.

If using SEER-provided populations:

Download and unzip the appropriate population file. In this case, we will use the 1990-2020 Idaho file with 19 ages: id.1990_2020.19ages.exe (EXE, 494 KB). This file has the appropriate race/origin fields. The 1969+ files are by race (white, black, other) and not by Hispanic origin.
Download the U.S. County Population SAS program (SAS, 2 KB) from SEER*Prep Utilities.
Make the following changes to the program (download example SAS code (SAS, 2 KB) with these changes):

Modify the input and output file names (the output file name must have a “.txd” extension).
Select 2000-2020.
Copy 2020 to 2021.
Create a field for publicReleaseFlag and set it to 1 (true) for 2000-2020 and 0 (false) for 2021. You do not need to consider the sex field for this flag as the populations are only available for males and females.
Modify the output statement to output the new publicReleaseFlag field in column 31. This column location is pre-determined by the DD file for this field.

Run the SAS program and confirm the *.txd file was created.

Step 4: Create a Database Description File for Your Database

The Database Description (DD) files supplied with the software are meant to be used as templates. Follow these steps to create a DD file for the exact specifications of your database:

Start SEER*Prep.
Reopen the DD file used in Step 1.
Add your incidence file or files to the Input Case Files control.
Add your population file or files to the Input Population Files control.
Provide a name for the SEER*Stat database to be created in the Database Name control. The name entered here will be shown in the list of databases on SEER*Stat's Data Tab.
Set the variables used to link the case data with the population data. These must be set to match the population file created in Step 3.
- The populations are for 19 age groups, therefore the age variable should be "Age recode with < 1 year olds."
- The populations are also by race and origin, so the race variable should be "Race recode (W, B, AI, API) / Origin recode NHIA (Hispanic, Non-Hisp)".
Use the Edit button next to the label, Study Cutoff Date for Survival to set the maximum and default month and year.
- The maximum date is used to validate several variables that SEER*Stat needs to perform survival analyses. This date must be set to the same date as is used to create the survival fields in your incidence data (e.g., based on extract or NAACCR Util). For the November/December 2022 submissions, this should be set to "December 2020".
- For the default date, you can set it to an earlier date if you do not have good death ascertainment through the maximum date. For example, if your last NDI was performed with deaths through 2019, you can set the default date to "December 2019". Most registries should use "December 2020" for both settings.
Remove non-derived, pre-defined fields that are not available in the input CSV file. To do this, sort the list of variables by the In Data column. Then select all fields that are blank in that column (not NA or Yes) and click the Hide button at the bottom of SEER*Prep.
Edit the “In Research Data” field to set it as a population variable. The easiest way to find the field is to first sort by the Variable column and then select and Edit the variable. On the edit dialog, check the box indicating that the field Is a Population Variable. The Population Starting Column will change to 31, which is the column specified in the population file creation step. When you click OK, the expected record length of the population file will change from 26 to 31.
Add new field for breast subtype. To do this:

Click the Add... button.
On the Add New Variable dialog, sort the list by Variable name and you will see a few fields at the top that are in the CSV file but not defined by the DD file.
We will just define the breast subtype field. Select the field with CSV Col ID = "mol_subtype" (this is the name used in the incidence SAS code) and click OK.
On the Edit Variable dialog, specify a category (e.g., Site and Morphology) and then click the Edit... button to open the Edit Format dialog and enter the format as:
1=HR+/HER2+
2=HR-/HER2+
3=HR+/HER2-
4=HR-/HER2-
5=Unknown
9=NA due to site or year

Set the database options. Open the database options dialog by selecting Database Options… from the File menu.
- On this dialog you can change many database options including the session types you would like the database available in and many standard selections.
- Be sure to review the Survival Follow-up Method options. Some registries will only use "Presumed Alive" survival and should uncheck the "Reported Alive" option. Do this if you do not update date of last contact for alive patients routinely. If you keep both options available, be sure to have the default one listed first. Check the SEER*Stat Session Option to “Enable In research database standard selection”.
Use the Save As function on the File menu to save the DD with a new name.

Step 5: Verify your Data Files

Use the checkmark on the toolbar or Verify Data from the Execute menu to create a Verify Report. SEER*Prep will generate a one way frequency of every variable in your incidence and population file.
With CSV data, there are two new warning messages that appear when verifying or creating a database.

A list of fields that are defined by the DD file that do not exist in your CSV files. This is informational only – the fields will not be included in the database. You will not see this warning if you hid the fields as directed above.
A list of fields in your CSV file that are not defined in the DD file. If these are “analytic” fields and you would like to include them in your database, then you need to define them. To do this, you can cancel the database verification and use the Add... button below the Variable list. For each field that you want to add, select the CSV Column ID from the list and then define the variable (e.g., length, formats). We did this for the subtype field. The other fields were created as temporary variables in SAS and are not needed. If you create any additional fields in SAS, define them like we did for subtype.

Review the Verify Report and resolve any issues identified in the report.

Step 6: Create Database

Click the lightning bolt on the toolbar or select Create Database from the Execute menu.
Enter a name for the Create Report.
Use the default for record exclusions (refer to the SEER*Prep Help for more information).
SEER*Prep will now create a SEER*Stat database. This database will contain your data converted to a binary format, indices, and dictionaries (format libraries) for SEER*Stat.
Review the Create Report, paying particular attention to any Notes/Warnings. These identify potential mismatches between your incidence and population files. For example, if your population file contained information for 1990-2021, you would get a warning because your incidence data is for 2000-2021.

Step 7: Use Your New Database in SEER*Stat

Exit SEER*Prep and start SEER*Stat. Your new database will be available in the list of databases on the Data Tab (for the appropriate sessions).

For more information, review the answers to the SEER*Prep Frequently Asked Questions and other materials provided in Getting Help.