--------------- SEER*Prep 3.0.3 --------------- The SEER*Prep software converts ASCII text data files to the SEER*Stat database format, allowing you to analyze your cancer data using SEER*Stat. SEER*Prep performs two main functions: it converts text data to the specific binary format required by SEER*Stat, and it creates the SEER*Stat data dictionary. For more information on SEER*Stat, please see http://seer.cancer.gov/seerstat. ------------------------------------------ FILE SUPPORT: INCIDENCE AND MORTALITY DATA ------------------------------------------ Incidence - NAACCR 4048 byte version 22 incidence-only record type (dated Nov 301 2022) Incidence (Global) - SEER*Prep Global Incidence Format 334 byte record type (developed specifically for this application by the NCI) Mortality - SEER*Prep 58/66 byte (developed specifically for this application by the SEER Program). Format is 66 bytes if using summarized mortality counts rather than individual death records. When using either incidence data format, there is a mechanism where you can prepare additional variables for use in SEER*Stat. These are variables that are not part of the documented data formats. The NAACCR data format does not support every defined variable at this time. An attempt was made to include the most important variables but there will be some that you will want that are missing. Please send email to the following address stating which variables you need. Every attempt will be made to provide a new data description (.dd file) within a few weeks of your request. seerprep@imsweb.com ------------------------------------------------ INPUT FILE NOMENCLATURE, COMPRESSION, AND FORMAT ------------------------------------------------ All files used as input to SEER*Prep must be named with either a .txd or .txd.gz extension. These extensions signify "text data" and "compressed text data" respectively. The freely available gzip utility, located at http://www.gzip.org, can be used to compress your input files. This practice will greatly reduce the resources required to store your SEER*Prep input data and will only minimally detract from the performance of SEER*Prep. SEER*Prep supports only fixed length text input records. If your records are shorter than required, you must pad them with blanks. If they are longer, you must truncate them to the supported length. For help preparing your data to meet this requirement, see the SEER*Prep utilities. http://seer.cancer.gov/seerprep/utilities All numeric values of the input records that are processed by SEER*Prep, must be zero filled on the left. In other words, for a variable of length two, the value "01" must be used in the input file and not " 1". ------------------- SYSTEM REQUIREMENTS ------------------- - Pentium-based PC - 32-bit Microsoft Windows with a text editor installed such as Notepad - SEER*Stat 8.3.6 or later installed - 32 MBs application RAM - Approximately 4 MBs disk space In addition, disk space on your PC or LAN is required for the compressed version of your incidence, mortality, population and expected survival rate (life table) data, and the indices that are all generated by SEER*Prep. An upper bound on the required space is half that of the text version (uncompressed) of your input data. In most cases, it will be less than half. -------------------- INSTALLING SEER*Prep -------------------- To install SEER*Prep on your PC: - Download the installation program named sp301.exe from the Internet and save it locally - From the Start menu, choose Run...; assuming the local path for the installation program is C:\Temp, type C:\Temp\sp301.exe - Follow the instructions on your screen ---------------- REVISION HISTORY ---------------- 12/8/2023 - Version 3.0.3 1. Modified parsing of input records from CSV files for efficiency. 2. Note other changes for version 3.0.2 9/18/2023 - Version 3.0.2 1. Improved parsing of input records from CSV files to handle multiple quotes within a field that contains the delimiter. 2. Handling of invalid floating point values in CSV input file. 3. Added option to verify conversions and formats before Create. 4. Added better message for input files with invalid record length. 5. Modified Site Recode ICD-O-3/WHO 2008. 6. Added support for CSV Population input files. 12/1/2022 - Version 3.0.1 1. Minor upgrades to reading of input data from CSV file. 2. Allow adding of CSV fields before input file is specified. 3. Fixed handling of leading spaces and non-printable characters in labels for editable variables. 4. Fixed setting of maximum value for fields 5 or more bytes long. 5. Updated warning dialog for fields with no Column Id in CSV header. 6. Added consistency check for headers in CSV input files. 7. Added new Cause of death recode to include COVID-19. 8. Updated Behavior Recode for Analysis to recode cases with with primary site C72.3 and histology 9421 as borderline malignancy. 5/3/2022 - Version 3.0.0 1. Upgraded support for reading input data from CSV file. 2. Modified the process for adding new variables when CSV 3. Added new Age Recode for single ages with 85-89 and 90+. 1/6/2021 - Version 2.6.0 1. Added warning message for Expected Rate database with uneven counts in variable formats. 2. Fixed the Verify report to show floating point ranges correctly. 3. Added support for (future) reading input data from CSV file. 6/26/2020 - Version 2.5.8 1. Fixed bug in loading of variables with very large number of formats (more than 32,767 - max for short integer). Convert to continuous variable if too many formats. 2. Added a warning for Date of Last Contact that does not Match the Study Cutoff Date. Updated the Study Cutoff dialog. 3. Added checkboxes to the Variable Edit dialog to exclude field from Case Listings, or to make data suppression mandatory. 4. Added support for setting suppression thresholds in the Database Options dialog. 5. New Age recodes added with 75+, 90+, and 100+ groupings. 6. Added new conversion to allow floating point data to be processed and stored as integers with a divisor. 5/28/2019 - Version 2.5.7 1. New cancer site recode added for CI5 from ICD-10 codes. 2. New Age recodes added with 80+ and 95+ groupings. 3. Reclassified ICD-O-3/WHO 2008 histologies 9735 and 9597 as lymphoma. 4. Fixed editing of numeric ranges to accept comma. 5. New option to avoid check for sorting across files with corresponding warning in the reports. 5/7/2018 - Version 2.5.6 (see 2.5.4 and 2.5.5 release changes also) 1. Modified Cause of Death recode from 3-Digit ICD codes to recategorize the ICD-10 code C86 as Non-Hodgkin Lymphoma. 4/2/2018 - Version 2.5.5 (see 2.5.4 limited release changes also) 1. Modified Cause of Death recodes to recategorize the following ICD-10 codes from Misc Malignant Cancer: C90.3 – Myeloma C91.6, C91.8 – Other Lymphocytic Leukemia C92.6, C92.8 – Acute Myeloid Leukemia C93.3 – Other Myeloid/Monocytic Leukemia 3/1/2018 - Version 2.5.4 1. Added variable descriptions to the Verify report. 2. Added database option with editable range of numeric values e.g for setting Minimum Cases for various session types. 3. Added support for editing Text fields, and added a Sort button to the format editor. 4. Added support for processing summary files when SummaryCountField section is defined. 5. Fixed Base11 fields to report invalid values that are convertable. 6. Added new Base26 conversion for fields using the first 3 digits of ICD-10 codes. 7. Added new Cause Of Death recode based on 3 digit ICD-9 and ICD-10 codes. 8. Added new RaceOrigin recode. 9. Modified Cause of Death recodes to make some ICD-10 codes valid. 2/4/2016 - Version 2.5.3 1. Added ability to specify and reorder person identifying variables. 2. Added ability to select and reorder survival time/date variable groups. 3. Fixed the handling of gziped/compressed input files that are larger than 2 gig uncompressed. 4. Fixed recode for COD recode with Kaposi and mesothelioma to handle grouping 252=State DC not available or state DC available but no COD 3/27/2013 - Version 2.5.2 - Changed logic for - ICCC site recode ICD-O-3/WHO 2008 - ICCC site rec extended ICD-O-3/WHO 2008 - ICCC site recode ICD-O-3 - ICCC site recode extended ICD-O-3 The change only affects benign and borderline cases. The recode had been classifying all benign and borderline tumors based on site/hist. Now, benign and borderline tumors can only be classified into: - III CNS and misc intracranial and intraspinal neoplasms (and all sub groups) - X(a) Intracranial & intraspinal germ cell tumors. Any other benign or borderline tumors will be in the not classified category. 3/19/2013 - Version 2.5.1 (see 2.5.0 limited release changes also) - 1. New cancer site recodes were added: - ICCC site recode ICD-O-3/WHO 2008 - ICCC site rec extended ICD-O-3/WHO 2008 - AYA site recode/WHO 2008 - Lymphoma subtype recode/WHO 2008 These will be used for SEER reporting starting with the November 2012 submission. These fields handle the new hematopoietic histologies that started being collected for 2010 cases and is based on a 2008 WHO publication. As of 4/15/2013, the logic can be found: http://www.seer.cancer.gov/analysis/incidence.html 2. The default expected survival table used in the incidence dd file was changed to newly available US by state life tables (1990-2010). An option was also added to the Database Options dialog to allow the selection of the default survival table. The new life tables will be released on 4/15/2013. 3. CHSDA 2012 was added to the NAACCR dd files. 4. Added information to the database.ini section of NAACCR dd file to support survival calculations using pre-calculated survival months. SAS code and some hand-editing of the database.ini file is required to utilize this. This also involved adding a few new fields: - Presumed alive month of last contact recode - Presumed alive day of last contact recode - Presumed alive year of last contact recode Please see CalculateSurvivalTimeInMonth.Modified.for.SEERPrep.sas for more information. 2/04/2013 - Version 2.5.0 (limited release) - 1. A new site recode variable was added: Site recode ICD-O-3/WHO 2008. This version will be used for SEER reporting starting with the November 2012 submission. This field handles new hematopoietic histologies that started being collected for 2010 cases and is based on a 2008 WHO publication. The logic can be found: http://www.seer.cancer.gov/siterecode/icdo3_dwhoheme/index.html 2. A new behavior recode for analysis variable is now available which supports the new histologies. 3. We removed the grouping 7=Other Unspecified (1991+) race category from all race recode variables. These cases now have race coded to 9=Unknown in the recodes. SEER has always treated them the same as unknown in analyses. 4. Nine new fields were added including new survival time variables, census tract poverty, and the day component of diagnosis, birth, and last contact. 5. The byteoffset for TNM clin stage group was corrected. 6. Changed valid range of ICD-O-3 histologies from 8000-9989 to 8000-9992. 7. Added additional valid values for many fields including, but not limited to Derived 7th ed T, N, and M; RX summ vars; CS version #s. 5/24/2012 - Version 2.4.6 - 1. Increased maximum field name and label length from 40 and 60 respectively to 80. 2. Correction to site recode variables. In prior versions, skin cancer with histology = 8046 were coded as Other Non-Epithelial Skin. These are now excluded as non-reportable skin cancers. 3. Maximum range for ICD-O-3 histology changed from 9989 to 9902. 4. Update to logic of Behavior recode for analysis - derived. Newly reportable cases for 2010+ (histologies = 9724,9751,9759,9831, 9975,9991,9992) are now grouped with the only reportable in ICD-O-3 value. 5. Cause of death derived variables updated to support new ICD-10 CODs. 6. Correction to AYA recode. In prior versions sites C714 and C717 with histology 9480 were incorrectly coded as unclassified. These are now coded as 3.5 Other specified intracranial and intraspinal neoplasms. 7. DD files were updated to include additional fields and values. 2/03/2011 - Version 2.4.5 (dd file changes only) - 1. Added new database option to NAACCR 12 incidence dd file to allow the user to select the default method for calculating cumulative expected survival. The functionality for this option, is supported by SEER*Stat Version 7.0 or later (to be released in April 2011). The default set in the dd file is Ederer II which also is only available in version 7.0 or later of SEER*Stat. SEER will be using Ederer II as their default starting with the November 2010 submission of data. All databases created prior to the availability of this option will default to Ederer I. 2. Fixed a problem in the incidence and incidence-based mortality NAACCR 12 dd file for the field CS version input current. 3. Chages were also made to the mortality and NAACCR 11 dd files to shorten some formats that were > the 60 character maximum. 1/11/2011 - Version 2.4.5 1. Added better messages to reports when data are not properly sorted. 2. Added error messages to notify user if a label for a value is too long. The maximum length is 60 characters. 3. Several changes were made to the NAACCR 12 dd files. These included adding new valid values for several fields as well as shortening some new labels that were too long, and other minor changes. 10/11/2010 - Version 2.4.4 (dd file changes only) - Fixed problems in NAACCR 12 dd files (standard and dd file for Incidence-Based mortality), including, but not limited to column position in case file of Addr at Dx--State, and population file byte offsets of Rural Urban codes. 9/30/2010 - Version 2.4.4 1. NAACCR 12 dd files (standard and dd file for Incidence-Based mortality). 2. Changed logic in all derived fields that use any month value in calculation such that blanks are now treated as unknown (previously '99'). 11/12/2009 - Version 2.4.3 (dd file changes only) 1. Add some additional valid values to census tract certainty and therapy date fields. 8/3/2009 - Version 2.4.3 1. Added AYA site recode and Lymphoma subtype recode derived variables logic to software. 2. Added AYA site recode and Lymphoma subtype recode derived variables to both NAACCR dd files. 3/25/2009 - Version 2.4.2 1. Support for 64-bit installation. 2. Modified survival time recode derived variables. These variables no longer require the fields Type of reporting source or SEER type of follow-up. 3. Minor bug fixes and enhancements to the dd files. 5/21/2007 - Version 2.4.0 1. The addition of variable Behavior recode for analysis derived. 2. Support for the NAACCR 11.1 file format. 3. We are no longer distributing NAACCR version 9 or 10 .dds. Also support for SEER 250 .dds has been eliminated. 5/24/2006 - Version 2.3.5 - Correction to the generation of the derived race variables. Unknown race, value 99, was not being handled properly. 5/19/2006 - Version 2.3.4 1. Changed names, formats, and algorithms for derived race variables. For more details, see: http://seer.cancer.gov/seerstat/variables/seer/yr1973_2003/race_ethnicity. 2. Changed names and formats for derived Hispanic origin variables. 3. ICCC recodes based on ICD-O-3. 4. Slightly more stringent coding of site recodes if histology >= 9590. 5. Ability to create population only databases. 6. Ability to create incidence-based mortality databases. 7. Add case and pop files dialog now defaults to show zipped and regular files. 8. Several new options and defaults on the database options dialog. 9. Minor fixes and enhancements to the dd files. 9/15/2005 - Version 2.3.3 (only available in one training class) 1. Added support for large fonts. 2. Fixed some issues with most recent used list for dd files. 3. Laterality variable: changed one format in NAACCR 10 that was too long. 4. Added Birthplace variable to NAACCR 10. 5. Added Census tract variables and NHIA origin variable to RateProblemVars in NAACCR 10. 6. Updated State-county in mort and NAACCR 10. 7. Add description with link to the SEER Web site for the Behavior recode for analysis variable in NAACCR 10. 2/18/2005 - Version 2.3.2 - Bug fix for the creation of derived variable Origin recode NHIA for the condition when the underlying variable NHIA Derived Hisp Origin is blank. 2/4/2005 - Version 2.3.1 1. Origin recode NHIA derived variable. 2. Ability to choose a database's default standard population. 3. Additional user-specified variables and all user-specified variables can now be population defining. 4. Change to population file format to match those distributed here: http://seer.cancer.gov/popdata. 5. Ability to specify a database informational message. 9/29/2004 - Version 2.3.0 1. Cause of death recode derived variable was updated to support new ICD-10 codes. See www.seer.cancer.gov/codrecode/1969+_d09172004. 2. Changed SEER*Prep to exclude population, standard population, and expected survival records with blank counts. 3. Added two 5-digit and two 6-digit user-specified variables in the NAACCR 10.1 dd file. 4. Added functionality to support a dynamic population record length. 5. In the NAACCR 10.1 dd file, changed all user-specified variables so they can be population defining. 6. Corrected column positions for a few treatment date variables in the NAACCR 10.1 dd file. 4/20/2004 - Version 2.2 1. Support for SEER*Stat's MP-SIR session. 2. Logic changes in the definition of several non-cancer CODs when creating the derived COD recode variables. See www.seer.cancer.gov/codrecode. 3. Support for the NAACCR 10.1 file format. 4. Addition of Behavior recode for analysis to all incidence .dds. Note this is NOT a derived variable; it is just a defined column location. 5. Ability to turn off SEER*Stat's select only malignant behavior feature (check box). 6. Ability to specify which of three behavior variables is used with SEER*Stat's select only malignant behavior feature. 7. Added a State-county variable to the NAACCR 10.1 .dd. 8. Changed the representation of State-county in SEER*Prep's output files so SEER*Stat's "add all-using underlying data values" feature provides FIPS codes. 9. Added additional counties to Virginia in the State-county variable. Also changed a handful of other counties to match the SEER provided U.S. populations. See www.seer.cancer.gov/popdata. 10. Added COD recode derived variables to the incidence .dds. 11. Added NAACCR item numbers to the description of every applicable variable in the NAACCR 10.1 .dd. 9/17/2003 - Version 2.1 1. User can specify that the input case data are sorted to enable SEER*Stat's person selection features. 2. Support for SEER*Stat's prevalence session. 3. More user-specified variables. 4. Support for user-specified variables as population variables. 5. Switch to a single generic population file format. 6. User can hide required variables. 7. Age recode with single ages and 85+. 8. User can specify which SEER*Stat sessions are applicable for the database being created. 9. Ability to include totals and subtotals when inputting a user- specified variable's format. 10. Users can select a category for user-specified variables. 11. Ability to denote whether a user-specified variable should not be allowed for the generation of age-adjusted rates. 12. Improved version checking between the software and .dds. 13. Addition of options for Race recode Y and Race recode Z/Origin recode. 6/3/2003 - Version 2.0 1. Interface changes allowing the user to choose race and age variables to link to the populations; this reduced the number of .dds. 2. SEER's new logic for Site and COD recode. See www.seer.cancer.gov/siterecode and codrecode. ICD-O-3 is used for the creation of Site recode if present, else ICD-O-2 is used. 3. NAACCR 9.1 .dds which include ICD-O-3 histology and behavior. 4. Added a version number to the .dds and checking by the system. 5. Added Alaska "counties" to the mortality .dd. 12/4/2002 - Version 1.9.1 - Bug fix for the creation of derived variables ICCC site recode and SEER modified ICCC site recode. Specifically, if the input file contained all blanks for histology, SEER*Prep crashed when attempting to create either of these. 6/10/2002 - Version 1.9 - Derived variables ICCC site recode and SEER modified ICCC site recode. 2/7/2002 - Version 1.8 - Logic changes to Site recode (incidence) and Cause of death recode (mortality). Creation of Age recode with <1 year olds for use with 19-age groups standard populations. Additions to SEER and NAACCR .dds to take advantage of SEER*Stat's cause- specific survival features. "Comments" in the .dds so users can take advantage of SEER*Stat's multiple primary (person selection) features. Expected rates can now be merged with incidence data by any number of user-specified variables. Year and age are the only two variables required in an expected rate table. Numerous updates to the .dds to better support SEER*Stat 4.x. 2/22/2001 - Version 1.7 - Ability to rename user-specified variables, an unknown county for each state in the mortality .dds, and several enhancements when utilizing a user-supplied .zds. 2/8/2001 - Version 1.6 - Inclusion of Alzheimers to COD recode. Generation of "old style" indices which take less disk space and consume more memory to generate. Bug fix to ICD-10 support where some invalid values caused an exception and others misrepresented themselves as a valid code. These were codes like all numbers or those starting with lower case letter. 1/10/2001 - Version 1.5 - Support for ICD-10 and the ability to generate Cause of death recode from ICD-10. Correction to the definition of basal and squamous cell skin in the generation of Site recode. 6/20/2000 - Version 1.1 - Support for standard populations and expected survival rate (life) tables. Output file and directory name changes so it is easier to share created databases with colleagues. 3/30/1999 - Version 1.0 - Minor fixes over Beta 5. Conversion sections removed from .dd files which makes SEER*Stat 2.0 a requirement. 2/22/1999 - Version 1.0 (Beta 5) - Bug fix for numeric values occurring in Addr at DX--state variable of NAACCR .dds. 2/10/1999 - Version 1.0 (Beta 4) - Bug fixes, treatment variables in the NAACCR .dds, and additions to the .dds for SEER*Stat 2.0. Fixed conversion in NAACCR .dd for Histologic type. 9/23/1998 - Version 1.0 (Beta 3) - Added feature to update a database and improved wording on CR and LF exception. Fixed problem that caused creations to take longer towards the end of processing. Added support for gzipped (.txd.gz) input files and support for mortality data. 6/22/1998 - Version 1.0 (Beta 2) - Performance enhancements of 40-70%. Derived variable, Race recode A no longer dependent on Hispanic origin variable. 5/26/1998 - Version 1.0 (Beta 1) - Initial release. ----------------- TECHNICAL SUPPORT ----------------- Email: seerprep@imsweb.com Web: seer.cancer.gov/seerprep