Skip to content

Latest commit

 

History

History
142 lines (107 loc) · 12.6 KB

File metadata and controls

142 lines (107 loc) · 12.6 KB

Banff Processor Version 1 Migration Guide

Foreword

This document is intended for users of the SAS based version of the Banff Processor and serves as a supplement to the main user guide to help with the transition from SAS (Banff Processor Version 1) to Python (Banff Processor Version 2).

Migrating from SAS to Python

The SAS version of the Banff Processor dates back to 2008, it has been used at Statistics Canada for many years. The Python version is similar to the SAS version in many ways and effort has been made to make the metadata backwards compatible where possible. This guide highlights some of the main differences to help users migrate to the new version.

In general, XML metadata files from version 1 is compatible with version 2, with the following exceptions:

  • SAS-based custom programs (or user-defined programs) will not work; they must be replaced with plugins. Guidelines for developing plugins are available in the user guide.
  • Syntax in expressions must be reviewed; SAS specific expressions must be converted to SQL-Lite syntax.
  • The SAS-based processor included some default behaviour that has been removed to increase modularity and transparency. The past behaviour can be replicated with process controls; see the section on Process Controls in the user guide for details.

New Features

  • Process Blocks: Process Blocks allow users to call a job within a job. In the SAS Processor, the jobs table could contain multiple jobs, but now they can be chained together to create a master job. This is done by using the process called job and specifying the job id in the spec id column.
  • Process Controls: Process Controls allow modifications to be made to the inputs of a processing step without the risk of modifying the main working datasets. This was typically done in the SAS processor with user-defined processes, or the input modifications were hardcoded in the processor itself.
  • The Banff Processor now supports variable lengths up to 64 characters. A variable must still begin with a character or underscore and consist of only alphabetic characters, digits, and underscores (spaces with variable names are not supported).

Behaviour Changes

  1. A key difference is that the SAS Banff Processor was a code generator. The processor created a SAS program and then the program was executed. There was an option to save the generated program, which could be run and used for debugging purposes. This is no longer the case with the Python processor, the process is executed dynamically in one pass.

  2. In the SAS Processor, the SAS work library was used to store and access temporary datasets. In Python there are a few alternatives. Generally, data is accessed in user-defined processes (plugins) through the ProcessorData object, however, plugin developers can choose to store diagnostic files in another location such as a duckdb database or an appropriate folder. Using the tempfile library is an option, but it saves data in the user’s profile by default which may not be appropriate in certain situations.

  3. The processor no longer automatically deletes records found on the reject file from the input data of an imputation step. If this behaviour is desired, the exclude_rejected process control can be used. Note that the rejected data can be accessed in a user defined process with processor_data.get_output_dataset("outreject").

  4. When calling the ErrorLoc procedure, outlier status is no longer taken into consideration, only values flagged as FTI in the input status file. If this behaviour is desired, a user-defined process (plugin) can be created. The medium-term plan would be to replace this functionality with a process control/filter.

Inputs

SAS Macro parameters

SAS Macro variables are now Python function parameters. With the new processor, these parameters can be specified in a JSON file. Alternatively, inputs can be specified directly when creating a Processor object.

Parameter names have changed to respect Python naming conventions and improved to be more consistent and descriptive. Some have been replaced by more generic options or are no longer applicable. Also note that parameter names are now case sensitive.

SAS Python Notes
jobid job_id
id unit_id This changed was required as id is a python function and not recommended to be used as a variable name.
dataLib input_folder In SAS, dataLib was a libref, in Python a file folder is specified. However, datasets can also be specified directly when creating a Processor object. If no dataset is specified, the processor will look in the input folder for the specified file name associated with input file.
curFile indata_filename
outdataLib output_folder
auxFile indata_aux_filename
instatus_filename This is a new optional parameter in the Python Processor which allows initial status values to be specified.
histFile indata_hist_filename
histStatus instatus_hist_filename
custProgFref user_plugins_folder
flatfileFref Dropped from the Python Processor as this option was rarely used.
seed seed
logType log_level The log_level parameter provides similar functionality as logType.
editstatsOutputType Replaced by process_output_type.
estimatorOutputType Replaced by process_output_type.
massImputOutputType Replaced by process_output_type.
randnumvar randnumvar
genCode/fgenprog No longer applicable, a program code is no longer generated and executed.
editGroupFilter Replaced by the EDIT_GROUP_FILTER process control
tempLib No longer applicable
bpOptions No longer applicable, these options were TIME, KEEPTEMP and NOBYGRPSTATS.
save_format This is a new option in the Python Processor, the SAS Processor produced SAS datasets. Parquet is currently the recommended save format (.parq), CSV is provided for testing and debugging purposes.

Input files

Input data files

In the SAS Processor, input data files were SAS datasets. There were essentially two data types: character (fixed width strings) and numeric (64-bit floating point numbers). In the Python Processor, parquet files are the recommended file format. These files are read in and mainly stored as arrow tables, though, in some cases, data is converted to other formats such as pandas data frames or duckdb tables. There are many different types, we generally recommend str (variable length strings) and float64 (64-bit floating point numbers), although various types can be used (float32, float16, int8, int16, ...). Variable types should be verified and adjusted as necessary. The main reason that CSV files are not recommended is due to the lack of metadata to ensure that types are set correctly when reading and writing files.

Metadata Files

  • The structure of the metadata files in the Python Processor are essentially the same as the SAS based version, any new elements are optional. However, the contents of the metadata may need to be adjusted. The expressions will likely need to be updated to reflect the new syntax.

  • Maximum lengths for most metadata columns have been increased. For example, previously, many ID fields had a limit of 30 characters, this limit has been increased to 100 characters.

  • Though the metadata files still except values of Yes/No or Y/N, They are stored in the processor as Boolean values and will therefore be converted to True/False values.

  • The Banff Processor still has an Excel template to help facilitate the creation of XML files as expected by the Banff Processor, however, the Excel Macro has been removed from the template and the banffprocessor package now includes a utility to convert the Excel workbook to XML. This utility can be called from the command line or from within a Python program. The command line utility is called banffconvert the utility is defined in banffprocessor.util.metadata_excel_to_xml

Metadata file Notes
JOBS Jobs has a new, optional element called controlid. This new column is used to link specifications in the process controls metadata. Also note that SEQNO can now have decimals, previously SEQNO could only be an integer.
USERVARS The structure has not changed.
EDITS The structure has not changed. The syntax for edits has not changed.
EDITGROUPS No changes.
VERIFYEDITSPECS No changes.
OUTLIERSPECS No changes.
ERRORLOCSPECS No changes.
DONORSPECS No changes.
ESTIMATORSPECS No changes.
PRORATESPECS No changes.
MASSIMPUTATIONSPECS No changes.
ALGORITHMS User-defined algorithms can no longer override the algorithms of built-in estimators, a new name needs to be chosen.
ESTIMATORS No changes.
EXPRESSIONS The structure has not changed. However, expressions are now based on SQLite as implemented in duckdb. An example difference would be that string constants must be enclosed in single quotes as opposed to double quotes; P53_05_1="1" would need to be changed to P53_05_1='1'.
VARLISTS No changes.
WEIGHTS No changes.
PROCESSCONTROLS This is a new metadata file that is used to create process control specfications.
PROCESSOUTPUTS This is a new metadata file that is used to control what outputs are kept. It is used when process_output_type='Custom'

User-defined Processes (UDPs)

In the Python Processor user-defined processes are commonly referred to as plugins. SAS based UDPs were SAS program files that were executed via an include statement, user parameters defined in metadata were available in the program as global SAS macro variables. SAS UDPs will need to be reviewed and written as plugins. Plugins are Python classes that implement the protocol defined in banffprocessor.procedures.procedure_interface, user parameters defined in metadata can be accessed through the ProcessorData object. Note that in some cases, tasks previous handled by UDPs will no longer be required or can be replaced by Process Controls. Process Controls are meant to reduce the need for UDPs that perform data management tasks.

SAS Python Notes
parmKeyVar, parmByList, parmSeqno, ... processor_data.input_params.unit_id, processor_data.by_varlist, processor_data.current_job_step.seqno, ... In the SAS Processor, SAS global macro variables were used to access input paramters, now input parameters are available through processor_data object attributes such as input_params and current_job_step
work.jobs processor_data.dbconn.sql("select * from Banff.JOBS").to_arrow_table() Instead of accessing metadata tables as SAS datasets in the SAS work library, metadata tables are accessible in a duckdb database through processor_data.dbconn.
work.statusall status_table = processor_data.get_dataset("status_file", table_format="arrow") Instead of accessing data tables as SAS datasets in the SAS work library, datasets are accessible through the get_dataset function of processor_data. The dataset can be returned in arrow or pandas format. The set_dataset function can be used to save an output dataset.

Output files

The SAS Processor was outputting an accumulative file with the suffix all. This suffix has been dropped.

SAS Python Notes
imputedfile imputed_file This output remains the same.
statusall status_file The columns on the status file have been reduced. The standard columns are the unit ID variable along with FIELDID, STATUS, VALUE, JOBID and SEQNO. VALUE is a new column, editgroupid, outlierstatus and by-variables have been removed.
cumulatifstatusall status_log Like the status file, the columns on this output have been standardized.
time_store This is a new dataset which stores the start time, end time and duration of each processing step along with the accumulative execution time.
acceptableall outacceptable
donormapall outdonormap
editapplicall outedit_applic
editstatusall outedit_status
estefall outest_ef
estlrall outest_lr
estparmsall outest_parm
globalstatusall outglobal_status
keditsstatusall outk_edits_status
matchfieldstatall outmatching_fields
outlierstatusall outlier_status
randomerrorall outrand_err
reducededitsall outedits_reduced
outreject The outreject file is the last outreject dataset generated by either Prorate or ErrorLoc. It was only a working dataset in the SAS version called rejected, in the Python version, it is included as an output data set.
rejectedall outreject_all The rejected all file is a special case where the all suffix was retained. This was because Prorate and ErrorLoc process this dataset slightly different. See the user guide for more information.
varsroleall outvars_role

Conclusion

For more information on how to use Banff and the Banff Processor, please consult the main user guides. Hopefully this information helps with the transition from version 1 to version 2.