Banff Processor Version 1 Migration Guide

Foreword

This document is intended for users of the SAS based version of the Banff Processor and serves as a supplement to the main user guide to help with the transition from SAS (Banff Processor Version 1) to Python (Banff Processor Version 2).

Migrating from SAS to Python

The SAS version of the Banff Processor dates back to 2008, it has been used at Statistics Canada for many years. The Python version is similar to the SAS version in many ways and effort has been made to make the metadata backwards compatible where possible. This guide highlights some of the main differences to help users migrate to the new version.

In general, XML metadata files from version 1 is compatible with version 2, with the following exceptions:

SAS-based custom programs (or user-defined programs) will not work; they must be replaced with plugins. Guidelines for developing plugins are available in the user guide.
Syntax in expressions must be reviewed; SAS specific expressions must be converted to SQL-Lite syntax.
The SAS-based processor included some default behaviour that has been removed to increase modularity and transparency. The past behaviour can be replicated with process controls; see the section on Process Controls in the user guide for details.

New Features

Process Blocks: Process Blocks allow users to call a job within a job. In the SAS Processor, the jobs table could contain multiple jobs, but now they can be chained together to create a master job. This is done by using the process called job and specifying the job id in the spec id column.
Process Controls: Process Controls allow modifications to be made to the inputs of a processing step without the risk of modifying the main working datasets. This was typically done in the SAS processor with user-defined processes, or the input modifications were hardcoded in the processor itself.
The Banff Processor now supports variable lengths up to 64 characters. A variable must still begin with a character or underscore and consist of only alphabetic characters, digits, and underscores (spaces with variable names are not supported).

Behaviour Changes

A key difference is that the SAS Banff Processor was a code generator. The processor created a SAS program and then the program was executed. There was an option to save the generated program, which could be run and used for debugging purposes. This is no longer the case with the Python processor, the process is executed dynamically in one pass.
In the SAS Processor, the SAS work library was used to store and access temporary datasets. In Python there are a few alternatives. Generally, data is accessed in user-defined processes (plugins) through the ProcessorData object, however, plugin developers can choose to store diagnostic files in another location such as a duckdb database or an appropriate folder. Using the tempfile library is an option, but it saves data in the user’s profile by default which may not be appropriate in certain situations.
The processor no longer automatically deletes records found on the reject file from the input data of an imputation step. If this behaviour is desired, the exclude_rejected process control can be used. Note that the rejected data can be accessed in a user defined process with processor_data.get_output_dataset("outreject").
When calling the ErrorLoc procedure, outlier status is no longer taken into consideration, only values flagged as FTI in the input status file. If this behaviour is desired, a user-defined process (plugin) can be created. The medium-term plan would be to replace this functionality with a process control/filter.

Inputs

SAS Macro parameters

SAS Macro variables are now Python function parameters. With the new processor, these parameters can be specified in a JSON file. Alternatively, inputs can be specified directly when creating a Processor object.

Parameter names have changed to respect Python naming conventions and improved to be more consistent and descriptive. Some have been replaced by more generic options or are no longer applicable. Also note that parameter names are now case sensitive.

SAS	Python	Notes
jobid	job_id
id	unit_id	This changed was required as id is a python function and not recommended to be used as a variable name.
dataLib	input_folder	In SAS, dataLib was a libref, in Python a file folder is specified. However, datasets can also be specified directly when creating a Processor object. If no dataset is specified, the processor will look in the input folder for the specified file name associated with input file.
curFile	indata_filename
outdataLib	output_folder
auxFile	indata_aux_filename
	instatus_filename	This is a new optional parameter in the Python Processor which allows initial status values to be specified.
histFile	indata_hist_filename
histStatus	instatus_hist_filename
custProgFref	user_plugins_folder
flatfileFref		Dropped from the Python Processor as this option was rarely used.
seed	seed
logType	log_level	The log_level parameter provides similar functionality as logType.
editstatsOutputType		Replaced by process_output_type.
estimatorOutputType		Replaced by process_output_type.
massImputOutputType		Replaced by process_output_type.
randnumvar	randnumvar
genCode/fgenprog		No longer applicable, a program code is no longer generated and executed.
editGroupFilter		Replaced by the EDIT_GROUP_FILTER process control
tempLib		No longer applicable
bpOptions		No longer applicable, these options were TIME, KEEPTEMP and NOBYGRPSTATS.
	save_format	This is a new option in the Python Processor, the SAS Processor produced SAS datasets. Parquet is currently the recommended save format (.parq), CSV is provided for testing and debugging purposes.

Input files

Input data files

In the SAS Processor, input data files were SAS datasets. There were essentially two data types: character (fixed width strings) and numeric (64-bit floating point numbers). In the Python Processor, parquet files are the recommended file format. These files are read in and mainly stored as arrow tables, though, in some cases, data is converted to other formats such as pandas data frames or duckdb tables. There are many different types, we generally recommend str (variable length strings) and float64 (64-bit floating point numbers), although various types can be used (float32, float16, int8, int16, ...). Variable types should be verified and adjusted as necessary. The main reason that CSV files are not recommended is due to the lack of metadata to ensure that types are set correctly when reading and writing files.

Metadata Files

The structure of the metadata files in the Python Processor are essentially the same as the SAS based version, any new elements are optional. However, the contents of the metadata may need to be adjusted. The expressions will likely need to be updated to reflect the new syntax.
Maximum lengths for most metadata columns have been increased. For example, previously, many ID fields had a limit of 30 characters, this limit has been increased to 100 characters.
Though the metadata files still except values of Yes/No or Y/N, They are stored in the processor as Boolean values and will therefore be converted to True/False values.
The Banff Processor still has an Excel template to help facilitate the creation of XML files as expected by the Banff Processor, however, the Excel Macro has been removed from the template and the banffprocessor package now includes a utility to convert the Excel workbook to XML. This utility can be called from the command line or from within a Python program. The command line utility is called banffconvert the utility is defined in banffprocessor.util.metadata_excel_to_xml

Metadata file	Notes
JOBS	Jobs has a new, optional element called controlid. This new column is used to link specifications in the process controls metadata. Also note that SEQNO can now have decimals, previously SEQNO could only be an integer.
USERVARS	The structure has not changed.
EDITS	The structure has not changed. The syntax for edits has not changed.
EDITGROUPS	No changes.
VERIFYEDITSPECS	No changes.
OUTLIERSPECS	No changes.
ERRORLOCSPECS	No changes.
DONORSPECS	No changes.
ESTIMATORSPECS	No changes.
PRORATESPECS	No changes.
MASSIMPUTATIONSPECS	No changes.
ALGORITHMS	User-defined algorithms can no longer override the algorithms of built-in estimators, a new name needs to be chosen.
ESTIMATORS	No changes.
EXPRESSIONS	The structure has not changed. However, expressions are now based on SQLite as implemented in duckdb. An example difference would be that string constants must be enclosed in single quotes as opposed to double quotes; `P53_05_1="1"` would need to be changed to `P53_05_1='1'`.
VARLISTS	No changes.
WEIGHTS	No changes.
PROCESSCONTROLS	This is a new metadata file that is used to create process control specfications.
PROCESSOUTPUTS	This is a new metadata file that is used to control what outputs are kept. It is used when process_output_type='Custom'

User-defined Processes (UDPs)

In the Python Processor user-defined processes are commonly referred to as plugins. SAS based UDPs were SAS program files that were executed via an include statement, user parameters defined in metadata were available in the program as global SAS macro variables. SAS UDPs will need to be reviewed and written as plugins. Plugins are Python classes that implement the protocol defined in banffprocessor.procedures.procedure_interface, user parameters defined in metadata can be accessed through the ProcessorData object. Note that in some cases, tasks previous handled by UDPs will no longer be required or can be replaced by Process Controls. Process Controls are meant to reduce the need for UDPs that perform data management tasks.

SAS	Python	Notes
parmKeyVar, parmByList, parmSeqno, ...	processor_data.input_params.unit_id, processor_data.by_varlist, processor_data.current_job_step.seqno, ...	In the SAS Processor, SAS global macro variables were used to access input paramters, now input parameters are available through processor_data object attributes such as input_params and current_job_step
work.jobs	processor_data.dbconn.sql("select * from Banff.JOBS").to_arrow_table()	Instead of accessing metadata tables as SAS datasets in the SAS work library, metadata tables are accessible in a duckdb database through processor_data.dbconn.
work.statusall	status_table = processor_data.get_dataset("status_file", table_format="arrow")	Instead of accessing data tables as SAS datasets in the SAS work library, datasets are accessible through the get_dataset function of processor_data. The dataset can be returned in arrow or pandas format. The set_dataset function can be used to save an output dataset.

Output files

The SAS Processor was outputting an accumulative file with the suffix all. This suffix has been dropped.

SAS	Python	Notes
imputedfile	imputed_file	This output remains the same.
statusall	status_file	The columns on the status file have been reduced. The standard columns are the unit ID variable along with FIELDID, STATUS, VALUE, JOBID and SEQNO. VALUE is a new column, editgroupid, outlierstatus and by-variables have been removed.
cumulatifstatusall	status_log	Like the status file, the columns on this output have been standardized.
	time_store	This is a new dataset which stores the start time, end time and duration of each processing step along with the accumulative execution time.
acceptableall	outacceptable
donormapall	outdonormap
editapplicall	outedit_applic
editstatusall	outedit_status
estefall	outest_ef
estlrall	outest_lr
estparmsall	outest_parm
globalstatusall	outglobal_status
keditsstatusall	outk_edits_status
matchfieldstatall	outmatching_fields
outlierstatusall	outlier_status
randomerrorall	outrand_err
reducededitsall	outedits_reduced
	outreject	The outreject file is the last outreject dataset generated by either Prorate or ErrorLoc. It was only a working dataset in the SAS version called `rejected`, in the Python version, it is included as an output data set.
rejectedall	outreject_all	The rejected all file is a special case where the `all` suffix was retained. This was because Prorate and ErrorLoc process this dataset slightly different. See the user guide for more information.
varsroleall	outvars_role

Conclusion

For more information on how to use Banff and the Banff Processor, please consult the main user guides. Hopefully this information helps with the transition from version 1 to version 2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Banff Processor Version 1 Migration Guide

Foreword

Migrating from SAS to Python

New Features

Behaviour Changes

Inputs

SAS Macro parameters

Input files

Input data files

Metadata Files

User-defined Processes (UDPs)

Output files

Conclusion

FilesExpand file tree

migrating-from-sas-python.md

Latest commit

History

migrating-from-sas-python.md

File metadata and controls

Banff Processor Version 1 Migration Guide

Foreword

Migrating from SAS to Python

New Features

Behaviour Changes

Inputs

SAS Macro parameters

Input files

Input data files

Metadata Files

User-defined Processes (UDPs)

Output files

Conclusion