Update differ to ouptut MCF files#1998
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the import_differ tool to generate diffs in MCF format and a consolidated JSON summary, replacing the previous CSV-based outputs. It introduces a direct runner mode for executing a Java-based differ via subprocess and updates the validation logic to consume these new formats. Feedback from the review highlights a mismatch in the glob pattern used to locate MCF diff files, a regression in defensive error handling during JSON parsing, and confusing logic regarding the runner_mode flag mapping where the local mode triggers the Java runner instead of the native Python implementation.
d6cc92a to
866f95e
Compare
| path = os.path.join(tmp_dir, file) | ||
| else: | ||
| path = os.path.join(dest, file) | ||
| with open(path, mode='w', encoding='utf-8') as out_file: |
There was a problem hiding this comment.
You can use util/file_util.py:FileIO() to support GCS or local files transparently without needing extra GCS file handling code here:
with FileIO(path, mode='w', encoding='utf-8') as out_file:
...
| flags.DEFINE_string('file_format', 'mcf', | ||
| 'Format of the input data (mcf,tfrecord)') | ||
| flags.DEFINE_string('runner_mode', 'local', 'Runner mode (local/cloud)') | ||
| flags.DEFINE_string('runner_mode', 'native', |
There was a problem hiding this comment.
Can we add a brief description of the different modes in the file? The difference between native and direct is not clear by the term.
| for diff_type in [ | ||
| Diff.ADDED.name, Diff.DELETED.name, Diff.MODIFIED.name | ||
| ]: | ||
| df_type = diff_df[diff_df[Column.diff_type.name] == diff_type] |
There was a problem hiding this comment.
If the obs nodes are being filtered separately by type, can we write them into separate mcf files too?
That may help with other analysis and also skip the diffType property in the node.
| def get_val(base_name): | ||
| col_name = base_name + suffix | ||
| if col_name in row: | ||
| return str(row[col_name]) |
There was a problem hiding this comment.
Instead of str(), can we have a utility to convert it to a string?
for example, for lists, instead of [... ], MCFs use a,b,c.
Also if the value has spaces, it needs to be enclosed in double quotes.
def val_str(value) -> str:
if isinstance(value, list):
return ",".join([val_str(v) for v in value])
if value and isinstance(value, str) and " " in value and value[0].isalpha():
return '"' + value + '"'
return str(value)
| if 'StatVarObservation' in node.get(Column.typeOf.name): | ||
| values_to_combine = [] | ||
| keys_to_combine = [] | ||
| groupby_keys = [ |
There was a problem hiding this comment.
Can we make this a constant since it is used in multiple places?
No description provided.