Fix the provenance summary generation query#2024
Merged
Merged
Conversation
- Modify `aggregation_utils.py` to reconstruct valid JSON arrays from Spanner's custom Observations proto by unrolling the map using `UNNEST` and aggregating the key-value pairs. This allows BigQuery to parse it correctly, resolving the silent failure that resulted in an empty Cache table. - Wrap the place type JSON_OBJECT generation in an IF guard to check if the keys array is populated. This makes the aggregation script robust against test databases that are missing standard metadata (like missing `typeOf` edges for places), preventing it from crashing and allowing the run to safely succeed with a NULL place type summary. - End-to-end verified against the datcom-ci Spanner test instance, successfully populating the Cache table with 11 ProvenanceSummary rows.
vish-cs
approved these changes
May 21, 2026
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the run_provenance_summary_aggregation function in aggregation_utils.py to improve the handling of JSON serialization for observations and place type summaries. The implementation replaces simple casting with manual JSON construction and nested subqueries. Review feedback highlights that manual JSON construction is insecure and suggests using native TO_JSON_STRING and JSON_OBJECT functions to ensure proper character escaping. Additionally, it is recommended to simplify the place type summary logic using a HAVING clause and to follow naming conventions for count variables.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
aggregation_utils.pyto reconstruct valid JSON arrays from Spanner's custom Observations proto by unrolling the map usingUNNESTand aggregating the key-value pairs. This allows BigQuery to parse it correctly, resolving the silent failure that resulted in an empty Cache table.JSON_OBJECTgeneration in anIFguard to check if the keys array is populated. This makes the aggregation script robust against test databases that are missing standard metadata (like missingtypeOfedges for places), preventing it from crashing and allowing the run to safely succeed with a NULL place type summary.Verified against the datcom-ci Spanner test instance, successfully populating the Cache table with 11 ProvenanceSummary rows.