Skip to content

Fix the provenance summary generation query#2024

Merged
SandeepTuniki merged 2 commits into
masterfrom
fix-spanner-proto-aggregation
May 21, 2026
Merged

Fix the provenance summary generation query#2024
SandeepTuniki merged 2 commits into
masterfrom
fix-spanner-proto-aggregation

Conversation

@SandeepTuniki
Copy link
Copy Markdown
Contributor

  • Modify aggregation_utils.py to reconstruct valid JSON arrays from Spanner's custom Observations proto by unrolling the map using UNNEST and aggregating the key-value pairs. This allows BigQuery to parse it correctly, resolving the silent failure that resulted in an empty Cache table.
  • Wrap the place type JSON_OBJECT generation in an IF guard to check if the keys array is populated. This makes the aggregation script robust against test databases that are missing standard metadata (like missing typeOf edges for places), preventing it from crashing and allowing the run to safely succeed with a NULL place type summary.

Verified against the datcom-ci Spanner test instance, successfully populating the Cache table with 11 ProvenanceSummary rows.

- Modify `aggregation_utils.py` to reconstruct valid JSON arrays from Spanner's custom Observations proto by unrolling the map using `UNNEST` and aggregating the key-value pairs. This allows BigQuery to parse it correctly, resolving the silent failure that resulted in an empty Cache table.
- Wrap the place type JSON_OBJECT generation in an IF guard to check if the keys array is populated. This makes the aggregation script robust against test databases that are missing standard metadata (like missing `typeOf` edges for places), preventing it from crashing and allowing the run to safely succeed with a NULL place type summary.
- End-to-end verified against the datcom-ci Spanner test instance, successfully populating the Cache table with 11 ProvenanceSummary rows.
@SandeepTuniki SandeepTuniki requested a review from vish-cs May 21, 2026 10:33
@SandeepTuniki SandeepTuniki enabled auto-merge (squash) May 21, 2026 10:35
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the run_provenance_summary_aggregation function in aggregation_utils.py to improve the handling of JSON serialization for observations and place type summaries. The implementation replaces simple casting with manual JSON construction and nested subqueries. Review feedback highlights that manual JSON construction is insecure and suggests using native TO_JSON_STRING and JSON_OBJECT functions to ensure proper character escaping. Additionally, it is recommended to simplify the place type summary logic using a HAVING clause and to follow naming conventions for count variables.

Comment thread import-automation/workflow/ingestion-helper/aggregation_utils.py
Comment thread import-automation/workflow/ingestion-helper/aggregation_utils.py
@SandeepTuniki SandeepTuniki merged commit 6808d38 into master May 21, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants