Fix the provenance summary generation query by SandeepTuniki · Pull Request #2024 · datacommonsorg/data

SandeepTuniki · 2026-05-21T10:32:46Z

Modify aggregation_utils.py to reconstruct valid JSON arrays from Spanner's custom Observations proto by unrolling the map using UNNEST and aggregating the key-value pairs. This allows BigQuery to parse it correctly, resolving the silent failure that resulted in an empty Cache table.
Wrap the place type JSON_OBJECT generation in an IF guard to check if the keys array is populated. This makes the aggregation script robust against test databases that are missing standard metadata (like missing typeOf edges for places), preventing it from crashing and allowing the run to safely succeed with a NULL place type summary.

Verified against the datcom-ci Spanner test instance, successfully populating the Cache table with 11 ProvenanceSummary rows.

- Modify `aggregation_utils.py` to reconstruct valid JSON arrays from Spanner's custom Observations proto by unrolling the map using `UNNEST` and aggregating the key-value pairs. This allows BigQuery to parse it correctly, resolving the silent failure that resulted in an empty Cache table. - Wrap the place type JSON_OBJECT generation in an IF guard to check if the keys array is populated. This makes the aggregation script robust against test databases that are missing standard metadata (like missing `typeOf` edges for places), preventing it from crashing and allowing the run to safely succeed with a NULL place type summary. - End-to-end verified against the datcom-ci Spanner test instance, successfully populating the Cache table with 11 ProvenanceSummary rows.

gemini-code-assist

Code Review

This pull request updates the run_provenance_summary_aggregation function in aggregation_utils.py to improve the handling of JSON serialization for observations and place type summaries. The implementation replaces simple casting with manual JSON construction and nested subqueries. Review feedback highlights that manual JSON construction is insecure and suggests using native TO_JSON_STRING and JSON_OBJECT functions to ensure proper character escaping. Additionally, it is recommended to simplify the place type summary logic using a HAVING clause and to follow naming conventions for count variables.

SandeepTuniki requested a review from vish-cs May 21, 2026 10:33

vish-cs approved these changes May 21, 2026

View reviewed changes

SandeepTuniki enabled auto-merge (squash) May 21, 2026 10:35

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

Comment thread import-automation/workflow/ingestion-helper/aggregation_utils.py

Comment thread import-automation/workflow/ingestion-helper/aggregation_utils.py

Merge branch 'master' into fix-spanner-proto-aggregation

1f1f366

SandeepTuniki merged commit 6808d38 into master May 21, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the provenance summary generation query#2024

Fix the provenance summary generation query#2024
SandeepTuniki merged 2 commits into
masterfrom
fix-spanner-proto-aggregation

SandeepTuniki commented May 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SandeepTuniki commented May 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants