Skip to content

fix: use utf-8 for local tracking text files#790

Merged
andreahlert merged 5 commits into
apache:mainfrom
Ghraven:fix/local-backend-annotation-utf8
May 29, 2026
Merged

fix: use utf-8 for local tracking text files#790
andreahlert merged 5 commits into
apache:mainfrom
Ghraven:fix/local-backend-annotation-utf8

Conversation

@Ghraven
Copy link
Copy Markdown
Contributor

@Ghraven Ghraven commented May 26, 2026

Summary

Thanks for maintaining Burr. This PR makes the local tracking backend text-file reads and writes use explicit UTF-8 encoding.

What changed

  • Adds encoding="utf-8" to annotation JSONL reads and writes in LocalBackend
  • Adds encoding="utf-8" when reading children.jsonl
  • Leaves binary log/metadata reads unchanged

Before / after

Before, these JSONL text files used the platform default encoding, which can vary on Windows and other non-UTF-8 locale environments.

After, the local backend reads and writes these text files consistently as UTF-8.

Verification

  • python -m py_compile burr/tracking/server/backend.py
  • git diff --check

@github-actions github-actions Bot added the area/tracking Telemetry, tracing, OpenTelemetry label May 26, 2026
@andreahlert andreahlert added the kind/bug Something is broken label May 26, 2026
Copy link
Copy Markdown
Collaborator

@andreahlert andreahlert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. One spot in the same file looks missed: backend.py:524 reads graph.json text-mode without encoding="utf-8". Same class of bug, and client writes it as UTF-8 (client.py:420).

async with aiofiles.open(graph_file, encoding="utf-8") as f:

Could you include that line in this PR as well?

@Ghraven
Copy link
Copy Markdown
Contributor Author

Ghraven commented May 27, 2026

Thank you for catching that missed spot. I added the same explicit encoding="utf-8" to the graph.json read in LocalBackend and pushed the update. I verified it with python -m py_compile burr/tracking/server/backend.py and git diff --check.

@skrawcz
Copy link
Copy Markdown
Contributor

skrawcz commented May 28, 2026

This looks like the right fix, but since this is correcting platform-dependent file encoding behavior, I think it would be worth adding a regression test that writes and reads non-ASCII content through LocalBackend (annotations / graph / children) to prove UTF-8 handling end-to-end and prevent missed call sites. Could you add that please?

@Ghraven
Copy link
Copy Markdown
Contributor Author

Ghraven commented May 29, 2026

Thank you for the suggestion. I added a focused regression test that writes non-ASCII content through LocalBackend annotations, graph.json, and children.jsonl, then reads it back through the backend APIs.\n\nVerified locally with:\n\n\npython -m pytest tests\\tracking\\test_local_tracking_client.py::test_local_backend_reads_utf8_annotations_graph_and_children -q\npython -m py_compile burr\\tracking\\server\\backend.py tests\\tracking\\test_local_tracking_client.py\ngit diff --check\n\n\nThe focused pytest passes; the remaining warnings are existing Pydantic deprecation warnings in the backend path.

Harden the local-backend UTF-8 regression test so it catches missed
call sites independently of the host's default encoding. A round-trip
assertion alone passes on a UTF-8 host even if a text open drops
encoding="utf-8". Wrap the backend's aiofiles.open and assert every
text-mode (non-binary) open is explicitly UTF-8; binary opens (mode
"rb") stay exempt. Also exercise non-ASCII (incl. CJK) content through
annotations, graph.json and children.jsonl end to end.

Signed-off-by: André Ahlert <andre@aex.partners>
Copy link
Copy Markdown
Collaborator

@andreahlert andreahlert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for adding the test and catching the graph.json spot.

I pushed one small follow-up commit (b142647) to harden the regression test. The round-trip assertion alone passes on a UTF-8 default host even if a text open drops encoding="utf-8", so it would not actually catch a future missed call site on CI. The follow-up wraps the backend's aiofiles.open and asserts every text-mode open is explicitly UTF-8 (binary rb opens stay exempt), which fails deterministically regardless of the host locale. Verified it goes red when the encoding arg is removed from any read.

Merging once CI is green.

The new regression test imports burr.tracking.server.backend, which
pulls aiofiles (tracking-server extra). The base test job only installed
tracking-client, so the top-level server import broke collection and
failed all test (3.9-3.12) jobs.

- Make the server backend/schema imports lazy and gate the test with
  pytest.importorskip("aiofiles"), matching the tracking-server-s3 test
  idiom, so the module still collects with only tracking-client present.
- Add the tracking-server extra to the base test job so aiofiles is
  available and the regression test actually runs across the matrix
  instead of being skipped everywhere.

Signed-off-by: André Ahlert <andre@aex.partners>
@github-actions github-actions Bot added the area/ci Workflows, build, release scripts label May 29, 2026
@andreahlert andreahlert merged commit 855bea6 into apache:main May 29, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci Workflows, build, release scripts area/tracking Telemetry, tracing, OpenTelemetry kind/bug Something is broken

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants