Skip to content

fix(fast-build): properly decode multi-byte UTF-8 in JSON string parser#7410

Merged
janechu merged 3 commits intomainfrom
users/janechu/fix-utf8-json-string-parser
Apr 9, 2026
Merged

fix(fast-build): properly decode multi-byte UTF-8 in JSON string parser#7410
janechu merged 3 commits intomainfrom
users/janechu/fix-utf8-json-string-parser

Conversation

@janechu
Copy link
Copy Markdown
Collaborator

@janechu janechu commented Apr 9, 2026

Pull Request

📖 Description

Fixes a bug in the hand-rolled JSON string parser (json.rs) where non-ASCII characters were corrupted during parsing.

The parse_string function advanced through the input byte-by-byte and used b as char for unescaped characters. For ASCII bytes this is correct, but for multi-byte UTF-8 sequences it casts each byte independently to a Unicode scalar, producing garbage. For example, ⭐ (U+2B50, UTF-8: 0xE2 0xAD 0x90) was rendered as â (U+00E2) followed by two invisible control characters.

The fix determines the byte-length of each UTF-8 sequence from the leading byte, reads the full sequence, and decodes it with std::str::from_utf8. ASCII bytes (< 0x80) continue to use the fast b as char path.

This explains the â SELECTED corruption observed in the observer-map fixture output (issue seen in #7388).

📑 Test Plan

New unit tests in json.rs:

  • test_parse_string_with_emoji — ⭐ round-trips correctly
  • test_parse_string_with_multibyte_chars — 2-byte (é), 3-byte (✓), and 4-byte (🎉) sequences
  • test_parse_string_emoji_in_object — emoji values inside a JSON object

New integration tests in tests/bindings.rs:

  • test_binding_emoji_from_state — emoji from JSON state renders correctly in a binding
  • test_binding_emoji_literal_in_template — emoji in template literal text is preserved verbatim
  • test_binding_multibyte_chars — general multi-byte character round-trip

All existing tests pass (cargo test).

✅ Checklist

General

  • I have included a change request file using $ npm run change
  • I have added tests for my changes.
  • I have tested my changes.
  • I have updated the project documentation to reflect my changes.
  • I have read the CONTRIBUTING documentation and followed the standards for this project.

@janechu janechu marked this pull request as ready for review April 9, 2026 05:11
@janechu janechu requested a review from Copilot April 9, 2026 05:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes crates/microsoft-fast-build’s hand-rolled JSON string parsing so that non-ASCII literal characters in JSON strings (e.g., emoji, accented letters) are decoded as proper multi-byte UTF-8 sequences instead of being corrupted by per-byte as char casting. This improves correctness for SSR/state-driven rendering paths that consume JSON.

Changes:

  • Update parse_string to decode multi-byte UTF-8 sequences as a whole (keeping a fast ASCII path).
  • Add unit tests in json.rs covering emoji and 2/3/4-byte UTF-8 sequences in strings and objects.
  • Add integration tests in tests/bindings.rs to ensure multi-byte characters round-trip through bindings/rendering.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
crates/microsoft-fast-build/src/json.rs Fix UTF-8 decoding in parse_string and add targeted unit tests.
crates/microsoft-fast-build/tests/bindings.rs Add integration tests to verify correct rendering/binding with non-ASCII content.
crates/microsoft-fast-build/DESIGN.md Document the updated parse_string behavior for literal non-ASCII UTF-8.

Replace manual lead-byte range inference with from_utf8 on the
remaining slice + chars().next(), then advance by ch.len_utf8().

This avoids misclassifying continuation bytes (0x80..0xBF) as 2-byte
lead bytes, and makes malformed-input behaviour unambiguous.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@janechu janechu merged commit 0f094e7 into main Apr 9, 2026
14 checks passed
@janechu janechu deleted the users/janechu/fix-utf8-json-string-parser branch April 9, 2026 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants