Strip Unicode Cf characters in PrintableString#4593
Strip Unicode Cf characters in PrintableString#4593tnull wants to merge 1 commit intolightningdevkit:mainfrom
Cf characters in PrintableString#4593Conversation
`PrintableString` is the sanitiser LDK uses to render untrusted strings
(node aliases, BOLT-12 invoice / offer text, `UntrustedString`, LSPS
messages, `lightning-invoice` descriptions) to logs and UI. It only
replaced `char::is_control` matches (Unicode general category `Cc`)
with U+FFFD, leaving the entire `Cf` (Format) category untouched.
That is the exact category covering the bidirectional override /
isolate codepoints (U+202A..U+202E, U+2066..U+2069) and zero-width
characters (U+200B..U+200D, U+FEFF) behind the "Trojan Source" attack
family (CVE-2021-42574): a peer can set its alias / invoice description
/ offer fields to e.g. `safe\u{202E}cipsxe.exe`, which previously
passed through verbatim while a human reader sees `safeexe.cips` —
defeating the threat model `PrintableString` exists to defend against.
Replace `Cf` codepoints alongside `Cc` ones. The `Cf` ranges are
inlined as a `matches!` table sourced from Unicode 16.0 to keep the
change `no_std`-friendly with no new dependencies.
Co-Authored-By: HAL 9000
|
I've assigned @valentinewallace as a reviewer! |
| | 0x110BD | ||
| | 0x110CD | ||
| | 0x13430..=0x13440 | ||
| | 0x1BCA0..=0x1BCA3 |
There was a problem hiding this comment.
Bug: Off-by-one — U+13440 EGYPTIAN HIEROGLYPH MIRROR HORIZONTALLY is general category Mn (Mark, Nonspacing) in Unicode 16.0, not Cf. The correct upper bound for the Egyptian Hieroglyph Format Controls Cf range is U+1343F.
This causes a valid combining mark to be incorrectly replaced with U+FFFD.
| | 0x1BCA0..=0x1BCA3 | |
| | 0x13430..=0x1343F |
Review SummaryInline comments posted
Minor observations (not bugs, no inline comment needed)
|
| use core::fmt::Write; | ||
| for c in self.0.chars() { | ||
| let c = if c.is_control() { core::char::REPLACEMENT_CHARACTER } else { c }; | ||
| let c = if c.is_control() || is_format_char(c) { |
There was a problem hiding this comment.
Could it make sense to simplify this to just use the replacement character if !is_alphanumeric && !is_whitespace, or similar?
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4593 +/- ##
==========================================
- Coverage 86.84% 86.24% -0.60%
==========================================
Files 161 159 -2
Lines 109260 109192 -68
Branches 109260 109192 -68
==========================================
- Hits 94882 94170 -712
- Misses 11797 12411 +614
- Partials 2581 2611 +30
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PrintableStringis the sanitiser LDK uses to render untrusted strings (node aliases, BOLT-12 invoice / offer text,UntrustedString, LSPS messages,lightning-invoicedescriptions) to logs and UI. It only replacedchar::is_controlmatches (Unicode general categoryCc) with U+FFFD, leaving the entireCf(Format) category untouched.That is the exact category covering the bidirectional override / isolate codepoints (U+202A..U+202E, U+2066..U+2069) and zero-width characters (U+200B..U+200D, U+FEFF) behind the "Trojan Source" attack family (CVE-2021-42574): a peer can set its alias / invoice description / offer fields to e.g.
safe\u{202E}cipsxe.exe, which previously passed through verbatim while a human reader seessafeexe.cips— defeating the threat modelPrintableStringexists to defend against.Replace
Cfcodepoints alongsideCcones. TheCfranges are inlined as amatches!table sourced from Unicode 16.0 to keep the changeno_std-friendly with no new dependencies.Co-Authored-By: HAL 9000