fix(runtime): replace unpaired surrogates with U+FFFD in UTF-16 TextD…#5304
fix(runtime): replace unpaired surrogates with U+FFFD in UTF-16 TextD…#5304tkshsbcue wants to merge 2 commits intoboa-dev:mainfrom
Conversation
…ecoder The UTF-16 TextDecoder (both LE and BE) was passing raw code units directly to JsString, preserving unpaired surrogates instead of replacing them with U+FFFD as required by the WHATWG Encoding Standard. Route both decoders through a shared `decode_utf16_units` helper that uses `std::char::decode_utf16` with replacement, and simplify the UTF-16 BE decoder to borrow instead of taking ownership. Closes boa-dev#4612
Test262 conformance changes
Tested main commit: |
| const INVALID_UTF16_CASES: &[(&[u16], &[u16])] = &[ | ||
| // Lone high surrogate in the middle | ||
| ( | ||
| &[0x0061, 0x0062, 0xD800, 0x0077, 0x0078], | ||
| &[0x0061, 0x0062, 0xFFFD, 0x0077, 0x0078], | ||
| ), | ||
| // Lone high surrogate only | ||
| (&[0xD800], &[0xFFFD]), | ||
| // Two consecutive high surrogates | ||
| (&[0xD800, 0xD800], &[0xFFFD, 0xFFFD]), | ||
| // Lone low surrogate in the middle | ||
| ( | ||
| &[0x0061, 0x0062, 0xDFFF, 0x0077, 0x0078], | ||
| &[0x0061, 0x0062, 0xFFFD, 0x0077, 0x0078], | ||
| ), | ||
| // Low surrogate followed by high surrogate (wrong order) | ||
| (&[0xDFFF, 0xD800], &[0xFFFD, 0xFFFD]), | ||
| ]; | ||
|
|
||
| #[test] | ||
| fn decoder_utf16le_replaces_unpaired_surrogates() { | ||
| for (invalid, replaced) in INVALID_UTF16_CASES { | ||
| let mut input_bytes = Vec::with_capacity(invalid.len() * 2); | ||
| for &code_unit in *invalid { | ||
| input_bytes.extend_from_slice(&code_unit.to_le_bytes()); | ||
| } | ||
|
|
||
| let result = encodings::utf16le::decode(&input_bytes, false); | ||
| let expected = JsString::from(*replaced); | ||
| assert_eq!(result, expected, "utf16le failed for input {invalid:?}"); | ||
| } | ||
| } | ||
|
|
||
| #[test] | ||
| fn decoder_utf16be_replaces_unpaired_surrogates() { | ||
| for (invalid, replaced) in INVALID_UTF16_CASES { | ||
| let mut input_bytes = Vec::with_capacity(invalid.len() * 2); | ||
| for &code_unit in *invalid { | ||
| input_bytes.extend_from_slice(&code_unit.to_be_bytes()); | ||
| } | ||
|
|
||
| let result = encodings::utf16be::decode(&input_bytes, false); | ||
| let expected = JsString::from(*replaced); | ||
| assert_eq!(result, expected, "utf16be failed for input {invalid:?}"); | ||
| } | ||
| } | ||
|
|
||
| #[test] | ||
| fn decoder_utf16le_dangling_byte_produces_replacement() { | ||
| // Odd-length input: the last byte is truncated and replaced with U+FFFD | ||
| let input: &[u8] = &[0x41, 0x00, 0x42]; // 'A' (LE) + dangling byte | ||
| let result = encodings::utf16le::decode(input, false); | ||
| let expected = JsString::from(&[0x0041u16, 0xFFFD][..]); | ||
| assert_eq!(result, expected); | ||
| } | ||
|
|
||
| #[test] | ||
| fn decoder_utf16be_dangling_byte_produces_replacement() { | ||
| let input: &[u8] = &[0x00, 0x41, 0x42]; // 'A' (BE) + dangling byte | ||
| let result = encodings::utf16be::decode(input, false); | ||
| let expected = JsString::from(&[0x0041u16, 0xFFFD][..]); | ||
| assert_eq!(result, expected); | ||
| } |
There was a problem hiding this comment.
Do we need these tests? Might be better to enable the utf16 decoding tests from the wpt suite instead, since those basically compile to the same kind of tests
There was a problem hiding this comment.
oh yes, I've pushed a follow-up commit that enables the upstream WPT suite instead. Specifically:
->Enabled encoding/textdecoder-utf16-surrogates.any.js in tests/wpt/src/lib.rs, which covers both the replacement-mode cases and the fatal: true cases.
->Since the WPT test also exercises fatal, I implemented the fatal option on TextDecoder: added it to TextDecoderOptions, stored it on the struct, exposed the fatal
->getter, and threaded it through the decode helpers so decode() returns a TypeError on invalid input instead of inserting U+FFFD.
->Dropped the custom INVALID_UTF16_CASES Rust tests since the WPT file now covers the same ground.
…or UTF-16 surrogates Addresses review feedback on boa-dev#5304: replace hand-written UTF-16 surrogate tests with the upstream WPT suite. The WPT `textdecoder-utf16-surrogates.any.js` test exercises both the replacement path (already fixed in 70c9ccb) and the fatal mode, which requires the decoder to throw a TypeError on invalid input. Wire the fatal option through TextDecoderOptions and the decode helpers, and enable the WPT file in the encoding test list. - Add `fatal: Option<bool>` to TextDecoderOptions and a `fatal` field + getter on TextDecoder - Change encoding decode functions to return Result<JsString, ()> and propagate a TypeError from TextDecoder::decode when fatal is set - Drop the custom INVALID_UTF16_CASES tests now covered by WPT
fix(runtime): replace unpaired surrogates with U+FFFD in UTF-16 TextDecoder
Summary
Fixes #4612.
The UTF-16
TextDecoder(both LE and BE) passed raw code units directly toJsString, preserving unpaired surrogates (e.g.\ud800) instead of replacing them with U+FFFD as required by the WHATWG Encoding Standard.Changes
decode_utf16_units()helper usingstd::char::decode_utf16with.unwrap_or('\u{FFFD}')to produce well-formed outpututf16le::decodeandutf16be::decodethrough this helperutf16be::decodeto borrow (&[u8]) instead of taking ownership (Vec<u8>)Test plan
decoder_utf16le_replaces_unpaired_surrogates— lone high/low surrogates, consecutive, reversed pairsdecoder_utf16be_replaces_unpaired_surrogates— same for big-endiandecoder_utf16le_dangling_byte_produces_replacement— odd-length inputdecoder_utf16be_dangling_byte_produces_replacement— odd-length input (BE)cargo fmtandcargo clippyclean