Skip to content

Make V8 string to NSString conversions UTF-16 faithful#392

Open
adrian-niculescu wants to merge 1 commit into
NativeScript:mainfrom
adrian-niculescu:fix/utf16-string-conversion-fidelity
Open

Make V8 string to NSString conversions UTF-16 faithful#392
adrian-niculescu wants to merge 1 commit into
NativeScript:mainfrom
adrian-niculescu:fix/utf16-string-conversion-fidelity

Conversation

@adrian-niculescu

@adrian-niculescu adrian-niculescu commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

V8 strings are UTF-16, but a few of the V8 <-> NSString bridge points round-trip through UTF-8 first, and that loses data in two ways:

  • Lone surrogates get corrupted. A JS string can hold an unpaired surrogate (for example "\uD834"), which has no valid UTF-8 encoding, so the round-trip replaces it with U+FFFD. This affects every site that goes through the UTF-8 path, including tns::ToUtf16String itself.
  • Embedded NUL gets truncated at the sites that use C-string APIs. DictionaryAdapter builds strings with [NSString stringWithUTF8String:result.c_str()], and a couple of reverse-direction spots fold [obj UTF8String] into a std::string, both of which stop at the first \0.

The fix is to keep everything in UTF-16:

  • tns::ToUtf16String now reads the V8 string's native two-byte buffer directly via v8::String::Value instead of ToString + std::codecvt_utf8_utf16. As a bonus that drops the deprecated codecvt (there was already a // FIXME on it) and is faster.
  • The JS -> NSString sites in DictionaryAdapter, Interop and ArgConverter now build the NSString with stringWithCharacters:length: off ToUtf16String, instead of ToNSString / stringWithUTF8String:.
  • The two NSString -> V8 spots (DictionaryAdapter objectForKey: and the indexed getter in ArgConverter) pass the NSString straight to ToV8String instead of going through a UTF8String C string. ToV8String(NSString*) already exists and uses an explicit byte length, so it is NUL safe.

I left tns::ToVector alone on purpose. Its input is already a UTF-8 std::string, so it is a different conversion and not part of this bug.

Test: added a TestRunner case that bridges a JS string with a lone high surrogate to an NSString and checks the code unit survives. Without the fix it comes back as U+FFFD.

Formatting note: I matched each file's existing indentation (DictionaryAdapter.mm is 4-space, the others are 2-space) and did not run clang-format, to keep the diff to just the real changes.

We have shipped this in our downstream fork for a while.

Summary by CodeRabbit

  • Bug Fixes

    • Improved JavaScript-to-native string interop by using UTF-16 consistently, improving correctness for edge-case Unicode values (including surrogate pairs and lone surrogates).
    • Updated dictionary/map key and object property enumeration to preserve string values accurately during interop.
  • Tests

    • Added Jasmine coverage ensuring lone high (\uD834) and lone low (\uDC00) surrogate strings are preserved by verifying UTF-16 length and code unit values.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ac4f49b9-3444-474a-8c05-5d5086d13e23

📥 Commits

Reviewing files that changed from the base of the PR and between 705533c and 8cc59bf.

📒 Files selected for processing (5)
  • NativeScript/runtime/ArgConverter.mm
  • NativeScript/runtime/DictionaryAdapter.mm
  • NativeScript/runtime/Helpers.mm
  • NativeScript/runtime/Interop.mm
  • TestRunner/app/tests/ApiTests.js
🚧 Files skipped from review as they are similar to previous changes (5)
  • NativeScript/runtime/Helpers.mm
  • TestRunner/app/tests/ApiTests.js
  • NativeScript/runtime/Interop.mm
  • NativeScript/runtime/ArgConverter.mm
  • NativeScript/runtime/DictionaryAdapter.mm

📝 Walkthrough

Walkthrough

This PR updates NativeScript iOS string bridging to use UTF-16 data directly in helper, adapter, and interop paths. It also adds tests confirming that JS strings with lone high and low surrogates preserve their UTF-16 code units across the V8-Objective-C boundary.

Changes

UTF-16 Bridging Conversion

Layer / File(s) Summary
UTF-16 helper core implementation
NativeScript/runtime/Helpers.mm
tns::ToUtf16String now reads V8's native UTF-16 buffer directly via v8::String::Value, handling empty values and StringObject wrappers, and removes the codecvt-based conversion path.
ArgConverter V8 to Objective-C string conversion
NativeScript/runtime/ArgConverter.mm
SetValue builds NSString from UTF-16 via the updated helper; IndexedPropertyGetterCallback passes NSString* directly to tns::ToV8String.
DictionaryAdapter key enumeration and access
NativeScript/runtime/DictionaryAdapter.mm
Map and object key enumeration, plus direct key lookup, switch to UTF-16-based NSString construction and direct NSString* forwarding.
Interop WriteValue and ToObject conversion
NativeScript/runtime/Interop.mm
WriteValue and ToObject replace tns::ToNSString with direct UTF-16 construction for string-like values.
Surrogate code unit preservation tests
TestRunner/app/tests/ApiTests.js
New tests verify that lone high and low surrogates survive bridging into NSString as individual UTF-16 code units.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through strings both old and new,
With UTF-16 shining bright and true.
No UTF-8 detours in sight today,
Lone surrogates hop safe along the way.
Hooray for code units, crisp and neat! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: updating V8 string to NSString conversions to preserve UTF-16 fidelity instead of using lossy UTF-8 round-trips.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@adrian-niculescu adrian-niculescu force-pushed the fix/utf16-string-conversion-fidelity branch from 16439b5 to a37e7f2 Compare June 13, 2026 22:28

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
TestRunner/app/tests/ApiTests.js (1)

15-22: ⚡ Quick win

Test description in PR objectives claims "round trip" but implementation is one-way.

The PR objectives state this test verifies "a JS lone high surrogate survives the JS → NSString → JS round trip," but the test only converts JS → NSString and checks the NSString properties. To truly verify a round trip, the test should convert the NSString back to a JS string and assert it still equals "\uD834".

The current assertions (ns.length === 1 and UTF-8 byte length === 0) do prove the NSString preserves the lone surrogate internally, but a complete round-trip verification would strengthen confidence in both directions of the bridge.

🔄 Suggested enhancement to complete the round trip
 it("preserves a lone surrogate when bridging a JS string to NSString", function () {
     // A lone high surrogate is valid in a JS string but has no UTF-8 encoding.
     // Faithful UTF-16 bridging keeps it, so the NSString reports 0 UTF-8 bytes.
     // A UTF-8 round-trip would have replaced it with U+FFFD, which is 3 UTF-8 bytes.
-    var ns = NSString.stringWithString("\uD834");
+    var original = "\uD834";
+    var ns = NSString.stringWithString(original);
     expect(ns.length).toBe(1);
     expect(ns.lengthOfBytesUsingEncoding(NSUTF8StringEncoding)).toBe(0);
+    
+    // Complete the round trip: NSString → JS
+    var roundTripped = ns.toString();
+    expect(roundTripped).toBe(original);
+    expect(roundTripped.charCodeAt(0)).toBe(0xD834);
 });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@TestRunner/app/tests/ApiTests.js` around lines 15 - 22, The test function
"preserves a lone surrogate when bridging a JS string to NSString" verifies the
first direction of the round trip (JS string to NSString) but does not complete
the round trip by converting the NSString back to a JS string. After the
existing assertions on ns.length and ns.lengthOfBytesUsingEncoding, add an
assertion that converts the NSString (stored in variable ns) back to a JS string
and verifies it equals the original lone surrogate character "\uD834" to fully
validate the round-trip conversion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@TestRunner/app/tests/ApiTests.js`:
- Around line 15-22: The test function "preserves a lone surrogate when bridging
a JS string to NSString" verifies the first direction of the round trip (JS
string to NSString) but does not complete the round trip by converting the
NSString back to a JS string. After the existing assertions on ns.length and
ns.lengthOfBytesUsingEncoding, add an assertion that converts the NSString
(stored in variable ns) back to a JS string and verifies it equals the original
lone surrogate character "\uD834" to fully validate the round-trip conversion.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a85c5090-e1e6-4cf0-b02b-428ae8103d44

📥 Commits

Reviewing files that changed from the base of the PR and between 16439b5 and a37e7f2.

📒 Files selected for processing (5)
  • NativeScript/runtime/ArgConverter.mm
  • NativeScript/runtime/DictionaryAdapter.mm
  • NativeScript/runtime/Helpers.mm
  • NativeScript/runtime/Interop.mm
  • TestRunner/app/tests/ApiTests.js
🚧 Files skipped from review as they are similar to previous changes (4)
  • NativeScript/runtime/Helpers.mm
  • NativeScript/runtime/Interop.mm
  • NativeScript/runtime/ArgConverter.mm
  • NativeScript/runtime/DictionaryAdapter.mm

@adrian-niculescu

Copy link
Copy Markdown
Contributor Author

@coderabbitai thanks, but I'm going to skip this suggestion, it would actually break the test.

Completing the round trip with ns.toString() would fail: the reverse NSString to JS direction still goes through UTF-8 (ToV8String(NSString*) uses [value UTF8String]), so it turns the lone surrogate back into U+FFFD. expect(ns.toString()).toBe("\uD834") and charCodeAt(0) === 0xD834 would both fail. I confirmed that by running the TestRunner suite locally.

That reverse path is intentionally not touched for surrogates in this PR. The reverse-direction edits here only fix NUL truncation (passing an explicit byte length instead of folding through a C string), not surrogate fidelity, so there is no faithful NSString to JS path to round-trip a lone surrogate through. That is exactly why the test observes the NSString with a numeric native query (lengthOfBytesUsingEncoding: is 0 for a faithfully kept lone surrogate, 3 once it has become U+FFFD) instead of converting back to a JS string.

The bidirectional NUL case is already covered by the existing Marshals NSString with null character test in NSStringTests.js.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

@adrian-niculescu, thanks for the thorough explanation — that makes perfect sense. The ToV8String(NSString*) path still uses [value UTF8String] which mangles lone surrogates back to U+FFFD, so a round-trip assertion via ns.toString() would indeed be a false expectation at this stage. Using lengthOfBytesUsingEncoding(NSUTF8StringEncoding) === 0 as a numeric native proxy is a clean way to confirm the surrogate was kept intact without attempting the broken reverse path. I'll withdraw my suggestion.

(╯°□°)╯ 🐇 ← gently backing away from that suggestion


✏️ Learnings added
Learnt from: adrian-niculescu
URL: https://github.com/NativeScript/ios/pull/392

Timestamp: 2026-06-13T22:37:15.839Z
Learning: In NativeScript iOS (NativeScript/ios), the `ToV8String(NSString*)` function in `Helpers.mm` still goes through `[value UTF8String]`, which means lone (unpaired) UTF-16 surrogates get corrupted to U+FFFD on the NSString→JS direction. PR `#392` intentionally fixes only the JS→NSString direction (surrogate fidelity) and the NUL-truncation issue on both directions, but does NOT fix NSString→JS surrogate fidelity. Therefore, a test for a lone surrogate round-trip through NSString cannot assert `ns.toString() === "\uD834"` — it uses `ns.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) === 0` as a native numeric proxy instead (faithfully stored lone surrogate yields 0 UTF-8 bytes; a corrupted U+FFFD would yield 3).

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@adrian-niculescu adrian-niculescu force-pushed the fix/utf16-string-conversion-fidelity branch from a37e7f2 to 705533c Compare June 13, 2026 23:28

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
TestRunner/app/tests/ApiTests.js (1)

15-31: ⚡ Quick win

Consider adding a lone low surrogate test case for completeness.

The current test verifies a lone high surrogate (U+D834, range 0xD800-0xDBFF). Adding a similar test for a lone low surrogate (e.g., "\uDC00", range 0xDC00-0xDFFF) would strengthen coverage of the UTF-16 fidelity fix, since both surrogate ranges should be preserved but have different bit patterns.

📝 Optional additional test case
it("preserves a lone low surrogate when bridging a JS string to NSString", function () {
    var ns = NSString.stringWithString("\uDC00");
    expect(ns.length).toBe(1);
    
    var buffer = interop.alloc(interop.sizeof(interop.types.uint16));
    ns.getCharactersRange(buffer, NSMakeRange(0, 1));
    var codeUnit = new interop.Reference(interop.types.uint16, buffer).value;
    interop.free(buffer);
    
    expect(codeUnit).toBe(0xDC00);
});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@TestRunner/app/tests/ApiTests.js` around lines 15 - 31, The test suite
currently only covers high surrogates (U+D834) but lacks coverage for low
surrogates in the UTF-16 bridging behavior. Add a new test case after the
existing "preserves a lone surrogate when bridging a JS string to NSString" test
that follows the same pattern but uses a low surrogate code unit (U+DC00 instead
of U+D834) to verify that lone low surrogates are also preserved correctly when
bridging JS strings to NSString. The new test should verify that the bridged
NSString preserves the low surrogate value as 0xDC00.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@TestRunner/app/tests/ApiTests.js`:
- Around line 15-31: The test suite currently only covers high surrogates
(U+D834) but lacks coverage for low surrogates in the UTF-16 bridging behavior.
Add a new test case after the existing "preserves a lone surrogate when bridging
a JS string to NSString" test that follows the same pattern but uses a low
surrogate code unit (U+DC00 instead of U+D834) to verify that lone low
surrogates are also preserved correctly when bridging JS strings to NSString.
The new test should verify that the bridged NSString preserves the low surrogate
value as 0xDC00.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e750c080-3819-4ab5-b5d6-c546b25801a3

📥 Commits

Reviewing files that changed from the base of the PR and between a37e7f2 and 705533c.

📒 Files selected for processing (5)
  • NativeScript/runtime/ArgConverter.mm
  • NativeScript/runtime/DictionaryAdapter.mm
  • NativeScript/runtime/Helpers.mm
  • NativeScript/runtime/Interop.mm
  • TestRunner/app/tests/ApiTests.js
🚧 Files skipped from review as they are similar to previous changes (4)
  • NativeScript/runtime/Interop.mm
  • NativeScript/runtime/Helpers.mm
  • NativeScript/runtime/DictionaryAdapter.mm
  • NativeScript/runtime/ArgConverter.mm

V8 strings are UTF-16, but several bridge points round-tripped them through
UTF-8 before building an NSString (and one reverse direction did too). That
path corrupts lone surrogates, which get replaced with U+FFFD, and where it
used C-string APIs (stringWithUTF8String:, [NSString UTF8String] folded into
a std::string) it truncated at an embedded NUL.

Rework tns::ToUtf16String to read the V8 string's native two-byte buffer
directly, which also drops the deprecated std::codecvt_utf8_utf16. Switch the
DictionaryAdapter, Interop and ArgConverter string sites to ToUtf16String +
stringWithCharacters:length:, and pass NSString straight to ToV8String instead
of going through a UTF8String C string.

Add TestRunner cases asserting lone surrogates survive the JS to NSString
bridge: each reads the bridged string's first UTF-16 code unit straight out of
its buffer and checks it is unchanged (high U+D834 and low U+DC00), whereas the
old UTF-8 round trip would have turned either into U+FFFD.
@adrian-niculescu adrian-niculescu force-pushed the fix/utf16-string-conversion-fidelity branch from 705533c to 8cc59bf Compare June 13, 2026 23:41
@adrian-niculescu

adrian-niculescu commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

🧹 Nitpick comments (1)

TestRunner/app/tests/ApiTests.js (1)> 15-31: ⚡ Quick win

Consider adding a lone low surrogate test case for completeness.
The current test verifies a lone high surrogate (U+D834, range 0xD800-0xDBFF). Adding a similar test for a lone low surrogate (e.g., "\uDC00", range 0xDC00-0xDFFF) would strengthen coverage of the UTF-16 fidelity fix, since both surrogate ranges should be preserved but have different bit patterns.

📝 Optional additional test case

it("preserves a lone low surrogate when bridging a JS string to NSString", function () {
    var ns = NSString.stringWithString("\uDC00");
    expect(ns.length).toBe(1);
    
    var buffer = interop.alloc(interop.sizeof(interop.types.uint16));
    ns.getCharactersRange(buffer, NSMakeRange(0, 1));
    var codeUnit = new interop.Reference(interop.types.uint16, buffer).value;
    interop.free(buffer);
    
    expect(codeUnit).toBe(0xDC00);
});

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@TestRunner/app/tests/ApiTests.js` around lines 15 - 31, The test suite
currently only covers high surrogates (U+D834) but lacks coverage for low
surrogates in the UTF-16 bridging behavior. Add a new test case after the
existing "preserves a lone surrogate when bridging a JS string to NSString" test
that follows the same pattern but uses a low surrogate code unit (U+DC00 instead
of U+D834) to verify that lone low surrogates are also preserved correctly when
bridging JS strings to NSString. The new test should verify that the bridged
NSString preserves the low surrogate value as 0xDC00.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@TestRunner/app/tests/ApiTests.js`:
- Around line 15-31: The test suite currently only covers high surrogates
(U+D834) but lacks coverage for low surrogates in the UTF-16 bridging behavior.
Add a new test case after the existing "preserves a lone surrogate when bridging
a JS string to NSString" test that follows the same pattern but uses a low
surrogate code unit (U+DC00 instead of U+D834) to verify that lone low
surrogates are also preserved correctly when bridging JS strings to NSString.
The new test should verify that the bridged NSString preserves the low surrogate
value as 0xDC00.

ℹ️ Review info

@coderabbitai, re the low surrogate nitpick: added a lone low surrogate case (U+DC00) next to the high surrogate one in 8cc59bf, and renamed the original so the pair reads clearly. Both read the bridged code unit back as a number via getCharacters:range: instead of a UTF-8 conversion, since converting a lone surrogate to UTF-8 hangs the test runner on CI's iOS 17.5 simulator.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

@adrian-niculescu, great — both high (\uD834) and low (\uDC00) surrogate cases are now covered, and the renaming makes the pair read cleanly together. The getCharacters:range: / numeric approach is exactly right given the CI constraint you surfaced.

Good to know about the iOS 17.5 simulator hang on UTF-8 lone surrogate conversion — I'll record that for future reference.


✏️ Learnings added
Learnt from: adrian-niculescu
URL: https://github.com/NativeScript/ios/pull/392

Timestamp: 2026-06-13T23:57:12.035Z
Learning: In NativeScript iOS (NativeScript/ios), attempting to convert a lone (unpaired) UTF-16 surrogate to UTF-8 (e.g., via `[NSString UTF8String]` or `lengthOfBytesUsingEncoding:NSUTF8StringEncoding` in a way that triggers actual string conversion) can **hang the test runner** on the iOS 17.5 simulator in CI. Tests that verify lone surrogate preservation (high surrogates U+D800–U+DBFF or low surrogates U+DC00–U+DFFF) must read the code unit back as a numeric value via `getCharacters:range:` into a `uint16` buffer rather than attempting any UTF-8 string conversion. This was discovered while implementing tests for PR `#392`.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant