On Monday, 27 December 2021 at 07:12:24 UTC, rempas wrote:
> I don't understand that. Based on your calculations, the results should have been different. Also how are the numbers fixed? Like you said the amount of bytes of each encoding is not always standard for every character. Even if they were fixed this means 2-bytes for each UTF-16 character and 4-bytes for each UTF-32 character so still the numbers doesn't make sense to me. So still the number of the "length" property should have been the same for every encoding or at least for UTF-16 and UTF-32. So are the sizes of every character fixed or not?
Your string is represented by 8 codepoints. The number of codeunits to represent them in memory depends on the encoding. D supports to work with 3 different encodings (in the Unicode standard there are more than these 3)
string utf8s = "Hello 😂\n";
wstring utf16s = "Hello 😂\n"w;
dstring utf32s = "Hello 😂\n"d;
Here the canonical Unicode representation of your string
H e l l o 😂 \n
U+0048 U+0065 U+006C U+006C U+006F U+0020 U+1F602 U+000a
let's see how these 3 variable are represented in memory:
utf8s : 48 65 6C 6C 6F 20 F0 9F 98 82 0a
11 char in memory using 11 bytes
utf16s: 0048 0065 006C 006C 006F 0020 D83D DE02 000A
9 wchar in memory using 18 bytes
utf16s: 00000048 00000065 0000006C 0000006C 0000006F 00000020 0001F602 0000000A
8 dchar in memory using 32 bytes
As you can see, the most compact form is generally UTF-8, that's why it is the preferred encoding for Unicode.
UTF-16 is supported because of legacy support reason like it is used in the Windows API and also internally in Java.
UTF-32 has one advantage, in that it has a 1 to 1 mapping between codepoint and array index. In practice it is not that much of an advantage as codepoints and characters are disjoint concepts. UTF-32 uses a lot of memory for practically no benefit (when you read in the forum about the big auto-decode error of D it is linked to this).