November 30, 2017
On Thursday, November 30, 2017 03:37:37 Walter Bright via Digitalmars-d wrote:
> On 11/30/2017 2:39 AM, Joakim wrote:
> > Java, .NET, Qt, Javascript, and a handful of others use UTF-16 too, some starting off with the earlier UCS-2:
> >
> > https://en.m.wikipedia.org/wiki/UTF-16#Usage
> >
> > Not saying either is better, each has their flaws, just pointing out it's more than just Windows.
>
> I stand corrected.

I get the impression that the stuff that uses UTF-16 is mostly stuff that picked an encoding early on in the Unicode game and thought that they picked one that guaranteed that a code unit would be an entire character. Many of them picked UCS-2 and then switched later to UTF-16, but once they picked a 16-bit encoding, they were kind of stuck.

Others - most notably C/C++ and the *nix world - picked UTF-8 for backwards compatibility, and once it became clear that UCS-2 / UTF-16 wasn't going to cut it for a code unit representing a character, most stuff that went Unicode went UTF-8.

Language-wise, I think that most of the UTF-16 is driven by the fact that Java went with UCS-2 / UTF-16, and C# followed them (both because they were copying Java and because the Win32 API had gone with UCS-2 / UTF-16). So, that's had a lot of influence on folks, though most others have gone with UTF-8 for backwards compatibility and because it typically takes up less space for non-Asian text. But the use of UTF-16 in Windows, Java, and C# does seem to have resulted in some folks thinking that wide characters means Unicode, and narrow characters meaning ASCII.

I really wish that everything would just got to UTF-8 and that UTF-16 would die, but that would just break too much code. And if we were willing to do that, I'm sure that we could come up with a better encoding than UTF-8 (e.g. getting rid of Unicode normalization as being a thing and never having multiple encodings for the same character), but _that_'s never going to happen.

- Jonathan M Davis

November 30, 2017
On Thursday, 30 November 2017 at 17:40:08 UTC, Jonathan M Davis wrote:
>
> [...] And if you're not dealing with Asian languages, UTF-16 uses up more space than UTF-8.

Not even that in most cases. Only if you use unstructured text can it happen that UTF-16 needs less space than UTF-8. In most cases, the text is embedded in some sort of ML (html, odf, docx, tmx, xliff, akoma ntoso, etc...) which puts the balance again to the side of UTF-8.
November 30, 2017
On Thursday, 30 November 2017 at 17:40:08 UTC, Jonathan M Davis wrote:
> English and thus don't as easily hit the cases where their code is wrong. For better or worse, UTF-16 hides it better than UTF-8, but the problem exists in both.
>

To give just an example of what can go wrong with UTF-16. Reading a file in UTF-16 and converting it tosomething else like UTF-8 or UTF-32. Reading block by block and hitting exactly a SMP codepoint at the buffer limit, high surrogate at the end of the first buffer, low surrogate at the start of the next. If you don't think about it => 2 invalid characters instead of your nice poop 💩 emoji character (emojis are in the SMP and they are more and more frequent).
November 30, 2017
On Thursday, 30 November 2017 at 17:56:58 UTC, Jonathan M Davis wrote:
> On Thursday, November 30, 2017 03:37:37 Walter Bright via Digitalmars-d wrote:
>> On 11/30/2017 2:39 AM, Joakim wrote:
>> > Java, .NET, Qt, Javascript, and a handful of others use UTF-16 too, some starting off with the earlier UCS-2:
>> >
>> > https://en.m.wikipedia.org/wiki/UTF-16#Usage
>> >
>> > Not saying either is better, each has their flaws, just pointing out it's more than just Windows.
>>
>> I stand corrected.
>
> I get the impression that the stuff that uses UTF-16 is mostly stuff that picked an encoding early on in the Unicode game and thought that they picked one that guaranteed that a code unit would be an entire character.

I don't think that's true though. Haven't you always been able to combine two codepoints into one visual representation (Ä for example). To me it's still two characters to look for when going through the string, but the UI or text interpreter might choose to combine them. So in certain domains, such as trying to visually represent the character, yes a codepoint is not a character, if by what you mean by character is the visual representation. But what we are referring to as a character can kind of morph depending on context. When you are running through the data though in the algorithm behind the scenes, you care about the *information* therefore the codepoint. And we are really just have a semantics battle if someone calls that a character.

> Many of them picked UCS-2 and then switched later to UTF-16, but once they picked a 16-bit encoding, they were kind of stuck.
>
> Others - most notably C/C++ and the *nix world - picked UTF-8 for backwards compatibility, and once it became clear that UCS-2 / UTF-16 wasn't going to cut it for a code unit representing a character, most stuff that went Unicode went UTF-8.

That's only because C used ASCII and thus was a byte. UTF-8 is inline with this, so literally nothing needs to change to get pretty much the same behavior. It makes sense. With this this in mind, it actually might make sense for D to use it.




November 30, 2017
On Thursday, 30 November 2017 at 17:56:58 UTC, Jonathan M Davis wrote:
> On Thursday, November 30, 2017 03:37:37 Walter Bright via Digitalmars-d wrote:
> Language-wise, I think that most of the UTF-16 is driven by the fact that Java went with UCS-2 / UTF-16, and C# followed them (both because they were copying Java and because the Win32 API had gone with UCS-2 / UTF-16). So, that's had a lot of influence on folks, though most others have gone with UTF-8 for backwards compatibility and because it typically takes up less space for non-Asian text. But the use of UTF-16 in Windows, Java, and C# does seem to have resulted in some folks thinking that wide characters means Unicode, and narrow characters meaning ASCII.

> - Jonathan M Davis

I think it also simplifies the logic. You are not always looking to represent the codepoints symbolically. You are just trying to see what information is in it. Therefore, if you can practically treat a codepoint as the unit of data behind the scenes, it simplifies the logic.
November 30, 2017
On Thursday, November 30, 2017 18:32:46 A Guy With a Question via Digitalmars-d wrote:
> On Thursday, 30 November 2017 at 17:56:58 UTC, Jonathan M Davis
>
> wrote:
> > On Thursday, November 30, 2017 03:37:37 Walter Bright via
> > Digitalmars-d wrote:
> > Language-wise, I think that most of the UTF-16 is driven by the
> > fact that Java went with UCS-2 / UTF-16, and C# followed them
> > (both because they were copying Java and because the Win32 API
> > had gone with UCS-2 / UTF-16). So, that's had a lot of
> > influence on folks, though most others have gone with UTF-8 for
> > backwards compatibility and because it typically takes up less
> > space for non-Asian text. But the use of UTF-16 in Windows,
> > Java, and C# does seem to have resulted in some folks thinking
> > that wide characters means Unicode, and narrow characters
> > meaning ASCII.
> >
> > - Jonathan M Davis
>
> I think it also simplifies the logic. You are not always looking to represent the codepoints symbolically. You are just trying to see what information is in it. Therefore, if you can practically treat a codepoint as the unit of data behind the scenes, it simplifies the logic.

Even if that were true, UTF-16 code units are not code points. If you want to operate on code points, you have to go to UTF-32. And even if you're at UTF-32, you have to worry about Unicode normalization, otherwise the same information can be represented differently even if all you care about is code points and not graphemes. And of course, some stuff really does care about graphemes, since those are the actual characters.

Ultimately, you have to understand how code units, code points, and graphemes work and what you're doing with a particular algorithm so that you know at which level you should operate at and where the pitfalls are. Some code can operate on code units and be fine; some can operate on code points; and some can operate on graphemes. But there is no one-size-fits-all solution that makes it all magically easy and efficient to use.

And UTF-16 does _nothing_ to improve any of this over UTF-8. It's just a different way to encode code points. And really, it makes things worse, because it usually takes up more space than UTF-8, and it makes it easier to miss when you screw up your Unicode handling, because more UTF-16 code units are valid code points than UTF-8 code units are, but they still aren't all valid code points. So, if you use UTF-8, you're more likely to catch your mistakes.

Honestly, I think that the only good reason to use UTF-16 is if you're interacting with existing APIs that use UTF-16, and even then, I think that in most cases, you're better off using UTF-8 and converting to UTF-16 only when you have to. Strings eat less memory that way, and mistakes are more easily caught. And if you're writing cross-platform code in D, then Windows is really the only place that you're typically going to have to deal with UTF-16, so it definitely works better in general to favor UTF-8 in D programs. But regardless, at least D gives you the tools to deal with the different Unicode encodings relatively cleanly and easily, so you can use whichever Unicode encoding you need to. Most D code is going to use UTF-8 though.

- Jonathan M Davis

November 30, 2017
On 11/30/17 1:20 PM, Patrick Schluter wrote:
> On Thursday, 30 November 2017 at 17:40:08 UTC, Jonathan M Davis wrote:
>> English and thus don't as easily hit the cases where their code is wrong. For better or worse, UTF-16 hides it better than UTF-8, but the problem exists in both.
>>
> 
> To give just an example of what can go wrong with UTF-16. Reading a file in UTF-16 and converting it tosomething else like UTF-8 or UTF-32. Reading block by block and hitting exactly a SMP codepoint at the buffer limit, high surrogate at the end of the first buffer, low surrogate at the start of the next. If you don't think about it => 2 invalid characters instead of your nice poop 💩 emoji character (emojis are in the SMP and they are more and more frequent).

iopipe handles this: http://schveiguy.github.io/iopipe/iopipe/textpipe/ensureDecodeable.html

-Steve
November 30, 2017
On 11/30/2017 5:22 AM, A Guy With a Question wrote:
> It's also worth mentioning that the more I think about it, the UTF8 vs. UTF16 thing was probably not worth mentioning with the rest of the things I listed out. It's pretty minor and more of a preference.

Both Windows and Java selected UTF16 before surrogates were added, so it was a reasonable decision made in good faith. But an awful lot of Windows/Java code has latent bugs in it because of not dealing with surrogates.

D is designed from the ground up to work smoothly with UTF8/UTF16 multi-codeunit encodings. If you do decide to use UTF16, please take advantage of this and deal with surrogates correctly. When you do decide to give up on UTF16 (!) and go with UTF8, your code will be easy to convert to UTF8.
December 01, 2017
On Thursday, 30 November 2017 at 19:37:47 UTC, Steven Schveighoffer wrote:
> On 11/30/17 1:20 PM, Patrick Schluter wrote:
>> On Thursday, 30 November 2017 at 17:40:08 UTC, Jonathan M Davis wrote:
>>> English and thus don't as easily hit the cases where their code is wrong. For better or worse, UTF-16 hides it better than UTF-8, but the problem exists in both.
>>>
>> 
>> To give just an example of what can go wrong with UTF-16. Reading a file in UTF-16 and converting it tosomething else like UTF-8 or UTF-32. Reading block by block and hitting exactly a SMP codepoint at the buffer limit, high surrogate at the end of the first buffer, low surrogate at the start of the next. If you don't think about it => 2 invalid characters instead of your nice poop 💩 emoji character (emojis are in the SMP and they are more and more frequent).
>
> iopipe handles this: http://schveiguy.github.io/iopipe/iopipe/textpipe/ensureDecodeable.html
>

It was only to give an example. With UTF-8 people who implement the low level code in general think about the multiple codeunits at the buffer boundary. With UTF-16 it's often forgotten. In UTF-16 there are also 2 other common pitfalls, that exist also in UTF-8 but are less consciously acknowledged, overlong encoding and isolated codepoints. So UTF-16 has the same issues as UTF-8, plus some more, endianness and size.

December 01, 2017
On Friday, 1 December 2017 at 06:07:07 UTC, Patrick Schluter wrote:
> On Thursday, 30 November 2017 at 19:37:47 UTC, Steven Schveighoffer wrote:
>> On 11/30/17 1:20 PM, Patrick Schluter wrote:
>>> On Thursday, 30 November 2017 at 17:40:08 UTC, Jonathan M Davis wrote:
>>>> English and thus don't as easily hit the cases where their code is wrong. For better or worse, UTF-16 hides it better than UTF-8, but the problem exists in both.
>>>>
>>> 
>>> To give just an example of what can go wrong with UTF-16. Reading a file in UTF-16 and converting it tosomething else like UTF-8 or UTF-32. Reading block by block and hitting exactly a SMP codepoint at the buffer limit, high surrogate at the end of the first buffer, low surrogate at the start of the next. If you don't think about it => 2 invalid characters instead of your nice poop 💩 emoji character (emojis are in the SMP and they are more and more frequent).
>>
>> iopipe handles this: http://schveiguy.github.io/iopipe/iopipe/textpipe/ensureDecodeable.html
>>
>
> It was only to give an example. With UTF-8 people who implement the low level code in general think about the multiple codeunits at the buffer boundary. With UTF-16 it's often forgotten. In UTF-16 there are also 2 other common pitfalls, that exist also in UTF-8 but are less consciously acknowledged, overlong encoding and isolated codepoints. So UTF-16 has the same issues as UTF-8, plus some more, endianness and size.

Most problems with UTF16 is applicable to UTF8. The only issue that isn't, is if you are just dealing with ASCII it's a bit of a waste of space.