December 01, 2017
On Friday, 1 December 2017 at 06:07:07 UTC, Patrick Schluter wrote:
> On Thursday, 30 November 2017 at 19:37:47 UTC, Steven Schveighoffer wrote:
>> On 11/30/17 1:20 PM, Patrick Schluter wrote:
>>> [...]
>>
>> iopipe handles this: http://schveiguy.github.io/iopipe/iopipe/textpipe/ensureDecodeable.html
>>
>
> It was only to give an example. With UTF-8 people who implement the low level code in general think about the multiple codeunits at the buffer boundary. With UTF-16 it's often forgotten. In UTF-16 there are also 2 other common pitfalls, that exist also in UTF-8 but are less consciously acknowledged, overlong encoding and isolated codepoints. So UTF-16 has the

I meant isolated code-units, of course.

> same issues as UTF-8, plus some more, endianness and size.

December 01, 2017
On Friday, 1 December 2017 at 12:21:22 UTC, A Guy With a Question wrote:
> On Friday, 1 December 2017 at 06:07:07 UTC, Patrick Schluter wrote:
>> On Thursday, 30 November 2017 at 19:37:47 UTC, Steven Schveighoffer wrote:
>>> On 11/30/17 1:20 PM, Patrick Schluter wrote:
>>>> [...]
>>>
>>> iopipe handles this: http://schveiguy.github.io/iopipe/iopipe/textpipe/ensureDecodeable.html
>>>
>>
>> It was only to give an example. With UTF-8 people who implement the low level code in general think about the multiple codeunits at the buffer boundary. With UTF-16 it's often forgotten. In UTF-16 there are also 2 other common pitfalls, that exist also in UTF-8 but are less consciously acknowledged, overlong encoding and isolated codepoints. So UTF-16 has the same issues as UTF-8, plus some more, endianness and size.
>
> Most problems with UTF16 is applicable to UTF8. The only issue that isn't, is if you are just dealing with ASCII it's a bit of a waste of space.

That's what I said. UTF-16 and UTF-8 have the same issues, but UTF-16 has even 2 more: endianness and bloat for ASCII. All 3 encodings have their pluses and minuses, that's why D supports all 3 but with a preference for utf-8.
December 01, 2017
On 12/1/17 7:26 AM, Patrick Schluter wrote:
> On Friday, 1 December 2017 at 06:07:07 UTC, Patrick Schluter wrote:
>>
>>  isolated codepoints. 
> 
> I meant isolated code-units, of course.

Hehe, it's impossible for me to talk about code points and code units without having to pause and consider which one I mean :)

-Steve
December 01, 2017
On Friday, December 01, 2017 09:49:08 Steven Schveighoffer via Digitalmars-d wrote:
> On 12/1/17 7:26 AM, Patrick Schluter wrote:
> > On Friday, 1 December 2017 at 06:07:07 UTC, Patrick Schluter wrote:
> >>  isolated codepoints.
> >
> > I meant isolated code-units, of course.
>
> Hehe, it's impossible for me to talk about code points and code units without having to pause and consider which one I mean :)

What, you mean that Unicode can be confusing? No way! ;)

LOL. I have to be careful with that too. What bugs me even more though is that the Unicode spec talks about code points being characters, and then talks about combining characters for grapheme clusters - and this in spite of the fact that what most people would consider a character is a grapheme cluster and _not_ a code point. But they presumably had to come up with new terms for a lot of this nonsense, and that's not always easy.

Regardless, what they came up with is complicated enough that it's arguably a miracle whenever a program actually handles Unicode text 100% correctly. :|

- Jonathan M Davis

December 01, 2017
On Friday, 1 December 2017 at 18:31:46 UTC, Jonathan M Davis wrote:
> On Friday, December 01, 2017 09:49:08 Steven Schveighoffer via Digitalmars-d wrote:
>> On 12/1/17 7:26 AM, Patrick Schluter wrote:
>> > On Friday, 1 December 2017 at 06:07:07 UTC, Patrick Schluter wrote:
>> >>  isolated codepoints.
>> >
>> > I meant isolated code-units, of course.
>>
>> Hehe, it's impossible for me to talk about code points and code units without having to pause and consider which one I mean :)
>
> What, you mean that Unicode can be confusing? No way! ;)
>
> LOL. I have to be careful with that too. What bugs me even more though is that the Unicode spec talks about code points being characters, and then talks about combining characters for grapheme clusters - and this in spite of the fact that what most people would consider a character is a grapheme cluster and _not_ a code point. But they presumably had to come up with new terms for a lot of this nonsense, and that's not always easy.
>
> Regardless, what they came up with is complicated enough that it's arguably a miracle whenever a program actually handles Unicode text 100% correctly. :|
>
> - Jonathan M Davis

And dealing with that complexity can often introduce bugs in their own right, because it's hard to get right. That's why sometimes it's easy just to simplify things and to exclude certain ways of looking at the string.
December 01, 2017
On 11/30/2017 9:23 AM, Kagamin wrote:
> On Tuesday, 28 November 2017 at 03:37:26 UTC, rikki cattermole wrote:
>> Be aware Microsoft is alone in thinking that UTF-16 was awesome. Everybody else standardized on UTF-8 for Unicode.
> 
> UCS2 was awesome. UTF-16 is used by Java, JavaScript, Objective-C, Swift, Dart and ms tech, which is 28% of tiobe index.

"was" :-) Those are pretty much pre-surrogate pair designs, or based on them (Dart compiles to JavaScript, for example).

UCS2 has serious problems:

1. Most strings are in ascii, meaning UCS2 doubles memory consumption. Strings in the executable file are twice the size.

2. The code doesn't work well with C. C doesn't even have a UCS2 type.

3. There's no reasonable way to audit the code to see if it handles surrogate pairs correctly. Surrogate pairs occur only rarely, so the code is never tested for it, and the bugs may remain latent for many, many years.

With UTF8, multibyte code points are much more common, so bugs are detected much earlier.
December 01, 2017
On Fri, Dec 01, 2017 at 03:04:44PM -0800, Walter Bright via Digitalmars-d wrote:
> On 11/30/2017 9:23 AM, Kagamin wrote:
> > On Tuesday, 28 November 2017 at 03:37:26 UTC, rikki cattermole wrote:
> > > Be aware Microsoft is alone in thinking that UTF-16 was awesome. Everybody else standardized on UTF-8 for Unicode.
> > 
> > UCS2 was awesome. UTF-16 is used by Java, JavaScript, Objective-C, Swift, Dart and ms tech, which is 28% of tiobe index.
> 
> "was" :-) Those are pretty much pre-surrogate pair designs, or based
> on them (Dart compiles to JavaScript, for example).
> 
> UCS2 has serious problems:
> 
> 1. Most strings are in ascii, meaning UCS2 doubles memory consumption. Strings in the executable file are twice the size.

This is not true in Asia, esp. where the CJK block is extensively used. A CJK block character is 3 bytes in UTF-8, meaning that string sizes are 150% of the UCS2 encoding.  If your code contains a lot of CJK text, that's a lot of bloat.

But then again, in non-Latin locales you'd generally store your strings separately of the executable (usually in l10n files), so this may not be that big an issue. But the blanket statement "Most strings are in ASCII" is not correct.


T

-- 
Bare foot: (n.) A device for locating thumb tacks on the floor.
December 01, 2017
On 11/30/2017 9:56 AM, Jonathan M Davis wrote:
> I'm sure that we could come up with a better encoding than UTF-8 (e.g.
> getting rid of Unicode normalization as being a thing and never having
> multiple encodings for the same character), but _that_'s never going to
> happen.

UTF-8 is not the cause of that particular problem, it's caused by the Unicode committee being a committee. Other Unicode problems are caused by the committee trying to add semantic information to code points, which causes nothing but problems. I.e. the committee forgot that Unicode is a character set, and nothing more.

December 01, 2017
On Friday, December 01, 2017 15:54:31 Walter Bright via Digitalmars-d wrote:
> On 11/30/2017 9:56 AM, Jonathan M Davis wrote:
> > I'm sure that we could come up with a better encoding than UTF-8 (e.g. getting rid of Unicode normalization as being a thing and never having multiple encodings for the same character), but _that_'s never going to happen.
>
> UTF-8 is not the cause of that particular problem, it's caused by the Unicode committee being a committee. Other Unicode problems are caused by the committee trying to add semantic information to code points, which causes nothing but problems. I.e. the committee forgot that Unicode is a character set, and nothing more.

Oh, definitely. UTF-8 is arguably the best that Unicode has, but Unicode in general is what's broken, because the folks designing it made poor choices. And personally, I think that their worst decisions tend to be at the code point level (e.g. having the same character being representable by different combinations of code points).

Quite possbily the most depressing thing that I've run into with Unicode though was finding out that emojis had their own code points. Emojis are specifically representable by a sequence of existing characters (usually ASCII), because they came from folks trying to represent pictures with text. The fact that they're then trying to put those pictures into the Unicode standard just blatantly shows that the Unicode folks have lost sight of what they're up to. It's like if they started trying to add Unicode characters for words. It makes no sense. But unfortunately, we just have to live with it... :(

- Jonathan M Davis

December 02, 2017
On 12/1/2017 3:16 PM, H. S. Teoh wrote:
> This is not true in Asia, esp. where the CJK block is extensively used.
> A CJK block character is 3 bytes in UTF-8, meaning that string sizes are
> 150% of the UCS2 encoding.  If your code contains a lot of CJK text,
> that's a lot of bloat.
> 
> But then again, in non-Latin locales you'd generally store your strings
> separately of the executable (usually in l10n files), so this may not be
> that big an issue. But the blanket statement "Most strings are in ASCII"
> is not correct.

Are you sure about that? I know that Asian languages will be longer in UTF-8. But how much data that programs handle is in those languages? The language of business, science, programming, aviation, and engineering is english.

Of course, D itself is agnostic about that. The compiler, for example, accepts strings, identifiers, and comments in Chinese in UTF-16 format.