6 days ago
In my not-so-humble opinion, the introduction of "normalization" to Unicode was a huge mistake. It's not necessary and causes nothing but grief. They should have consulted with me first :-)
6 days ago
I ran into that as well with the 3 PRs I did:

fix array(String) to work with no autodecode
https://github.com/dlang/phobos/pull/7133

fix assocArray() unittests for no autodecode
https://github.com/dlang/phobos/pull/7134

fix unittests for array.join() for no autodecode
https://github.com/dlang/phobos/pull/7135

More specifically, the ElementType template returns a dchar for an autodecodable string, and char/wchar for a non-autodecodable string. I suspect that most people are not aware of this, and code that uses ElementType may already be subtly broken.

Note that the documentation for ElementType is also wrong,

  https://github.com/dlang/phobos/pull/

because isNarrowString is NOT THE SAME THING as an autodecoding string! The difference is isNarrowString excludes stringish aggregates and enums with a string base type, while autodecoding types include them.

Does this confuse anyone? It confuses me. I can never remember which is which.

Autodecoding is not only a conceptual mistake, the way it is implemented is a buggy, confusing disaster. (isNarrowString is often incorrectly used instead isAutodecodableString in Phobos.)

I think the only solution is to "rip the band-aid off" and have ElementType give the code unit type when autodecoding is disabled.
6 days ago
On Thursday, 15 August 2019 at 19:11:14 UTC, Walter Bright wrote:
> In my not-so-humble opinion, the introduction of "normalization" to Unicode was a huge mistake. It's not necessary and causes nothing but grief. They should have consulted with me first :-)

I am not sure that you can go entirely without normalization for all languages in existence. But Unicode conflates semantic representation and rendering in ways that are effectively layering violations. The LTR and RTL control characters are nice examples of that. Why should a Unicode string be able to specify the displayed direction of the script? The same goes for the stylistic ligatures I pointed out. These should be handled exclusively by the font rendering subsystem. There's a substitution table in OpenType for that, FFS!

Well, I guess that Unicode is the best we have despite all this maddening cruft. Attempting to do better would just result in text encoding "standard" N+1. And we know how much the world needs that. ;)
6 days ago
On Thursday, August 15, 2019 1:11:14 PM MDT Walter Bright via Digitalmars-d wrote:
> In my not-so-humble opinion, the introduction of "normalization" to Unicode was a huge mistake. It's not necessary and causes nothing but grief. They should have consulted with me first :-)

IMHO, the fact that Unicode normalization is a thing is one of those things that proves that Unicode is unnecessarily complex. There should only be a single way to represent a given character. Unfortunately, that's definitely not the way they went, and we suffer that much more because of it. Honestly, I question that very many applications exist which actually handle Unicode fully correctly. Its level of complexity is way past the point that the average programmer has much chance of getting it right.

- Jonathan M Davis



6 days ago
On 8/15/2019 12:38 PM, Gregor Mückl wrote:
> I am not sure that you can go entirely without normalization for all languages in existence. But Unicode conflates semantic representation and rendering in ways that are effectively layering violations. The LTR and RTL control characters are nice examples of that. Why should a Unicode string be able to specify the displayed direction of the script? The same goes for the stylistic ligatures I pointed out. These should be handled exclusively by the font rendering subsystem. There's a substitution table in OpenType for that, FFS!

Unicode also fouled up by adding semantic information that is invisible to the rendering. It should have stuck with the Unicode<=>print round-trip not losing information.

Naturally, people have already used such to trick people, track people, etc.

Another thing I hate about Unicode is there are articles about how people can get their vanity symbol into Unicode! And they do. They invent glyphs, and get them in. This goes on all the time.

Unicode started out as a cool idea, and turned rather quickly into a cesspool.
6 days ago
On 8/15/2019 12:44 PM, Jonathan M Davis wrote:
> There should only be a
> single way to represent a given character.

Exactly. And two glyphs that render identically should be the same code point.

After all, when we write:

   a + b = c

we don't use a separate code point for the letters. Also,

   a) one item
   b) another item

we don't use a separate code point, either. I've debated this point with Unicode people, and their arguments for separate glyphs fall to pieces when I point this out.
6 days ago
On 8/15/2019 5:09 AM, Vladimir Panteleev wrote:
> I sent a few PRs for the modules that I am listed as a code owner of.

Can you please add a link to those PRs in
https://github.com/dlang/phobos/pull/7130 ?

I think such references to how Phobos fixed its dependencies on autodecode will be valuable to programmers who want to fix theirs.
6 days ago
On Thursday, 15 August 2019 at 21:21:33 UTC, Walter Bright wrote:
> On 8/15/2019 5:09 AM, Vladimir Panteleev wrote:
>> I sent a few PRs for the modules that I am listed as a code owner of.
>
> Can you please add a link to those PRs in
> https://github.com/dlang/phobos/pull/7130 ?

I added a link to #7130 to the PR derciptions, which should do it. Noticed you added some comments just now too doing the same.

6 days ago
On Thursday, 15 August 2019 at 19:59:34 UTC, Walter Bright wrote:
> On 8/15/2019 12:44 PM, Jonathan M Davis wrote:
>> There should only be a
>> single way to represent a given character.
>
> Exactly. And two glyphs that render identically should be the same code point.
>

if it was not sarcasm:
different code points can ref to same glyphs not vice verse:
A(EN,\u0041), A(RU,\u0410), A(EL,\u0391)
else sorting for non English will not work.

even order(A<B) will be wrong for example such RU glyphs
ABCEHKMOPTXacepuxy
corresponds to next English letters by sound or meaning
AVSENKMORTHaserihu
as u can see even uppers and lowers don't exists as pairs and have different meanings
6 days ago
On 15.08.19 21:54, Walter Bright wrote:
> Unicode also fouled up by adding semantic information that is invisible to the rendering. It should have stuck with the Unicode<=>print round-trip not losing information.
> 
> Naturally, people have already used such to trick people, track people, etc.

'I' and 'l' are (virtually) identical in many fonts.
1 2 3 4 5 6 7 8 9