May 20, 2016
On Tuesday, 17 May 2016 at 14:06:37 UTC, Jack Stouffer wrote:

Related discussion https://trello.com/c/4XmFdcp6/163-rediscuss-redundant-utf-8-string-validation.
June 02, 2016
On Tuesday, 17 May 2016 at 14:06:37 UTC, Jack Stouffer wrote:
>
> If you think there should be any more information included in the article, please let me know so I can add it.

I was a little confused by something in the main autodecoding thread, so I read your article again. Unfortunately, I don't think my confusion is resolved. I was trying one of your examples (full code I used below). You claim it works, but I keep getting assertion failures. I'm just running it with rdmd on Windows 7.


import std.algorithm : canFind;

void main()
{
	string s = "cassé";

	assert(s.canFind!(x => x == 'é'));
}
June 02, 2016
On 6/2/16 5:21 PM, jmh530 wrote:
> On Tuesday, 17 May 2016 at 14:06:37 UTC, Jack Stouffer wrote:
>>
>> If you think there should be any more information included in the
>> article, please let me know so I can add it.
>
> I was a little confused by something in the main autodecoding thread, so
> I read your article again. Unfortunately, I don't think my confusion is
> resolved. I was trying one of your examples (full code I used below).
> You claim it works, but I keep getting assertion failures. I'm just
> running it with rdmd on Windows 7.
>
>
> import std.algorithm : canFind;
>
> void main()
> {
>     string s = "cassé";
>
>     assert(s.canFind!(x => x == 'é'));
> }

If that é above is an e followed by a combining character, then you will get the error. This is because autodecoding does not auto normalize as well -- the code points have to match exactly.

-Steve
June 02, 2016
On Thursday, 2 June 2016 at 21:21:50 UTC, jmh530 wrote:
> I was a little confused by something in the main autodecoding thread, so I read your article again. Unfortunately, I don't think my confusion is resolved. I was trying one of your examples (full code I used below). You claim it works, but I keep getting assertion failures. I'm just running it with rdmd on Windows 7.
>
>
> import std.algorithm : canFind;
>
> void main()
> {
> 	string s = "cassé";
>
> 	assert(s.canFind!(x => x == 'é'));
> }

Your browser is turning the é in the string into two code points via normalization whereas it should be one. Try using \u00E9 instead.
June 02, 2016
On 6/2/16 5:27 PM, Steven Schveighoffer wrote:
> On 6/2/16 5:21 PM, jmh530 wrote:
>> On Tuesday, 17 May 2016 at 14:06:37 UTC, Jack Stouffer wrote:
>>>
>>> If you think there should be any more information included in the
>>> article, please let me know so I can add it.
>>
>> I was a little confused by something in the main autodecoding thread, so
>> I read your article again. Unfortunately, I don't think my confusion is
>> resolved. I was trying one of your examples (full code I used below).
>> You claim it works, but I keep getting assertion failures. I'm just
>> running it with rdmd on Windows 7.
>>
>>
>> import std.algorithm : canFind;
>>
>> void main()
>> {
>>     string s = "cassé";
>>
>>     assert(s.canFind!(x => x == 'é'));
>> }
>
> If that é above is an e followed by a combining character, then you will
> get the error. This is because autodecoding does not auto normalize as
> well -- the code points have to match exactly.
>
> -Steve

Indeed. FWIW I just copied OP's code from Thunderbird into Chrome (on OSX) and it worked: https://dpaste.dzfl.pl/09b9188d87a5

Should I assume some normalization occurred on the way?


Andrei

June 03, 2016
On Thursday, 2 June 2016 at 21:31:39 UTC, Jack Stouffer wrote:
> On Thursday, 2 June 2016 at 21:21:50 UTC, jmh530 wrote:
>> I was a little confused by something in the main autodecoding thread, so I read your article again. Unfortunately, I don't think my confusion is resolved. I was trying one of your examples (full code I used below). You claim it works, but I keep getting assertion failures. I'm just running it with rdmd on Windows 7.
>>
>>
>> import std.algorithm : canFind;
>>
>> void main()
>> {
>> 	string s = "cassé";
>>
>> 	assert(s.canFind!(x => x == 'é'));
>> }
>
> Your browser is turning the é in the string into two code points via normalization whereas it should be one. Try using \u00E9 instead.

That doesn't cause an assert to fail, but when I do  writeln('\u00E9') I get é. So there might still be something wonky going on. I looked up \u00E9 online and I don't think there's an error with that.
June 03, 2016
On Thursday, 2 June 2016 at 21:33:02 UTC, Andrei Alexandrescu wrote:
>
> Should I assume some normalization occurred on the way?
>

I'm just looking over std.uni's section on normalization and realizing that I had basically no idea what it is or what's going on. The wikipedia page on unicode equivalence is a bit clearer.

I'm definitely nowhere near qualified to have an opinion on these issues.
June 03, 2016
On Fri, Jun 3, 2016 at 5:16 AM, jmh530 via Digitalmars-d-announce <digitalmars-d-announce@puremagic.com> wrote:
> On Thursday, 2 June 2016 at 21:33:02 UTC, Andrei Alexandrescu wrote:
>>
>>
>> Should I assume some normalization occurred on the way?
>>
>
> I'm just looking over std.uni's section on normalization and realizing that I had basically no idea what it is or what's going on. The wikipedia page on unicode equivalence is a bit clearer.
>
> I'm definitely nowhere near qualified to have an opinion on these issues.


This dpaste shows a couple of issues with combining chars in D.

https://dpaste.dzfl.pl/4b006959c5c0

The compiler actually can't handle a combining character literal either. see line 10.

R
June 03, 2016
On Friday, 3 June 2016 at 03:16:33 UTC, jmh530 wrote:
> I'm just looking over std.uni's section on normalization and realizing that I had basically no idea what it is or what's going on. The wikipedia page on unicode equivalence is a bit clearer.

This might help a bit, as well:
    https://dpaste.dzfl.pl/2ffb22b02842

June 03, 2016
On Friday, 3 June 2016 at 06:37:59 UTC, Rory McGuire wrote:
> This dpaste shows a couple of issues with combining chars in D.
>
> https://dpaste.dzfl.pl/4b006959c5c0
>
> The compiler actually can't handle a combining character literal either. see line 10.

Your paste behaves as expected: the "character" types in D are defined as single Unicode code units. By definition, the NFD form of "é" is not a single code unit. You would need to use a Grapheme or [w|d]string for that.

(Of course, one might reasonably question how useful our built-in character types actually are compared to ubyte/ushort/uint.)