D's Auto Decoding and You (page 4)

On Tuesday, 17 May 2016 at 14:06:37 UTC, Jack Stouffer wrote: > > If you think there should be any more information included in the article, please let me know so I can add it. I was a little confused by something in the main autodecoding thread, so I read your article again. Unfortunately, I don't think my confusion is resolved. I was trying one of your examples (full code I used below). You claim it works, but I keep getting assertion failures. I'm just running it with rdmd on Windows 7. import std.algorithm : canFind; void main() { string s = "cassé"; assert(s.canFind!(x => x == 'é')); }

On 6/2/16 5:21 PM, jmh530 wrote: > On Tuesday, 17 May 2016 at 14:06:37 UTC, Jack Stouffer wrote: >> >> If you think there should be any more information included in the >> article, please let me know so I can add it. > > I was a little confused by something in the main autodecoding thread, so > I read your article again. Unfortunately, I don't think my confusion is > resolved. I was trying one of your examples (full code I used below). > You claim it works, but I keep getting assertion failures. I'm just > running it with rdmd on Windows 7. > > > import std.algorithm : canFind; > > void main() > { > string s = "cassé"; > > assert(s.canFind!(x => x == 'é')); > } If that é above is an e followed by a combining character, then you will get the error. This is because autodecoding does not auto normalize as well -- the code points have to match exactly. -Steve

On Thursday, 2 June 2016 at 21:21:50 UTC, jmh530 wrote: > I was a little confused by something in the main autodecoding thread, so I read your article again. Unfortunately, I don't think my confusion is resolved. I was trying one of your examples (full code I used below). You claim it works, but I keep getting assertion failures. I'm just running it with rdmd on Windows 7. > > > import std.algorithm : canFind; > > void main() > { > string s = "cassé"; > > assert(s.canFind!(x => x == 'é')); > } Your browser is turning the é in the string into two code points via normalization whereas it should be one. Try using \u00E9 instead.

June 02, 2016

Re: D's Auto Decoding and You

Posted by Andrei Alexandrescu
in reply to Steven Schveighoffer

Permalink

Andrei Alexandrescu

Posted in reply to Steven Schveighoffer

Permalink

On 6/2/16 5:27 PM, Steven Schveighoffer wrote:
> On 6/2/16 5:21 PM, jmh530 wrote:
>> On Tuesday, 17 May 2016 at 14:06:37 UTC, Jack Stouffer wrote:
>>>
>>> If you think there should be any more information included in the
>>> article, please let me know so I can add it.
>>
>> I was a little confused by something in the main autodecoding thread, so
>> I read your article again. Unfortunately, I don't think my confusion is
>> resolved. I was trying one of your examples (full code I used below).
>> You claim it works, but I keep getting assertion failures. I'm just
>> running it with rdmd on Windows 7.
>>
>>
>> import std.algorithm : canFind;
>>
>> void main()
>> {
>>     string s = "cassé";
>>
>>     assert(s.canFind!(x => x == 'é'));
>> }
>
> If that é above is an e followed by a combining character, then you will
> get the error. This is because autodecoding does not auto normalize as
> well -- the code points have to match exactly.
>
> -Steve

Indeed. FWIW I just copied OP's code from Thunderbird into Chrome (on OSX) and it worked: https://dpaste.dzfl.pl/09b9188d87a5

Should I assume some normalization occurred on the way?


Andrei

On Thursday, 2 June 2016 at 21:31:39 UTC, Jack Stouffer wrote: > On Thursday, 2 June 2016 at 21:21:50 UTC, jmh530 wrote: >> I was a little confused by something in the main autodecoding thread, so I read your article again. Unfortunately, I don't think my confusion is resolved. I was trying one of your examples (full code I used below). You claim it works, but I keep getting assertion failures. I'm just running it with rdmd on Windows 7. >> >> >> import std.algorithm : canFind; >> >> void main() >> { >> string s = "cassé"; >> >> assert(s.canFind!(x => x == 'é')); >> } > > Your browser is turning the é in the string into two code points via normalization whereas it should be one. Try using \u00E9 instead. That doesn't cause an assert to fail, but when I do writeln('\u00E9') I get ├⌐. So there might still be something wonky going on. I looked up \u00E9 online and I don't think there's an error with that.

On Thursday, 2 June 2016 at 21:33:02 UTC, Andrei Alexandrescu wrote: > > Should I assume some normalization occurred on the way? > I'm just looking over std.uni's section on normalization and realizing that I had basically no idea what it is or what's going on. The wikipedia page on unicode equivalence is a bit clearer. I'm definitely nowhere near qualified to have an opinion on these issues.

On Fri, Jun 3, 2016 at 5:16 AM, jmh530 via Digitalmars-d-announce <digitalmars-d-announce@puremagic.com> wrote: > On Thursday, 2 June 2016 at 21:33:02 UTC, Andrei Alexandrescu wrote: >> >> >> Should I assume some normalization occurred on the way? >> > > I'm just looking over std.uni's section on normalization and realizing that I had basically no idea what it is or what's going on. The wikipedia page on unicode equivalence is a bit clearer. > > I'm definitely nowhere near qualified to have an opinion on these issues. This dpaste shows a couple of issues with combining chars in D. https://dpaste.dzfl.pl/4b006959c5c0 The compiler actually can't handle a combining character literal either. see line 10. R

On Friday, 3 June 2016 at 03:16:33 UTC, jmh530 wrote: > I'm just looking over std.uni's section on normalization and realizing that I had basically no idea what it is or what's going on. The wikipedia page on unicode equivalence is a bit clearer. This might help a bit, as well: https://dpaste.dzfl.pl/2ffb22b02842

On Friday, 3 June 2016 at 06:37:59 UTC, Rory McGuire wrote: > This dpaste shows a couple of issues with combining chars in D. > > https://dpaste.dzfl.pl/4b006959c5c0 > > The compiler actually can't handle a combining character literal either. see line 10. Your paste behaves as expected: the "character" types in D are defined as single Unicode code units. By definition, the NFD form of "é" is not a single code unit. You would need to use a Grapheme or [w|d]string for that. (Of course, one might reasonably question how useful our built-in character types actually are compared to ubyte/ushort/uint.)

Forums