char array weirdness (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » char array weirdness (page 2)

March 28, 2016

Re: char array weirdness

Posted by Jack Stouffer
in reply to Jonathan M Davis

Jack Stouffer

Posted in reply to Jonathan M Davis

On Monday, 28 March 2016 at 23:07:22 UTC, Jonathan M Davis wrote:
> ...

Thanks for the detailed responses. I think I'll compile this info and put it in a blog post so people can just point to it when someone else is confused.

March 28, 2016

Re: char array weirdness

Posted by Steven Schveighoffer
in reply to Anon

Steven Schveighoffer

Posted in reply to Anon

On 3/28/16 7:06 PM, Anon wrote:
> The compiler doesn't know that, and it isn't true in general. You could
> have, for example, U+3042 in your char[]. That would be encoded as three
> chars. It wouldn't make sense (or be correct) for val.front to yield
> '\xe3' (the first byte of U+3042 in UTF-8).

I just want to interject to say that the compiler understands that char[] is an array of char code units just fine. It's Phobos that has a strange interpretation of it.

-Steve

March 29, 2016

Re: char array weirdness

Posted by Jonathan M Davis

Jonathan M Davis

On Monday, March 28, 2016 16:29:50 H. S. Teoh via Digitalmars-d-learn wrote:
> On Mon, Mar 28, 2016 at 04:07:22PM -0700, Jonathan M Davis via Digitalmars-d-learn wrote: [...]
>
> > The range API considers all strings to have an element type of dchar. char, wchar, and dchar are UTF code units - UTF-8, UTF-16, and UTF-32 respectively. One or more code units make up a code point, which is actually something displayable but not necessarily what you'd call a character (e.g.  it could be an accent). One or more code points then make up a grapheme, which is really what a displayable character is. When Andrei designed the range API, he didn't know about graphemes - just code units and code points, so he thought that code points were guaranteed to be full characters and decided that that's what we'd operate on for correctness' sake.
>
> [...]
>
> Unfortunately, the fact that the default is *not* to use graphemes makes working with non-European language strings pretty much just as ugly and error-prone as working with bare char's in European language strings.

Yeah. Operating at the code point level instead of the code unit level is correct for more text than just operating at the code unit level (especially if you're dealing with char rather than wchar), but ultimately, it's definitely not correct, and there's plenty of text that will be processed incorrectly as code points.

> I argue that auto-decoding, as currently implemented, is a net minus, even though I realize this is unlikely to change in this lifetime. It charges a constant performance overhead yet still does not guarantee things will behave as the user would expect (i.e., treat the string as graphemes rather than code points).

I totally agree, and I think that _most_ of the Phobos devs agree at this point. It's Andrei that doesn't. But we have the twin problems of figuring out how to convince him and how to deal with the fact that changing it would break a lot of code. Unicoding is disgusting to deal with if you want to deal with it correctly _and_ be efficient about it, but hiding it doesn't really work.

I think that the first steps are to make it so that the algorithms in Phobos will operate just fine on ranges of char and wchar in addition to dchar and move towards making it irrelevant wherever we can. Some functions (like filter) are going to have to be told what level to operate at and would be a serious problem if/when we switched away from auto-decoding, but many others (such as find) can be made not to care while still operating on Unicode correctly. And if we can get the amount code impacted low enough (at least as far as Phobos goes), then maybe we can find a way to switch away from auto-decoding. Ultimately though, I fear that we're stuck with it and that we'll just have to figure out how to make it work well for those who know what they're doing while minimizing the performance impact of auto-decoding on those who don't know what they're doing as much as we reasonably can.

- Jonathan M Davis

March 30, 2016

Re: char array weirdness

Posted by Marco Leise
in reply to H. S. Teoh

Marco Leise

Posted in reply to H. S. Teoh

Am Mon, 28 Mar 2016 16:29:50 -0700
schrieb "H. S. Teoh via Digitalmars-d-learn"
<digitalmars-d-learn@puremagic.com>:

> […] your diacritics may get randomly reattached to
> stuff they weren't originally attached to, or you may end up with wrong
> sequences of Unicode code points (e.g. diacritics not attached to any
> grapheme). Using filter() on Korean text, even with autodecoding, will
> pretty much produce garbage. And so on.

I'm on the same page here. If it ain't ASCII parsable, you *have* to make a conscious decision about whether you need code units or graphemes. I've yet to find out about the use cases for auto-decoded code-points though.

> So in short, we're paying a performance cost for something that's only arguably better but still not quite there, and this cost is attached to almost *everything* you do with strings, regardless of whether you need to (e.g., when you know you're dealing with pure ASCII data).

An unconscious decision made by the library that yields the least likely intended and expected result? Let me think ... mhhh ... that's worse than iterating by char. No talking back :p.

-- 
Marco

March 29, 2016

Re: char array weirdness

Posted by Basile B.
in reply to Jack Stouffer

Basile B.

Posted in reply to Jack Stouffer

On Monday, 28 March 2016 at 22:34:31 UTC, Jack Stouffer wrote:
> void main () {
>     import std.range.primitives;
>     char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
>     pragma(msg, ElementEncodingType!(typeof(val)));
>     pragma(msg, typeof(val.front));
> }
>
> prints
>
>     char
>     dchar
>
> Why?

I've seen you so many time as a reviewer on dlang that I belive this Q is a joke.
Even if obviously nobody can know everything...

https://www.youtube.com/watch?v=l97MxTx0nzs

seriously you didn't know that auto decoding is on and that it gives you a dchar...

March 29, 2016

Re: char array weirdness

Posted by Jack Stouffer
in reply to Basile B.

Jack Stouffer

Posted in reply to Basile B.

On Tuesday, 29 March 2016 at 23:15:26 UTC, Basile B. wrote:
> I've seen you so many time as a reviewer on dlang that I belive this Q is a joke.
> Even if obviously nobody can know everything...
>
> https://www.youtube.com/watch?v=l97MxTx0nzs
>
> seriously you didn't know that auto decoding is on and that it gives you a dchar...

It's not a joke. This is the first time I've run into this problem in my code. I just started using D more and more in my work and I've never written anything that was really string heavy.

March 29, 2016

Re: char array weirdness

Posted by H. S. Teoh
in reply to Basile B.

H. S. Teoh

Posted in reply to Basile B.

On Tue, Mar 29, 2016 at 11:15:26PM +0000, Basile B. via Digitalmars-d-learn wrote:
> On Monday, 28 March 2016 at 22:34:31 UTC, Jack Stouffer wrote:
> >void main () {
> >    import std.range.primitives;
> >    char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
> >    pragma(msg, ElementEncodingType!(typeof(val)));
> >    pragma(msg, typeof(val.front));
> >}
> >
> >prints
> >
> >    char
> >    dchar
> >
> >Why?
> 
> I've seen you so many time as a reviewer on dlang that I belive this Q
> is a joke.
> Even if obviously nobody can know everything...
> 
> https://www.youtube.com/watch?v=l97MxTx0nzs
> 
> seriously you didn't know that auto decoding is on and that it gives you a dchar...

Believe it or not, it was only last year (IIRC, maybe the year before) that Walter "discovered" that Phobos does autodecoding, and got pretty upset over it.  If even Walter wasn't aware of this for that long...

I used to be in favor of autodecoding, but more and more, I'm seeing that it was a bad choice.  It's a special case to how ranges normally work, and this special case has caused a ripple of exceptional corner cases to percolate throughout all Phobos code, leaving behind a string(!) of bugs over the years that, certainly, eventually got addressed, but nevertheless it shows that something didn't quite fit in. It also left behind a trail of additional complexity to deal with these special cases that made Phobos harder to understand and maintain.

It's a performance bottleneck for string-processing code, which is a pity because D could have stood the chance to win against C/C++ string processing (due to extensive need to call strlen and strdup). But in spite of this heavy price we *still* don't guarantee correctness. On the spectrum of speed (don't decode at all) vs. correctness (segment by graphemes, not by code units or code points) autodecoding lands in the anemic middle where you get neither speed nor full correctness.

The saddest part of it all is that this is unlikely to change because people have gotten so uptight about the specter of breaking existing code, in spite of the repeated experiences of newbies (and not-so-newbies like Walter himself!) wondering why strings have ElementType == dchar instead of char, usually followed by concerns over the performance overhead.

T

-- 
Designer clothes: how to cover less by paying more.

March 29, 2016

Re: char array weirdness

Posted by Steven Schveighoffer
in reply to H. S. Teoh

Steven Schveighoffer

Posted in reply to H. S. Teoh

On 3/29/16 7:42 PM, H. S. Teoh via Digitalmars-d-learn wrote:
> On Tue, Mar 29, 2016 at 11:15:26PM +0000, Basile B. via Digitalmars-d-learn wrote:
>> On Monday, 28 March 2016 at 22:34:31 UTC, Jack Stouffer wrote:
>>> void main () {
>>>     import std.range.primitives;
>>>     char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
>>>     pragma(msg, ElementEncodingType!(typeof(val)));
>>>     pragma(msg, typeof(val.front));
>>> }
>>>
>>> prints
>>>
>>>     char
>>>     dchar
>>>
>>> Why?
>>
>> I've seen you so many time as a reviewer on dlang that I belive this Q
>> is a joke.
>> Even if obviously nobody can know everything...
>>
>> https://www.youtube.com/watch?v=l97MxTx0nzs
>>
>> seriously you didn't know that auto decoding is on and that it gives
>> you a dchar...
>
> Believe it or not, it was only last year (IIRC, maybe the year before)
> that Walter "discovered" that Phobos does autodecoding, and got pretty
> upset over it.  If even Walter wasn't aware of this for that long...

Phobos treats narrow strings (wchar[], char[]) as ranges of dchar. It was discovered that auto decoding strings isn't always the smartest thing to do, especially for performance.

So you get things like this: https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm/searching.d#L1622

That's right. Phobos insists that auto decoding must happen for narrow strings. Except that's not the best thing to do so it inserts lots of exceptions -- for narrow strings.

Mind blown?

-Steve

March 30, 2016

Re: char array weirdness

Posted by Basile B.
in reply to Steven Schveighoffer

Basile B.

Posted in reply to Steven Schveighoffer

On Wednesday, 30 March 2016 at 00:05:29 UTC, Steven Schveighoffer wrote:
> On 3/29/16 7:42 PM, H. S. Teoh via Digitalmars-d-learn wrote:
>> On Tue, Mar 29, 2016 at 11:15:26PM +0000, Basile B. via Digitalmars-d-learn wrote:
>>> On Monday, 28 March 2016 at 22:34:31 UTC, Jack Stouffer wrote:
>>>> void main () {
>>>>     import std.range.primitives;
>>>>     char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
>>>>     pragma(msg, ElementEncodingType!(typeof(val)));
>>>>     pragma(msg, typeof(val.front));
>>>> }
>>>>
>>>> prints
>>>>
>>>>     char
>>>>     dchar
>>>>
>>>> Why?
>>>
>>> I've seen you so many time as a reviewer on dlang that I belive this Q
>>> is a joke.
>>> Even if obviously nobody can know everything...
>>>
>>> https://www.youtube.com/watch?v=l97MxTx0nzs
>>>
>>> seriously you didn't know that auto decoding is on and that it gives
>>> you a dchar...
>>
>> Believe it or not, it was only last year (IIRC, maybe the year before)
>> that Walter "discovered" that Phobos does autodecoding, and got pretty
>> upset over it.  If even Walter wasn't aware of this for that long...
>
> Phobos treats narrow strings (wchar[], char[]) as ranges of dchar. It was discovered that auto decoding strings isn't always the smartest thing to do, especially for performance.
>
> So you get things like this: https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm/searching.d#L1622
>
> That's right. Phobos insists that auto decoding must happen for narrow strings. Except that's not the best thing to do so it inserts lots of exceptions -- for narrow strings.
>
> Mind blown?
>
> -Steve

https://www.youtube.com/watch?v=JKQwgpaLR6o

Listen to this then it'll be more clear.

March 30, 2016

Re: char array weirdness

Posted by Jack Stouffer
in reply to H. S. Teoh

Jack Stouffer

Posted in reply to H. S. Teoh

On Tuesday, 29 March 2016 at 23:42:07 UTC, H. S. Teoh wrote:
> Believe it or not, it was only last year (IIRC, maybe the year before) that Walter "discovered" that Phobos does autodecoding, and got pretty upset over it.  If even Walter wasn't aware of this for that long...

The link (I think this is what you're referring to) to that discussion: http://forum.dlang.org/post/lfbg06$30kh$1@digitalmars.com

It's a shame Walter never got his way. Special casing ranges like this is a huge mistake.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation