Jump to page: 1 2 3
Thread overview
char array weirdness
Mar 28, 2016
Jack Stouffer
Mar 28, 2016
Anon
Mar 28, 2016
Jack Stouffer
Mar 28, 2016
H. S. Teoh
Mar 28, 2016
Anon
Mar 28, 2016
Anon
Mar 28, 2016
ag0aep6g
Mar 28, 2016
Jonathan M Davis
Mar 28, 2016
Jonathan M Davis
Mar 28, 2016
Jack Stouffer
Mar 28, 2016
H. S. Teoh
Mar 29, 2016
Marco Leise
Mar 29, 2016
Jonathan M Davis
Mar 29, 2016
Basile B.
Mar 29, 2016
Jack Stouffer
Mar 29, 2016
H. S. Teoh
Mar 30, 2016
Basile B.
Mar 30, 2016
H. S. Teoh
Mar 30, 2016
Jack Stouffer
Mar 30, 2016
H. S. Teoh
Mar 30, 2016
Jack Stouffer
Mar 30, 2016
ag0aep6g
Mar 31, 2016
Jack Stouffer
Mar 31, 2016
ag0aep6g
Mar 31, 2016
Jack Stouffer
March 28, 2016
void main () {
    import std.range.primitives;
    char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
    pragma(msg, ElementEncodingType!(typeof(val)));
    pragma(msg, typeof(val.front));
}

prints

    char
    dchar

Why?
March 28, 2016
On Monday, 28 March 2016 at 22:34:31 UTC, Jack Stouffer wrote:
> void main () {
>     import std.range.primitives;
>     char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
>     pragma(msg, ElementEncodingType!(typeof(val)));
>     pragma(msg, typeof(val.front));
> }
>
> prints
>
>     char
>     dchar
>
> Why?

Unicode! `char` is UTF-8, which means a character can be from 1 to 4 bytes. val.front gives a `dchar` (UTF-32), consuming those bytes and giving you a sensible value.
March 28, 2016
On Monday, 28 March 2016 at 22:43:26 UTC, Anon wrote:
> On Monday, 28 March 2016 at 22:34:31 UTC, Jack Stouffer wrote:
>> void main () {
>>     import std.range.primitives;
>>     char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
>>     pragma(msg, ElementEncodingType!(typeof(val)));
>>     pragma(msg, typeof(val.front));
>> }
>>
>> prints
>>
>>     char
>>     dchar
>>
>> Why?
>
> Unicode! `char` is UTF-8, which means a character can be from 1 to 4 bytes. val.front gives a `dchar` (UTF-32), consuming those bytes and giving you a sensible value.

But the value fits into a char; a dchar is a waste of space. Why on Earth would a different type be given for the front value than the type of the elements themselves?
March 28, 2016
On Mon, Mar 28, 2016 at 10:49:28PM +0000, Jack Stouffer via Digitalmars-d-learn wrote:
> On Monday, 28 March 2016 at 22:43:26 UTC, Anon wrote:
> >On Monday, 28 March 2016 at 22:34:31 UTC, Jack Stouffer wrote:
> >>void main () {
> >>    import std.range.primitives;
> >>    char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
> >>    pragma(msg, ElementEncodingType!(typeof(val)));
> >>    pragma(msg, typeof(val.front));
> >>}
> >>
> >>prints
> >>
> >>    char
> >>    dchar
> >>
> >>Why?
> >
> >Unicode! `char` is UTF-8, which means a character can be from 1 to 4 bytes. val.front gives a `dchar` (UTF-32), consuming those bytes and giving you a sensible value.
> 
> But the value fits into a char; a dchar is a waste of space. Why on Earth would a different type be given for the front value than the type of the elements themselves?

Welcome to the world of auto-decoding.  Phobos ranges always treat any string / wstring / dstring as a range of dchar, even if it's encoded as UTF-8.

The pros and cons of auto-decoding have been debated to death several times already. Walter hates it and wishes to get rid of it, but so far Andrei has refused to budge.  Personally I lean on the side of killing auto-decoding, but it seems unlikely to change at this point.  (But you never know... if enough people revolt against it, maybe there's a small chance Andrei could be convinced...)

For the time being, I'd recommend std.utf.byCodeUnit as a workaround.


T

-- 
Those who don't understand D are condemned to reinvent it, poorly. -- Daniel N
March 28, 2016
On Monday, 28 March 2016 at 22:49:28 UTC, Jack Stouffer wrote:
> On Monday, 28 March 2016 at 22:43:26 UTC, Anon wrote:
>> On Monday, 28 March 2016 at 22:34:31 UTC, Jack Stouffer wrote:
>>> void main () {
>>>     import std.range.primitives;
>>>     char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
>>>     pragma(msg, ElementEncodingType!(typeof(val)));
>>>     pragma(msg, typeof(val.front));
>>> }
>>>
>>> prints
>>>
>>>     char
>>>     dchar
>>>
>>> Why?
>>
>> Unicode! `char` is UTF-8, which means a character can be from 1 to 4 bytes. val.front gives a `dchar` (UTF-32), consuming those bytes and giving you a sensible value.
>
> But the value fits into a char;

The compiler doesn't know that, and it isn't true in general. You could have, for example, U+3042 in your char[]. That would be encoded as three chars. It wouldn't make sense (or be correct) for val.front to yield '\xe3' (the first byte of U+3042 in UTF-8).

> a dchar is a waste of space.

If you're processing Unicode text, you *need* to use that space. Any because you're using ranges, it is only 3 extra bytes, anyway. It isn't going to hurt on modern systems.

> Why on Earth would a different type be given for the front value than the type of the elements themselves?

Unicode. A single char cannot hold a Unicode code point. A single dchar can.
March 28, 2016
On Monday, March 28, 2016 22:34:31 Jack Stouffer via Digitalmars-d-learn wrote:
> void main () {
>      import std.range.primitives;
>      char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's'];
>      pragma(msg, ElementEncodingType!(typeof(val)));
>      pragma(msg, typeof(val.front));
> }
>
> prints
>
>      char
>      dchar
>
> Why?

assert(typeof(ElementType!(typeof(val)) == dchar));

The range API considers all strings to have an element type of dchar. char, wchar, and dchar are UTF code units - UTF-8, UTF-16, and UTF-32 respectively. One or more code units make up a code point, which is actually something displayable but not necessarily what you'd call a character (e.g. it could be an accent). One or more code points then make up a grapheme, which is really what a displayable character is. When Andrei designed the range API, he didn't know about graphemes - just code units and code points, so he thought that code points were guaranteed to be full characters and decided that that's what we'd operate on for correctness' sake.

In the case of UTF-8, a code point is made up of 1 - 4 code units of 8 bits each. In the case of UTF-16, a code point is mode up of 1 - 2 code units of 16 bits each. And in the case of UTF-32, a code unit is guaranteed to be a single code point. So, by having the range API decode UTF-8 and UTF-16 to UTF-32, strings then become ranges of dchar and avoid having code points chopped up by stuff like slicing. So, while a code point is not actually guaranteed to be a full character, certain classes of bugs are prevented by operating on ranges of code points rather than code units. Of course, for full correctness, graphemes need to be taken into account, and some algorithms generally don't care whether they're operating on code units, code points, or graphemes (e.g. find on code units generally works quite well, whereas something like filter would be a complete disaster if you're not actually dealing with ASCII).

Arrays of char and wchar are termed "narrow strings" - hence isNarrowString is true for them (but not arrays of dchar) - and the range API does not consider them to have slicing, be random access, or have length, because as ranges of dchar, those operations would be O(n) rather than O(1). However, because of this mess of whether an algorithm works best when operating on code units or code points and the desire to avoid decoding to code points if unnecessary, many algorithms special case narrow strings in order to operate on them more efficiently. So, ElementEncodingType was introduced for such cases. ElementType gives you the element type of the range, and for everythnig but narrow strings ElementEncodingType is the same as ElementType, but in the case of narrow strings, whereas ElementType is dchar, ElementEncodingType is the actual element type of the array - hence why ElementEncodingType(typeof(val)) is char in your code above.

The correct way to deal with this is really to understand Unicode well enough to know when you should be dealing at the code unit, code point, or grapheme level and write your code accordingly, but that's not exactly easy. So, in some respects, just operating on strings as dchar simplifies things and reduces bugs relating to breaking up code points, but it does come with an efficiency cost, and it does make the range API more confusing when it comes to operating on narrow strings. And it isn't even fully correct, because it doesn't take graphemes into account. But it's what we're stuck with at this point.

std.utf provides byCodeUnit and byChar to iterate by code unit or specific character types, and std.uni provides byGrapheme for iterating by grapheme (along with plenty of other helper functions). So, the tools to deal with range s of characters more precisely are there, but they do require some understanding of Unicode, and they don't always interact with the rest of Phobos very well, since they're newer (e.g. std.conv.to doesn't fully work with byCodeUnit yet, even though it works with ranges of dchar just fine).

- Jonathan M Davis

March 28, 2016
On Monday, 28 March 2016 at 23:06:49 UTC, Anon wrote:
> Any because you're using ranges,

*And because you're using ranges,


March 29, 2016
On 29.03.2016 00:49, Jack Stouffer wrote:
> But the value fits into a char; a dchar is a waste of space. Why on
> Earth would a different type be given for the front value than the type
> of the elements themselves?

UTF-8 strings are decoded by the range primitives. That is, `front` returns one Unicode code point (type dchar) that's pieced together from up to four UTF-8 code units (type char). A code point does not fit into the 8 bits of a char.
March 28, 2016
On Monday, March 28, 2016 16:02:26 H. S. Teoh via Digitalmars-d-learn wrote:
> For the time being, I'd recommend std.utf.byCodeUnit as a workaround.

Yeah, though as I've started using it, I've quickly found that enough of Phobos doesn't support it yet, that it's problematic. e.g.

https://issues.dlang.org/show_bug.cgi?id=15800

The situation will improve, but for the moment, the most reliable thing is still to use strings as ranges of dchar but special case functions for them so that they avoid decoding where necessary. The main problem is places like filter where if you _know_ that you're just dealing with ASCII but the code has to treat the string as a range of dchar anyway, because it has to decode to match what's expected of auto-decoding. To some extent, using std.string.representation gets around that, but it runs into problems similar to those of byCodeUnit.

So, we have a ways to go.

- Jonathan M Davis

March 28, 2016
On Mon, Mar 28, 2016 at 04:07:22PM -0700, Jonathan M Davis via Digitalmars-d-learn wrote: [...]
> The range API considers all strings to have an element type of dchar. char, wchar, and dchar are UTF code units - UTF-8, UTF-16, and UTF-32 respectively. One or more code units make up a code point, which is actually something displayable but not necessarily what you'd call a character (e.g.  it could be an accent). One or more code points then make up a grapheme, which is really what a displayable character is. When Andrei designed the range API, he didn't know about graphemes - just code units and code points, so he thought that code points were guaranteed to be full characters and decided that that's what we'd operate on for correctness' sake.
[...]

Unfortunately, the fact that the default is *not* to use graphemes makes working with non-European language strings pretty much just as ugly and error-prone as working with bare char's in European language strings.

You gave the example of filter() returning wrong results when used with a range of chars (if we didn't have autodecoding), but the same can be said of using filter() *with* autodecoding on a string that contains combining diacritics: your diacritics may get randomly reattached to stuff they weren't originally attached to, or you may end up with wrong sequences of Unicode code points (e.g. diacritics not attached to any grapheme). Using filter() on Korean text, even with autodecoding, will pretty much produce garbage. And so on.

So in short, we're paying a performance cost for something that's only arguably better but still not quite there, and this cost is attached to almost *everything* you do with strings, regardless of whether you need to (e.g., when you know you're dealing with pure ASCII data).  Even when dealing with non-ASCII Unicode data, in many cases autodecoding introduces a constant (and unnecessary!) overhead.  E.g., searching for a non-ASCII character is equivalent to a substring search on the encoded form of the character, and there is no good reason why Phobos couldn't have done this instead of autodecoding every character while scanning the string.  Regexes on Unicode strings could possibly be faster if the regex engine internally converted literals in the regex into their equivalent encoded forms and did the scanning without decoding. (IIRC Dmitry did remark in some PR some time ago, to the effect that the regex engine has been optimized to the point where the cost of autodecoding is becoming visible, and the next step might be to bypass autodecoding.)

I argue that auto-decoding, as currently implemented, is a net minus, even though I realize this is unlikely to change in this lifetime. It charges a constant performance overhead yet still does not guarantee things will behave as the user would expect (i.e., treat the string as graphemes rather than code points).


T

-- 
We are in class, we are supposed to be learning, we have a teacher... Is it too much that I expect him to teach me??? -- RL
« First   ‹ Prev
1 2 3