ElementType!string

Aug 25, 2013

qznc

Aug 25, 2013

Paolo Invernizzi

Aug 25, 2013

Aug 26, 2013

Aug 25, 2013

Aug 27, 2013

Aug 27, 2013

Aug 27, 2013

Aug 25, 2013

On Sunday, 25 August 2013 at 19:25:08 UTC, qznc wrote: > Apparently, ElementType!string evaluates to dchar. I would have expected char. Why is that? I think is because they are iterated by ranges as dchar, that's equivalent to iterating the unicode symbol. If they were iterated by char, you would get during the iteration the singles pieces of the utf8 encoding, and usually that is not what an user is expecting. Note on the other side that static assert( is(typeof(" "[0]) == immutable(char)) ), so you can iterate the string by chars using the index. - Paolo Invernizzi

On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi wrote: > On Sunday, 25 August 2013 at 19:25:08 UTC, qznc wrote: >> Apparently, ElementType!string evaluates to dchar. I would have expected char. Why is that? > > I think is because they are iterated by ranges as dchar, that's equivalent to iterating the unicode symbol. > > If they were iterated by char, you would get during the iteration the singles pieces of the utf8 encoding, and usually that is not what an user is expecting. > > Note on the other side that static assert( is(typeof(" "[0]) == immutable(char)) ), so you can iterate the string by chars using the index. > > - Paolo Invernizzi Thanks, somewhat unintuitive. This also seems to the explanation, why the types documentation decribes char as "unsigned 8 bit UTF-8", which is different than ubyte "unsigned 8 bit". Confirmed by this unittest: string raw = "Maß";↵ assert(raw.length == 4);↵ assert(walkLength(raw)== 3);

August 25, 2013

Re: ElementType!string

Posted by Jakob Ovrum
in reply to qznc

Permalink

Jakob Ovrum

Posted in reply to qznc

Permalink

On Sunday, 25 August 2013 at 19:25:08 UTC, qznc wrote:
> Apparently, ElementType!string evaluates to dchar. I would have expected char. Why is that?

It is mentioned in the documentation of `ElementType`. Use `std.range.ElementEncodingType` or `std.traits.ForeachType` to get `char` and `wchar` when given arrays of those two types.

As for the rationale:

`string`, being an alias for `immutable(char)[]`, is an array of UTF-8 code units - an array of `char`s. However, it is indeed a forward range of code points (represented as a UTF-32 code unit - `dchar`). It's a (slightly controversial) choice that was made to make Unicode-correct code the easiest and most intuitive to write, as code points are much more useful than code units.

Note that it is not a random-access range. UTF-8 is a variable length encoding, so several code units can be required to encode a single code point. Hence, a non-trivial search is required to get the n'th code point in a UTF-8 or UTF-16 string.

Another name for a code point is "character" (technically, a character is what the code point translates to in the UCS). However, it can be a deceptive name - the units we see on screen when rendered are "graphemes", as Unicode characters can be combining, zero-width etc.

To get a range of UTF-8 or UTF-16 code units, the code units have to be represented as something other than `char` and `wchar`. For example, you can cast your string to immutable(ubyte)[] to operate on that, then cast it back at a later point.

On Sunday, 25 August 2013 at 19:51:50 UTC, qznc wrote: > Thanks, somewhat unintuitive. Yes, but un-intuitive... to the un-initiated. By default, it's also safer. A string *is* conceptually, a sequence of unicode codepoints. The fact that it is made of UTF-8 codepoints is really just low level implementation detail. Thanks to this behavior, things like: string s = "日本語" //search for '本'; for ( ; s.front != '本' ; s.popFront()) {} Well, they *just work* (TM). Now... If you *know* what you are doing, then by all means, iterate on the UTF8 code units. But be warned, you must really know what you are doing. Back to your original subject, you can use: ElementEncodingType!S ElementEncodingType works just like ElementType, but for strings, *really* takes the array's element's type. This is usually *not* the default you want. Also related, foreach naturally iterates on codeunits by default (for some weird reason). I recommend to try to iterate on dchars.

On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi wrote: > Thanks, somewhat unintuitive. It is a trap for the unwary, but in this case the benefits outweigh the costs. On Sunday, 25 August 2013 at 19:56:34 UTC, Jakob Ovrum wrote: > To get a range of UTF-8 or UTF-16 code units, the code units have to be represented as something other than `char` and `wchar`. For example, you can cast your string to immutable(ubyte)[] to operate on that, then cast it back at a later point. To have to use ubyte would seem to defeat the purpose of having char. If I were to have this: auto no_convert(C)(C[] s) if (isSomeChar!C) { struct No { private C[] s; this(C[] _s) { s = _s; } @property bool empty() { return s.length == 0; } @property C front() in{ assert(s.length != 0); } body{ return s[0]; } void popFront() in{ assert(s.length != 0); } body{ s = s[1..$]; } } return No(s); } it's element type would be char for strings. Would this still result in conversions if I used it with other algorithms?

On Tuesday, 27 August 2013 at 11:43:29 UTC, Jason den Dulk wrote: > On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi wrote: > >> Thanks, somewhat unintuitive. > > It is a trap for the unwary, but in this case the benefits outweigh the costs. > > On Sunday, 25 August 2013 at 19:56:34 UTC, Jakob Ovrum wrote: > >> To get a range of UTF-8 or UTF-16 code units, the code units have to be represented as something other than `char` and `wchar`. For example, you can cast your string to immutable(ubyte)[] to operate on that, then cast it back at a later point. > > To have to use ubyte would seem to defeat the purpose of having char. If I were to have this: > > auto no_convert(C)(C[] s) if (isSomeChar!C) > { > struct No > { > private C[] s; > this(C[] _s) { s = _s; } > > @property bool empty() { return s.length == 0; } > @property C front() in{ assert(s.length != 0); } body{ return s[0]; } > void popFront() in{ assert(s.length != 0); } body{ s = s[1..$]; } > } > return No(s); > } > > it's element type would be char for strings. Would this still result in conversions if I used it with other algorithms? That should work. It's the functions in std.array that make ranges out of arrays by provideng empty, front and popFront. As long as you don't use these, everything is fine. Actually I think that your wrapper should do the conversion and std.array should not, but that train is long gone.

On Tuesday, 27 August 2013 at 11:43:29 UTC, Jason den Dulk wrote: > On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi wrote: > >> Thanks, somewhat unintuitive. > > It is a trap for the unwary, but in this case the benefits outweigh the costs. > > On Sunday, 25 August 2013 at 19:56:34 UTC, Jakob Ovrum wrote: > >> To get a range of UTF-8 or UTF-16 code units, the code units have to be represented as something other than `char` and `wchar`. For example, you can cast your string to immutable(ubyte)[] to operate on that, then cast it back at a later point. > > To have to use ubyte would seem to defeat the purpose of having char. If I were to have this: > > auto no_convert(C)(C[] s) if (isSomeChar!C) > { > struct No > { > private C[] s; > this(C[] _s) { s = _s; } > > @property bool empty() { return s.length == 0; } > @property C front() in{ assert(s.length != 0); } body{ return s[0]; } > void popFront() in{ assert(s.length != 0); } body{ s = s[1..$]; } > } > return No(s); > } > > it's element type would be char for strings. Would this still result in conversions if I used it with other algorithms? It might, but that range of yours is underwhelming: no indexing, no length, no nothing. Why would you want to do *that* though? Is it because you have an ASCII string? In that case, you should be interested in std.encoding.AsciiChar and std.encoding.AsciiString.

Forums