Thread overview
ElementType!string
Aug 25, 2013
qznc
Aug 25, 2013
Paolo Invernizzi
Aug 25, 2013
qznc
Aug 26, 2013
monarch_dodra
Aug 25, 2013
Jakob Ovrum
Aug 27, 2013
Jason den Dulk
Aug 27, 2013
Tobias Pankrath
Aug 27, 2013
monarch_dodra
Aug 25, 2013
bearophile
August 25, 2013
Apparently, ElementType!string evaluates to dchar. I would have expected char. Why is that?
August 25, 2013
On Sunday, 25 August 2013 at 19:25:08 UTC, qznc wrote:
> Apparently, ElementType!string evaluates to dchar. I would have expected char. Why is that?

I think is because they are iterated by ranges as dchar, that's equivalent to iterating the unicode symbol.

If they were iterated by char, you would get during the iteration the singles pieces of the utf8 encoding, and usually that is not what an user is expecting.

Note on the other side that static assert( is(typeof(" "[0]) == immutable(char)) ), so you can iterate the string by chars using the index.

- Paolo Invernizzi
August 25, 2013
On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi wrote:
> On Sunday, 25 August 2013 at 19:25:08 UTC, qznc wrote:
>> Apparently, ElementType!string evaluates to dchar. I would have expected char. Why is that?
>
> I think is because they are iterated by ranges as dchar, that's equivalent to iterating the unicode symbol.
>
> If they were iterated by char, you would get during the iteration the singles pieces of the utf8 encoding, and usually that is not what an user is expecting.
>
> Note on the other side that static assert( is(typeof(" "[0]) == immutable(char)) ), so you can iterate the string by chars using the index.
>
> - Paolo Invernizzi

Thanks, somewhat unintuitive.

This also seems to the explanation, why the types documentation decribes char as "unsigned 8 bit UTF-8", which is different than ubyte "unsigned 8 bit".

Confirmed by this unittest:

string raw = "Maß";↵
assert(raw.length == 4);↵
assert(walkLength(raw)== 3);
August 25, 2013
On Sunday, 25 August 2013 at 19:25:08 UTC, qznc wrote:
> Apparently, ElementType!string evaluates to dchar. I would have expected char. Why is that?

It is mentioned in the documentation of `ElementType`. Use `std.range.ElementEncodingType` or `std.traits.ForeachType` to get `char` and `wchar` when given arrays of those two types.

As for the rationale:

`string`, being an alias for `immutable(char)[]`, is an array of UTF-8 code units - an array of `char`s. However, it is indeed a forward range of code points (represented as a UTF-32 code unit - `dchar`). It's a (slightly controversial) choice that was made to make Unicode-correct code the easiest and most intuitive to write, as code points are much more useful than code units.

Note that it is not a random-access range. UTF-8 is a variable length encoding, so several code units can be required to encode a single code point. Hence, a non-trivial search is required to get the n'th code point in a UTF-8 or UTF-16 string.

Another name for a code point is "character" (technically, a character is what the code point translates to in the UCS). However, it can be a deceptive name - the units we see on screen when rendered are "graphemes", as Unicode characters can be combining, zero-width etc.

To get a range of UTF-8 or UTF-16 code units, the code units have to be represented as something other than `char` and `wchar`. For example, you can cast your string to immutable(ubyte)[] to operate on that, then cast it back at a later point.
August 25, 2013
qznc:

> Apparently, ElementType!string evaluates to dchar. I would have expected char. Why is that?

Try also ForeachType :-)

Bye,
bearophile
August 26, 2013
On Sunday, 25 August 2013 at 19:51:50 UTC, qznc wrote:
> Thanks, somewhat unintuitive.

Yes, but un-intuitive... to the un-initiated. By default, it's also safer. A string *is* conceptually, a sequence of unicode codepoints. The fact that it is made of UTF-8 codepoints is really just low level implementation detail.

Thanks to this behavior, things like:
string s = "日本語"
//search for '本';
for ( ; s.front != '本' ; s.popFront())
{}

Well, they *just work* (TM).

Now... If you *know* what you are doing, then by all means, iterate on the UTF8 code units. But be warned, you must really know what you are doing.

Back to your original subject, you can use:
ElementEncodingType!S

ElementEncodingType works just like ElementType, but for strings, *really* takes the array's element's type. This is usually *not* the default you want.

Also related, foreach naturally iterates on codeunits by default (for some weird reason). I recommend to try to iterate on dchars.
August 27, 2013
On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi wrote:

> Thanks, somewhat unintuitive.

It is a trap for the unwary, but in this case the benefits outweigh the costs.

On Sunday, 25 August 2013 at 19:56:34 UTC, Jakob Ovrum wrote:

> To get a range of UTF-8 or UTF-16 code units, the code units have to be represented as something other than `char` and `wchar`. For example, you can cast your string to immutable(ubyte)[] to operate on that, then cast it back at a later point.

To have to use ubyte would seem to defeat the purpose of having char. If I were to have this:

  auto no_convert(C)(C[] s) if (isSomeChar!C)
  {
    struct No
    {
      private C[] s;
      this(C[] _s) { s = _s; }

      @property bool empty() { return s.length == 0; }
      @property C front() in{ assert(s.length != 0); } body{ return s[0]; }
      void popFront() in{ assert(s.length != 0); } body{ s = s[1..$]; }
    }
    return No(s);
  }

it's element type would be char for strings. Would this still result in conversions if I used it with other algorithms?
August 27, 2013
On Tuesday, 27 August 2013 at 11:43:29 UTC, Jason den Dulk wrote:
> On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi wrote:
>
>> Thanks, somewhat unintuitive.
>
> It is a trap for the unwary, but in this case the benefits outweigh the costs.
>
> On Sunday, 25 August 2013 at 19:56:34 UTC, Jakob Ovrum wrote:
>
>> To get a range of UTF-8 or UTF-16 code units, the code units have to be represented as something other than `char` and `wchar`. For example, you can cast your string to immutable(ubyte)[] to operate on that, then cast it back at a later point.
>
> To have to use ubyte would seem to defeat the purpose of having char. If I were to have this:
>
>   auto no_convert(C)(C[] s) if (isSomeChar!C)
>   {
>     struct No
>     {
>       private C[] s;
>       this(C[] _s) { s = _s; }
>
>       @property bool empty() { return s.length == 0; }
>       @property C front() in{ assert(s.length != 0); } body{ return s[0]; }
>       void popFront() in{ assert(s.length != 0); } body{ s = s[1..$]; }
>     }
>     return No(s);
>   }
>
> it's element type would be char for strings. Would this still result in conversions if I used it with other algorithms?

That should work. It's the functions in std.array that make ranges out of arrays by provideng empty, front and popFront. As long as you don't use these, everything is fine.

Actually I think that your wrapper should do the conversion and std.array should not, but that train is long gone.
August 27, 2013
On Tuesday, 27 August 2013 at 11:43:29 UTC, Jason den Dulk wrote:
> On Sunday, 25 August 2013 at 19:38:52 UTC, Paolo Invernizzi wrote:
>
>> Thanks, somewhat unintuitive.
>
> It is a trap for the unwary, but in this case the benefits outweigh the costs.
>
> On Sunday, 25 August 2013 at 19:56:34 UTC, Jakob Ovrum wrote:
>
>> To get a range of UTF-8 or UTF-16 code units, the code units have to be represented as something other than `char` and `wchar`. For example, you can cast your string to immutable(ubyte)[] to operate on that, then cast it back at a later point.
>
> To have to use ubyte would seem to defeat the purpose of having char. If I were to have this:
>
>   auto no_convert(C)(C[] s) if (isSomeChar!C)
>   {
>     struct No
>     {
>       private C[] s;
>       this(C[] _s) { s = _s; }
>
>       @property bool empty() { return s.length == 0; }
>       @property C front() in{ assert(s.length != 0); } body{ return s[0]; }
>       void popFront() in{ assert(s.length != 0); } body{ s = s[1..$]; }
>     }
>     return No(s);
>   }
>
> it's element type would be char for strings. Would this still result in conversions if I used it with other algorithms?

It might, but that range of yours is underwhelming: no indexing, no length, no nothing.

Why would you want to do *that* though? Is it because you have an ASCII string? In that case, you should be interested in std.encoding.AsciiChar and std.encoding.AsciiString.