typeof(string.front) should be char - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » typeof(string.front) should be char

Thread overview

typeof(string.front) should be char
Mar 03, 2012 Piotr Szturmaj
Mar 03, 2012 Ali Çehreli
Mar 03, 2012 Jonathan M Davis
Mar 03, 2012 Piotr Szturmaj
Mar 03, 2012 Ali Çehreli
Mar 03, 2012 Jonathan M Davis
Mar 03, 2012 H. S. Teoh
Mar 03, 2012 Timon Gehr
Mar 03, 2012 Jonathan M Davis
Mar 03, 2012 Timon Gehr
Mar 03, 2012 Jonathan M Davis
Mar 03, 2012 H. S. Teoh
Mar 03, 2012 Ali Çehreli
Mar 04, 2012 Jonathan M Davis
Mar 04, 2012 Jonathan M Davis
Mar 03, 2012 Jacob Carlborg
Mar 03, 2012 Ali Çehreli
Mar 03, 2012 Jacob Carlborg

March 03, 2012

typeof(string.front) should be char

Posted by Piotr Szturmaj

Piotr Szturmaj

Hello,

For this code:

    auto c = "test"c;
    auto w = "test"w;
    auto d = "test"d;
    pragma(msg, typeof(c.front));
    pragma(msg, typeof(w.front));
    pragma(msg, typeof(d.front));

compiler prints:

dchar
dchar
immutable(dchar)

IMO it should print this:

immutable(char)
immutable(wchar)
immutable(dchar)

Is it a bug?

March 03, 2012

Re: typeof(string.front) should be char

Posted by Ali Çehreli
in reply to Piotr Szturmaj

Ali Çehreli

Posted in reply to Piotr Szturmaj

On 03/02/2012 06:30 PM, Piotr Szturmaj wrote:
> Hello,
>
> For this code:
>
> auto c = "test"c;
> auto w = "test"w;
> auto d = "test"d;
> pragma(msg, typeof(c.front));
> pragma(msg, typeof(w.front));
> pragma(msg, typeof(d.front));
>
> compiler prints:
>
> dchar
> dchar
> immutable(dchar)
>
> IMO it should print this:
>
> immutable(char)
> immutable(wchar)
> immutable(dchar)
>
> Is it a bug?

No, that's by design. When used as InputRange ranges, slices of any character type are exposed as ranges of dchar.

Ali

March 03, 2012

Re: typeof(string.front) should be char

Posted by Jonathan M Davis
in reply to Ali Çehreli

Jonathan M Davis

Posted in reply to Ali Çehreli

On Friday, March 02, 2012 20:41:35 Ali Çehreli wrote:
> On 03/02/2012 06:30 PM, Piotr Szturmaj wrote:
>  > Hello,
>  >
>  > For this code:
>  >
>  > auto c = "test"c;
>  > auto w = "test"w;
>  > auto d = "test"d;
>  > pragma(msg, typeof(c.front));
>  > pragma(msg, typeof(w.front));
>  > pragma(msg, typeof(d.front));
>  >
>  > compiler prints:
>  >
>  > dchar
>  > dchar
>  > immutable(dchar)
>  >
>  > IMO it should print this:
>  >
>  > immutable(char)
>  > immutable(wchar)
>  > immutable(dchar)
>  >
>  > Is it a bug?
> 
> No, that's by design. When used as InputRange ranges, slices of any character type are exposed as ranges of dchar.

Indeed.

Strings are always treated as ranges of dchar, because it generally makes no sense to operate on individual chars or wchars. A char is a UTF-8 code unit. A wchar is a UTF-16 code unit. And a dchar is a UTF-32 code unit. The _only_ one of those which is guranteed to be a code point is dchar, since in UTF-32, all code points are a single code unit. If you were to operate on individual chars or wchars, you'd be operating on pieces of characters rather than whole characters, which wreaks havoc with unicode.

Now, technically speaking, a code point isn't necessarily a full character, since you can also combine code points (e.g. adding a subscript to a letter), and a full character is what's called a grapheme, and unfortunately, at the moment, Phobos doesn't have a way to operate on graphemes, but operating on code points is _far_ more correct than operating on code units. It's also more efficient.

Unfortunately, in order to code completely efficiently with unicode, you have understand quite a bit about it, which most programmers don't, but by operating on ranges of code points, Phobos manages to be correct in the majority of cases.

So, yes. It's very much on purpose that all strings are treated as ranges of dchar.

- Jonathan M Davis

March 03, 2012

Re: typeof(string.front) should be char

Posted by Jacob Carlborg
in reply to Piotr Szturmaj

Jacob Carlborg

Posted in reply to Piotr Szturmaj

On 2012-03-03 03:30, Piotr Szturmaj wrote:
> Hello,
>
> For this code:
>
> auto c = "test"c;
> auto w = "test"w;
> auto d = "test"d;
> pragma(msg, typeof(c.front));
> pragma(msg, typeof(w.front));
> pragma(msg, typeof(d.front));
>
> compiler prints:
>
> dchar
> dchar
> immutable(dchar)

I thought all these would be either "dchar" or "immutable(dchar)". Why are they of different types?

> IMO it should print this:
>
> immutable(char)
> immutable(wchar)
> immutable(dchar)
>
> Is it a bug?


-- 
/Jacob Carlborg

March 03, 2012

Re: typeof(string.front) should be char

Posted by Piotr Szturmaj
in reply to Jonathan M Davis

Piotr Szturmaj

Posted in reply to Jonathan M Davis

Jonathan M Davis wrote:
> On Friday, March 02, 2012 20:41:35 Ali Çehreli wrote:
>> On 03/02/2012 06:30 PM, Piotr Szturmaj wrote:
>>   >  Hello,
>>   >
>>   >  For this code:
>>   >
>>   >  auto c = "test"c;
>>   >  auto w = "test"w;
>>   >  auto d = "test"d;
>>   >  pragma(msg, typeof(c.front));
>>   >  pragma(msg, typeof(w.front));
>>   >  pragma(msg, typeof(d.front));
>>   >
>>   >  compiler prints:
>>   >
>>   >  dchar
>>   >  dchar
>>   >  immutable(dchar)
>>   >
>>   >  IMO it should print this:
>>   >
>>   >  immutable(char)
>>   >  immutable(wchar)
>>   >  immutable(dchar)
>>   >
>>   >  Is it a bug?
>>
>> No, that's by design. When used as InputRange ranges, slices of any
>> character type are exposed as ranges of dchar.
>
> Indeed.
>
> Strings are always treated as ranges of dchar, because it generally makes no
> sense to operate on individual chars or wchars. A char is a UTF-8 code unit. A
> wchar is a UTF-16 code unit. And a dchar is a UTF-32 code unit. The _only_ one
> of those which is guranteed to be a code point is dchar, since in UTF-32, all
> code points are a single code unit. If you were to operate on individual chars
> or wchars, you'd be operating on pieces of characters rather than whole
> characters, which wreaks havoc with unicode.
>
> Now, technically speaking, a code point isn't necessarily a full character,
> since you can also combine code points (e.g. adding a subscript to a letter),
> and a full character is what's called a grapheme, and unfortunately, at the
> moment, Phobos doesn't have a way to operate on graphemes, but operating on
> code points is _far_ more correct than operating on code units. It's also more
> efficient.
>
> Unfortunately, in order to code completely efficiently with unicode, you have
> understand quite a bit about it, which most programmers don't, but by
> operating on ranges of code points, Phobos manages to be correct in the
> majority of cases.

I know about Unicode, code units/points and their encoding.

> So, yes. It's very much on purpose that all strings are treated as ranges of
> dchar.

Foreach gives opportunity to handle any string by char, wchar or dchar, the default dchar is appropriate here, but why for ranges?

I was afraid it is on purpose, because it has some bad consequences. It breaks genericity when dealing with ranges. Consider a custom range of char:

struct CharRange
{
    @property bool empty();
    @property char front();
    void popFront();
}

typeof(CharRange.front) and ElementType!CharRange both return _char_ while for string they return _dchar_. This discrepancy pushes the range writer to handle special string cases. I'm currently trying to write ByDchar range:

template ByDchar(R)
     if (isInputRange!R && isSomeChar!(ElementType!R))
{
    alias ElementType!R E;
    static if (is(E == dchar))
        alias R ByDchar;
    else static if (is(E == char))
    {
        struct ByDchar
        {
            ...
        }
    }
    else static if (is(E == wchar))
    {
        ...
    }
}

The problem with that range is when it takes a string type, it aliases this type with itself, because ElementType!R yields dchar. This is why I'm talking about "bad consequences", I just want to iterate string by _char_, not _dchar_.

March 03, 2012

Re: typeof(string.front) should be char

Posted by Ali Çehreli
in reply to Jacob Carlborg

Ali Çehreli

Posted in reply to Jacob Carlborg

On 03/03/2012 04:36 AM, Jacob Carlborg wrote:
> On 2012-03-03 03:30, Piotr Szturmaj wrote:
>> Hello,
>>
>> For this code:
>>
>> auto c = "test"c;
>> auto w = "test"w;
>> auto d = "test"d;
>> pragma(msg, typeof(c.front));
>> pragma(msg, typeof(w.front));
>> pragma(msg, typeof(d.front));
>>
>> compiler prints:
>>
>> dchar
>> dchar
>> immutable(dchar)
>
> I thought all these would be either "dchar" or "immutable(dchar)". Why
> are they of different types?

In the case of char and wchar slices, the "elements" are decoded as the iteration happens. In other words, the returned values are not actual elements of the ranges.

>
>> IMO it should print this:
>>
>> immutable(char)
>> immutable(wchar)
>> immutable(dchar)
>>
>> Is it a bug?
>
>

Ali

March 03, 2012

Re: typeof(string.front) should be char

Posted by Ali Çehreli
in reply to Piotr Szturmaj

Ali Çehreli

Posted in reply to Piotr Szturmaj

On 03/03/2012 05:57 AM, Piotr Szturmaj wrote:
> Consider a custom range of
> char:
>
> struct CharRange
> {
> @property bool empty();
> @property char front();
> void popFront();
> }
>
> typeof(CharRange.front) and ElementType!CharRange both return _char_

Yes, and I would expect both to the same type.

> while for string they return _dchar_. This discrepancy pushes the range
> writer to handle special string cases.

Yes, Phobos faces the same issues.

> I'm currently trying to write
> ByDchar range:
>
> template ByDchar(R)
> if (isInputRange!R && isSomeChar!(ElementType!R))
> {
> alias ElementType!R E;
> static if (is(E == dchar))
> alias R ByDchar;
> else static if (is(E == char))
> {
> struct ByDchar
> {
> ...
> }
> }
> else static if (is(E == wchar))
> {
> ...
> }
> }
>
> The problem with that range is when it takes a string type, it aliases
> this type with itself, because ElementType!R yields dchar. This is why
> I'm talking about "bad consequences", I just want to iterate string by
> _char_, not _dchar_.

In case you don't know already, there are std.traits.isNarrowString, std.range.ForEachType, etc. which may be useful.

Ali

March 03, 2012

Re: typeof(string.front) should be char

Posted by Timon Gehr
in reply to Jonathan M Davis

Timon Gehr

Posted in reply to Jonathan M Davis

On 03/03/2012 09:40 AM, Jonathan M Davis wrote:
> ...  but operating on
> code points is _far_ more correct than operating on code units. It's also more
> efficient.
> [snip.]

No, it is less efficient.

March 03, 2012

Re: typeof(string.front) should be char

Posted by Jacob Carlborg
in reply to Ali Çehreli

Jacob Carlborg

Posted in reply to Ali Çehreli

On 2012-03-03 15:10, Ali Çehreli wrote:
> On 03/03/2012 04:36 AM, Jacob Carlborg wrote:
>> On 2012-03-03 03:30, Piotr Szturmaj wrote:
>>> Hello,
>>>
>>> For this code:
>>>
>>> auto c = "test"c;
>>> auto w = "test"w;
>>> auto d = "test"d;
>>> pragma(msg, typeof(c.front));
>>> pragma(msg, typeof(w.front));
>>> pragma(msg, typeof(d.front));
>>>
>>> compiler prints:
>>>
>>> dchar
>>> dchar
>>> immutable(dchar)
>>
>> I thought all these would be either "dchar" or "immutable(dchar)". Why
>> are they of different types?
>
> In the case of char and wchar slices, the "elements" are decoded as the
> iteration happens. In other words, the returned values are not actual
> elements of the ranges.

Ah, I see, thanks.

-- 
/Jacob Carlborg

March 03, 2012

Re: typeof(string.front) should be char

Posted by Jonathan M Davis
in reply to Timon Gehr

Jonathan M Davis

Posted in reply to Timon Gehr

On Saturday, March 03, 2012 18:38:44 Timon Gehr wrote:
> On 03/03/2012 09:40 AM, Jonathan M Davis wrote:
> > ...  but operating on
> > code points is _far_ more correct than operating on code units. It's also
> > more efficient.
> > [snip.]
> 
> No, it is less efficient.

Operating on code points is more efficient than operating on graphemes is what I meant. I can see that I wasn't clear enough on that.

It's more correct than operating on code units and less correct than operating on graphemes,while it's less efficient than operating on code units and more efficient than operating on graphemes.

- Jonathan M Davis

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation