Jump to page: 1 2
Thread overview
[Issue 6458] New: Multibyte char literals shouldn't implicitly convert to char
Aug 09, 2011
Don
Aug 09, 2011
Jonathan M Davis
Aug 09, 2011
Don
Aug 09, 2011
Jonathan M Davis
Aug 09, 2011
changlon
Aug 09, 2011
changlon
Aug 09, 2011
Jonathan M Davis
Aug 09, 2011
Jacob Carlborg
Aug 09, 2011
Don
Jan 31, 2012
yebblies
Jan 31, 2012
yebblies
Feb 01, 2012
yebblies
Apr 20, 2012
SomeDude
Jul 19, 2012
Kenji Hara
Jul 21, 2012
Kenji Hara
Jul 21, 2012
Kenji Hara
August 09, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=6458

           Summary: Multibyte char literals shouldn't implicitly convert
                    to char
           Product: D
           Version: D2
          Platform: Other
        OS/Version: Windows
            Status: NEW
          Severity: normal
          Priority: P2
         Component: DMD
        AssignedTo: nobody@puremagic.com
        ReportedBy: clugdbug@yahoo.com.au


--- Comment #0 from Don <clugdbug@yahoo.com.au> 2011-08-08 21:43:38 PDT ---
The code below should either be rejected, or work correctly.
The particularly problematic case is:   s[0..2] = 'ä', which looks perfectly
reasonable, but creates garbage.
I'm a bit confused about non-ASCII char literals, since although they are typed
as 'char', they can't be stored in a char... This just seems wrong.

----
int bug6458()
{
    char [] s = "abcdef".dup;
    s[0] = 'ä';
    assert(s == "äcdef");
    return 34;
}
void main()
{
    bug6458();
}

Surely this has been reported before, but I can't find it.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
August 09, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=6458


Jonathan M Davis <jmdavisProg@gmx.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jmdavisProg@gmx.com


--- Comment #1 from Jonathan M Davis <jmdavisProg@gmx.com> 2011-08-08 21:53:05 PDT ---
Personally, I think that all character literals should be typed as dchar, since it's generally a _bad_ idea to operate on individual chars or wchars. Normally, the only places that chars or wchars should be used is in ranges of chars or wchars (which would normally be arrays). But making character literals dchar be default might break too much code at this point. Though, since it should be possible to use range propagation to verify whether a particular code point will fit in a particular code unit, the breakage might be minimal.

Regardless, I actually never would have expected s[0 .. 2] = 'ä' to work, since you're assigning a character to multiple characters as far as types go, though I can see why you might think that it would work or why it arguably _should_ work. Obviously though, if the compiler is allowing you to assign a code point to multiple code units like that, it should only compile if it can verify that the code unit will fit exactly in those code units, and if it does compile, it should work correctly rather than generate garbage. So, there are several issues at work here it seems.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
August 09, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=6458


Don <clugdbug@yahoo.com.au> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |accepts-invalid


--- Comment #2 from Don <clugdbug@yahoo.com.au> 2011-08-08 22:27:32 PDT ---
(In reply to comment #1)
> Personally, I think that all character literals should be typed as dchar, since it's generally a _bad_ idea to operate on individual chars or wchars. Normally, the only places that chars or wchars should be used is in ranges of chars or wchars (which would normally be arrays). But making character literals dchar be default might break too much code at this point. Though, since it should be possible to use range propagation to verify whether a particular code point will fit in a particular code unit, the breakage might be minimal.

Oddly, this passes:
static assert('ä'.sizeof == 2);
So there's something a bit nonsensical about the whole thing.

> Regardless, I actually never would have expected s[0 .. 2] = 'ä' to work, since you're assigning a character to multiple characters as far as types go,

It's more subtle. This is block assignment.
s[0..4] = 'a'; works, and creates "aaaa".
s[0..4] = 'ä' is expected to fill the string with ä, creating "ää".
Instead, it fills it with four copies of the first uft8 byte of ä, creating an
invalid string.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
August 09, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=6458



--- Comment #3 from Jonathan M Davis <jmdavisProg@gmx.com> 2011-08-08 22:33:20 PDT ---
Ah, yes. I forgot that you could assign a single value to every element in an array like that. That being the case, it should just fail to compile given that the code point is not going to fit in each of the elements of the array. But regardless, something odd is definitely going on here given that 'ä'.sizeof == 2. It's probably an edge case which wasn't caught, since the only types which take up multiple elements like that are char and wchar.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
August 09, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=6458


changlon <changlon@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |changlon@gmail.com


--- Comment #4 from changlon <changlon@gmail.com> 2011-08-08 23:13:53 PDT ---
s[0..3] = 'a';

this should raise an exception ?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
August 09, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=6458



--- Comment #5 from changlon <changlon@gmail.com> 2011-08-08 23:14:35 PDT ---
(In reply to comment #4)
> s[0..3] = 'a';
> 
> this should raise an exception ?

sorry , I mean  s[0..3] = 'ä';

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
August 09, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=6458



--- Comment #6 from Jonathan M Davis <jmdavisProg@gmx.com> 2011-08-08 23:19:15 PDT ---
It shouldn't even compile, because the types don't match. Even with range propagation, the best that you'll do with 'ä' is fit it in a wchar, so it won't fit in a char, and so you _can't_ assign it to each element of s[0 .. 3] like that. s[0 .. 3] = "ä"[] should work, but s[0 .. 3] = 'ä' definitely shouldn't.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
August 09, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=6458


Jacob Carlborg <doob@me.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |doob@me.com


--- Comment #7 from Jacob Carlborg <doob@me.com> 2011-08-08 23:44:22 PDT ---
As far as I can see, D uses the smallest type necessary to fit a character literal. So all non-ascii character literals will either be wchar or dchar. Both of the following passes, as expected.

static assert(is(typeof('ä') == wchar));
static assert(is(typeof('a') == char));

But I don't know why the compiler allows to assign a wchar to a char array element. That doesn't seem right.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
August 09, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=6458



--- Comment #8 from Don <clugdbug@yahoo.com.au> 2011-08-09 00:09:02 PDT ---
(In reply to comment #7)
> As far as I can see, D uses the smallest type necessary to fit a character literal. So all non-ascii character literals will either be wchar or dchar. Both of the following passes, as expected.
> 
> static assert(is(typeof('ä') == wchar));
> static assert(is(typeof('a') == char));

That's good news. Seems like it's only a few cases where it behaves stupidly.

> But I don't know why the compiler allows to assign a wchar to a char array element. That doesn't seem right.

It's more general than that:

   wchar w = 'ä';
   char c = w; // Error: cannot implicitly convert expression (w) of type wchar
to char


   char c = 'ä'; // passes!!!

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 31, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=6458


yebblies <yebblies@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |yebblies@gmail.com
           Platform|Other                       |All
         AssignedTo|nobody@puremagic.com        |yebblies@gmail.com
         OS/Version|Windows                     |All


--- Comment #10 from yebblies <yebblies@gmail.com> 2012-01-31 15:24:30 EST ---
(In reply to comment #9)
> 
> The compiler complains about the code above, just as it should, because a long won't fit in an int. Don't know why character literals are treated differently.

They aren't.  The problem is that 'ä' evaluates to 0x00E4, and a bug in integer range propagation thinks this is ok to convert back to a char.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
« First   ‹ Prev
1 2