[Issue 6458] New: Multibyte char literals shouldn't implicitly convert to char - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Issues » [Issue 6458] New: Multibyte char literals shouldn't implicitly convert to char

Thread overview

[Issue 6458] New: Multibyte char literals shouldn't implicitly convert to char
Aug 09, 2011 Don
Aug 09, 2011 Jonathan M Davis
Aug 09, 2011 Don
Aug 09, 2011 Jonathan M Davis
Aug 09, 2011 changlon
Aug 09, 2011 changlon
Aug 09, 2011 Jonathan M Davis
Aug 09, 2011 Jacob Carlborg
Aug 09, 2011 Don
Jan 31, 2012 yebblies
Jan 31, 2012 yebblies
Feb 01, 2012 yebblies
Apr 20, 2012 SomeDude
Jul 19, 2012 Kenji Hara
Jul 21, 2012 Kenji Hara
Jul 21, 2012 Kenji Hara

August 09, 2011

[Issue 6458] New: Multibyte char literals shouldn't implicitly convert to char

Posted by Don

Don

http://d.puremagic.com/issues/show_bug.cgi?id=6458

           Summary: Multibyte char literals shouldn't implicitly convert
                    to char
           Product: D
           Version: D2
          Platform: Other
        OS/Version: Windows
            Status: NEW
          Severity: normal
          Priority: P2
         Component: DMD
        AssignedTo: nobody@puremagic.com
        ReportedBy: clugdbug@yahoo.com.au


--- Comment #0 from Don <clugdbug@yahoo.com.au> 2011-08-08 21:43:38 PDT ---
The code below should either be rejected, or work correctly.
The particularly problematic case is:   s[0..2] = 'ä', which looks perfectly
reasonable, but creates garbage.
I'm a bit confused about non-ASCII char literals, since although they are typed
as 'char', they can't be stored in a char... This just seems wrong.

----
int bug6458()
{
    char [] s = "abcdef".dup;
    s[0] = 'ä';
    assert(s == "äcdef");
    return 34;
}
void main()
{
    bug6458();
}

Surely this has been reported before, but I can't find it.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

August 09, 2011

[Issue 6458] Multibyte char literals shouldn't implicitly convert to char

Posted by Jonathan M Davis
in reply to Don

Jonathan M Davis

Posted in reply to Don

http://d.puremagic.com/issues/show_bug.cgi?id=6458


Jonathan M Davis <jmdavisProg@gmx.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jmdavisProg@gmx.com


--- Comment #1 from Jonathan M Davis <jmdavisProg@gmx.com> 2011-08-08 21:53:05 PDT ---
Personally, I think that all character literals should be typed as dchar, since it's generally a _bad_ idea to operate on individual chars or wchars. Normally, the only places that chars or wchars should be used is in ranges of chars or wchars (which would normally be arrays). But making character literals dchar be default might break too much code at this point. Though, since it should be possible to use range propagation to verify whether a particular code point will fit in a particular code unit, the breakage might be minimal.

Regardless, I actually never would have expected s[0 .. 2] = 'ä' to work, since you're assigning a character to multiple characters as far as types go, though I can see why you might think that it would work or why it arguably _should_ work. Obviously though, if the compiler is allowing you to assign a code point to multiple code units like that, it should only compile if it can verify that the code unit will fit exactly in those code units, and if it does compile, it should work correctly rather than generate garbage. So, there are several issues at work here it seems.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

August 09, 2011

[Issue 6458] Multibyte char literals shouldn't implicitly convert to char

Posted by Don
in reply to Don

Don

Posted in reply to Don

http://d.puremagic.com/issues/show_bug.cgi?id=6458


Don <clugdbug@yahoo.com.au> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |accepts-invalid


--- Comment #2 from Don <clugdbug@yahoo.com.au> 2011-08-08 22:27:32 PDT ---
(In reply to comment #1)
> Personally, I think that all character literals should be typed as dchar, since it's generally a _bad_ idea to operate on individual chars or wchars. Normally, the only places that chars or wchars should be used is in ranges of chars or wchars (which would normally be arrays). But making character literals dchar be default might break too much code at this point. Though, since it should be possible to use range propagation to verify whether a particular code point will fit in a particular code unit, the breakage might be minimal.

Oddly, this passes:
static assert('ä'.sizeof == 2);
So there's something a bit nonsensical about the whole thing.

> Regardless, I actually never would have expected s[0 .. 2] = 'ä' to work, since you're assigning a character to multiple characters as far as types go,

It's more subtle. This is block assignment.
s[0..4] = 'a'; works, and creates "aaaa".
s[0..4] = 'ä' is expected to fill the string with ä, creating "ää".
Instead, it fills it with four copies of the first uft8 byte of ä, creating an
invalid string.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

August 09, 2011

[Issue 6458] Multibyte char literals shouldn't implicitly convert to char

Posted by Jonathan M Davis
in reply to Don

Jonathan M Davis

Posted in reply to Don

http://d.puremagic.com/issues/show_bug.cgi?id=6458



--- Comment #3 from Jonathan M Davis <jmdavisProg@gmx.com> 2011-08-08 22:33:20 PDT ---
Ah, yes. I forgot that you could assign a single value to every element in an array like that. That being the case, it should just fail to compile given that the code point is not going to fit in each of the elements of the array. But regardless, something odd is definitely going on here given that 'ä'.sizeof == 2. It's probably an edge case which wasn't caught, since the only types which take up multiple elements like that are char and wchar.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

August 09, 2011

[Issue 6458] Multibyte char literals shouldn't implicitly convert to char

Posted by changlon
in reply to Don

changlon

Posted in reply to Don

http://d.puremagic.com/issues/show_bug.cgi?id=6458


changlon <changlon@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |changlon@gmail.com


--- Comment #4 from changlon <changlon@gmail.com> 2011-08-08 23:13:53 PDT ---
s[0..3] = 'a';

this should raise an exception ?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

August 09, 2011

[Issue 6458] Multibyte char literals shouldn't implicitly convert to char

Posted by changlon
in reply to Don

changlon

Posted in reply to Don

http://d.puremagic.com/issues/show_bug.cgi?id=6458



--- Comment #5 from changlon <changlon@gmail.com> 2011-08-08 23:14:35 PDT ---
(In reply to comment #4)
> s[0..3] = 'a';
> 
> this should raise an exception ?

sorry , I mean  s[0..3] = 'ä';

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

August 09, 2011

[Issue 6458] Multibyte char literals shouldn't implicitly convert to char

Posted by Jonathan M Davis
in reply to Don

Jonathan M Davis

Posted in reply to Don

http://d.puremagic.com/issues/show_bug.cgi?id=6458



--- Comment #6 from Jonathan M Davis <jmdavisProg@gmx.com> 2011-08-08 23:19:15 PDT ---
It shouldn't even compile, because the types don't match. Even with range propagation, the best that you'll do with 'ä' is fit it in a wchar, so it won't fit in a char, and so you _can't_ assign it to each element of s[0 .. 3] like that. s[0 .. 3] = "ä"[] should work, but s[0 .. 3] = 'ä' definitely shouldn't.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

August 09, 2011

[Issue 6458] Multibyte char literals shouldn't implicitly convert to char

Posted by Jacob Carlborg
in reply to Don

Jacob Carlborg

Posted in reply to Don

http://d.puremagic.com/issues/show_bug.cgi?id=6458


Jacob Carlborg <doob@me.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |doob@me.com


--- Comment #7 from Jacob Carlborg <doob@me.com> 2011-08-08 23:44:22 PDT ---
As far as I can see, D uses the smallest type necessary to fit a character literal. So all non-ascii character literals will either be wchar or dchar. Both of the following passes, as expected.

static assert(is(typeof('ä') == wchar));
static assert(is(typeof('a') == char));

But I don't know why the compiler allows to assign a wchar to a char array element. That doesn't seem right.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

August 09, 2011

[Issue 6458] Multibyte char literals shouldn't implicitly convert to char

Posted by Don
in reply to Don

Don

Posted in reply to Don

http://d.puremagic.com/issues/show_bug.cgi?id=6458



--- Comment #8 from Don <clugdbug@yahoo.com.au> 2011-08-09 00:09:02 PDT ---
(In reply to comment #7)
> As far as I can see, D uses the smallest type necessary to fit a character literal. So all non-ascii character literals will either be wchar or dchar. Both of the following passes, as expected.
> 
> static assert(is(typeof('ä') == wchar));
> static assert(is(typeof('a') == char));

That's good news. Seems like it's only a few cases where it behaves stupidly.

> But I don't know why the compiler allows to assign a wchar to a char array element. That doesn't seem right.

It's more general than that:

   wchar w = 'ä';
   char c = w; // Error: cannot implicitly convert expression (w) of type wchar
to char


   char c = 'ä'; // passes!!!

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

January 31, 2012

[Issue 6458] Multibyte char literals shouldn't implicitly convert to char

Posted by yebblies
in reply to Don

yebblies

Posted in reply to Don

http://d.puremagic.com/issues/show_bug.cgi?id=6458


yebblies <yebblies@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |yebblies@gmail.com
           Platform|Other                       |All
         AssignedTo|nobody@puremagic.com        |yebblies@gmail.com
         OS/Version|Windows                     |All


--- Comment #10 from yebblies <yebblies@gmail.com> 2012-01-31 15:24:30 EST ---
(In reply to comment #9)
> 
> The compiler complains about the code above, just as it should, because a long won't fit in an int. Don't know why character literals are treated differently.

They aren't.  The problem is that 'ä' evaluates to 0x00E4, and a bug in integer range propagation thinks this is ok to convert back to a char.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation