Implicit encoding conversion on string ~= int ? - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Implicit encoding conversion on string ~= int ?

Thread overview

Implicit encoding conversion on string ~= int ?
Jun 23, 2013 Marco Leise
Jun 23, 2013 Adam D. Ruppe
Jun 23, 2013 bearophile
Jun 23, 2013 Marco Leise
Jun 23, 2013 Adam D. Ruppe
Jun 23, 2013 Marco Leise
Jun 23, 2013 Adam D. Ruppe
Jun 24, 2013 Jonathan M Davis
Jun 24, 2013 Marco Leise
Jun 24, 2013 Jonathan M Davis
Jun 24, 2013 Timon Gehr
Jun 24, 2013 Jonathan M Davis
Jun 25, 2013 Jakob Ovrum
Jun 24, 2013 monarch_dodra
Jun 24, 2013 Marco Leise

June 23, 2013

Implicit encoding conversion on string ~= int ?

Posted by Marco Leise

Marco Leise

I've seen some C code, that does something like string[i] =
int, which seems to implicitly cast the int to a char.
Now in D to get it running I just did string ~= int and
wondered why the resulting code page 850 string looked correct
on the UTF-8 terminal. Then I asserted that 'string' only ever
grows by one byte for each append and the assertion failed. So
there is a hidden conversion from some charset (probably
Windows or Latin-1?) to a UTF-8 multi-byte string going on.

While it is convenient, this code uses some form of LZ77 and will from time to time append copies of previous parts of 'string' to it. In that case the byte offsets wouldn't match any more and the result be garbage.

Eventually I'd have looked over the code and created the CP850 string in a temporary ubyte[], but in the mean time I wonder what the rationale behind this automatic conversion is and if we want to keep it like that. Is this documented behavior ?

-- 
Marco

June 23, 2013

Re: Implicit encoding conversion on string ~= int ?

Posted by Adam D. Ruppe
in reply to Marco Leise

Adam D. Ruppe

Posted in reply to Marco Leise

I think what's happening is the compiler considers chars to be integral types (like they were in C), which means some implicit conversions between char, int, dchar, and others happen.

So

char[] a;
int b = 1000;
a ~= b;

the "a ~= b" is more like "a ~= cast(dchar) b", and then dchar -> char means it may be multibyte encoded, going from utf-32 to utf-8.

June 23, 2013

Re: Implicit encoding conversion on string ~= int ?

Posted by bearophile
in reply to Adam D. Ruppe

bearophile

Posted in reply to Adam D. Ruppe

Adam D. Ruppe:

> char[] a;
> int b = 1000;
> a ~= b;
>
> the "a ~= b" is more like "a ~= cast(dchar) b", and then dchar -> char means it may be multibyte encoded, going from utf-32 to utf-8.

I didn't know that, is this already in Bugzilla?

Bye,
bearophile

June 23, 2013

Re: Implicit encoding conversion on string ~= int ?

Posted by Marco Leise
in reply to bearophile

Marco Leise

Posted in reply to bearophile

Am Sun, 23 Jun 2013 18:37:16 +0200
schrieb "bearophile" <bearophileHUGS@lycos.com>:

> Adam D. Ruppe:
> 
> > char[] a;
> > int b = 1000;
> > a ~= b;
> >
> > the "a ~= b" is more like "a ~= cast(dchar) b", and then dchar -> char means it may be multibyte encoded, going from utf-32 to utf-8.

No no no, this is not what happens. In my case it was:
string a;
int b = 228;  // CP850 value for 'ä'. Note: fits in a single byte!
a ~= b;

Maybe it goes as follows:
o compiler sees ~= to a string and becomes "aware" of wchar and dchar
  conversions to char
o appended value is only checked for size (type and signedness are lost)
  and maps int to dchar
o this dchar value is now checked for Unicode conformity and fails the test
o the dchar value is now assumed to be Latin-1, Windows-1252 or similar
  and a conversion routine invoked
o the dchar value is converted to utf-8 and...
o appended as a multi-byte string to variable "a".

That still doesn't sound right to me thought. What if the dchar value is not valid Unicode AND >= 256 ?

> I didn't know that, is this already in Bugzilla?
> 
> Bye,
> bearophile

I don't know what exactly is supposed to happen here.

-- 
Marco

June 23, 2013

Re: Implicit encoding conversion on string ~= int ?

Posted by Adam D. Ruppe
in reply to bearophile

Adam D. Ruppe

Posted in reply to bearophile

On Sunday, 23 June 2013 at 16:37:18 UTC, bearophile wrote:
> I didn't know that, is this already in Bugzilla?

I don't know, but if it is, it is probably marked as won't fix because I'm pretty sure this has come up before, but it is actually by design because a char in C is considered an integral type too.

June 23, 2013

Re: Implicit encoding conversion on string ~= int ?

Posted by Adam D. Ruppe
in reply to Marco Leise

Adam D. Ruppe

Posted in reply to Marco Leise

On Sunday, 23 June 2013 at 17:12:41 UTC, Marco Leise wrote:
> int b = 228;  // CP850 value for 'ä'. Note: fits in a single byte!

228 (e4 in hex) is also the Unicode code point for ä, which is [195, 164] when encoded as UTF-8. see: http://www.utf8-chartable.de/unicode-utf8-table.pl?number=512&utf8=dec

While the number 228 would fit in a byte normally, utf-8 uses the high bits as markers that this is part of a multibyte sequence (this helps with ascii compatibility), so any code point > 127 will always be a multibyte sequence in utf-8. see: http://en.wikipedia.org/wiki/UTF-8#Description

June 23, 2013

Re: Implicit encoding conversion on string ~= int ?

Posted by Marco Leise
in reply to Marco Leise

Marco Leise

Posted in reply to Marco Leise

Am Sun, 23 Jun 2013 19:12:21 +0200
schrieb Marco Leise <Marco.Leise@gmx.de>:

> Am Sun, 23 Jun 2013 18:37:16 +0200
> schrieb "bearophile" <bearophileHUGS@lycos.com>:
> 
> > Adam D. Ruppe:
> > 
> > > char[] a;
> > > int b = 1000;
> > > a ~= b;
> > >
> > > the "a ~= b" is more like "a ~= cast(dchar) b", and then dchar -> char means it may be multibyte encoded, going from utf-32 to utf-8.
> 
> No no no, this is not what happens. In my case it was:
> string a;
> int b = 228;  // CP850 value for 'ä'. Note: fits in a single byte!
> a ~= b;
> 
> Maybe it goes as follows:
> o compiler sees ~= to a string and becomes "aware" of wchar and dchar
>   conversions to char
> o appended value is only checked for size (type and signedness are lost)
>   and maps int to dchar
> o this dchar value is now checked for Unicode conformity and fails the test
> o the dchar value is now assumed to be Latin-1, Windows-1252 or similar
>   and a conversion routine invoked
> o the dchar value is converted to utf-8 and...
> o appended as a multi-byte string to variable "a".
> 
> That still doesn't sound right to me thought. What if the dchar value is not valid Unicode AND >= 256 ?

Actually you were 100% right, Adam. I was distracted by the
fact that the source was CP850.
UTF-32 maps all of Latin-1 in a 1:1 correspondence and most of
CP850 has the same code in Latin-1. So yes, all the compiler
was doing is to append a dchar value.
And with char/ubyte I do find it convenient to mix them
without casting. E.g. "if (someChar < 0x80)" and similar code.

As confusing as it was for me, I agree with "WONT FIX".

-- 
Marco

June 24, 2013

Re: Implicit encoding conversion on string ~= int ?

Posted by Jonathan M Davis
in reply to Adam D. Ruppe

Jonathan M Davis

Posted in reply to Adam D. Ruppe

On Sunday, June 23, 2013 19:25:41 Adam D. Ruppe wrote:
> On Sunday, 23 June 2013 at 16:37:18 UTC, bearophile wrote:
> > I didn't know that, is this already in Bugzilla?
> 
> I don't know, but if it is, it is probably marked as won't fix because I'm pretty sure this has come up before, but it is actually by design because a char in C is considered an integral type too.

This is definitely by design. Walter is definitely in the camp that thinks that chars are integral types, so they follow all of the various integral conversion rules. In some cases this is nice. In others, it's bug-prone, but I think that we're stuck with it regardless of whether it's ultimately a good idea or not. I don't think that we even succeeded at coming close to convincing Walter that _bool_ isn't an integral type and shouldn't be treated as such (when it was discussed right before deconf), and that should be a far more clearcut case.

- Jonathan M Davis

June 24, 2013

Re: Implicit encoding conversion on string ~= int ?

Posted by Marco Leise
in reply to Jonathan M Davis

Marco Leise

Posted in reply to Jonathan M Davis

Am Sun, 23 Jun 2013 17:50:01 -0700
schrieb Jonathan M Davis <jmdavisProg@gmx.com>:

> I don't think that we even succeeded at coming close to convincing Walter that _bool_ isn't an integral type and shouldn't be treated as such (when it was discussed right before deconf), and that should be a far more clearcut case.
> 
> - Jonathan M Davis

You can take bool to int promotion out of my...

// best way to toggle forth and back between 0 and 1. "!" returns a bool. value = !value

// don't ask, I've seen this :)
arr[someBool]

// sometimes the bool has just the value you need
length -= boolRemoveTerminator

-- 
Marco

June 24, 2013

Re: Implicit encoding conversion on string ~= int ?

Posted by Jonathan M Davis
in reply to Marco Leise

Jonathan M Davis

Posted in reply to Marco Leise

On Monday, June 24, 2013 07:20:10 Marco Leise wrote:
> Am Sun, 23 Jun 2013 17:50:01 -0700
> 
> schrieb Jonathan M Davis <jmdavisProg@gmx.com>:
> > I don't think that we even succeeded at coming close to
> > convincing Walter that _bool_ isn't an integral type and shouldn't be
> > treated as such (when it was discussed right before deconf), and that
> > should be a far more clearcut case.
> > 
> > - Jonathan M Davis
> 
> You can take bool to int promotion out of my...
> 
> // best way to toggle forth and back between 0 and 1. "!" returns a bool. value = !value
> 
> // don't ask, I've seen this :)
> arr[someBool]
> 
> // sometimes the bool has just the value you need
> length -= boolRemoveTerminator

And in all those cases, you can cast to int to get the value you want. The case that brought up the big discussion on it a couple of months ago was when you had

auto foo(bool b) {...}
auto foo(long l) {...}

Which one does foo(1) call? It calls the bool version, because of how the integer conversion rules work. IMHO, this is _very_ broken, but Walter's response is that the solution is to add the overload

auto foo(int i) {...}

And that does fix the code in question, but it means that bool is _not_ strongly typed in D, and you get a variety of weird cases that cause bugs because of such implicit conversions. I would strongly argue that the case where you want bool to act like an integer is by far the rarer case and that casting fixes that problem nicely. Plenty of others agree with me. But no one has been able to convince Walter.

You can read the thread here:

http://forum.dlang.org/post/klc5r7$3c4$1@digitalmars.com

- Jonathan M Davis

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation