Thread overview
What is the legal range of chars?
Jun 19, 2013
monarch_dodra
Jun 19, 2013
Ali Çehreli
Jun 19, 2013
monarch_dodra
Jun 19, 2013
anonymous
Jun 19, 2013
Jonathan M Davis
Jun 19, 2013
monarch_dodra
Jun 19, 2013
Jonathan M Davis
Jul 28, 2013
monarch_dodra
June 19, 2013
I know a "binary" char can hold the values 0 to 0xFF. However, I'm wondering about the cases where a codepoint can fit inside a char. For example, 'ç' is represented by 0xe7, which technically fits inside a char.

This is illegal:
char c = 'ç';
But this works:
char c = cast(char)'ç';
assert(c == 'ç');

... it "works"... but is it legal?

--------
The root of the question though is actually this: If I have a string, and somebody asks me to find the character "char c" in that string. Is it legal to iterate on the string char by char, until I find c exactly, or do I have to take onto account that some troll may have decided to put a wchar inside my char...?

Basically:
string myFind(string s, char c)
{
    foreach(i, char sc ; s)
        if(sc == c)
            return s[i .. $];
    return s[$ .. $];
}
assert(myFind("aça", cast(char)'ç') == "ça");

The assert above will fail. But whose fault is it? Is it a wrong call, or a wrong implementation?
June 19, 2013
On 06/19/2013 05:34 AM, monarch_dodra wrote:

> I know a "binary" char can hold the values 0 to 0xFF. However, I'm
> wondering about the cases where a codepoint can fit inside a char. For
> example, 'ç' is represented by 0xe7, which technically fits inside a char.

'ç' is represented by 0xe7 in an encoding that is not UTF-8. :)

That would be a special agreement between the producer and the consumer of that string. Otherwise, 0xe7 is not 'ç'. I recommend ubyte[] for those cases.

In UTF-8, 0xe7 is the first byte of a 3-byte code point:

import std.stdio;

void main()
{
    char[] a = [ 'a', 'b', 'c', 0xe7, 0x80, 0x80 ];
    writeln(a);
}

Prints a Chinese character:

abc瀀

Ali

June 19, 2013
On Wednesday, 19 June 2013 at 15:13:23 UTC, Ali Çehreli wrote:
> On 06/19/2013 05:34 AM, monarch_dodra wrote:
>
> > I know a "binary" char can hold the values 0 to 0xFF.
> However, I'm
> > wondering about the cases where a codepoint can fit inside a
> char. For
> > example, 'ç' is represented by 0xe7, which technically fits
> inside a char.
>
> 'ç' is represented by 0xe7 in an encoding that is not UTF-8. :)
>
> That would be a special agreement between the producer and the consumer of that string. Otherwise, 0xe7 is not 'ç'. I recommend ubyte[] for those cases.
>
> In UTF-8, 0xe7 is the first byte of a 3-byte code point:
>
> import std.stdio;
>
> void main()
> {
>     char[] a = [ 'a', 'b', 'c', 0xe7, 0x80, 0x80 ];
>     writeln(a);
> }
>
> Prints a Chinese character:
>
> abc瀀
>
> Ali

Hum... well, that's true for UTF-8 strings, if the _codeunit_ 0xe7 appears, it is not 'ç'.

But when handling a 'char', there is no encoding, it "should" be raw _codepoint_.

I'm not really sure *if* these cases should be handle, nor how :/
June 19, 2013
On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote:
> Hum... well, that's true for UTF-8 strings, if the _codeunit_ 0xe7 appears, it is not 'ç'.
>
> But when handling a 'char', there is no encoding, it "should" be raw _codepoint_.

No, char is a UTF8 code unit.
Code unit and code point become synonymous in UTF32, so dchar is
a code point.
June 19, 2013
On Wednesday, June 19, 2013 19:02:55 anonymous wrote:
> On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote:
> > Hum... well, that's true for UTF-8 strings, if the _codeunit_ 0xe7 appears, it is not 'ç'.
> > 
> > But when handling a 'char', there is no encoding, it "should" be raw _codepoint_.
> 
> No, char is a UTF8 code unit.
> Code unit and code point become synonymous in UTF32, so dchar is
> a code point.

Exactly. char, wchar, and dchar are all code _units_, and dchar (UTF-32) is the only case where a code unit is guaranteed to be a code point. For both char (UTF-8) and wchar (UTF-16), the number of code units in a code point is variable, and in the case of UTF-8, any code point which isn't an ASCII characters is multiple code units. Wikipedia and TDPL both have a nice chart showing the valid values for UTF-8 and how many code units are in a code point for each set of values:

http://en.wikipedia.org/wiki/UTF-8#Description

- Jonathan M Davis
June 19, 2013
On Wednesday, 19 June 2013 at 17:48:49 UTC, Jonathan M Davis wrote:
> On Wednesday, June 19, 2013 19:02:55 anonymous wrote:
>> On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote:
>> > Hum... well, that's true for UTF-8 strings, if the _codeunit_
>> > 0xe7 appears, it is not 'ç'.
>> > 
>> > But when handling a 'char', there is no encoding, it "should"
>> > be raw _codepoint_.
>> 
>> No, char is a UTF8 code unit.
>> Code unit and code point become synonymous in UTF32, so dchar is
>> a code point.
>
> Exactly. char, wchar, and dchar are all code _units_, and dchar (UTF-32) is
> the only case where a code unit is guaranteed to be a code point. For both
> char (UTF-8) and wchar (UTF-16), the number of code units in a code point is
> variable, and in the case of UTF-8, any code point which isn't an ASCII
> characters is multiple code units. Wikipedia and TDPL both have a nice chart
> showing the valid values for UTF-8 and how many code units are in a code point
> for each set of values:
>
> http://en.wikipedia.org/wiki/UTF-8#Description
>
> - Jonathan M Davis

Well, there is still ambiguity when you have a standalone char if it is holding a (paritally truncated) code unit, or a partial code point.

If I write:
    char  c = '\xDF'; //0b11011111; //Lead UTF-8 2 byte encoding
    wchar w = 'ß';    //0b11011111; \u00DF
    assert(c == w);

The assert passes. Yet 'c' is just the partial of a 2 byte sequence, and not 'ß'.

In any case, this conversation gave me the answers I was looking for in the context of the original question.
June 19, 2013
On Wednesday, June 19, 2013 21:22:00 monarch_dodra wrote:
> Well, there is still ambiguity when you have a standalone char if it is holding a (paritally truncated) code unit, or a partial code point.
> 
> If I write:
> char c = '\xDF'; //0b11011111; //Lead UTF-8 2 byte encoding
> wchar w = 'ß'; //0b11011111; \u00DF
> assert(c == w);
> 
> The assert passes. Yet 'c' is just the partial of a 2 byte sequence, and not 'ß'.

Well, it's fundamentally broken to compare char and wchar unless you know that both of the values being compared are ASCII values. They're different encodings.

> In any case, this conversation gave me the answers I was looking for in the context of the original question.

Good to hear.

- Jonathan M Davis
July 28, 2013
On Wednesday, 19 June 2013 at 20:09:54 UTC, Jonathan M Davis wrote:
> Good to hear.
>
> - Jonathan M Davis

Resurrecting this thread for a related question: What is the legal range a dchar can hold?

is it 0 .. 0x110000, or basically, just 0 .. 2^^32?

For example, writing this:
dchar d = 0x110000;
Will result in:
Error: cannot implicitly convert expression (1114112) of type int to dchar.

Yet:
uint v = uint.max;
dchar d = v; //Perfactly fine.

So the question is: While I know you *can* put anything you want in a dchar, is it actually legal? is the "dchar d = 0x110000;" thing just the whole "value range propagation" thing being weird? A bit more background on the thing would be nice.