Thread overview | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
June 19, 2013 What is the legal range of chars? | ||||
---|---|---|---|---|
| ||||
I know a "binary" char can hold the values 0 to 0xFF. However, I'm wondering about the cases where a codepoint can fit inside a char. For example, 'ç' is represented by 0xe7, which technically fits inside a char. This is illegal: char c = 'ç'; But this works: char c = cast(char)'ç'; assert(c == 'ç'); ... it "works"... but is it legal? -------- The root of the question though is actually this: If I have a string, and somebody asks me to find the character "char c" in that string. Is it legal to iterate on the string char by char, until I find c exactly, or do I have to take onto account that some troll may have decided to put a wchar inside my char...? Basically: string myFind(string s, char c) { foreach(i, char sc ; s) if(sc == c) return s[i .. $]; return s[$ .. $]; } assert(myFind("aça", cast(char)'ç') == "ça"); The assert above will fail. But whose fault is it? Is it a wrong call, or a wrong implementation? |
June 19, 2013 Re: What is the legal range of chars? | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On 06/19/2013 05:34 AM, monarch_dodra wrote: > I know a "binary" char can hold the values 0 to 0xFF. However, I'm > wondering about the cases where a codepoint can fit inside a char. For > example, 'ç' is represented by 0xe7, which technically fits inside a char. 'ç' is represented by 0xe7 in an encoding that is not UTF-8. :) That would be a special agreement between the producer and the consumer of that string. Otherwise, 0xe7 is not 'ç'. I recommend ubyte[] for those cases. In UTF-8, 0xe7 is the first byte of a 3-byte code point: import std.stdio; void main() { char[] a = [ 'a', 'b', 'c', 0xe7, 0x80, 0x80 ]; writeln(a); } Prints a Chinese character: abc瀀 Ali |
June 19, 2013 Re: What is the legal range of chars? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ali Çehreli | On Wednesday, 19 June 2013 at 15:13:23 UTC, Ali Çehreli wrote:
> On 06/19/2013 05:34 AM, monarch_dodra wrote:
>
> > I know a "binary" char can hold the values 0 to 0xFF.
> However, I'm
> > wondering about the cases where a codepoint can fit inside a
> char. For
> > example, 'ç' is represented by 0xe7, which technically fits
> inside a char.
>
> 'ç' is represented by 0xe7 in an encoding that is not UTF-8. :)
>
> That would be a special agreement between the producer and the consumer of that string. Otherwise, 0xe7 is not 'ç'. I recommend ubyte[] for those cases.
>
> In UTF-8, 0xe7 is the first byte of a 3-byte code point:
>
> import std.stdio;
>
> void main()
> {
> char[] a = [ 'a', 'b', 'c', 0xe7, 0x80, 0x80 ];
> writeln(a);
> }
>
> Prints a Chinese character:
>
> abc瀀
>
> Ali
Hum... well, that's true for UTF-8 strings, if the _codeunit_ 0xe7 appears, it is not 'ç'.
But when handling a 'char', there is no encoding, it "should" be raw _codepoint_.
I'm not really sure *if* these cases should be handle, nor how :/
|
June 19, 2013 Re: What is the legal range of chars? | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote:
> Hum... well, that's true for UTF-8 strings, if the _codeunit_ 0xe7 appears, it is not 'ç'.
>
> But when handling a 'char', there is no encoding, it "should" be raw _codepoint_.
No, char is a UTF8 code unit.
Code unit and code point become synonymous in UTF32, so dchar is
a code point.
|
June 19, 2013 Re: What is the legal range of chars? | ||||
---|---|---|---|---|
| ||||
Posted in reply to anonymous | On Wednesday, June 19, 2013 19:02:55 anonymous wrote: > On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote: > > Hum... well, that's true for UTF-8 strings, if the _codeunit_ 0xe7 appears, it is not 'ç'. > > > > But when handling a 'char', there is no encoding, it "should" be raw _codepoint_. > > No, char is a UTF8 code unit. > Code unit and code point become synonymous in UTF32, so dchar is > a code point. Exactly. char, wchar, and dchar are all code _units_, and dchar (UTF-32) is the only case where a code unit is guaranteed to be a code point. For both char (UTF-8) and wchar (UTF-16), the number of code units in a code point is variable, and in the case of UTF-8, any code point which isn't an ASCII characters is multiple code units. Wikipedia and TDPL both have a nice chart showing the valid values for UTF-8 and how many code units are in a code point for each set of values: http://en.wikipedia.org/wiki/UTF-8#Description - Jonathan M Davis |
June 19, 2013 Re: What is the legal range of chars? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On Wednesday, 19 June 2013 at 17:48:49 UTC, Jonathan M Davis wrote:
> On Wednesday, June 19, 2013 19:02:55 anonymous wrote:
>> On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote:
>> > Hum... well, that's true for UTF-8 strings, if the _codeunit_
>> > 0xe7 appears, it is not 'ç'.
>> >
>> > But when handling a 'char', there is no encoding, it "should"
>> > be raw _codepoint_.
>>
>> No, char is a UTF8 code unit.
>> Code unit and code point become synonymous in UTF32, so dchar is
>> a code point.
>
> Exactly. char, wchar, and dchar are all code _units_, and dchar (UTF-32) is
> the only case where a code unit is guaranteed to be a code point. For both
> char (UTF-8) and wchar (UTF-16), the number of code units in a code point is
> variable, and in the case of UTF-8, any code point which isn't an ASCII
> characters is multiple code units. Wikipedia and TDPL both have a nice chart
> showing the valid values for UTF-8 and how many code units are in a code point
> for each set of values:
>
> http://en.wikipedia.org/wiki/UTF-8#Description
>
> - Jonathan M Davis
Well, there is still ambiguity when you have a standalone char if it is holding a (paritally truncated) code unit, or a partial code point.
If I write:
char c = '\xDF'; //0b11011111; //Lead UTF-8 2 byte encoding
wchar w = 'ß'; //0b11011111; \u00DF
assert(c == w);
The assert passes. Yet 'c' is just the partial of a 2 byte sequence, and not 'ß'.
In any case, this conversation gave me the answers I was looking for in the context of the original question.
|
June 19, 2013 Re: What is the legal range of chars? | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On Wednesday, June 19, 2013 21:22:00 monarch_dodra wrote: > Well, there is still ambiguity when you have a standalone char if it is holding a (paritally truncated) code unit, or a partial code point. > > If I write: > char c = '\xDF'; //0b11011111; //Lead UTF-8 2 byte encoding > wchar w = 'ß'; //0b11011111; \u00DF > assert(c == w); > > The assert passes. Yet 'c' is just the partial of a 2 byte sequence, and not 'ß'. Well, it's fundamentally broken to compare char and wchar unless you know that both of the values being compared are ASCII values. They're different encodings. > In any case, this conversation gave me the answers I was looking for in the context of the original question. Good to hear. - Jonathan M Davis |
July 28, 2013 Re: What is the legal range of chars? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On Wednesday, 19 June 2013 at 20:09:54 UTC, Jonathan M Davis wrote:
> Good to hear.
>
> - Jonathan M Davis
Resurrecting this thread for a related question: What is the legal range a dchar can hold?
is it 0 .. 0x110000, or basically, just 0 .. 2^^32?
For example, writing this:
dchar d = 0x110000;
Will result in:
Error: cannot implicitly convert expression (1114112) of type int to dchar.
Yet:
uint v = uint.max;
dchar d = v; //Perfactly fine.
So the question is: While I know you *can* put anything you want in a dchar, is it actually legal? is the "dchar d = 0x110000;" thing just the whole "value range propagation" thing being weird? A bit more background on the thing would be nice.
|
Copyright © 1999-2021 by the D Language Foundation