What is the legal range of chars?

Jun 19, 2013

monarch_dodra

Jun 19, 2013

Ali Çehreli

Jun 19, 2013

Jun 19, 2013

Jun 19, 2013

Jun 19, 2013

Jun 19, 2013

Jul 28, 2013

I know a "binary" char can hold the values 0 to 0xFF. However, I'm wondering about the cases where a codepoint can fit inside a char. For example, 'ç' is represented by 0xe7, which technically fits inside a char. This is illegal: char c = 'ç'; But this works: char c = cast(char)'ç'; assert(c == 'ç'); ... it "works"... but is it legal? -------- The root of the question though is actually this: If I have a string, and somebody asks me to find the character "char c" in that string. Is it legal to iterate on the string char by char, until I find c exactly, or do I have to take onto account that some troll may have decided to put a wchar inside my char...? Basically: string myFind(string s, char c) { foreach(i, char sc ; s) if(sc == c) return s[i .. $]; return s[$ .. $]; } assert(myFind("aça", cast(char)'ç') == "ça"); The assert above will fail. But whose fault is it? Is it a wrong call, or a wrong implementation?

On 06/19/2013 05:34 AM, monarch_dodra wrote: > I know a "binary" char can hold the values 0 to 0xFF. However, I'm > wondering about the cases where a codepoint can fit inside a char. For > example, 'ç' is represented by 0xe7, which technically fits inside a char. 'ç' is represented by 0xe7 in an encoding that is not UTF-8. :) That would be a special agreement between the producer and the consumer of that string. Otherwise, 0xe7 is not 'ç'. I recommend ubyte[] for those cases. In UTF-8, 0xe7 is the first byte of a 3-byte code point: import std.stdio; void main() { char[] a = [ 'a', 'b', 'c', 0xe7, 0x80, 0x80 ]; writeln(a); } Prints a Chinese character: abc瀀 Ali

On Wednesday, 19 June 2013 at 15:13:23 UTC, Ali Çehreli wrote: > On 06/19/2013 05:34 AM, monarch_dodra wrote: > > > I know a "binary" char can hold the values 0 to 0xFF. > However, I'm > > wondering about the cases where a codepoint can fit inside a > char. For > > example, 'ç' is represented by 0xe7, which technically fits > inside a char. > > 'ç' is represented by 0xe7 in an encoding that is not UTF-8. :) > > That would be a special agreement between the producer and the consumer of that string. Otherwise, 0xe7 is not 'ç'. I recommend ubyte[] for those cases. > > In UTF-8, 0xe7 is the first byte of a 3-byte code point: > > import std.stdio; > > void main() > { > char[] a = [ 'a', 'b', 'c', 0xe7, 0x80, 0x80 ]; > writeln(a); > } > > Prints a Chinese character: > > abc瀀 > > Ali Hum... well, that's true for UTF-8 strings, if the _codeunit_ 0xe7 appears, it is not 'ç'. But when handling a 'char', there is no encoding, it "should" be raw _codepoint_. I'm not really sure *if* these cases should be handle, nor how :/

On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote: > Hum... well, that's true for UTF-8 strings, if the _codeunit_ 0xe7 appears, it is not 'ç'. > > But when handling a 'char', there is no encoding, it "should" be raw _codepoint_. No, char is a UTF8 code unit. Code unit and code point become synonymous in UTF32, so dchar is a code point.

On Wednesday, June 19, 2013 19:02:55 anonymous wrote: > On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote: > > Hum... well, that's true for UTF-8 strings, if the _codeunit_ 0xe7 appears, it is not 'ç'. > > > > But when handling a 'char', there is no encoding, it "should" be raw _codepoint_. > > No, char is a UTF8 code unit. > Code unit and code point become synonymous in UTF32, so dchar is > a code point. Exactly. char, wchar, and dchar are all code _units_, and dchar (UTF-32) is the only case where a code unit is guaranteed to be a code point. For both char (UTF-8) and wchar (UTF-16), the number of code units in a code point is variable, and in the case of UTF-8, any code point which isn't an ASCII characters is multiple code units. Wikipedia and TDPL both have a nice chart showing the valid values for UTF-8 and how many code units are in a code point for each set of values: http://en.wikipedia.org/wiki/UTF-8#Description - Jonathan M Davis