January 13, 2011
"Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message news:ignon1$2p4k$1@digitalmars.com...
>
> This may sometimes not be what the user expected; most of the time they'd care about the code points.
>

I dunno, spir has succesfuly convinced me that most of the time it's graphemes the user cares about, not code points. Using code points is just as misleading as using UTF-16 code units.


January 14, 2011
On 2011-01-13 14:11:44 -0500, spir <denis.spir@gmail.com> said:

> In Cocoa, string sorting and case-insensitive comparition is also dependent on the user's locale settings, although you can also specify your own locale if the user's locale is not what you want. See kde trying to invent a, hum, "natural", way of sorting file names...)

Mac OS sorts file names in a "natural" way since a very long time (since Mac OS 8 I believe). By natural, I mean that numbers inside the file name are sorted in numeric order while the rest is sorted character by character. For instance "My File 2" will go before "My File 10" in file listings because "2" is less than "10".

There's an option in NSString comparison methods to use this ordering, but it's not the default.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 14, 2011
On 2011-01-13 15:39:14 -0500, "Nick Sabalausky" <a@a.a> said:

> "Andrej Mitrovic" <andrej.mitrovich@gmail.com> wrote in message
> news:mailman.604.1294932704.4748.digitalmars-d@puremagic.com...
>> OT: Spir, do you know if I can change the syntax highlighting settings
>> on bitbucket? I can't see anything with these gray on dark-gray
>> colors: http://i.imgur.com/SmLk1.jpg
> 
> I'm getting the same problem too.

I bypassed the problem by fetching the files from the repository. But I agree it's very annoying.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 14, 2011
On 01/13/2011 11:00 PM, Nick Sabalausky wrote:
> "Andrei Alexandrescu"<SeeWebsiteForEmail@erdani.org>  wrote in message
> news:ignon1$2p4k$1@digitalmars.com...
>>
>> This may sometimes not be what the user expected; most of the time they'd
>> care about the code points.
>>
>
> I dunno, spir has succesfuly convinced me that most of the time it's
> graphemes the user cares about, not code points. Using code points is just
> as misleading as using UTF-16 code units.

You are right in that those 2 issues are really analog. In practice, once universal text is truely and commonly used, I guess problems with codes-do-not-represent-characters may become far more obvious; and also far more serious because (logical) errors can easily pass by unseen.
[In fact, how can a programmer even know for instance that a search routine missed its target or returned a false positive, when dealing with characters from unknown languages? Indeed, there are test data sets, but they are useless if the tools one uses just ignore the issues.]
The problem with using 16-bit representation and thus ignoring a fair amount of codepoints is maybe less problematic because there are rather few chances to randomly meet characters outside the BMP (Basic Multiligual Plane, part of UCS which codepoints are < 0x10000).
Outside the BMP are scripting systems of less commonly studied archeological languages, and various sets of images such as alchemical symbols, playing cards or domino tiles. I doubt they'll ever be commonly used, or else for specialised apps the programmer perfectly knows what they deal with.

A list of UCS blocks with pointers to detailed content can be found here:
http://www.fileformat.info/info/unicode/block/index.htm
Blocks over the BMP start with the line:
Linear B Syllabary 	U+10000 	U+1007F 	(88)

Denis
_________________
vita es estrany
spir.wikidot.com

January 14, 2011
"Michel Fortin" <michel.fortin@michelf.com> wrote in message news:igo5v2$gq2$1@digitalmars.com...
> On 2011-01-13 14:11:44 -0500, spir <denis.spir@gmail.com> said:
>
>> In Cocoa, string sorting and case-insensitive comparition is also dependent on the user's locale settings, although you can also specify your own locale if the user's locale is not what you want. See kde trying to invent a, hum, "natural", way of sorting file names...)
>
> Mac OS sorts file names in a "natural" way since a very long time (since Mac OS 8 I believe). By natural, I mean that numbers inside the file name are sorted in numeric order while the rest is sorted character by character. For instance "My File 2" will go before "My File 10" in file listings because "2" is less than "10".
>

XP's explorer does that too. It's a very nice feature.


January 14, 2011
On 2011-01-13 15:51:00 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
>> On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail@erdani.org> wrote:
>>> Let's take a look:
>>> 
>>> // Incorrect string code
>>> void fun(string s) {
>>> foreach (i; 0 .. s.length) {
>>> writeln("The character in position ", i, " is ", s[i]);
>>> }
>>> }
>>> 
>>> // Incorrect string_t code
>>> void fun(string_t!char s) {
>>> foreach (i; 0 .. s.codeUnits) {
>>> writeln("The character in position ", i, " is ", s[i]);
>>> }
>>> }
>>> 
>>> Both functions are incorrect, albeit in different ways. The only
>>> improvement I'm seeing is that the user needs to write codeUnits
>>> instead of length, which may make her think twice. Clearly, however,
>>> copiously incorrect code can be written with the proposed interface
>>> because it tries to hide the reality that underneath a variable-length
>>> encoding is being used, but doesn't hide it completely (albeit for
>>> good efficiency-related reasons).
>> 
>> You might be looking at my previous version. The new version (recently
>> posted) will throw an exception for that code if a multi-code-unit
>> code-point is found.
> 
> I was looking at your latest. It's code that compiles and runs, but dynamically fails on some inputs. I agree that it's often better to fail noisily instead of silently, but in a manner of speaking the string-based code doesn't fail at all - it correctly iterates the code units of a string. This may sometimes not be what the user expected; most of the time they'd care about the code points.

That's forgetting that most of the time people care about graphemes (user-perceived characters), not code points.


>> It also supports this:
>> 
>> foreach(i, d; s)
>> {
>> writeln("The character in position ", i, " is ", d);
>> }
>> 
>> where i is the index (might not be sequential)
> 
> Well string supports that too, albeit with the nit that you need to specify dchar.

Except it breaks with combining characters. For instance, take the string "t̃", which is two code points -- 't' followed by combining tilde (U+0303) -- and you'll get the following output:

	The character in position 0 is t
	The character in position 1 is ̃

(Note that the tilde becomes combined with the preceding space character.)

The conception of character that normal people have does not match the notion of code points when combining characters enters the equation.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 14, 2011
On 1/13/11 7:09 PM, Michel Fortin wrote:
> On 2011-01-13 15:51:00 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> said:
>
>> On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
>>> On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
>>> <SeeWebsiteForEmail@erdani.org> wrote:
>>>> Let's take a look:
>>>>
>>>> // Incorrect string code
>>>> void fun(string s) {
>>>> foreach (i; 0 .. s.length) {
>>>> writeln("The character in position ", i, " is ", s[i]);
>>>> }
>>>> }
>>>>
>>>> // Incorrect string_t code
>>>> void fun(string_t!char s) {
>>>> foreach (i; 0 .. s.codeUnits) {
>>>> writeln("The character in position ", i, " is ", s[i]);
>>>> }
>>>> }
>>>>
>>>> Both functions are incorrect, albeit in different ways. The only
>>>> improvement I'm seeing is that the user needs to write codeUnits
>>>> instead of length, which may make her think twice. Clearly, however,
>>>> copiously incorrect code can be written with the proposed interface
>>>> because it tries to hide the reality that underneath a variable-length
>>>> encoding is being used, but doesn't hide it completely (albeit for
>>>> good efficiency-related reasons).
>>>
>>> You might be looking at my previous version. The new version (recently
>>> posted) will throw an exception for that code if a multi-code-unit
>>> code-point is found.
>>
>> I was looking at your latest. It's code that compiles and runs, but
>> dynamically fails on some inputs. I agree that it's often better to
>> fail noisily instead of silently, but in a manner of speaking the
>> string-based code doesn't fail at all - it correctly iterates the code
>> units of a string. This may sometimes not be what the user expected;
>> most of the time they'd care about the code points.
>
> That's forgetting that most of the time people care about graphemes
> (user-perceived characters), not code points.

I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).

>>> It also supports this:
>>>
>>> foreach(i, d; s)
>>> {
>>> writeln("The character in position ", i, " is ", d);
>>> }
>>>
>>> where i is the index (might not be sequential)
>>
>> Well string supports that too, albeit with the nit that you need to
>> specify dchar.
>
> Except it breaks with combining characters. For instance, take the
> string "t̃", which is two code points -- 't' followed by combining tilde
> (U+0303) -- and you'll get the following output:
>
> The character in position 0 is t
> The character in position 1 is ̃
>
> (Note that the tilde becomes combined with the preceding space character.)
>
> The conception of character that normal people have does not match the
> notion of code points when combining characters enters the equation.

This might be a good time to see whether we need to address graphemes systematically. Could you please post a few links that would educate me and others in the mysteries of combining characters?


Thanks,

Andrei
January 14, 2011
"Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message news:igoj6s$17r6$1@digitalmars.com...
>
> I'm not so sure about that. What do you base this assessment on? Denis wrote a library that according to him does grapheme-related stuff nobody else does. So apparently graphemes is not what people care about (although it might be what they should care about).
>

It's what they want, they just don't know it.

Graphemes are what many people *think* code points are.

>
> This might be a good time to see whether we need to address graphemes systematically. Could you please post a few links that would educate me and others in the mysteries of combining characters?
>

Maybe someone else has a link to an explanation (I don't), but it's basically just this:

Three levels of abstraction from lowest to highest:
- Code Unit (ie, encoding)
- Code Point (ie, what Unicode assigns distinct numbers to)
- Grapheme (ie, what we think of as a "character")

A code-point can be made up of one or more code-units. Likewise, a grapheme can be made up of one or more code-points.

There are (at least) two types of code points:

- Regular ones, such as letters, digits, and punctuation.

- "Combining Characters", such as accent marks (or if you're familiar with Japanese, the little things in the upper-right corner that change an "s" to a "z" or an "h" to a "p". Or like German's umlaut - the two dots above a vowel). Ie, things that are not characters in their own right, but merely modify other characters. These can be often (always?) be thought of as being like overlays.

If a code point representing a "combining character" exists in a string, then instead of being displayed as a character it merely modifies whatever code-point came before it.

So, for instance, if you want to store the German word for five (in all lower-case), there are two ways to do it:

[ 'f', {u with the umlaut}, 'n', 'f' ]

Or:

[ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

Those *both* get rendered exactly the same, and both represent the same four-letter sequence. In the second example, the 'u' and the {umlaut combining character} combine to form one grapheme. The f's and n's just happen to be single-code-point graphemes.

Note that while some characters exist in pre-combined form (such as the {u with the umlaut} above), legend has it there are others than can only be represented using a combining character.

It's also my understanding, though I'm not certain, that sometimes multiple combining characters can be used together on the same "root" character.

Caveat: There may very well be further complications that I'm not aware of. Heck, knowing Unicode, there probably are.


January 14, 2011
On 1/13/11 10:26 PM, Nick Sabalausky wrote:
[snip]
> [ 'f', {u with the umlaut}, 'n', 'f' ]
>
> Or:
>
> [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]
>
> Those *both* get rendered exactly the same, and both represent the same
> four-letter sequence. In the second example, the 'u' and the {umlaut
> combining character} combine to form one grapheme. The f's and n's just
> happen to be single-code-point graphemes.
>
> Note that while some characters exist in pre-combined form (such as the {u
> with the umlaut} above), legend has it there are others than can only be
> represented using a combining character.
>
> It's also my understanding, though I'm not certain, that sometimes multiple
> combining characters can be used together on the same "root" character.

Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?

Andrei


January 14, 2011
"Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message news:igoqrm$1n5r$1@digitalmars.com...
> On 1/13/11 10:26 PM, Nick Sabalausky wrote:
> [snip]
>> [ 'f', {u with the umlaut}, 'n', 'f' ]
>>
>> Or:
>>
>> [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]
>>
>> Those *both* get rendered exactly the same, and both represent the same four-letter sequence. In the second example, the 'u' and the {umlaut combining character} combine to form one grapheme. The f's and n's just happen to be single-code-point graphemes.
>>
>> Note that while some characters exist in pre-combined form (such as the
>> {u
>> with the umlaut} above), legend has it there are others than can only be
>> represented using a combining character.
>>
>> It's also my understanding, though I'm not certain, that sometimes
>> multiple
>> combining characters can be used together on the same "root" character.
>
> Thanks. One further question is: in the above example with u-with-umlaut, there is one code point that corresponds to the entire combination. Are there combinations that do not have a unique code point?
>

My understanding is "yes". At least that's what I've heard, and I've never heard any claims of "no". I don't know of any specific ones offhand, though. Actually, it might be possible to use any combining character with any old letter or number (like maybe a 7 with an umlaut), though I'm not certain.

FWIW, the Wikipedia article might help, or at least link to other things that might help: http://en.wikipedia.org/wiki/Combining_character

Michel or spir might have better links though.