View mode: basic / threaded / horizontal-split · Log in · Help
January 13, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
"Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message 
news:ignon1$2p4k$1@digitalmars.com...
>
> This may sometimes not be what the user expected; most of the time they'd 
> care about the code points.
>

I dunno, spir has succesfuly convinced me that most of the time it's 
graphemes the user cares about, not code points. Using code points is just 
as misleading as using UTF-16 code units.
January 14, 2011
Re: Unicode's proper level of abstraction? [was: Re: VLERange:...]
On 2011-01-13 14:11:44 -0500, spir <denis.spir@gmail.com> said:

> In Cocoa, string sorting and case-insensitive comparition is also 
> dependent on the user's locale settings, although you can also specify 
> your own locale if the user's locale is not what you want. See kde 
> trying to invent a, hum, "natural", way of sorting file names...)

Mac OS sorts file names in a "natural" way since a very long time 
(since Mac OS 8 I believe). By natural, I mean that numbers inside the 
file name are sorted in numeric order while the rest is sorted 
character by character. For instance "My File 2" will go before "My 
File 10" in file listings because "2" is less than "10".

There's an option in NSString comparison methods to use this ordering, 
but it's not the default.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 14, 2011
Re: Unicode's proper level of abstraction? [was: Re: VLERange:...]
On 2011-01-13 15:39:14 -0500, "Nick Sabalausky" <a@a.a> said:

> "Andrej Mitrovic" <andrej.mitrovich@gmail.com> wrote in message
> news:mailman.604.1294932704.4748.digitalmars-d@puremagic.com...
>> OT: Spir, do you know if I can change the syntax highlighting settings
>> on bitbucket? I can't see anything with these gray on dark-gray
>> colors: http://i.imgur.com/SmLk1.jpg
> 
> I'm getting the same problem too.

I bypassed the problem by fetching the files from the repository. But I 
agree it's very annoying.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 14, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/13/2011 11:00 PM, Nick Sabalausky wrote:
> "Andrei Alexandrescu"<SeeWebsiteForEmail@erdani.org>  wrote in message
> news:ignon1$2p4k$1@digitalmars.com...
>>
>> This may sometimes not be what the user expected; most of the time they'd
>> care about the code points.
>>
>
> I dunno, spir has succesfuly convinced me that most of the time it's
> graphemes the user cares about, not code points. Using code points is just
> as misleading as using UTF-16 code units.

You are right in that those 2 issues are really analog. In practice, 
once universal text is truely and commonly used, I guess problems with 
codes-do-not-represent-characters may become far more obvious; and also 
far more serious because (logical) errors can easily pass by unseen.
[In fact, how can a programmer even know for instance that a search 
routine missed its target or returned a false positive, when dealing 
with characters from unknown languages? Indeed, there are test data 
sets, but they are useless if the tools one uses just ignore the issues.]
The problem with using 16-bit representation and thus ignoring a fair 
amount of codepoints is maybe less problematic because there are rather 
few chances to randomly meet characters outside the BMP (Basic 
Multiligual Plane, part of UCS which codepoints are < 0x10000).
Outside the BMP are scripting systems of less commonly studied 
archeological languages, and various sets of images such as alchemical 
symbols, playing cards or domino tiles. I doubt they'll ever be commonly 
used, or else for specialised apps the programmer perfectly knows what 
they deal with.

A list of UCS blocks with pointers to detailed content can be found here:
http://www.fileformat.info/info/unicode/block/index.htm
Blocks over the BMP start with the line:
Linear B Syllabary 	U+10000 	U+1007F 	(88)

Denis
_________________
vita es estrany
spir.wikidot.com
January 14, 2011
Re: Unicode's proper level of abstraction? [was: Re: VLERange:...]
"Michel Fortin" <michel.fortin@michelf.com> wrote in message 
news:igo5v2$gq2$1@digitalmars.com...
> On 2011-01-13 14:11:44 -0500, spir <denis.spir@gmail.com> said:
>
>> In Cocoa, string sorting and case-insensitive comparition is also 
>> dependent on the user's locale settings, although you can also specify 
>> your own locale if the user's locale is not what you want. See kde trying 
>> to invent a, hum, "natural", way of sorting file names...)
>
> Mac OS sorts file names in a "natural" way since a very long time (since 
> Mac OS 8 I believe). By natural, I mean that numbers inside the file name 
> are sorted in numeric order while the rest is sorted character by 
> character. For instance "My File 2" will go before "My File 10" in file 
> listings because "2" is less than "10".
>

XP's explorer does that too. It's a very nice feature.
January 14, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-13 15:51:00 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail@erdani.org> said:

> On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
>> On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail@erdani.org> wrote:
>>> Let's take a look:
>>> 
>>> // Incorrect string code
>>> void fun(string s) {
>>> foreach (i; 0 .. s.length) {
>>> writeln("The character in position ", i, " is ", s[i]);
>>> }
>>> }
>>> 
>>> // Incorrect string_t code
>>> void fun(string_t!char s) {
>>> foreach (i; 0 .. s.codeUnits) {
>>> writeln("The character in position ", i, " is ", s[i]);
>>> }
>>> }
>>> 
>>> Both functions are incorrect, albeit in different ways. The only
>>> improvement I'm seeing is that the user needs to write codeUnits
>>> instead of length, which may make her think twice. Clearly, however,
>>> copiously incorrect code can be written with the proposed interface
>>> because it tries to hide the reality that underneath a variable-length
>>> encoding is being used, but doesn't hide it completely (albeit for
>>> good efficiency-related reasons).
>> 
>> You might be looking at my previous version. The new version (recently
>> posted) will throw an exception for that code if a multi-code-unit
>> code-point is found.
> 
> I was looking at your latest. It's code that compiles and runs, but 
> dynamically fails on some inputs. I agree that it's often better to 
> fail noisily instead of silently, but in a manner of speaking the 
> string-based code doesn't fail at all - it correctly iterates the code 
> units of a string. This may sometimes not be what the user expected; 
> most of the time they'd care about the code points.

That's forgetting that most of the time people care about graphemes 
(user-perceived characters), not code points.


>> It also supports this:
>> 
>> foreach(i, d; s)
>> {
>> writeln("The character in position ", i, " is ", d);
>> }
>> 
>> where i is the index (might not be sequential)
> 
> Well string supports that too, albeit with the nit that you need to 
> specify dchar.

Except it breaks with combining characters. For instance, take the 
string "t̃", which is two code points -- 't' followed by combining 
tilde (U+0303) -- and you'll get the following output:

	The character in position 0 is t
	The character in position 1 is ̃

(Note that the tilde becomes combined with the preceding space character.)

The conception of character that normal people have does not match the 
notion of code points when combining characters enters the equation.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 14, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/13/11 7:09 PM, Michel Fortin wrote:
> On 2011-01-13 15:51:00 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> said:
>
>> On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
>>> On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
>>> <SeeWebsiteForEmail@erdani.org> wrote:
>>>> Let's take a look:
>>>>
>>>> // Incorrect string code
>>>> void fun(string s) {
>>>> foreach (i; 0 .. s.length) {
>>>> writeln("The character in position ", i, " is ", s[i]);
>>>> }
>>>> }
>>>>
>>>> // Incorrect string_t code
>>>> void fun(string_t!char s) {
>>>> foreach (i; 0 .. s.codeUnits) {
>>>> writeln("The character in position ", i, " is ", s[i]);
>>>> }
>>>> }
>>>>
>>>> Both functions are incorrect, albeit in different ways. The only
>>>> improvement I'm seeing is that the user needs to write codeUnits
>>>> instead of length, which may make her think twice. Clearly, however,
>>>> copiously incorrect code can be written with the proposed interface
>>>> because it tries to hide the reality that underneath a variable-length
>>>> encoding is being used, but doesn't hide it completely (albeit for
>>>> good efficiency-related reasons).
>>>
>>> You might be looking at my previous version. The new version (recently
>>> posted) will throw an exception for that code if a multi-code-unit
>>> code-point is found.
>>
>> I was looking at your latest. It's code that compiles and runs, but
>> dynamically fails on some inputs. I agree that it's often better to
>> fail noisily instead of silently, but in a manner of speaking the
>> string-based code doesn't fail at all - it correctly iterates the code
>> units of a string. This may sometimes not be what the user expected;
>> most of the time they'd care about the code points.
>
> That's forgetting that most of the time people care about graphemes
> (user-perceived characters), not code points.

I'm not so sure about that. What do you base this assessment on? Denis 
wrote a library that according to him does grapheme-related stuff nobody 
else does. So apparently graphemes is not what people care about 
(although it might be what they should care about).

>>> It also supports this:
>>>
>>> foreach(i, d; s)
>>> {
>>> writeln("The character in position ", i, " is ", d);
>>> }
>>>
>>> where i is the index (might not be sequential)
>>
>> Well string supports that too, albeit with the nit that you need to
>> specify dchar.
>
> Except it breaks with combining characters. For instance, take the
> string "t̃", which is two code points -- 't' followed by combining tilde
> (U+0303) -- and you'll get the following output:
>
> The character in position 0 is t
> The character in position 1 is ̃
>
> (Note that the tilde becomes combined with the preceding space character.)
>
> The conception of character that normal people have does not match the
> notion of code points when combining characters enters the equation.

This might be a good time to see whether we need to address graphemes 
systematically. Could you please post a few links that would educate me 
and others in the mysteries of combining characters?


Thanks,

Andrei
January 14, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
"Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message 
news:igoj6s$17r6$1@digitalmars.com...
>
> I'm not so sure about that. What do you base this assessment on? Denis 
> wrote a library that according to him does grapheme-related stuff nobody 
> else does. So apparently graphemes is not what people care about (although 
> it might be what they should care about).
>

It's what they want, they just don't know it.

Graphemes are what many people *think* code points are.

>
> This might be a good time to see whether we need to address graphemes 
> systematically. Could you please post a few links that would educate me 
> and others in the mysteries of combining characters?
>

Maybe someone else has a link to an explanation (I don't), but it's 
basically just this:

Three levels of abstraction from lowest to highest:
- Code Unit (ie, encoding)
- Code Point (ie, what Unicode assigns distinct numbers to)
- Grapheme (ie, what we think of as a "character")

A code-point can be made up of one or more code-units. Likewise, a grapheme 
can be made up of one or more code-points.

There are (at least) two types of code points:

- Regular ones, such as letters, digits, and punctuation.

- "Combining Characters", such as accent marks (or if you're familiar with 
Japanese, the little things in the upper-right corner that change an "s" to 
a "z" or an "h" to a "p". Or like German's umlaut - the two dots above a 
vowel). Ie, things that are not characters in their own right, but merely 
modify other characters. These can be often (always?) be thought of as being 
like overlays.

If a code point representing a "combining character" exists in a string, 
then instead of being displayed as a character it merely modifies whatever 
code-point came before it.

So, for instance, if you want to store the German word for five (in all 
lower-case), there are two ways to do it:

[ 'f', {u with the umlaut}, 'n', 'f' ]

Or:

[ 'f', 'u', {umlaut combining character}, 'n', 'f' ]

Those *both* get rendered exactly the same, and both represent the same 
four-letter sequence. In the second example, the 'u' and the {umlaut 
combining character} combine to form one grapheme. The f's and n's just 
happen to be single-code-point graphemes.

Note that while some characters exist in pre-combined form (such as the {u 
with the umlaut} above), legend has it there are others than can only be 
represented using a combining character.

It's also my understanding, though I'm not certain, that sometimes multiple 
combining characters can be used together on the same "root" character.

Caveat: There may very well be further complications that I'm not aware of. 
Heck, knowing Unicode, there probably are.
January 14, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/13/11 10:26 PM, Nick Sabalausky wrote:
[snip]
> [ 'f', {u with the umlaut}, 'n', 'f' ]
>
> Or:
>
> [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]
>
> Those *both* get rendered exactly the same, and both represent the same
> four-letter sequence. In the second example, the 'u' and the {umlaut
> combining character} combine to form one grapheme. The f's and n's just
> happen to be single-code-point graphemes.
>
> Note that while some characters exist in pre-combined form (such as the {u
> with the umlaut} above), legend has it there are others than can only be
> represented using a combining character.
>
> It's also my understanding, though I'm not certain, that sometimes multiple
> combining characters can be used together on the same "root" character.

Thanks. One further question is: in the above example with 
u-with-umlaut, there is one code point that corresponds to the entire 
combination. Are there combinations that do not have a unique code point?

Andrei
January 14, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
"Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message 
news:igoqrm$1n5r$1@digitalmars.com...
> On 1/13/11 10:26 PM, Nick Sabalausky wrote:
> [snip]
>> [ 'f', {u with the umlaut}, 'n', 'f' ]
>>
>> Or:
>>
>> [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]
>>
>> Those *both* get rendered exactly the same, and both represent the same
>> four-letter sequence. In the second example, the 'u' and the {umlaut
>> combining character} combine to form one grapheme. The f's and n's just
>> happen to be single-code-point graphemes.
>>
>> Note that while some characters exist in pre-combined form (such as the 
>> {u
>> with the umlaut} above), legend has it there are others than can only be
>> represented using a combining character.
>>
>> It's also my understanding, though I'm not certain, that sometimes 
>> multiple
>> combining characters can be used together on the same "root" character.
>
> Thanks. One further question is: in the above example with u-with-umlaut, 
> there is one code point that corresponds to the entire combination. Are 
> there combinations that do not have a unique code point?
>

My understanding is "yes". At least that's what I've heard, and I've never 
heard any claims of "no". I don't know of any specific ones offhand, though. 
Actually, it might be possible to use any combining character with any old 
letter or number (like maybe a 7 with an umlaut), though I'm not certain.

FWIW, the Wikipedia article might help, or at least link to other things 
that might help: http://en.wikipedia.org/wiki/Combining_character

Michel or spir might have better links though.
1 2 3 4 5 6 7 8 9
Top | Discussion index | About this forum | D home