View mode: basic / threaded / horizontal-split · Log in · Help
January 14, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/14/2011 01:52 PM, Daniel Gibson wrote:
> Am 14.01.2011 07:26, schrieb Nick Sabalausky:
>> "Andrei Alexandrescu"<SeeWebsiteForEmail@erdani.org> wrote in message
>> news:igoj6s$17r6$1@digitalmars.com...
>>>
>>> I'm not so sure about that. What do you base this assessment on? Denis
>>> wrote a library that according to him does grapheme-related stuff nobody
>>> else does. So apparently graphemes is not what people care about
>>> (although
>>> it might be what they should care about).
>>>
>>
>> It's what they want, they just don't know it.
>>
>> Graphemes are what many people *think* code points are.
>>
>
> Agreed. Up until spir mentioned graphemes in this newsgroup I always
> thought that one Unicode code point == one character on the screen.
>
> I guess in the majority of use cases you want to operate on user
> perceived characters.

That's what makes sense for the user in 99.9% case, thus that's what 
makes sense for the programmer, thus that's what makes sense for the 
language/type/lib designer.

denis
_________________
vita es estrany
spir.wikidot.com
January 14, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/14/2011 02:37 PM, Steven Schveighoffer wrote:
>
> * I don't even know how to make a grapheme that is more than one
> code-unit, let alone more than one code-point :)  Every time I try, I
> get 'invalid utf sequence'.
>
> I feel significantly ignorant on this issue, and I'm slowly getting
> enough knowledge to join the discussion, but being a dumb American who
> only speaks English, I have a hard time grasping how this shit all works.

1. See my text at 
https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction

2.
    writeln ("A\u0308\u0330");
<A + tilde above + umlaut below> (or the opposite)
If it does not display properly, either set your terminal to UTF* or use 
a more unicode-aware font (eg DejaVu series).

The point is not playing like that with Unicode flexibility. Rather that 
composite characters are just normal thingies in most languages of the 
world. Actually, on this point, english is a rare exception (discarding 
letters imported from foreign languages like french 'à'); to the point 
of beeing, I guess, the only western language without any diacritic.


Denis
_________________
vita es estrany
spir.wikidot.com
January 14, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir@gmail.com> wrote:

> On 01/14/2011 02:37 PM, Steven Schveighoffer wrote:
>>
>> * I don't even know how to make a grapheme that is more than one
>> code-unit, let alone more than one code-point :)  Every time I try, I
>> get 'invalid utf sequence'.
>>
>> I feel significantly ignorant on this issue, and I'm slowly getting
>> enough knowledge to join the discussion, but being a dumb American who
>> only speaks English, I have a hard time grasping how this shit all  
>> works.
>
> 1. See my text at  
> https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction

I can't read that document, it's black background with super-dark-grey  
text.

> 2.
>      writeln ("A\u0308\u0330");
> <A + tilde above + umlaut below> (or the opposite)
> If it does not display properly, either set your terminal to UTF* or use  
> a more unicode-aware font (eg DejaVu series).

OK, I'll have to remember this so I can use it to test my string type ;)

> The point is not playing like that with Unicode flexibility. Rather that  
> composite characters are just normal thingies in most languages of the  
> world. Actually, on this point, english is a rare exception (discarding  
> letters imported from foreign languages like french 'à'); to the point  
> of beeing, I guess, the only western language without any diacritic.

Is it common to have multiple modifiers on a single character?  The  
problem I see with using decomposed canonical form for strings is that we  
would have to return a dchar[] for each 'element', which severely  
complicates code that, for instance, only expects to handle English.

I was hoping to lazily transform a string into its composed canonical  
form, allowing the (hopefully rare) exception when a composed character  
does not exist.  My thinking was that this at least gives a useful string  
representation for 90% of usages, leaving the remaining 10% of usages to  
find a more complex representation (like your Text type).  If we only get  
like 20% or 30% there by making dchar the element type, then we haven't  
made it useful enough.

Either way, we need a string type that can be compared canonically for  
things like searches or opEquals.

-Steve
January 14, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail@erdani.org> said:

> On 1/13/11 7:09 PM, Michel Fortin wrote:
>> That's forgetting that most of the time people care about graphemes
>> (user-perceived characters), not code points.
> 
> I'm not so sure about that. What do you base this assessment on? Denis 
> wrote a library that according to him does grapheme-related stuff 
> nobody else does. So apparently graphemes is not what people care about 
> (although it might be what they should care about).

Apple implemented all these things in the NSString class in Cocoa. They 
did all this work on Unicode at the beginning of Mac OS X, at a time 
where making such changes wouldn't break anything.

It's a hard thing to change later when you have code that depend on the 
old behaviour. It's a complicated matter and not so many people will 
understand the issues, so it's no wonder many languages just deal with 
code points.


> This might be a good time to see whether we need to address graphemes 
> systematically. Could you please post a few links that would educate me 
> and others in the mysteries of combining characters?

As usual, Wikipedia offers a good summary and a couple of references. 
Here's the part about combining characters: 
<http://en.wikipedia.org/wiki/Combining_character>.

There's basically four ranges of code points which are combining:
- Combining Diacritical Marks (0300–036F)
- Combining Diacritical Marks Supplement (1DC0–1DFF)
- Combining Diacritical Marks for Symbols (20D0–20FF)
- Combining Half Marks (FE20–FE2F)

A code point followed by one or more code points in these ranges is 
conceptually a single character (a grapheme).

But for comparing strings correctly, you need to determine the 
canonical equivalence. Wikipedia describes it in Unicode Normalization 
article <http://en.wikipedia.org/wiki/Unicode_normalization>. The full 
algorithm specification can be found here: 
<http://unicode.org/reports/tr15/> (the algorithm . The canonical form 
has both a composed and decomposed variant, the first trying to use 
pre-combined character when possible, the second not using any 
pre-combined character. Not only combining marks are concerned, there 
are a few single-code-point characters which have a duplicate somewhere 
else in the code point table.

Also, there's two normalizations: the canonical one (described above) 
and the compatibility one which is more lax (making the ligature "fl" 
would equivalent to "fl" for instance). If a user searches for some 
text in a document, it's probably better to search using the 
compatibility normalization so that "flower" (with ligature) and 
"flower" (without ligature) can match each other. If you want to search 
case-insensitively, then you'll need to implement the collation 
algorithm, but that's getting further.

If you're wondering which direction to take, this official FAQ seems 
like a good resource (especially the first few questions):
<http://www.unicode.org/faq/normalization.html>

One important thing to note is that most of the time, strings come 
already in the normalized pre-composed form. So the normalization 
algorithm should be optimized for the case it has nothing to do. That's 
what is said in section 1.3 Description of the Normalization Algorithm 
in the specification: 
<http://www.unicode.org/reports/tr15/#Description_Norm>.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 14, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-14 01:44:19 -0500, "Nick Sabalausky" <a@a.a> said:

> "Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message
> news:igoqrm$1n5r$1@digitalmars.com...
>> Thanks. One further question is: in the above example with u-with-umlaut,
>> there is one code point that corresponds to the entire combination. Are
>> there combinations that do not have a unique code point?
> 
> My understanding is "yes". At least that's what I've heard, and I've never
> heard any claims of "no". I don't know of any specific ones offhand, though.
> Actually, it might be possible to use any combining character with any old
> letter or number (like maybe a 7 with an umlaut), though I'm not certain.

Correct, there's a lot of combinations with no pre-combined form. This 
should be no surprise given that you can apply any number of combining 
marks to any character.

	mythical 7 with an umlaut: 7̈
	mythical 7 with umlaut, ring above, and acute accent: 7̈̊́

I can't guaranty your news reader will display the above correctly, but 
it works as described in mine (Unison on Mac OS X). In fact, it should 
work in all Cocoa-based applications. This probably includes iOS-based 
devices too, but I haven't tested there.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 14, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-14 09:34:55 -0500, "Steven Schveighoffer" 
<schveiguy@yahoo.com> said:

> On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir@gmail.com> wrote:
> 
>> The point is not playing like that with Unicode flexibility. Rather 
>> that  composite characters are just normal thingies in most languages 
>> of the  world. Actually, on this point, english is a rare exception 
>> (discarding  letters imported from foreign languages like french 'à'); 
>> to the point  of beeing, I guess, the only western language without any 
>> diacritic.
> 
> Is it common to have multiple modifiers on a single character?

Not in my knowledge. But I rarely deal with non-latin texts, there's 
probably some scripts out there that takes advantage of this.


> The  problem I see with using decomposed canonical form for strings is 
> that we  would have to return a dchar[] for each 'element', which 
> severely  complicates code that, for instance, only expects to handle 
> English.

Actually, returning a sliced char[] or wchar[] could also be valid. 
User-perceived characters are basically a substring of one or more code 
points. I'm not sure it complicates that much the semantics of the 
language -- what's complicated about writing str.front == "a" instead 
of str.front == 'a'? -- although it probably would complicate the 
generated code and make it a little slower.

In the case of NSString in Cocoa, you can only access the 'characters' 
in their UTF-16 form. But everything from comparison to search for 
substring is done using graphemes. It's like they implemented 
specialized Unicode-aware algorithms for these functions. There's no 
genericness about how it handles graphemes.

I'm not sure yet about what would be the right approach for D.


> I was hoping to lazily transform a string into its composed canonical  
> form, allowing the (hopefully rare) exception when a composed character 
>  does not exist.  My thinking was that this at least gives a useful 
> string  representation for 90% of usages, leaving the remaining 10% of 
> usages to  find a more complex representation (like your Text type).  
> If we only get  like 20% or 30% there by making dchar the element type, 
> then we haven't  made it useful enough.
> 
> Either way, we need a string type that can be compared canonically for  
> things like searches or opEquals.

I wonder if normalized string comparison shouldn't be built directly in 
the char[] wchar[] and dchar[] types instead. Also bring the idea above 
that iterating on a string would yield graphemes as char[] and this 
code would work perfectly irrespective of whether you used combining 
characters:

	foreach (grapheme; "exposé") {
		if (grapheme == "é")
			break;
	}

I think a good standard to evaluate our handling of Unicode is to see 
how easy it is to do things the right way. In the above, foreach would 
slice the string grapheme by grapheme, and the == operator would 
perform a normalized comparison. While it works correctly, it's 
probably not the most efficient way to do thing however.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 14, 2011
Re: VLERange: a range in between BidirectionalRange andRandomAccessRange
"spir" <denis.spir@gmail.com> wrote in message 
news:mailman.619.1295012086.4748.digitalmars-d@puremagic.com...
>
> If anyone finds a pointer to such an explanation, bravo, and than you. 
> (You will certainly not find it in Unicode literature, for instance.)
> Nick's explanation below is good and concise. (Just 2 notes added.)

Yea, most Unicode explanations seem to talk all about "code-units vs 
code-points" and then they'll just have a brief note like "There's also 
other things like digraphs and combining codes." And that'll be all they 
mention.

You're right about the Unicode literature. It's the usual standards-body 
documentation, same as W3C: "Instead of only some people understanding how 
this works, lets encode the documentation in legalese (and have twenty 
only-slightly-different versions) to make sure that nobody understands how 
it works."

> You can also say there are 2 kinds of characters: simple like "u" & 
> composite "ü" or "ü??". The former are coded with a single (base) code, 
> the latter with one (rarely more) base codes and an arbitrary number of 
> combining codes.

Couple questions about the "more than one base codes":

- Do you know an example offhand?

- Does that mean like a ligature where the base codes form a single glyph, 
or does it mean that the combining code either spans or operates over 
multiple glyphs? Or can it go either way?

> For a majority of _common_ characters made of 2 or 3 codes (western 
> language letters, korean Hangul syllables,...), precombined codes have 
> been added to the set. Thus, they can be coded with a single code like 
> simple characters.
>

Out of curiosity, how do decomposed Hangul characters work? (Or do you 
know?) Not actually knowing any Korean, my understanding is that they're a 
set of 1 to 4 phoenetic glyphs that are then combined into one glyph. So, it 
is like a series of base codes that automatically combine, or are there 
combining characters involved?

> [Also note, to avoid things be too simple ;-), some (few) combining codes 
> called "prepend" come _before_ the base in raw code sequence...]
>

Fun!
January 14, 2011
Re: VLERange: a range in between BidirectionalRange andRandomAccessRange
"spir" <denis.spir@gmail.com> wrote in message 
news:mailman.624.1295013588.4748.digitalmars-d@puremagic.com...
>
> If it does not display properly, either set your terminal to UTF* or use a 
> more unicode-aware font (eg DejaVu series).
>

How to do that on the Windows (XP) command prompt, for anyone who doesn't 
know:

Step 1:

Right-click title bar, "Properties", "Font" tab, set font to "Lucidia 
Console" (It'll look weird at first, but you get used to it.)

Step 2 (I had to google this step):

For just the current terminal session: Run "chcp 65001". (Ie "CHange Code 
Page) Also, you can run "chcp" to just see what codepage you're already set 
to.

To make it work permanently: Put "chcp 65001" into the registry key 
"HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun"
January 14, 2011
Re: VLERange: a range in between BidirectionalRange andRandomAccessRange
"Nick Sabalausky" <a@a.a> wrote in message 
news:igq9u6$1bqu$1@digitalmars.com...
>
> Step 2 (I had to google this step):
>
> For just the current terminal session: Run "chcp 65001". (Ie "CHange Code 
> Page) Also, you can run "chcp" to just see what codepage you're already 
> set to.
>
> To make it work permanently: Put "chcp 65001" into the registry key 
> "HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun"
>

Forget that step 2, that causes "Active code page: 65001" to be sent to 
stdout *every* time system() is invoked. We shouldn't be relying on that. 
*This* is what should be done (and this really should be done in all D 
command line apps - or better yet, put into the runtime):

import std.stdio;

version(Windows)
{
   import std.c.windows.windows;
   extern(Windows) export BOOL SetConsoleOutputCP(UINT);
}

void main()
{
   version(Windows) SetConsoleOutputCP(65001);

   writeln("HuG says: Fukken Über Death Terminal");
}

See also: http://d.puremagic.com/issues/show_bug.cgi?id=1448
January 14, 2011
Re: VLERange: a range in between BidirectionalRange andRandomAccessRange
On 1/14/11, Nick Sabalausky <a@a.a> wrote:
> import std.stdio;
>
> version(Windows)
> {
>     import std.c.windows.windows;
>     extern(Windows) export BOOL SetConsoleOutputCP(UINT);
> }
>
> void main()
> {
>     version(Windows) SetConsoleOutputCP(65001);
>
>     writeln("HuG says: Fukken Über Death Terminal");
> }
>

Does that work for you? I get back:
HuG says: Fukken Ãœber Death Terminal
3 4 5 6 7 8 9 10 11
Top | Discussion index | About this forum | D home