Jump to page: 1 2
Thread overview
Questions about Unicode, particularly Japanese
Jun 08, 2010
Nick Sabalausky
Jun 08, 2010
bearophile
Jun 08, 2010
Matti Niemenmaa
Jun 08, 2010
Nick Sabalausky
Jun 08, 2010
Matti Niemenmaa
Jun 09, 2010
Nick Sabalausky
Jun 08, 2010
Ruslan Nikolaev
Jun 09, 2010
Nick Sabalausky
Jun 08, 2010
Michel Fortin
Re: Another Q about Unicode, Folding Greek edition!
Jun 09, 2010
Nick Sabalausky
Jun 09, 2010
Don
June 08, 2010
The "Wide character support in D" thread got me to question and double-check some of my assumptions about unicode. From double-checking the UTF-8 encoding, and looking at the charts at ( http://www.unicode.org/charts/ ), I realized that Japanese, Chinese and Korean characters are almost entirely (if not entirely) 3 bytes on UTF-8. For some reason I had been under the impression that the Japanese -kanas and at least a few of the Chinese characters were 2 bytes on UTF-8. Turns out that's not the case. I thought I'd share that in case any one else didn't know. Also, FWIW, Cyrillic (ex, Russian, AIUI), and Greek appear to be primarily, if not entirely, 2 bytes in UTF-8.

But then I noticed something on the charts for the Japanese -kanas (ex: http://www.unicode.org/charts/PDF/U3040.pdf ). Umm, first of all, for those unfamiliar with Japanese: There are two phonetic alphabets, hiragana and katakana (in addition to the chinese characters), and they're based more on syllables than the individual sounds of western-style letters. Also, some of the sounds are formed by adding a modifier to a symbol for a similar sound. For instance: ? (U+305D, hiragana "so") is the sound "so", and to make "zo" you add what looks like a double-quote to it: ? (U+305E, hiragana "zo") (You may need to increase your font size to see it well). That same modifier converts most of the "s"'s to "z"'s, or any of the "h"'s to "b"'s, etc. And there's also another modifier that converts the "h"'s to "p"'s (looks like a little circle).

The thing is, there appears to also be Unicode code points for these modifiers by themselves (U+3099 and U+309A). Maybe I'm understanding it wrong, but according to Page 3 in the document I linked to above, it looks like these are intended to be used in conjunction with the regular letters in order to modify them. So, it seems that there are two valid ways to encode a single character like ? ("zo"): Either (U+305E) or (U+305D, U+3099).

I think these are what people call "combining characters" but every explanation of Unicode I've ever seen that actually mentions such things always just hand-waves it away with "oh, yea, and then there's something called 'combining characters' that can complicate things", and that's all they ever say.

So, my questions:

1. Am I correct in all of that?

2. Is there a proper way to encode that modifier character by itself? For instance, if you wanted to write "Japanese has a (the modifier by itself here) that changes a sound".

3. A text editor, for instance, is intended to treat something like (U+305D, U+3099) as a single character, right?

4. When comparing strings, are (U+305E) and (U+305D, U+3099) intended to compare as equal?

5. Does Phobos/Tango correctly abide by whatever the answer to #4 is?

6. Are there other languages with similar things for which the answers to #3 and #4 are different? (And if so, how does Phobos/Tango handle it?)

7. I assume Unicode doesn't have any provisions for Furigana, right? I assume that would be outside the scope of Unicode, but I thought I'd ask.


June 08, 2010
Nick Sabalausky:

> 3. A text editor, for instance, is intended to treat something like (U+305D, U+3099) as a single character, right?

Languages are a product of biology, and in biology it's usually hard to put absolute limits between things; all definitions must be flexible and a little fuzzy if they want to grasp enough of the reality and be useful. So I think the answer to this question is positive.
When you iterate with D foreach on a string that contains those, what is the right way to split chars? Returning a single "char" 8 bytes long (that is a string of two 32-bit chars) that contains them both is not wrong (but probably not expected) :-)

Bye,
bearophile
June 08, 2010
On 2010-06-08 22:27, Nick Sabalausky wrote:
<snip>
>
> 1. Am I correct in all of that?

Yes. In particular, the three-byteness of CJK characters is an often-cited reason to use UTF-16 instead of UTF-8.

> 2. Is there a proper way to encode that modifier character by itself? For
> instance, if you wanted to write "Japanese has a (the modifier by itself
> here) that changes a sound".

You can combine it with a space, but yes: that mark, called the dakuten or voicing mark, can be encoded by itself as U+309B.

I recommend http://rishida.net/scripts/uniview/ for searching through Unicode.

> 3. A text editor, for instance, is intended to treat something like (U+305D,
> U+3099) as a single character, right?

Yes, I'd say so. I suppose it could allow for removing only the modifier (or the modified), but that doesn't seem like it should be the default behaviour.

> 4. When comparing strings, are (U+305E) and (U+305D, U+3099) intended to
> compare as equal?

Yes. You might want to read about equivalence and normalization in Unicode:

http://en.wikipedia.org/wiki/Unicode_equivalence

> 5. Does Phobos/Tango correctly abide by whatever the answer to #4 is?

AFAIK, neither support normalization of any kind.

> 6. Are there other languages with similar things for which the answers to #3
> and #4 are different? (And if so, how does Phobos/Tango handle it?)

Factor has pretty good support for Unicode:

http://docs.factorcode.org/content/article-unicode.html

> 7. I assume Unicode doesn't have any provisions for Furigana, right? I
> assume that would be outside the scope of Unicode, but I thought I'd ask.

There's:

U+FFF9  INTERLINEAR ANNOTATION ANCHOR
U+FFFA  INTERLINEAR ANNOTATION SEPARATOR
U+FFFB  INTERLINEAR ANNOTATION TERMINATOR

But it's usually recommended to use some kind of ruby markup instead. See:

http://en.wikipedia.org/wiki/Ruby_character#Ruby_in_Unicode

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
June 08, 2010
"Matti Niemenmaa" <see_signature@for.real.address> wrote in message news:hum6ft$2jar$1@digitalmars.com...
> On 2010-06-08 22:27, Nick Sabalausky wrote:
> <snip>

Thanks for the helpful response :)


>
> I recommend http://rishida.net/scripts/uniview/ for searching through Unicode.
>

Ahh, I'd been wanting a good Unicode equivalent to an ASCII chart. That seems to do nicely.


>> 6. Are there other languages with similar things for which the answers to
>> #3
>> and #4 are different? (And if so, how does Phobos/Tango handle it?)
>
> Factor has pretty good support for Unicode:
>
> http://docs.factorcode.org/content/article-unicode.html
>

Actually, I meant other human-languages. Like, are there other combining characters for some language other than Japanese that are indended to be compared as unequal to their corresponding singe-code-point version?


>> 7. I assume Unicode doesn't have any provisions for Furigana, right? I assume that would be outside the scope of Unicode, but I thought I'd ask.
>
> There's:
>
> U+FFF9  INTERLINEAR ANNOTATION ANCHOR
> U+FFFA  INTERLINEAR ANNOTATION SEPARATOR
> U+FFFB  INTERLINEAR ANNOTATION TERMINATOR
>
> But it's usually recommended to use some kind of ruby markup instead. See:
>
> http://en.wikipedia.org/wiki/Ruby_character#Ruby_in_Unicode
>

Thanks. I was wondering about those being there but not being recommended, so I followed that link and the footnote, and found the following very helpful explanation:

http://www.unicode.org/reports/tr20/#Interlinear

Their explanation is easy to understand, but basically, they're there as a convenience for internal use by an application. It don't provide other information that would normally be important for markup, such as where to position it. And it's not easily displayable in plain-text-only-modes without the risk of subtly changing the meaning.

Any idea if "Ruby markup" has anything to do with the Ruby programming language? It's not clear from that Wikipedia article.


June 08, 2010
Sorry, if it's again top post in your mail clients. I'll try to figure out what's going on later today.


> 
> 1. Am I correct in all of that?

Yes. That's the reason I was saying that UTF-16 is *NOT* a lousy encoding. It really depends on a situation. The advantage is not only space but also faster processing speed (even for 2 byte letters: Greek, Cyrillic, etc.) since those 2 bytes can be read at one memory access as opposed to UTF-8. Also, consider another thing: it's easier (and cheaper) to convert from ANSI to UTF-16 since a direct table can be created. Whereas for UTF-8, you'll have to do some shifts to create a surrogate for non-ASCII letters (even for Latin ones).

What encoding is better depends on your taste, language, applications, etc. I was simply pointing out that it's quite nice to have universal 'tchar' type. My argument was never about which encoding is better - it's hard to tell in general. Besides, many people still use ANSI and not UTF-8.




June 08, 2010
On 2010-06-08 23:16, Nick Sabalausky wrote:
> "Matti Niemenmaa"<see_signature@for.real.address>  wrote in message
> news:hum6ft$2jar$1@digitalmars.com...
>> On 2010-06-08 22:27, Nick Sabalausky wrote:
>>> 6. Are there other languages with similar things for which the answers to
>>> #3
>>> and #4 are different? (And if so, how does Phobos/Tango handle it?)
>>
>> Factor has pretty good support for Unicode:
>>
>> http://docs.factorcode.org/content/article-unicode.html
>>
>
> Actually, I meant other human-languages. Like, are there other combining
> characters for some language other than Japanese that are indended to be
> compared as unequal to their corresponding singe-code-point version?

Ah, sorry for the misunderstanding. :-)

I don't think so, no. The Unicode FAQ at http://www.unicode.org/faq/normalization.html says "Programs should always compare canonical-equivalent Unicode strings as equal".

> Any idea if "Ruby markup" has anything to do with the Ruby programming
> language? It's not clear from that Wikipedia article.

No, they're completely unrelated.

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi
June 08, 2010
On Tue, 08 Jun 2010 16:18:54 -0400, Ruslan Nikolaev <nruslan_devel@yahoo.com> wrote:

> Sorry, if it's again top post in your mail clients. I'll try to figure out what's going on later today.

It appears as a top-post in my newsreader too.

>
>
>>
>> 1. Am I correct in all of that?
>
> Yes. That's the reason I was saying that UTF-16 is *NOT* a lousy encoding. It really depends on a situation. The advantage is not only space but also faster processing speed (even for 2 byte letters: Greek, Cyrillic, etc.) since those 2 bytes can be read at one memory access as opposed to UTF-8. Also, consider another thing: it's easier (and cheaper) to convert from ANSI to UTF-16 since a direct table can be created. Whereas for UTF-8, you'll have to do some shifts to create a surrogate for non-ASCII letters (even for Latin ones).
>
> What encoding is better depends on your taste, language, applications, etc. I was simply pointing out that it's quite nice to have universal 'tchar' type. My argument was never about which encoding is better - it's hard to tell in general. Besides, many people still use ANSI and not UTF-8.

Wouldn't this suggest that the decision of what character type to use would be more suited to what language you speak than what OS you are running?

-Steve
June 08, 2010
On 2010-06-08 15:27:10 -0400, "Nick Sabalausky" <a@a.a> said:

> So, my questions:
> 
> 1. Am I correct in all of that?

Yes. Note that combining characters exist for a variety of glyphs. There is somewhere a "combining acute accent" that can be combined with a "e", so you could use two code points to write "é" if you wanted instead of the single code point "pre-combined" form.

> 2. Is there a proper way to encode that modifier character by itself? For
> instance, if you wanted to write "Japanese has a (the modifier by itself
> here) that changes a sound".

Sometime there is a separate (non-combining) character for that. For instance you have a non-combining acute accent as a standalone character. Perhaps you can use a combining character with a no-break space?

> 3. A text editor, for instance, is intended to treat something like (U+305D,
> U+3099) as a single character, right?

Yes. They are both equivalent, and they'll share the same Unicode normalization.

> 4. When comparing strings, are (U+305E) and (U+305D, U+3099) intended to
> compare as equal?

Yes, well, it depends on what you're trying to do. Say you're searching for "é" in a text editor, it should match both the normal and the combining version. In your code, it depends on what you want to do (if you want to replace U+305D U+3099 with U+305E, then obviously you search by code point).

I think the proper way to do this is to perform Unicode normalization on both strings before comparing code points.

> 5. Does Phobos/Tango correctly abide by whatever the answer to #4 is?

Probably not. But again, in some cases making a literal code-point search might be what you want.

It'd be interesting if someone could make a unicode normalizer in the form of a range in Phobos 2. That way you could compare both strings by comparing code points from the normalizer ranges, all this without having to create a normalized copy.

> 6. Are there other languages with similar things for which the answers to #3
> and #4 are different? (And if so, how does Phobos/Tango handle it?)

Not all combinations have a pre-combined form, so you can't always convert them to a single code point. But beside that, when there is a pre-combined form, they should be treated as equivalent.

> 7. I assume Unicode doesn't have any provisions for Furigana, right? I
> assume that would be outside the scope of Unicode, but I thought I'd ask.

I'm pretty sure furigana is out of scope.

Reference:
<http://en.wikipedia.org/wiki/Combining_character>
<http://en.wikipedia.org/wiki/Unicode_normalization>

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

June 09, 2010
"Matti Niemenmaa" <see_signature@for.real.address> wrote in message news:hum8us$2o7m$1@digitalmars.com...
>
>> Any idea if "Ruby markup" has anything to do with the Ruby programming language? It's not clear from that Wikipedia article.
>
> No, they're completely unrelated.
>

Heh, you know, that would have been perfectly obvious from the article if I had just scrolled up a bit :)


June 09, 2010
"Ruslan Nikolaev" <nruslan_devel@yahoo.com> wrote in message news:mailman.138.1276028343.24349.digitalmars-d@puremagic.com...
> Sorry, if it's again top post in your mail clients. I'll try to figure out what's going on later today.
>
>
>>
>> 1. Am I correct in all of that?
>
> Yes. That's the reason I was saying that UTF-16 is *NOT* a lousy encoding. It really depends on a situation. The advantage is not only space but also faster processing speed (even for 2 byte letters: Greek, Cyrillic, etc.) since those 2 bytes can be read at one memory access as opposed to UTF-8. Also, consider another thing: it's easier (and cheaper) to convert from ANSI to UTF-16 since a direct table can be created. Whereas for UTF-8, you'll have to do some shifts to create a surrogate for non-ASCII letters (even for Latin ones).
>

Yea, I need to remember not to try to post late at night ;)


« First   ‹ Prev
1 2