UTF-8 char[] consistency (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » UTF-8 char[] consistency (page 2)

September 27, 2004

Re: UTF-8 char[] consistency

Posted by Jaap Geurts
in reply to Arcane Jill

Jaap Geurts

Posted in reply to Arcane Jill

> >I'm testing and programming in D using UTF-8 under linux to encode the
Vietnamese character set.
>
> Cool.
>
>
> >I have some trouble with the way D handles the char[].length property.
>
> length does what it does. What you need is a character count, which is something different.

I see. If that is the way it is. Than I'll use functions operating on strings.

> >Therefore I still need to use a strlen function to determine the correct
string length.
>
> Okay, here's one:

Thanks for the code examples.

>
> Phobos is not really geared up for Unicode yet. The string handling
functions
> are defined to work only for ASCII.

I noticed. I'll use David's (Spotted Tiger) stringw.d and complement if
necessary.

> >3. convert everything internally to UTF-16, convert it back to UTF-8
before output.
>
> Option 3 won't work in general. In general, you'll need to convert
everything
> internally to UTF-32, not UTF-16. Of course, if it's just for Vietnamese,
UTF-16
> will be fine.

Strangely enough the Windows32 API uses the UTF16 as their encoding.

> That's just not possible. A char is a UTF-8 fragment, not a Unicode
character.
> They're just not the same.

I understand the issues, and UTF-8 in particular was actually designed with backwards compatibility in mind. (For C uses the zero char as the terminator. Had the world programmed in Pascal then we probably wouldn't have UTF-8/

> >or
> >2. Implement a special String datatype (has been discussed earlier and
Walter is against it.)
>
> This will happen anyway in time - by accident! ICU has a class called UnicodeString, so D will get that once ICU is ported.

But (if my memory serves me well) this is exactly what Walter wanted to
prevent: A multitude of String classes all doing the same but having
slightly different interfaces. Should this be part of the Language or the
Phobos library don't you think. UTF-8 will always require a class of some
sort...
I'm not trying to put oil in the fire but isn't this an important aspect for
version 1.0?

Jaap

--
D Programming from Vietnam.

September 27, 2004

Re: UTF-8 char[] consistency

Posted by Jaap Geurts
in reply to Ben Hinkle

Jaap Geurts

Posted in reply to Ben Hinkle

"Ben Hinkle" <bhinkle@mathworks.com> wrote in message news:cj7eb6$ole$1@digitaldaemon.com...
>
>
> That is by design. Out of curiosity, what are you doing with your strings that require the number of characters? Usually one just deals with string fragments and it doesn't matter how long it is (either in characters or in bytes). In a perfect world your expectation of having a one-to-one mapping between array indexing and character indexing would clearly be nice to
have.
> But the current design is (in Walter's opinion - and I agree with him) the best we can do given the imperfect world we find ourselves in and given
D's
> design goals.
>
If this is by design than fine. Who am I to change it. It is just because I
need to insert characters into existing strings.
I see. Moreover if char[] does behave the way it currently does it will be
fast, but it probably won't if it had to interpret the array as UTF-8.
But then I see little difference between byte[] and char[]. They are
basically the same and can be interpreted ambiguously. Something that Walter
wanted to prevent if I remember correctly.

Jaap

September 27, 2004

Re: UTF-8 char[] consistency

Posted by Arcane Jill
in reply to Jaap Geurts

Arcane Jill

Posted in reply to Jaap Geurts

In article <cj90u4$b46$1@digitaldaemon.com>, Jaap Geurts says...

>Strangely enough the Windows32 API uses the UTF16 as their encoding.

Most Unicode platforms use UTF-16, including the ICU library. It follows logically, therefore, that on these platforms - /including/ the Windows API - you cannot use array indexing to find the nth character.

But there is method in this madness. The Unicode characters from U+10000 upwards are not characters from living languages. By and large, it is generally considered "harmless" to regard such characters as if they were two characters. For examples, consider the character U+1D11E (musical symbol G clef) - does it really /matter/ if your application perceives instead U+D874 followed by U+D11C? It won't affect casing, sorting or anything like that because the character isn't part of any living language script. From the point of view of most general purpose algorithms, it's just another shape to draw, like a WingDings symbol. So UTF-16 is simply the best space/speed compromise for the majority of real-life languages.


>I understand the issues, and UTF-8 in particular was actually designed with backwards compatibility in mind. (For C uses the zero char as the terminator. Had the world programmed in Pascal then we probably wouldn't have UTF-8/

The compatibility is with ASCII, not with C. There is no Unicode meaning of U+0000, apart from "some sort of application-dependent control character".



>> This will happen anyway in time - by accident! ICU has a class called UnicodeString, so D will get that once ICU is ported.
>
>But (if my memory serves me well) this is exactly what Walter wanted to prevent: A multitude of String classes all doing the same but having slightly different interfaces. Should this be part of the Language or the Phobos library don't you think.

Well, ICU is not really anything to do with D. It was originally a Java API, then got ported to C and C++. We'll have it in D, too, eventually. It's not my fault if ICU defines a string class. But I don't think Walter will be complaining - the ICU class isn't a simple "replacement" or "alternative" to char[] - it provides full Unicode functionality, in a way that char[] doesn't.

I don't think we'll be seeing "a multitude of String classes" either. To be honest, I don't think even ICU's UnicodeString class will ever become any kind of D "standard", because you won't be able to do implicit casting to/from it.



>UTF-8 will always require a class of some
>sort...

Well, I'm more inclined to the view that truly internationalized software just won't use UTF-8 at all. UTF-16 is much more managable for this sort of thing. UTF-8 can do the job, but it's mainly intended for text which "mostly ASCII".

Arcane Jill

September 27, 2004

Re: UTF-8 char[] consistency

Posted by Thomas Kuehne
in reply to Arcane Jill

Thomas Kuehne

Posted in reply to Arcane Jill

Arcane Jill schrieb:
> But there is method in this madness. The Unicode characters from U+10000
> upwardsare not characters from living languages. By and large, it is
> generally considered "harmless" to regard such characters as if they were
> two characters.
> For examples, consider the character U+1D11E (musical symbol G clef) -
> does it really /matter/ if your application perceives instead U+D874
> followed by U+D11C?
> It won't affect casing, sorting or anything like that because the
> character isn't part of any living language script.

Guess you missed the extended CJK part. There are names of living persons that can only be encoded using post U+FFFF stuff. As a consequence it does affect the sorting and "character"/"glyph"/"graphem"/"codepoint"/"what-so-ever" count algorithms.

Thomas

September 27, 2004

Re: UTF-8 char[] consistency

Posted by Ben Hinkle
in reply to Arcane Jill

Ben Hinkle

Posted in reply to Arcane Jill

[snip]
> >> This will happen anyway in time - by accident! ICU has a class called UnicodeString, so D will get that once ICU is ported.
> >
> >But (if my memory serves me well) this is exactly what Walter wanted to prevent: A multitude of String classes all doing the same but having slightly different interfaces. Should this be part of the Language or the Phobos library don't you think.
>
> Well, ICU is not really anything to do with D. It was originally a Java
API,
> then got ported to C and C++. We'll have it in D, too, eventually. It's
not my
> fault if ICU defines a string class. But I don't think Walter will be complaining - the ICU class isn't a simple "replacement" or "alternative"
to
> char[] - it provides full Unicode functionality, in a way that char[]
doesn't.
>
> I don't think we'll be seeing "a multitude of String classes" either. To
be
> honest, I don't think even ICU's UnicodeString class will ever become any
kind
> of D "standard", because you won't be able to do implicit casting to/from
it.

Is there a link to the String class API? I'm curious to see what the
differences are from a function-based API. Is the basic difference that the
String's encoding is determined at runtime? Maybe a struct would be better
than a class:
struct ICUString {
  enum Encoding {UTF8, UTF16, UTF32,...};
  uint len;
  void* data;
  Encoding encoding;
  ... member functions like opIndex, etc...
}
... functions like std.string with ICUString instead of char[] or wchar[] or
dchar[]...

[snip]

September 27, 2004

Re: UTF-8 char[] consistency

Posted by Benjamin Herr
in reply to Thomas Kuehne

Benjamin Herr

Posted in reply to Thomas Kuehne

Thomas Kuehne wrote:
> Benjamin Herr <ben@0x539.de> schrieb:
> 
>>Arcane Jill wrote:
>>
>>>This will happen anyway in time - by accident! ICU has a class called
>>>UnicodeString, so D will get that once ICU is ported.
>>
>>So can we not just drop char and char[]s and define some standard string
>>class to be used for unicode strings (preferably one returning dchars
>>when prompted for individual characters)?
> 
> 
> I guess you didn't (yet) dive into Unicode?
> A "character" is something quite complicated.
I only theoretically dealed with Unicode (so, no). I had not idea I am so far off, though.

> 1) it can consist of one codepoint like 0x41 "A"
Sounds easy so far.

> 2) two different codpoint sequences can be equal: 0xC1 "Á" and 0x41 0x2CA
> "Á"
Is it not invalid at least with utf8 to use anything but the least `large' representation?

> [...]
>
> Above points out only some basics you'd have to implement in your string
> class.
I either have to implement it in my string class, or I have to do it `by hand' every time I need any of this functionality.
Which is why I suggested a class (or even just a struct, to keep the semantics closer to the standard arrays), after all.

-ben

September 27, 2004

Re: [ot] UTF-8 char[] consistency

Posted by Thomas Kuehne
in reply to Benjamin Herr

Thomas Kuehne

Posted in reply to Benjamin Herr

Benjamin Herr schrieb:
> > 2) two different codpoint sequences can be equal: 0xC1 "Á" and 0x41 0x2CA "Á"
> Is it not invalid at least with utf8 to use anything but the least `large' representation?

UTF-8/16/32 only deal with one codepoint at a time(except for some
checking).
 The codepoint sequence above would be U+0000C1 "Á" and U+000041 U+0002CA
"Á"
The above are different Normalization Forms.
(http://www.unicode.org/reports/tr15/)

> > Above points out only some basics you'd have to implement in your string class.
> I either have to implement it in my string class, or I have to do it `by
> hand' every time I need any of this functionality.
> Which is why I suggested a class (or even just a struct, to keep the
> semantics closer to the standard arrays), after all.

If you ensure that the input only contains Latin(Frensh/German..) / Greek / Cyrillic in fully  NFC/NFKC  you can assume for most cases that 1 dchar == 1 character.

If you realy nead full string handling, I suppose you assist Arcane Jill with porting the ICU.

Thomas

September 28, 2004

Re: UTF-8 char[] consistency

Posted by Jaap Geurts
in reply to David L. Davis

Jaap Geurts

Posted in reply to David L. Davis

David,

I've examined your wstring library, and noticed that the
case(islower,isupper) family functions cannot do other languages than plain
latin ascii. Am I right in this?
What is needed I guess is for the user to supply a conversion table (are the
functions in phobos suitable?). I don't know enough about locale support in
OS's but if it is not available there we'd have to code it into the lib.

I'll do some probing about how to code it first and if you wish I can provide you the one for Vietnamese.

Regards, Jaap

"David L. Davis" <SpottedTiger@yahoo.com> wrote in message news:cj7aih$mq5$1@digitaldaemon.com...
> In article <opsexriepv2saxk9@krd8833t>, Jaap Geurts says...
> >
> >On Sat, 25 Sep 2004 10:50:41 -0400, Ben Hinkle <bhinkle4@juno.com> wrote:
> >
> >> I don't have that wstring.d handy but hopefully it covers these. If not please let the author know so they can add them (and/or contribute them yourself). Your help in improving the library support for wchar and
dchar
> >> would most likely be very much appreciated.
> >
> >If  someone is reading this and knows where the wstring.d is. Can you
please point me to it?
> >
> >Thanks, Jaap
> >
> >---
> >D programming from Vietnam
>
> Jaap Geurts: Yes, stringw.d (v0.3 beta) is one of my pet projects and you
can
> find here:
http://spottedtiger.tripod.com/D_Language/D_Support_Projects_XP.html
>
> Please, let me know if there's any missing std.string.d function(s) that
you
> need, and I'll work on getting them in as soon as possible.
>
> David L.
>
> -------------------------------------------------------------------
> "Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"

September 28, 2004

Re: UTF-8 char[] consistency

Posted by Arcane Jill
in reply to Jaap Geurts

Arcane Jill

Posted in reply to Jaap Geurts

In article <cjal85$1oia$1@digitaldaemon.com>, Jaap Geurts says...
>
>What is needed I guess is for the user to supply a conversion table (are the functions in phobos suitable?).

Sorry to leap into the middle of your conversation with David, but that is not so. What you need to do is go to www.dsource.org and look for a project called Deimos. Therein, you will find a library called etc.unicode, in source code form. Development of this library has been halted, in favor of ICU, but etc.unicode /does/ do simple casing. (And don't be fooled by the word "simple" - which only means that the function works on characters, not strings, (so it can't uppercase "ß" to "SS") and that it doesn't know that Turkish, Azeri and Lithuanian have non-standard casing rules. It is "simple casing" as opposed to "full casing", that's all).

The relevant prototypes are:

#    dchar getSimpleUppercaseMapping(dchar c);
#    dchar getSimpleLowercaseMapping(dchar c);
#    dchar getSimpleTitlecaseMapping(dchar c);

You do not need to specify a locale, because, if the locale is anything other than Turkish, Azeri or Lithuanian, the casing will be done correctly.


>I don't know enough about locale support in
>OS's but if it is not available there we'd have to code it into the lib.

It is a common misconception that casing is locale sensitive. In Unicode, in general, it is not. Okay, so (as mentioned above) Turkish, Azeri and Lithuanian are different, but that is a small enough number that I prefer to think of it as being "locale-independent with three exceptions".

I think the misconception arises because the C functions toupper(), tolower() etc. are dependent on something /called/ locale, but which is in fact more closely related to encoding scheme. These ctype functions need to do this because C's chars are only eight bits wide. This logic does not apply to Unicode, and certainly not to the functions in etc.unicode and the forthcoming ICU port.



>I'll do some probing about how to code it first and if you wish I can provide you the one for Vietnamese.

The Unicode standard does not regard Vietnamese as an exception to the standard lookups, so etc.unicode is all you need.

Arcane Jill

September 28, 2004

Re: UTF-8 char[] consistency

Posted by Arcane Jill
in reply to Thomas Kuehne

Arcane Jill

Posted in reply to Thomas Kuehne

In article <cj97fh$t7f$1@digitaldaemon.com>, Thomas Kuehne says...
>
>Arcane Jill schrieb:
>> But there is method in this madness. The Unicode characters from U+10000
>> upwardsare not characters from living languages. By and large, it is
>> generally considered "harmless" to regard such characters as if they were
>> two characters.
>> For examples, consider the character U+1D11E (musical symbol G clef) -
>> does it really /matter/ if your application perceives instead U+D874
>> followed by U+D11C?
>> It won't affect casing, sorting or anything like that because the
>> character isn't part of any living language script.
>
>Guess you missed the extended CJK part. There are names of living persons that can only be encoded using post U+FFFF stuff. As a consequence it does affect the sorting and "character"/"glyph"/"graphem"/"codepoint"/"what-so-ever" count algorithms.
>
>Thomas

I will freely admit that I don't speak Chinese and don't know the intricacies of CJK. But that isn't really what I was trying to get at. Yes, obviously, if an app wants to be general, it must use proper character access via library functions. All I really meant was that if you pretend UTF-16 fragments are characters then your application will /usually/ behave sensibly. That's all.

Me, I'm all in favor of proper character iteration. It's just that a lot of apps are going to want a quick-and-dirty shortcut that works more often than not, and UTF-16 is exactly that.

So, there are characters in the >U+FFFF range which are used in proper names? I didn't know that. But how badly does that change things? Does it affect casing? I suppose the answer to that depends on whether or not CJK characters /have/ case. Do they? Does it affect sorting? Not in general, since collation is a function of the /user's preferences/, not the script (that is, if an English user sorts Czechoslovakian text, they will expect to see it in "English order", not "Czechoslovakian order"), so only applications which are (a) fully internationalized, or (b) written for CJK users specifically, will need to care. For the rest of the world, two "unknown character" glyphs is not that much worse than one.

So I'd summarize as:

*) If you want to write a fully internationalized app, you need to be using a proper Unicode library, but

*) If you just want basic Unicode support which works in all but exceptional circumstances, you can make do with UTF-16, and the pretence that characters are 16-bits wide.

In other words, yes, you're right. But we can ususally cheat.

Anyway, this sort of conversation goes on all the time on the Unicode public forum. If you want to talk about this in depth, suggest we move the discussion there.

Arcane Jill

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation