UTF-8 char[] consistency

Hi all,

I'm testing and programming in D using UTF-8 under linux to encode the Vietnamese character set.
I have some trouble with the way D handles the char[].length property.

If I make a string as follows
char[] s = "câu này có những chữ cái tiếng việt";

Then the length property (s.length) reports the number of bytes not the number of characters as I would expect to happen. The length property would return the number of bytes for the byte[].
Therefore I still need to use a strlen function to determine the correct string length.
One of the implications is that most *string* handling functions in the phobos library depend on the length property and thus fail.

There are some solutions to this: without modifying the language:

1. use a special functions to do the work.
2. make a string class.
3. convert everything internally to UTF-16, convert it back to UTF-8 before output.

1. The special functions would work but is troublesome because the phobos functions cannot be used.(i.e. they have to be rewritten).

2. the string class doesn't work well because the opAssign function cannot be overridden and this the following cannot be done:

# String s = new String(); s = "hello".

I know that it can be done slightly different but I'd like it to be as seemless as possible. (String s = new String("hello");) However the the phobos functions stil don't work and have to be included in the class. Wasn't Walter against a String class?? ;)

3. Converting everything is not very efficient. And requires non-transparent extra work.

I'd suggest the following:

1. The char[] needs to be treated by the D compiler as a string array not as a byte array, or
2. Implement a special String datatype (has been discussed earlier and Walter is against it.)

Also, a lot of phobos functions are missing for wide and double character operations. E.g. wchar[] ljustify(wchar[], int width); is not available and many more are not available for larger char sets.

Regards, Jaap


---
D programming from Vietnam

September 25, 2004

Re: UTF-8 char[] consistency

Posted by Ben Hinkle
in reply to Jaap Geurts

Permalink

Ben Hinkle

Posted in reply to Jaap Geurts

Permalink

Jaap Geurts wrote:

> Hi all,
> 
> I'm testing and programming in D using UTF-8 under linux to encode the Vietnamese character set. I have some trouble with the way D handles the char[].length property.

If this isn't in some FAQ it should be.

> If I make a string as follows
> char[] s = "câu này có những chữ cái tiếng việt";
> 
> Then the length property (s.length) reports the number of bytes not the number of characters as I would expect to happen. The length property would return the number of bytes for the byte[]. Therefore I still need to use a strlen function to determine the correct string length. One of the implications is that most *string* handling functions in the phobos library depend on the length property and thus fail.

which string functions specifically? What do you mean by "fail"?

> There are some solutions to this: without modifying the language:
> 
> 1. use a special functions to do the work.
> 2. make a string class.
> 3. convert everything internally to UTF-16, convert it back to UTF-8
> before output.

  4. use dchar[] (or possibly wchar[] if you know the unicode codepoints in
your string will fit in a wchar).

> 1. The special functions would work but is troublesome because the phobos functions cannot be used.(i.e. they have to be rewritten).
> 
> 2. the string class doesn't work well because the opAssign function cannot be overridden and this the following cannot be done:
> 
> # String s = new String(); s = "hello".
> 
> I know that it can be done slightly different but I'd like it to be as seemless as possible. (String s = new String("hello");) However the the phobos functions stil don't work and have to be included in the class. Wasn't Walter against a String class?? ;)
> 
> 3. Converting everything is not very efficient. And requires non-transparent extra work.
> 
> I'd suggest the following:
> 
> 1. The char[] needs to be treated by the D compiler as a string array not as a byte array, or 2. Implement a special String datatype (has been discussed earlier and Walter is against it.)

Have you tried using dchar[] or wchar[] in your app? Someone has made wstring.d which is the wchar equivalent to std.string (maybe it works for dchar, too, I don't remember exactly). And AJ and some others are working on expanding the unicode support - see the www.dsource.org.

> Also, a lot of phobos functions are missing for wide and double character operations. E.g. wchar[] ljustify(wchar[], int width); is not available and many more are not available for larger char sets.

I don't have that wstring.d handy but hopefully it covers these. If not please let the author know so they can add them (and/or contribute them yourself). Your help in improving the library support for wchar and dchar would most likely be very much appreciated.

> Regards, Jaap
> 
> 
> ---
> D programming from Vietnam

September 26, 2004

Re: UTF-8 char[] consistency

Posted by Jaap Geurts
in reply to Ben Hinkle

Permalink

Jaap Geurts

Posted in reply to Ben Hinkle

Permalink

On Sat, 25 Sep 2004 10:50:41 -0400, Ben Hinkle <bhinkle4@juno.com> wrote:

>> If I make a string as follows
>> char[] s = "câu này có những chữ cái tiếng việt";
>>
>> Then the length property (s.length) reports the number of bytes not the
>> number of characters as I would expect to happen. The length property
>> would return the number of bytes for the byte[]. Therefore I still need to
>> use a strlen function to determine the correct string length. One of the
>> implications is that most *string* handling functions in the phobos
>> library depend on the length property and thus fail.
>
> which string functions specifically? What do you mean by "fail"?

The report the incorrect length. It reports the byte count not the the actual character count, as I would expect because it's an array of char. If I'm right for a char[] s; array and then requesting its length s.length; should report a wcslen(s) of some sort. But the curren't implementation doesn't.

>   4. use dchar[] (or possibly wchar[] if you know the unicode codepoints in
> your string will fit in a wchar).

I tried the wchar[] and dchar[] and that works just fine. But because I program under linux it would be nice if I can keep all my internal data in a consistent format. Which is utf-8 for unix bases systems. It seems a little odd to have to convert it to utf-16 each time I need to know the length of a string. Of course the occasional conversion is unavoidable because sometimes if one wants to insert a utf-8 encoded character into a string, one has to fit a wchar into a char[], i realize that.

> I don't have that wstring.d handy but hopefully it covers these. If not
> please let the author know so they can add them (and/or contribute them
> yourself). Your help in improving the library support for wchar and dchar
> would most likely be very much appreciated.

If  someone is reading this and knows where the wstring.d is. Can you please point me to it?

Thanks, Jaap

---
D programming from Vietnam

September 26, 2004

Re: UTF-8 char[] consistency

Posted by Arcane Jill
in reply to Jaap Geurts

Permalink

Arcane Jill

Posted in reply to Jaap Geurts

Permalink

In article <opsevonsdl2saxk9@krd8833t>, Jaap Geurts says...
>
>Hi all,

Hi.


>I'm testing and programming in D using UTF-8 under linux to encode the Vietnamese character set.

Cool.


>I have some trouble with the way D handles the char[].length property.

length does what it does. What you need is a character count, which is something different.


>Therefore I still need to use a strlen function to determine the correct string length.

Okay, here's one:

#    uint strlen(char[] s)
#    {
#        uint n = 0;
#        foreach (char c; s)
#        {
#            if (c<0x80 || c>=0xC0) ++n;
#        }
#        return n;
#    }

And some overloads to complete the set:

#    uint strlen(wchar[] s)
#    {
#        uint n = 0;
#        foreach (wchar c; s)
#        {
#            if (c<0xD800 || c>=0xDC00) ++n;
#        }
#        return n;
#    }
#
#    uint strlen(dchar[] s)
#    {
#        return s.length;
#    }



>One of the implications is that most *string* handling functions in the phobos library depend on the length property and thus fail.

Phobos is not really geared up for Unicode yet. The string handling functions are defined to work only for ASCII.

What you need is Unicode string handling. D doesn't have that yet. There is a third party Unicode library called ICU (Internationalization Components for Unicode) which I'm trying to port to D, but it's slow work, partly because I've got too much else on at the moment.



>There are some solutions to this: without modifying the language:
>
>1. use a special functions to do the work.
>2. make a string class.
>3. convert everything internally to UTF-16, convert it back to UTF-8 before output.

Option 3 won't work in general. In general, you'll need to convert everything internally to UTF-32, not UTF-16. Of course, if it's just for Vietnamese, UTF-16 will be fine.



>1. The special functions would work but is troublesome because the phobos functions cannot be used.(i.e. they have to be rewritten).

True.


>2. the string class doesn't work well because the opAssign function cannot be overridden and this the following cannot be done:
>
># String s = new String(); s = "hello".
>
>I know that it can be done slightly different but I'd like it to be as seemless as possible. (String s = new String("hello");) However the the phobos functions stil don't work and have to be included in the class. Wasn't Walter against a String class?? ;)

I've had exactly the same problem with a completely different class. I would very much like to see implicit constructors in D, so we could do:

#    String s = "hello"; // what you want
#    Int n = 42; // what I want

But this sort of thing is down to Walter, and he doesn't consider it a priority.



>3. Converting everything is not very efficient. And requires non-transparent extra work.
>
>I'd suggest the following:
>
>1. The char[] needs to be treated by the D compiler as a string array not as a byte array,

That's just not possible. A char is a UTF-8 fragment, not a Unicode character. They're just not the same.


>or
>2. Implement a special String datatype (has been discussed earlier and Walter is against it.)

This will happen anyway in time - by accident! ICU has a class called UnicodeString, so D will get that once ICU is ported.



>Also, a lot of phobos functions are missing for wide and double character operations. E.g. wchar[] ljustify(wchar[], int width); is not available and many more are not available for larger char sets.

Again, ICU will fill in these gaps. I wish I could bring you better news, but at least these things are on their way and will get here eventually.

Arcane Jill

September 26, 2004

Re: UTF-8 char[] consistency

Posted by David L. Davis
in reply to Jaap Geurts

Permalink

David L. Davis

Posted in reply to Jaap Geurts

Permalink

In article <opsexriepv2saxk9@krd8833t>, Jaap Geurts says...
>
>On Sat, 25 Sep 2004 10:50:41 -0400, Ben Hinkle <bhinkle4@juno.com> wrote:
>
>> I don't have that wstring.d handy but hopefully it covers these. If not please let the author know so they can add them (and/or contribute them yourself). Your help in improving the library support for wchar and dchar would most likely be very much appreciated.
>
>If  someone is reading this and knows where the wstring.d is. Can you please point me to it?
>
>Thanks, Jaap
>
>---
>D programming from Vietnam

Jaap Geurts: Yes, stringw.d (v0.3 beta) is one of my pet projects and you can find here: http://spottedtiger.tripod.com/D_Language/D_Support_Projects_XP.html

Please, let me know if there's any missing std.string.d function(s) that you need, and I'll work on getting them in as soon as possible.

David L.

-------------------------------------------------------------------
"Dare to reach for the Stars...Dare to Dream, Build, and Achieve!"

September 26, 2004

Re: UTF-8 char[] consistency

Posted by Ben Hinkle
in reply to Jaap Geurts

Permalink

Ben Hinkle

Posted in reply to Jaap Geurts

Permalink

"Jaap Geurts" <jaapsen@hotmail.com> wrote in message news:opsexriepv2saxk9@krd8833t...
> On Sat, 25 Sep 2004 10:50:41 -0400, Ben Hinkle <bhinkle4@juno.com> wrote:
>
> >> If I make a string as follows
> >> char[] s = "câu này có nh?ng ch? cái ti?ng vi?t";
> >>
> >> Then the length property (s.length) reports the number of bytes not the number of characters as I would expect to happen. The length property would return the number of bytes for the byte[]. Therefore I still need
to
> >> use a strlen function to determine the correct string length. One of
the
> >> implications is that most *string* handling functions in the phobos library depend on the length property and thus fail.
> >
> > which string functions specifically? What do you mean by "fail"?
>
> The report the incorrect length. It reports the byte count not the the
actual character count, as I would expect because it's an array of char. If I'm right for a char[] s; array and then requesting its length s.length; should report a wcslen(s) of some sort. But the curren't implementation doesn't.

That is by design. Out of curiosity, what are you doing with your strings that require the number of characters? Usually one just deals with string fragments and it doesn't matter how long it is (either in characters or in bytes). In a perfect world your expectation of having a one-to-one mapping between array indexing and character indexing would clearly be nice to have. But the current design is (in Walter's opinion - and I agree with him) the best we can do given the imperfect world we find ourselves in and given D's design goals.

September 26, 2004

Re: UTF-8 char[] consistency

Posted by Benjamin Herr
in reply to Arcane Jill

Permalink

Benjamin Herr

Posted in reply to Arcane Jill

Permalink

Arcane Jill wrote:
> This will happen anyway in time - by accident! ICU has a class called
> UnicodeString, so D will get that once ICU is ported.

So can we not just drop char and char[]s and define some standard string class to be used for unicode strings (preferably one returning dchars when prompted for individual characters)?

I mean, strings via easy-to-use arrays were one of those nifty ideas that attracts me to D. No freaky libraries to remember, just intuitive things that work the same for all kinds of arrays.
But having strings implemented as character arrays is cool only as long as I can actually use that char[]-string like an array and get characters out of it by using the [] operator.
Beyond that, it just is an annoying inconsistent analogy. Also it appears confusing to me that some string operations are supposed to be done with array operations, while others are defined in std.string.
Now it seems far easier-to-use to have a string class that wraps all this.

I apologise if my uneducated ranting is far below the average level of insight that is to be available here, and I apologise for the slight offtopicness, and I apologise for bringing this up long after the case to ditch char.

-ben

September 27, 2004

Re: UTF-8 char[] consistency

Posted by Thomas Kuehne
in reply to Benjamin Herr

Permalink

Thomas Kuehne

Posted in reply to Benjamin Herr

Permalink

Benjamin Herr <ben@0x539.de> schrieb:
> Arcane Jill wrote:
> > This will happen anyway in time - by accident! ICU has a class called UnicodeString, so D will get that once ICU is ported.
>
> So can we not just drop char and char[]s and define some standard string class to be used for unicode strings (preferably one returning dchars when prompted for individual characters)?

I guess you didn't (yet) dive into Unicode?
A "character" is something quite complicated.

1) it can consist of one codepoint like 0x41 "A"
2) two different codpoint sequences can be equal: 0xC1 "Á" and 0x41 0x2CA
"Á"
3) especially in Hanglu/Korean a "character" might be a sequenze of 1 up to
4 codepoints.
4) upper/lowercase conversion is dependend on the language used: Up1 ->
Down1, Down2

Above points out only some basics you'd have to implement in your string class.

Thomas

September 27, 2004

Re: UTF-8 char[] consistency

Posted by Arcane Jill
in reply to Thomas Kuehne

Permalink

Arcane Jill

Posted in reply to Thomas Kuehne

Permalink

In article <cj8kb0$22n5$1@digitaldaemon.com>, Thomas Kuehne says...

>A "character" is something quite complicated.

True enough. The best definition of "character" I have ever encountered is this: A "character" is anything the Unicode Consortium say is a character!

More official definitions such as "the smallest unit of information having semantic meaning" just don't hold up under close examination, as it's too easy to find counterexamples. The problem arises because Unicode started its life as the union of many existing legacy "character sets", each of which had their own different idea of what a "character" was.


>1) it can consist of one codepoint like 0x41 "A"
>2) two different codpoint sequences can be equal: 0xC1 "Á" and 0x41 0x2CA
>"Á"
>3) especially in Hanglu/Korean a "character" might be a sequenze of 1 up to
>4 codepoints.

Actually, you're talking about graphemes and/or glyphs, not characters. There is, in fact, a precise one-to-one correspondence between codepoints and characters.

A grapheme, on the other hand, may consist of one or more characters combined together (for example 'A' + combining-acute-accent = 'Á', as per your example); a glyph may consist of one or more graphemes ligated together (for example 'a' + zero-width-joiner + 'e' = 'æ').

And just to be even more pedantic, your statement "two different codepoint sequences can be /equal/" should really read "two different codepoint sequences can be /canonically equivalent/". Equal means equal.


>4) upper/lowercase conversion is dependend on the language used: Up1 -> Down1, Down2

..though currently only Turkish, Lithuanian and Azeri are non-standard. As far as casing is concerned, locale is /almost/ ignorable. The functions getSimpleUppercaseMapping() and getSimpleLowercaseMapping() in etc.unicode will work fine for all languages apart from these few non-standard exceptions listed above. A bigger problem with casing is that (for example) uppercase "ß" is "SS" - that is, strings can get longer when you case-convert them. Even etc.unicode doesn't deal with that (because it got aborted in favor of ICU before full casing was implemented).

You're probably thinking of collation (sort order), which varies /greatly/ from
language to language.


>Above points out only some basics you'd have to implement in your string class.

I think the original poster was only talking about character counting, and the related problem of locating character boundaries in a UTF array. That's relatively easy, and can be hand-coded without too much trouble. The more complex stuff like casing, collation, equivalence, grapheme boundary identification, etc., is probably best left to an external library.

Arcane Jill

September 27, 2004

Re: UTF-8 char[] consistency

Posted by Jaap Geurts
in reply to David L. Davis

Permalink

Jaap Geurts

Posted in reply to David L. Davis

Permalink

"David L. Davis" <SpottedTiger@yahoo.com> wrote in message news:cj7aih$mq5$1@digitaldaemon.com...

> Jaap Geurts: Yes, stringw.d (v0.3 beta) is one of my pet projects and you
can
> find here:
http://spottedtiger.tripod.com/D_Language/D_Support_Projects_XP.html
>
> Please, let me know if there's any missing std.string.d function(s) that
you
> need, and I'll work on getting them in as soon as possible.
>
If I find bugs or I need other functions, I'll submit my ideas to you. Thanks, David.

Top | Forum index | About this forum

Forums