Jump to page: 1 2
Thread overview
string with encoding( suggestion )
Dec 02, 2003
Keisuke UEDA
Dec 02, 2003
Elias Martenson
Dec 02, 2003
I
Dec 02, 2003
Elias Martenson
Dec 02, 2003
Hauke Duden
Dec 03, 2003
Elias Martenson
Dec 03, 2003
Hauke Duden
Dec 02, 2003
J C Calvarese
Dec 03, 2003
Elias Martenson
Dec 03, 2003
Roald Ribe
Dec 03, 2003
Elias Martenson
Jan 03, 2004
Walter
Jan 03, 2004
Walter
Dec 02, 2003
Keisuke UEDA
Dec 03, 2003
Elias Martenson
Dec 03, 2003
Elias Martenson
Jan 03, 2004
Walter
Dec 02, 2003
Ilya Minkov
Dec 02, 2003
Elias Martenson
December 02, 2003
Hello. I've read the D language specification and "D Strings vs C++ Strings". I thought that D strings are not international strings. I think that strings should be independent of encoding, but D strings are array of char and resemble C strings. I think that array classes and string classes are different concepts, so we should specify encoding, in case we make string from array of char. And we should specify encoding, when we take out array of char from string. We should not assume tacit encoding to a character encoding.

If string class is internationalized, even if a certain programmer does not know the encoding of a foreign language, it cannot be necessary to make a bug.

Probably, as an actual problem, string class cannot but use the existing encodings, such as Unicode. For example, in the  string class of Java and Objective-C(NSString), Unicode is used internally.


December 02, 2003
Keisuke UEDA wrote:

> Hello. I've read the D language specification and "D Strings vs C++ Strings". I thought that D strings are not international strings. I think that strings should be independent of encoding, but D strings are array of char and resemble C strings. I think that array classes and string classes are different concepts, so we should specify encoding, in case we make string from array of char. And we should specify encoding, when we take out array of char from string. We should not assume tacit encoding to a character encoding.
> 
> If string class is internationalized, even if a certain programmer does not know the encoding of a foreign language, it cannot be necessary to make a bug.
> 
> Probably, as an actual problem, string class cannot but use the existing encodings, such as Unicode. For example, in the  string class of Java and Objective-C(NSString), Unicode is used internally.

I agree almost completely with this.

Also, the three different char types of D scared me a bit. I don't think it should come as a surprise when we see developers use char-arrays exclusiveley, and even though the documentation states that these arrays are supposed to use UTF-8 encoding, we will see a lot of people doing stuff like:

    char foo = bar[0]; // bar is an array of char

The above being completely useless, at best, or quite possibly illegal. Depending on the contents of the char-array. What's worse is that the bug probably will only manifest itself when people try to use international characters in the array. Just think about what would happen if bar contained the string: "€100"?

In fact, I feel that the char and wchar types are useless in that they serve no pratcical purpose. The documentation says "wchar - unsigned 8 bit UTF-8". The only UTF-8 encoded characters that fit inside a char is a one with a code point less than 128, i.e. an ASCII character.

This makes me wonder what possible use there can be for the char and wchar types? When manipulating individual characters you absolutely need a data type that is able to hold any character.

Either the char and wchar types should be dropped, or the documentation should be clear that a char is ASCII only.

I have worked for a very long time with internationalisation issues, and anyone who ever tried to fix a C program to do full Unicode everywhere knows how painful that can be. Actually, even writing new Unicode-aware code in C can be real difficult. The Unicode support in D seems not to be very well thought through, and I feel that it needs to be fixed before it's too late.

I would very much like to do what I can to help out. At the very least share my experiences and knowledge on the subject.

Regards

Elias

December 02, 2003
In article <bqhkrg$11s8$1@digitaldaemon.com>, Keisuke UEDA says...
>
>Hello. I've read the D language specification and "D Strings vs C++ Strings". I thought that D strings are not international strings.

D Strings are Unicode UTF-8. It is enough for internationalised exchange within a program, but for processing we need a cursor struct which would allow to extract Unicode UTF-32 characters, and its counterpart to create strings.

-eye


December 02, 2003
In article <bqhnd6$15f8$1@digitaldaemon.com>, Elias Martenson says...

>In fact, I feel that the char and wchar types are useless in that they serve no pratcical purpose. The documentation says "wchar - unsigned 8 bit UTF-8". The only UTF-8 encoded characters that fit inside a char is a one with a code point less than 128, i.e. an ASCII character.

It must be a bug in documentation. Under Windows wchar = UTF16 and under Linux wchar = UTF32.

>This makes me wonder what possible use there can be for the char and wchar types? When manipulating individual characters you absolutely need a data type that is able to hold any character.

wchar can hold any character that is allowed in the operating system. See above.

>I have worked for a very long time with internationalisation issues, and anyone who ever tried to fix a C program to do full Unicode everywhere knows how painful that can be. Actually, even writing new Unicode-aware code in C can be real difficult. The Unicode support in D seems not to be very well thought through, and I feel that it needs to be fixed before it's too late.

We have standard conversions, don't we? Yet much to be done though.

>I would very much like to do what I can to help out. At the very least share my experiences and knowledge on the subject.

Write a library, and let the standard library workgroup (currently defunct) take
it in.

I propose that streams and strings are somehow unified, which would allow both to format strings and to iterate through them.

-eye


December 02, 2003
I wrote:

> In article <bqhnd6$15f8$1@digitaldaemon.com>, Elias Martenson says...
> 
>>In fact, I feel that the char and wchar types are useless in that they serve no pratcical purpose. The documentation says "wchar - unsigned 8 bit UTF-8". The only UTF-8 encoded characters that fit inside a char is a one with a code point less than 128, i.e. an ASCII character.
> 
> It must be a bug in documentation. Under Windows wchar = UTF16 and under Linux
> wchar = UTF32.

Well, if that's the case then it's even worse. In C and C++, Windows uses a 16-bit entity for wchar_t which can cause a lot of grief since it required you to deal with surrogate pairs more or less manually. Java has the same problems. They tried to deal with it in JDK1.5, but individual manipulation of characters is still very painul.

I fail to see any good arguments for having char be anything else than a 32-bit type. The two arguments that do exist are:

    1) Storage. A 32-bit char is 4 times as large as an 8-bit char.

       Counter argument: Individual chars should be 32 bits. You could
                         have two string types, one UIF-32, and one
                         UTF-8 version. Both of which could have
                         identical interfaces. One would be fast,
                         the other would be small.

    2) Interoperability with legacy API's (i.e. linking with C and C++)

       Counter argument: I can sort of agree with this one.
                         Interoperability is necessary, but it
                         should not dictate the implementation.
                         Perhaps a type called legacy_char or
                         something like that. At least it would
                         prevent programmers from writing new code
                         that uses this type unless they really
                         need it?

>>This makes me wonder what possible use there can be for the char and wchar types? When manipulating individual characters you absolutely need a data type that is able to hold any character.
> 
> wchar can hold any character that is allowed in the operating system. See above.

Windows allows full Unicode 3.1 without problems. You do have to jump through some hoops to use Unicode codes >64K though, but that's caused by the legacy API's which were designed in an era when all Unicde values fit in 16 bits. This is (since 3.1) no longer case, and sticking with that convention because of argument 2 above is not a good thing.

>>I have worked for a very long time with internationalisation issues, and anyone who ever tried to fix a C program to do full Unicode everywhere knows how painful that can be. Actually, even writing new Unicode-aware code in C can be real difficult. The Unicode support in D seems not to be very well thought through, and I feel that it needs to be fixed before it's too late.
> 
> We have standard conversions, don't we?

We sure do. But in D, many of those conventions are (fortunately) not set in stone yet, and can be fixed.

> Yet much to be done though.

I can agree with this one.

>>I would very much like to do what I can to help out. At the very least share my experiences and knowledge on the subject.
> 
> Write a library, and let the standard library workgroup (currently defunct) take
> it in.

Thanks for trusting me. I'd love to help out with exactly that. However, a single person can not create the perfect Unicode-aware string library. This can be proved by looking at the mountain of mistakes made when designing Java, a language that still can pride itself by being one of the best in terms of Unicode-awareness. They had to fix a lot of things along the way though, and the standard library is still riddled with legacy bugs that can't be fixed because of backward compatibility issues.

> I propose that streams and strings are somehow unified, which would allow both
> to format strings and to iterate through them.

I sort of agree with you. Although there should be a distiction in a way that Java did it (which I believe is what the original poster requested):

    - A string should be a sequence of unicode characters. That's it.
      Nothing more, noting less.

    - A stream should provide an externalisation interface for
      strings. It should be the responsibility of the stream to
      provide encoding and decoding of the external encoding to
      and from Unicode.

This is what Java has done, but they still managed to mess it up originally. I this one of the reasons for this is that the people who designed it didn't understand Unicode and encodings entirely.

Originally I wasn't interested in this at all, and made a lot of mistakes. I'm swedish, and while I do have the need for non-ASCII characters, I didn't understand the requirements of Unicode until I started studying mandarin chinese, and then russian. Since my GF is russian I now see a lot of the problems caused by badly written code, and I see D as an opportunity to lobby for a use of Unicode in a way that minimises the opportunties to write code that only works with english.

Regards

Elias

December 02, 2003
Ilya Minkov wrote:

> In article <bqhkrg$11s8$1@digitaldaemon.com>, Keisuke UEDA says...
> 
>>Hello. I've read the D language specification and "D Strings vs C++ Strings". I thought that D strings are not international strings.
> 
> D Strings are Unicode UTF-8. It is enough for internationalised exchange within
> a program, but for processing we need a cursor struct which would allow to
> extract Unicode UTF-32 characters, and its counterpart to create strings.

That is all fine, but that struct needs to be the natural way of dealing with characters. Arrays of 8-bit entities is too appealing for western-language-only speaking people to use for manual string manipulation. I know, I see it all the time.

Regards

Elias

December 02, 2003
Thank you for replying.

I think that UTF-8 should not be treated directly.

UTF-8 is so complicated that user may take mistakes. ASCII character sets ( from 0x00 to 0x7f ) are encoded to 1 byte, so programmers who use only ASCII characters do not need to distinguish ASCII and UTF-8. But 2 bytes character sets ( such as Japanese Shift JIS and Chinese Big5 ) are encoded to multi bytes. A certain character is encoded to 15 bytes. Unicode strings will be destroyed if it do not treat correctly. Many programmers do not know the circumstances of foreign language well. I think that encoded text data should be wrapped and programmers should use them indirectly.

I agree with Mr. Elias Martenson's opinion. I think that he has understood many languages. But I have no idea which is the best encoding which string class use.


With best kind regards.


December 02, 2003
Elias Martenson wrote:
> I fail to see any good arguments for having char be anything else than a 32-bit type. The two arguments that do exist are:

Well, I think the problem is that many (western) programmers have an ASCII bias. Most realize that Unicode is the best way to be able to write international code, but they don't want to change their existing code base. So UTF-8 seems like a nice solution - you have Unicode support on paper, but you can still treat everything as ASCII.

The problem, however, is that UTF-8 gets really complicated when you have non ASCII strings. You cannot index the string directly, you need to decode multiple code points in a loop to get a single character, you have to deal with invalid code point sequences... the list goes on.

UTF-16 is a little better if you know you won't ever have any surrogate pairs in there, but in my opinion that's a short-sighted view. These kinds of assumptions have a tendency to be proven wrong just when you are in a position where changing the encoding is not possible anymore.

In my opinion, memory strings should simply be UTF-32. Easy indexing and manipulation for everyone, no "discrimination" of multibyte languages.

But, I realise that some people do not share this view. And of course there's the problem of interacting with legacy code (e.g. printf and all the other C stuff, or the UTF-16 Windows API).

Which really only leaves one solution: we need an abstract string class with implementations for UTF-8, UTF-16 and UTF32. That way you can choose the best encoding, depending on your needs. And such classes could also take care of the hassles of UTF-8 and UTF-16 decoding/encoding/manipulation.

An abstract class would even allow users to add their own encoding, which is necessary if legacy code is not ASCII, but one of the other few dozen codepages that are popular around the world.

And last, but not least, I think the D character type should always be 32 bit. Then it would be a real, decoded Unicode character, not a code point. Since the decoding is done internally by the string classes, there is really no need to have different character sizes.

Hauke

December 02, 2003
Elias Martenson wrote:

> I wrote:
> 
>> In article <bqhnd6$15f8$1@digitaldaemon.com>, Elias Martenson says...
>>
...
>>> I would very much like to do what I can to help out. At the very least share my experiences and knowledge on the subject.
>>
>>
>> Write a library, and let the standard library workgroup (currently defunct) take
>> it in.
> 
> 
> Thanks for trusting me. I'd love to help out with exactly that. However, a single person can not create the perfect Unicode-aware string library.

I think everyone would agree that the task would be a large one. Since Walter is only one person, you might judge the string-handling functions he developed to be simplistic.  (For my purposes they're fine, but I've never worked with Unicode.) If someone (or a group of people) offered to supply some more comprehensive functions/classes, I think he'll accept donations.

Personally, I know next to nothing about Unicode, so your discussion is way over my head.  I've noted similar criticisms before and I suspect D's library is somewhat lacking in this area.

I don't think the fundamental (C-inspired) types need to get any more complicated, but I think a fancy (Java-like) String class could help handle most of the messy things in the background.

Justin

> This can be proved by looking at the mountain of mistakes made when designing Java, a language that still can pride itself by being one of the best in terms of Unicode-awareness. They had to fix a lot of things along the way though, and the standard library is still riddled with legacy bugs that can't be fixed because of backward compatibility issues.
> 
>> I propose that streams and strings are somehow unified, which would allow both
>> to format strings and to iterate through them.
> 
> 
> I sort of agree with you. Although there should be a distiction in a way that Java did it (which I believe is what the original poster requested):
> 
>     - A string should be a sequence of unicode characters. That's it.
>       Nothing more, noting less.
> 
>     - A stream should provide an externalisation interface for
>       strings. It should be the responsibility of the stream to
>       provide encoding and decoding of the external encoding to
>       and from Unicode.
> 
> This is what Java has done, but they still managed to mess it up originally. I this one of the reasons for this is that the people who designed it didn't understand Unicode and encodings entirely.
> 
> Originally I wasn't interested in this at all, and made a lot of mistakes. I'm swedish, and while I do have the need for non-ASCII characters, I didn't understand the requirements of Unicode until I started studying mandarin chinese, and then russian. Since my GF is russian I now see a lot of the problems caused by badly written code, and I see D as an opportunity to lobby for a use of Unicode in a way that minimises the opportunties to write code that only works with english.
> 
> Regards
> 
> Elias
> 

December 03, 2003
Hauke Duden wrote:
> Elias Martenson wrote:
>
>    [ a lot of very good reasoning snipped for space ]
>
> Which really only leaves one solution: we need an abstract string class with implementations for UTF-8, UTF-16 and UTF32. That way you can choose the best encoding, depending on your needs. And such classes could also take care of the hassles of UTF-8 and UTF-16 decoding/encoding/manipulation.
> 
> An abstract class would even allow users to add their own encoding, which is necessary if legacy code is not ASCII, but one of the other few dozen codepages that are popular around the world.

Agreed. This is a very good suggestion, and it overlaps to a large degree with my ideas.

Taking your reasoning a little further, this means we have a need for:

    - An interface that represents a string (called "String"?)

    - Three concrete implementations of said class:
      UTF8String, UTF16String and UTF32String
      (or perhaps String8 etc...)

    - Yet another implementation called NativeString that implictly
      uses the encoding of the environment that the program is
      running in. In Unix this would look at the environment
      variable LC_CTYPE.

    - A comprehensive set of string manipulation classes and methods
      that work with the String interface.

    - Making sure the external interfaces of all std-classes use
      String instead of char arrays.

    - Removing char and wchar, and renaming dchar to char.
      The old "char" was all wrong anyway since UTF-8 is defined
      as being a byte sequence, so we already have the types
      byte and short.

All this is needed. wchar and dchar arrays are useless today anyway, since, from what I can tell, external interfaces seems to be using char[] for strings. If you decide you want to work with proper chars (i.e. dchar) you have to do UTF-32 <-> UTF-8 conversions on every method call that involves strings. Not a good thing, and an effective way of preventing any proper use of Unicode.

Besides, UTF-8 is highly inefficient for many operations. It's only advantage is small size of mostly-ASCII data and compatibility with ASCII. Internally, 32-bit strings should be used.

> And last, but not least, I think the D character type should always be 32 bit. Then it would be a real, decoded Unicode character, not a code point. Since the decoding is done internally by the string classes, there is really no need to have different character sizes.

I think I agree with you, but I'm not sure what you mean by "real, decoded Unicode character, not a code point"? If you are referring to the bytes that make up a UTF-8 character, then I agree with you (but that's not called a code point).

A code point is an individual character "position" as defined by Unicode. Are you saying that the "char" type should be able to hold a completeted composite character, including combining diacritical marks? In that case I don't agree with you, and no other languages even attempt this.

Regards

Elias
« First   ‹ Prev
1 2