string with encoding( suggestion ) (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » D » string with encoding( suggestion ) (page 2)

December 03, 2003

Re: string with encoding( suggestion )

Posted by Elias Martenson
in reply to Keisuke UEDA

Elias Martenson

Posted in reply to Keisuke UEDA

Keisuke UEDA wrote:

> Thank you for replying.
> 
> I think that UTF-8 should not be treated directly. 
> 
> UTF-8 is so complicated that user may take mistakes. ASCII character sets ( from 0x00 to 0x7f ) are encoded to 1 byte, so programmers who use only ASCII characters do not need to distinguish ASCII and UTF-8. But 2 bytes character sets ( such as Japanese Shift JIS and Chinese Big5 ) are encoded to multi bytes. A certain character is encoded to 15 bytes. Unicode strings will be destroyed if it do not treat correctly. Many programmers do not know the circumstances of foreign language well. I think that encoded text data should be wrapped and programmers should use them indirectly.

Yes, agreed. Looking at much of the C-code today we see that even though these issues are well knows very few people bother doing it right unless the language makes it natural to do so.

I am currently working in a combined C++ and Java project. The product will be exported to various countires using different languages. Still, the developers who were doing the bulk of the work before me completely disregarded the problem. Trying to fix the code now has proven to be more or less impossible and is one of reasons we have to rewrite a lot of it.

The attitude of the developers were "we'll fix the language issues when we do the translation". Suffice it to say that it wasn't that easy.

If we let the 8-bit char[] type be the "official" string type in D, then  the exact same mistakes that was made in C and C++ will be made again, and I really don't want to see that happen. I really have high hopes for D, it does so many things just "right", but this is a potential dealbreaker.

> I agree with Mr. Elias Martenson's opinion. I think that he has understood many languages. But I have no idea which is the best encoding which string class use. 

Yes, I have basic skills in a lot of languages, and I have real needs to use multiple languages. However, when I read what you had to say I realise you also have good skills in this subject.

Regards

Elias

December 03, 2003

Re: string with encoding( suggestion )

Posted by Elias Martenson
in reply to J C Calvarese

Elias Martenson

Posted in reply to J C Calvarese

J C Calvarese wrote:

> I think everyone would agree that the task would be a large one. Since Walter is only one person, you might judge the string-handling functions he developed to be simplistic.

Yes they are. However, like I said. No single person can do this right. A lot of very bright people (don't ask me to compare them to Walter, I don't know him, nor them :-) ) worked on Java and they made dozens of very bad mistakes in 1.0. Not until 1.5 (which isn't released yet) are they catching up.

> (For my purposes they're fine, but I've never worked with Unicode.)

Exactly. Not to offend anyone by asking about nationality, but I assume you are an english-only speaker? This is a common situation. One uses the most obvious tools at hand (which in both C and D are char-arrays) and everything works fine. Later, when it's time to localise the application you realise you should have used char_t (dchar) instead. Congratulations on the task of changing all that code.

In fact, you don't veen have to localise your app to end up with problems. My last name is actually Mårtenson, and in Unicode the second character encodes into two bytes. Guess what happens when a naïve implementation runs the following code to verify my name? Try to count the number of errors:

    for (int c = 0 ; c < strlen(name) ; c++) {
        char ch = name[c];

        // make uppercase chars into lowercase
        if (isupper(ch)) {
            name[c] += 'a' - 'A';
        }
    }

    // print the name with a series of stars below it
    printf ("%s\n", name);
    for (int c = 0 ; c < strlen(name) ; c++) {
        putchar ('*');
    }
    putchar ('\n');

Code like the one above is not particularily uncommon to see. Hell, I even wrote code like it myself. All of the above problems will not be fixed if char is a 32-bit entity (at least one will remain) but most of the work will be done already.

> If someone (or a group of people) offered to supply some more comprehensive functions/classes, I think he'll accept donations.

I would like to help out in such a group, but I certainly cannot do it myself. For one, I'm not skilled enough in D to even use correct "D-style" everywhere.

> Personally, I know next to nothing about Unicode, so your discussion is way over my head.  I've noted similar criticisms before and I suspect D's library is somewhat lacking in this area.

Indeed they are. Unicode has to be in from the start. Again and again we have seen languages struggle when trying to tack on Unicode after the fact. C, C++, Perl, PHP are just a few examples.

> I don't think the fundamental (C-inspired) types need to get any more complicated, but I think a fancy (Java-like) String class could help handle most of the messy things in the background.

They shouldn't be more complicated syntax-wise or implementation-wise?

Syntax-wise they already are extremely convoluted, at least if your intention is to write a program that works properly with unicode. You need to manage the UTF-8 encoding yourself which actually next to impossible to get right.

Implementation-wise, I can conceieve an implementation where you still declare a string as a char[] (char being 32-bit of course) and then the compiler having special knowledge about this type such that it actually can implement it differently behind the scenes. An extremely poor example:

    char[] utf32string;
    char[] #UTF16 utf16string;
    char[] #UTF8 utf8string;
    char[] #NATIVE nativeString;

The above mentioned four strings would behave identically and it should be possible to use them interchangeably. In all cases s[0] would return a 32-bit char.

This cannot be stressed enough: Being able to dereference individual components of a UTF-8 or UTF-16 string is a recipie for failure. There are hardly any situations where anyone needs this. The only time would be a UTF-8 en/decoder but that functionality should be built in anyway.

Regards

Elias

December 03, 2003

Re: string with encoding( suggestion )

Posted by Chris Paulson-Ellis
in reply to Elias Martenson

Chris Paulson-Ellis

Posted in reply to Elias Martenson

Elias Martenson wrote:

> If we let the 8-bit char[] type be the "official" string type in D, then  the exact same mistakes that was made in C and C++ will be made again, and I really don't want to see that happen. I really have high hopes for D, it does so many things just "right", but this is a potential dealbreaker.

There is an implicit assumption in these arguments that making char 32 bits wide will magically make programs easier to internationalise. I don't buy this. Even 32 bit characters are not 'characters' in the old fasioned ASCII sense. What about the combining diacritic marks, etc.

What is really needed is a debate about which of the C Unicode libraries out there is the most appropriate for inclusion in the D library (via a wrapper to support the D string functionality, I assume).

Does anyone out there have experience with these libraries. I don't :-(.

Chris.

December 03, 2003

Re: string with encoding( suggestion )

Posted by Elias Martenson
in reply to Chris Paulson-Ellis

Elias Martenson

Posted in reply to Chris Paulson-Ellis

Chris Paulson-Ellis wrote:

> Elias Martenson wrote:
> 
>> If we let the 8-bit char[] type be the "official" string type in D, then  the exact same mistakes that was made in C and C++ will be made again, and I really don't want to see that happen. I really have high hopes for D, it does so many things just "right", but this is a potential dealbreaker.
> 
> 
> There is an implicit assumption in these arguments that making char 32 bits wide will magically make programs easier to internationalise. I don't buy this. Even 32 bit characters are not 'characters' in the old fasioned ASCII sense. What about the combining diacritic marks, etc.

You are right of course. There are a number of issues, not only the diacritics, but also upper/lowercasing, equivalence (diacritic or combined char) etc...

However, just because a 32-bit char doesn't solve everything doesn't mean that we should just give up and use 8-bit chars for the basic string type.

Unicode works with code points. The concept of what a "character" is is well defined by Unicode as "The basic unit of encoding for the Unicode character encoding". It's only natural to actually deal with these for the char type. Don't you think?

An example:

    char[] myString = ...;

    for (int c = 0 ; c < myString.size ; c++) {
        if (myString[c] == '€') {
            // found the euro sign
        }
    }

This is perfectly reasonable code that doesn't work in D today, but would work if chars were 32-bit. Code like this should work in my opinion.

> What is really needed is a debate about which of the C Unicode libraries out there is the most appropriate for inclusion in the D library (via a wrapper to support the D string functionality, I assume).

The functinality is needed. Basing it off of existing libraries is a good thing. Just grabbing them and include them in D is not. The functionality provided byt he string management functions in the D standard library should be natural and helpful. In particular, it should make the need to individually access characters as small as possible.

> Does anyone out there have experience with these libraries. I don't :-(.

As in "have I written real code using them"? No. The IBM libraries are good, that much I know. Might be worth taking a deeper look at.

    http://oss.software.ibm.com/icu/index.html

We have to keep in mind though, that even if all these features are included in the standard library, if they aren't easy and natural to use we will still see a lot of applications doing the wrong thing.

In my mind, Java is very good at making Unicode natural to use. But it can be better. That's why I'm soapboxing on this forum.

Regards

Elias

December 03, 2003

Re: string with encoding( suggestion )

Posted by Roald Ribe
in reply to Elias Martenson

Roald Ribe

Posted in reply to Elias Martenson

"Elias Martenson" <elias-m@algonet.se> wrote in message news:bqkc8e$233i$1@digitaldaemon.com...

[snip]

> This cannot be stressed enough: Being able to dereference individual components of a UTF-8 or UTF-16 string is a recipie for failure. There are hardly any situations where anyone needs this. The only time would be a UTF-8 en/decoder but that functionality should be built in anyway.

I will just add moral support to all that E.M. said above. Even in laguages written with chars fitting into 8 bits, the current situation is a mess, in ALL computer languages, but Java is getting there (slowly).

We have several groups of "customers" here:
- Native english speaking developers
  (with a large industrialized "home" market)
- "8-bits is enough" developers, "most of rest of industrialized world"
  (still has problems with different codepages though)
- "16-bits is enough" developers,
  Accounts for most of the rest of developers (as of today) but not all.
- "32-bit rules"
  Groups of all of the above with experience an insight + a few not
  fitting in the 16-bits limitations.

If D is to be seen as an 1st choice language around the world, it has to *enforce* UNICODE for strings. If we all agree on that (not today maybe, but what about 5 years from now?) a design should be selected that caters for ALL, before the language hits 1.0. If not, it will just be "one more language", no matter what other nice features it has. This can really set it apart.

32 bit UNICODE chars, resolves all these problems. The only "problem" it generates, is more memory used. With the RAM prices (still declining) we have today, I refuse to seriously consider it a problem.

How about making this THE focus point of D?

"D 1st international language"
"D 1st language with real international focus"

Certainly an easy to communicate advantage, as all the marketdroids I ever worked with seems to want.

For those who knows next to nothing about these issues, I would recommend: http://www.cl.cam.ac.uk/~mgk25/unicode.html as an introduction to some of the issues (but I disagree with their reason for basing efforts around UTF-8).

Those who decides to take this on should:
1. Grab the String interface from Java 1.5
2. Learn from the other Java based String implementation
   performance improvers.
3. Use as much as possible from the IBM lib for UNICODE.

So much to do, never enough time...

This feature alone has the potential to make D a
language with very widespread use, and should be a very
nice "hook" for getting people to write about it.

Roald

December 03, 2003

Re: string with encoding( suggestion )

Posted by Hauke Duden
in reply to Elias Martenson

Hauke Duden

Posted in reply to Elias Martenson

Elias Martenson wrote:
> Hauke Duden wrote:
>> And last, but not least, I think the D character type should always be 32 bit. Then it would be a real, decoded Unicode character, not a code point. Since the decoding is done internally by the string classes, there is really no need to have different character sizes.
> 
> 
> I think I agree with you, but I'm not sure what you mean by "real, decoded Unicode character, not a code point"? If you are referring to the bytes that make up a UTF-8 character, then I agree with you (but that's not called a code point).

Never write such a thing when you're tired! ;) Where I wrote code point, I meant "encoding elements", i.e. bytes for UTF8 and 16 bit ints for UTF-16.

> A code point is an individual character "position" as defined by Unicode. Are you saying that the "char" type should be able to hold a completeted composite character, including combining diacritical marks? In that case I don't agree with you, and no other languages even attempt this.

God, no. Combining marks to what the end user thinks of as a "character" needs to be done on another layer, probably just before or while they are printed.

Hauke

December 03, 2003

Re: string with encoding( suggestion )

Posted by Elias Martenson
in reply to Roald Ribe

Elias Martenson

Posted in reply to Roald Ribe

Den Wed, 03 Dec 2003 17:11:54 +0100 skrev Roald Ribe:

> For those who knows next to nothing about these issues, I would recommend: http://www.cl.cam.ac.uk/~mgk25/unicode.html as an introduction to some of the issues (but I disagree with their reason for basing efforts around UTF-8).

Well, the efforts he refers to is external data representation in the operating system. UTF-8 is the only reasonable encoding to use, or everything would break. This is because Unix is so focused around ASCII files. UTF-8 is a very neat way to go Unicode without breaking anything.

> Those who decides to take this on should:
> 1. Grab the String interface from Java 1.5
> 2. Learn from the other Java based String implementation
>    performance improvers.
> 3. Use as much as possible from the IBM lib for UNICODE.

Agreed. This is very important. However, there are some fundamental changes that needs to be done to the language, sone of which is going to break a lot of existing code (although some people would argue that that code is already broken since they don't handle Unicode).

Does Walter have anything to say on this subject?

> So much to do, never enough time...

Again, agreed. However, I believe it needs to be done now, because fixing it later will be next to impossble.

> This feature alone has the potential to make D a
> language with very widespread use, and should be a very
> nice "hook" for getting people to write about it.

Indeed. But as some other person already mentioned, there is a lot more to be done than just getting 32-bit chars. It's the first step though.

Regards

Elias

January 03, 2004

Re: string with encoding( suggestion )

Posted by Walter
in reply to I

Walter

Posted in reply to I

"I" <I_member@pathlink.com> wrote in message news:bqhohh$1781$1@digitaldaemon.com...
> In article <bqhnd6$15f8$1@digitaldaemon.com>, Elias Martenson says...
>
> >In fact, I feel that the char and wchar types are useless in that they serve no pratcical purpose. The documentation says "wchar - unsigned 8 bit UTF-8". The only UTF-8 encoded characters that fit inside a char is a one with a code point less than 128, i.e. an ASCII character.
>
> It must be a bug in documentation. Under Windows wchar = UTF16 and under
Linux
> wchar = UTF32.

This is not correct. A D wchar is always UTF-16. A D dchar is always UTF-32.

January 03, 2004

Re: string with encoding( suggestion )

Posted by Walter
in reply to Roald Ribe

Walter

Posted in reply to Roald Ribe

"Roald Ribe" <rr.no@spam.teikom.no> wrote in message news:bql1dn$30vm$1@digitaldaemon.com...
> 32 bit UNICODE chars, resolves all these problems. The only "problem" it generates, is more memory used. With the RAM prices (still declining) we have today, I refuse to seriously consider it a problem.

I agree with you that internationalization and support for unicode is of very large importance for the success of D. D's current support for it is weak, but I think that is more a library issue than the language definitional one.

I've written a large server app that was internally 100% UTF-32. It used all available memory and then went into swap. The 4x memory consumption from UTF-32 cost plenty in performance, if I'd have done it with UTF-8 I'd have had a happier customer.

The reality is that D must offer the programmer a choice in using UTF-8, UTF-16, and UTF-32, and make it easy to use either, as the optimal representation for one app will be suboptimal for the next. Since D is a systems programming language, coming off poorly on benchmarks is not affordable <g>.

January 03, 2004

Re: string with encoding( suggestion )

Posted by Walter
in reply to Elias Martenson

Walter

Posted in reply to Elias Martenson

"Elias Martenson" <elias-m@algonet.se> wrote in message news:bqkd9l$24fv$1@digitaldaemon.com...
> An example:
>
>      char[] myString = ...;
>
>      for (int c = 0 ; c < myString.size ; c++) {
>          if (myString[c] == '€') {
>              // found the euro sign
>          }
>      }
>
> This is perfectly reasonable code that doesn't work in D today, but
> would work if chars were 32-bit. Code like this should work in my opinion.

It will work as written if you define myString as dchar[] rather than char[].

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation