Unicode discussion (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » D » Unicode discussion (page 2)

December 16, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Ben Hinkle

Elias Martenson

Posted in reply to Ben Hinkle

Ben Hinkle wrote:

> I think Walter once said char had been called 'ascii'. That doesn't sound
> all that bad to me. Perhaps we should have the primitive types
> 'ascii','utf8','utf16' and 'utf32' and remove char, wchar and dchar. Insane,
> I know, but at least then you never will mistake an ascii[] for a utf32[]
> (or a utf8[], for that matter).

No. This would be extremely bad. The (unfortunately) very large amount of english-only programmers will use "ascii" exclusively, and we'll end up with yet another english/latin-only language.

ASCII really has no place in modern computing enironments anymore. All oeprating systems and languages has migrated, or ar in the process of migrating, to Unicode.

Regards

Elias Mårtenson

December 16, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to uwem

Elias Martenson

Posted in reply to uwem

uwem wrote:

> You mean icu?!
> 
> http://oss.software.ibm.com/icu/

Yes that's it! No wonder I didn't find it, I was searching for "classes for unicode".

Regards

Elias Mårtenson

December 16, 2003

Re: Unicode discussion

Posted by Ben Hinkle
in reply to Elias Martenson

Ben Hinkle

Posted in reply to Elias Martenson

"Elias Martenson" <elias-m@algonet.se> wrote in message news:brn3tp$t93$1@digitaldaemon.com...
> Ben Hinkle wrote:
>
> > I think Walter once said char had been called 'ascii'. That doesn't
sound
> > all that bad to me. Perhaps we should have the primitive types 'ascii','utf8','utf16' and 'utf32' and remove char, wchar and dchar.
Insane,
> > I know, but at least then you never will mistake an ascii[] for a
utf32[]
> > (or a utf8[], for that matter).
>
> No. This would be extremely bad. The (unfortunately) very large amount of english-only programmers will use "ascii" exclusively, and we'll end up with yet another english/latin-only language.
>
> ASCII really has no place in modern computing enironments anymore. All oeprating systems and languages has migrated, or ar in the process of migrating, to Unicode.

But ASCII has a place in a practical programming language designed to work with legacy system and code. If you pass a utf-8 or utf-32 format string that isn't ASCII to printf it probably won't print out what you want. That's life.

In terms of encouraging a healthy, happy future... the only thing the D language definition can do is choose what type to use for string literals. ie, given the declarations

 void foo(ascii[]);
 void foo(utf8[]);
 void foo(utf32[]);

what function does

 foo("bar")

calll? Right now it would call foo(utf8[]). You are arguing it should call
utf32[]. I am on the fence about what it should call.
Phobos should have routines to handle any encoding - ascii (or just rely on
the std.c for these), utf8, utf16 and utf32.

-Ben

December 16, 2003

Re: Unicode discussion

Posted by Hauke Duden
in reply to Walter

Hauke Duden

Posted in reply to Walter

Walter wrote:
>>The overloading issue is interesting, but may I suggest that char and
> 
> whcar
> 
>>are at least renamed to something more appropriate? Maybe utf8byte and
>>utf16byte? I feel it's important to point out that they aren't characters.
> 
> 
> I see your point, but I just can't see making utf8byte into a keyword <g>.
> The world has already gotten used to multibyte 'char' in C and the funky
> 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much
> of an issue here.

This is simply not true, Walter. The world has not gotten used to multibyte chars in C at all. A lot of english-speaking programmers simply treat chars as ASCII characters, even if there's some comment somewhere stating that the data should be UTF-8.

I agree with Elias that the "char" type should be 32 bit, so that people who simply use a char array as a string, as they have done for years in other languages, will actually get the behaviour they expect, without losing the Unicode support.

Btw: this could also be used to solve the "oops, I forgot to make the string null-terminated" problem when interacting with C functions. If the D char is a different type than the old C char (which could be called char_c or charz instead) then people will automatically be reminded that they need to convert them.

So how about the following proposal:

- char is a 32 bit Unicode character
- wcharz (or wchar_c? c_wchar?) is a C wide char character of either 16 or 32 bits (depending on the system), provided for interoperability with C functions
- charz (or char_c? c_char?) is a normal 8 bit C character, also provided for interoperability with C functions

UTF-8 and UTF-16 strings could simply use ubyte and ushort types. This would at the same time remind users that the elements are NOT characters but simply a bunch of binary data. I don't see the need to define a new type for these - there are a lot of encodings out there, so why treat UTF-8 and UTF-16 specially?

With this system it would be instantly obvious that D strings are Unicode. Interacting with legacy C code is still possible, and accidentally passing a wrong (e.g. UTF-8) string to a C function that expects ASCII or Latin-1 is impossible. Also, pure D code will automatically be UTF-32, which is exactly what you need if you want to make the lives of newbies easier. Otherwise people WILL end up using ASCII strings when they start out.

Hauke

December 16, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Hauke Duden

Elias Martenson

Posted in reply to Hauke Duden

Hauke Duden wrote:

>> I see your point, but I just can't see making utf8byte into a keyword <g>.
>> The world has already gotten used to multibyte 'char' in C and the funky
>> 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much
>> of an issue here.
> 
> This is simply not true, Walter. The world has not gotten used to multibyte chars in C at all. A lot of english-speaking programmers simply treat chars as ASCII characters, even if there's some comment somewhere stating that the data should be UTF-8.

I agree. You are better at explaining these things than I am. :-)

> I agree with Elias that the "char" type should be 32 bit, so that people who simply use a char array as a string, as they have done for years in other languages, will actually get the behaviour they expect, without losing the Unicode support.

Indeed. In many cases existing code would actually continue working, since char[] would still declare a string. It wouldn't work when using legacy libraries though, but they won't anyway because of the zero-termination issue.

> Btw: this could also be used to solve the "oops, I forgot to make the string null-terminated" problem when interacting with C functions. If the D char is a different type than the old C char (which could be called char_c or charz instead) then people will automatically be reminded that they need to convert them.

Exactly.

> So how about the following proposal:
> 
> - char is a 32 bit Unicode character
> - wcharz (or wchar_c? c_wchar?) is a C wide char character of either 16 or 32 bits (depending on the system), provided for interoperability with C functions
> - charz (or char_c? c_char?) is a normal 8 bit C character, also provided for interoperability with C functions
> 
> UTF-8 and UTF-16 strings could simply use ubyte and ushort types. This would at the same time remind users that the elements are NOT characters but simply a bunch of binary data. I don't see the need to define a new type for these - there are a lot of encodings out there, so why treat UTF-8 and UTF-16 specially?
> 
> With this system it would be instantly obvious that D strings are Unicode. Interacting with legacy C code is still possible, and accidentally passing a wrong (e.g. UTF-8) string to a C function that expects ASCII or Latin-1 is impossible. Also, pure D code will automatically be UTF-32, which is exactly what you need if you want to make the lives of newbies easier. Otherwise people WILL end up using ASCII strings when they start out.

We have to keep in mind, that in most cases, when you call a legacy C functions accepting (char *) the correct thing is to pas in a UTF-8 encoded string. The number of functions which actually fail when doing so are quite few.

What I'm saying here is that there are actually few "C function[s] that expects ASCII or Latin-1". Most of them expect a (char *) and work on them as if they were a byte array. Compare this to my (and your) suggestion of using byte[] (or ubyte[]) for UTF-8 strings.

Regards

Elias Mårtenson

December 16, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Ben Hinkle

Elias Martenson

Posted in reply to Ben Hinkle

Ben Hinkle wrote:

> But ASCII has a place in a practical programming language designed to work
> with legacy system and code. If you pass a utf-8 or utf-32 format string
> that isn't ASCII to printf it probably won't print out what you want. That's
> life.

For legacy code, you should have to take an extra step to make it work. However, it should certainly be possible. Allow me to compare to how Java does it:

    String str = "this is a unicode string";

    byte[] asciiString = str.getBytes("ASCII");

You can also convert it to UTF-8 if you like:

    byte[] utf8String = str.getBytes("UTF-8");

If the "default" string type in D is a simple ASCII string, do you honestly think that programmers who only speak english will even bother to do the right thing? Do you think they will even know that they are writing effectively broken code?

I am suffering from these kinds of bugs every day (I speak swedish natively, but have also need to work with cyrillic) and let me tell you: 99% of all problems I have are caused by bugs similar to this.

Also, I don't think it's a good idea to design a language around a legacy character set (ASCII) which will hopefully be gone in a few years (for newly written programs that is).

> In terms of encouraging a healthy, happy future... the only thing the D
> language definition can do is choose what type to use for string literals.
> ie, given the declarations
> 
>  void foo(ascii[]);
>  void foo(utf8[]);
>  void foo(utf32[]);
> 
> what function does
> 
>  foo("bar")
> 
> calll? Right now it would call foo(utf8[]). You are arguing it should call
> utf32[]. I am on the fence about what it should call.

Yes, with Walters previous posting in mind, I argue that in foo() is overloaded with all three string types, it would call the dchar[] version. If one is not available, it would fall back to the wchar[], and lastly the char[] version.

Then again, I also argue that there should be a way to using the supertype, "string", to avoid having to mess with the overloading and transparent string conversions.

> Phobos should have routines to handle any encoding - ascii (or just rely on
> the std.c for these), utf8, utf16 and utf32.

The C standard library has largely migrated away from pure ASCII. It's there for backwards compatibility reasons, but people people still tend to use them, that's not the languages fault though but rather the developers.

Regards

Elias Mårtenson

December 16, 2003

Re: Unicode discussion

Posted by Hauke Duden
in reply to Elias Martenson

Hauke Duden

Posted in reply to Elias Martenson

Elias Martenson wrote:
> We have to keep in mind, that in most cases, when you call a legacy C functions accepting (char *) the correct thing is to pas in a UTF-8 encoded string. The number of functions which actually fail when doing so are quite few.

They are not quite as few as one may think. For example, if you pass an UTF-8 string to fopen then it will only work correctly if the filename is made up of only ASCII characters only. printf will print garbage if you pass it a UTF-8 character. If you use scanf to read a string from stdin then the returned string will not be UTF-8, so you have to deal with that. The is-functions (isalpha, etc.) will not work correctly for all characters. toupper, tolower, etc. are not able to work with non-ASCII characters. The list goes on...

Pretty much the only thing I can think of that will work correctly under all circumstances are simple C functions that pass strings through unmodified (if they modify them they might slice them in the middle of a UTF-8 sequence).

IMHO, the safest way to call C functions is to pass them strings encoded using the current system code page, because that's what the CRT expects a char array to be. Since the code page is different from system to system this makes a runtime conversion pretty much inevitable, but there's no way around that if you want Unicode support.

Hauke

December 16, 2003

Re: Unicode discussion

Posted by Ben Hinkle
in reply to Elias Martenson

Ben Hinkle

Posted in reply to Elias Martenson

> If the "default" string type in D is a simple ASCII string, do you honestly think that programmers who only speak english will even bother to do the right thing? Do you think they will even know that they are writing effectively broken code?

I didn't say the default type should be ASCII. I just said it should be explicit when it is ASCII. For example, I think printf should be declared as accepting an ascii* format string, not a char* as it is currently declared (same for fopen etc etc). I said I didn't know what the default type should be, though I'm leaning towards UTF-8 so that casting to ascii[] doesn't have to reallocate anything.

-Ben

December 16, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Hauke Duden

Elias Martenson

Posted in reply to Hauke Duden

Hauke Duden wrote:

> They are not quite as few as one may think. For example, if you pass an UTF-8 string to fopen then it will only work correctly if the filename is made up of only ASCII characters only.

Depends on the OS. Unix handles it perfectly.

> printf will print garbage if
> you pass it a UTF-8 character. If you use scanf to read a string from stdin then the returned string will not be UTF-8, so you have to deal with that. The is-functions (isalpha, etc.) will not work correctly for all characters. toupper, tolower, etc. are not able to work with non-ASCII characters. The list goes on...

Exactly. But the number of functions that do these things are still pretty smapp, compared to the total number of functions accetping strings. Take a look at your own code and try to classify them as UTF-8 safe or not. I think you'll be surprised.

> Pretty much the only thing I can think of that will work correctly under all circumstances are simple C functions that pass strings through unmodified (if they modify them they might slice them in the middle of a UTF-8 sequence).

And, believe it or not, this is the major part of all such functions.

But, the discussion is really irrelevant since we both agree that it is inherently unsafe.

Regards

Elias Mårtenson

December 16, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Ben Hinkle

Elias Martenson

Posted in reply to Ben Hinkle

Ben Hinkle wrote:

>>If the "default" string type in D is a simple ASCII string, do you
>>honestly think that programmers who only speak english will even bother
>>to do the right thing? Do you think they will even know that they are
>>writing effectively broken code?
> 
> I didn't say the default type should be ASCII. I just said it should be
> explicit when it is ASCII.

But for all intents and puproses, ASCII does not exist anymore. It's a legacy character set, and it should certainly not be the "natural" way of dealing with strings.

Believe it or not, there are a lot of programmers out there who still believe that ASCII ought to be enough for anybody.

> For example, I think printf should be declared as
> accepting an ascii* format string, not a char* as it is currently declared
> (same for fopen etc etc).

But printf() works very well with UTF-8 in most cases.

> I said I didn't know what the default type should
> be, though I'm leaning towards UTF-8 so that casting to ascii[]
> doesn't have
> to reallocate anything.

True. But then again, isn't the intent to try to avoid legacy calls ats much as possible? Is it a good idea to set the default in order to accomodate a legacy character set?

You have to remember that UTF-8 is very inefficient. Suppose you have a 10000 character long string, and you want to retrieve the 9000'th character from that string. If the string is UTF-32, that means a single memory lookup. With UTF-8 it could mean anywhere between 9000 and 54000 memory lookups. Now imagine if the string is ten or one hundred times as long...

Now, with the current design, what many people are going to do is:

    char c = str[9000];
    // now play happily(?) with the char "c" that probably isn't the
    // 9000'th character and maybe was a part of a UTF-8 multi byte
    // character

Again, this is a huge problem. The bug will not be evident until some other person (me, for example) tries to use non-ASCII characters. The above broken code may have run through every single test that the developer wrote, simply because he didn't think of putting a non-ASCII character in the string.

This is a real problem, and it desperately needs to be solved. Several solutions has already been presented, the question is just which one of them that Walter will support? He already explained in the previous post, but it seems that there is still some things to be said.

Regards

Elias Mårtenson

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation