June 08, 2010
please use the "Reply" Button

On 08.06.2010 08:50, Ruslan Nikolaev wrote:
>>
>> Maybe "lousy" is too strong a word, but aside from
>> compatibility with other
>> libs/software that use it (which I'll address separately),
>> UTF-16 is not
>> particularly useful compared to UTF-8 and UTF-32:
> ...
>>
>
> I tried to avoid commenting this because I am afraid we'll stray away from the main point (which is not discussion about which Unicode is better). But in short I would say: "Not quite right". UTF-16 as already mentioned is generally faster for non-Latin letters (as reading 2 bytes of aligned data takes the same time as reading 1 byte). Although, I am not familiar with Asian languages, I believe that UTF-16 requires just 2 bytes instead of 3 for most of symbols. That is one of the reason they don't like UTF-8. UTF-32 doesn't have any advantage except for being fixed length. It has a lot of unnecessary memory, cache, etc. overhead (the worst case scenario for both UTF8/16) which is not justified for any language.
>
>>
>> First of all, it's not exactly unheard of for big projects
>> to make a
>> sub-optimal decision.
>
> I would say, the decision was quite optimal for many reasons, including that "lousy programming" will not cause too many problems as in case of UTF-8.
>
>>
>> Secondly, Java and Windows adapted 16-bit encodings back
>> when many people
>> were still under the mistaken impression that would allow
>> them to hold any
>> character in one code-unit. If that had been true, then it
>
> I doubt that it was the only reason. UTF-8 was already available before Windows NT was released. It would be much easier to use UTF-8 instead of ANSI as opposed to creating parallel API. Nonetheless, UTF-16 has been chosen. In addition, C# has been released already when UTF-16 became variable length. I doubt that conversion overhead (which is small compared to VM) was the main reason to preserve UTF-16.
>
>
> Concerning why I say that it's good to have conversion to UTF-32 (you asked somewhere):
>
> I think you did not understand correctly what I meant. This a very common practice, and in fact - required, to convert from both UTF-8 and UTF-16 to UTF-32 when you need to do character analysis (e.g. mbtowc() in C). In fact, it is the only place where UTF-32 is commonly used and useful.
>
>
>
>

June 08, 2010
"Ruslan Nikolaev" <nruslan_devel@yahoo.com> wrote in message news:mailman.128.1275979841.24349.digitalmars-d@puremagic.com...
>>
>> Secondly, Java and Windows adapted 16-bit encodings back
>> when many people
>> were still under the mistaken impression that would allow
>> them to hold any
>> character in one code-unit. If that had been true, then it
>
> I doubt that it was the only reason. UTF-8 was already available before Windows NT was released. It would be much easier to use UTF-8 instead of ANSI as opposed to creating parallel API. Nonetheless, UTF-16 has been chosen.
>

I didn't say that was the only reason. Also, you've misunderstood my point:

Their reasoning at the time:
    8-bit: Multiple code-units for some characters
    16-bit: One code-unit per character
    Therefore, use 16-bit.

Reality:
    8-bit: Multiple code-units for some characters
    16-bit: Multiple code-units for some characters
    Therefore, old reasoning not necessarily still applicable.

> In addition, C# has been released already when UTF-16 became variable length.

Right, like I said, C#/.NET use UTF-16 because that's what MS had already standardized on.

>I doubt that conversion overhead (which is small compared to VM) was the main reason to preserve UTF-16.

I never said anything about conversion overhead being a reason to preserve UTF-16.

>
> Concerning why I say that it's good to have conversion to UTF-32 (you asked somewhere):
>
> I think you did not understand correctly what I meant. This a very common practice, and in fact - required, to convert from both UTF-8 and UTF-16 to UTF-32 when you need to do character analysis (e.g. mbtowc() in C). In fact, it is the only place where UTF-32 is commonly used and useful.
>

I'm well aware why UTF-32 is useful. Earlier, you had started out saying that there should only be one string type, the OS-native type. Now you're changing your tune and saying that we do need multiple types.


June 08, 2010
"Nick Sabalausky" <a@a.a> wrote in message news:huktq1$8tr$1@digitalmars.com...
> "Ruslan Nikolaev" <nruslan_devel@yahoo.com> wrote in message news:mailman.128.1275979841.24349.digitalmars-d@puremagic.com...
>> In addition, C# has been released already when UTF-16 became variable length.
>
> Right, like I said, C#/.NET use UTF-16 because that's what MS had already standardized on.
>

s/UTF-16/16-bit/  It's getting late and I'm starting to mix terminology...


June 08, 2010
> 
> I'm well aware why UTF-32 is useful. Earlier, you had
> started out saying
> that there should only be one string type, the OS-native
> type. Now you're
> changing your tune and saying that we do need multiple
> types.
> 

No. From the very beginning I said "it would also be nice to have some builtin function for conversion to dchar". That means it would be nice to have function that converts from tchar (regardless of its width) to UTF-32. The reason was always clear - you normally don't need UTF-32 chars/strings but for some character analysis you might need them.



June 08, 2010
On 06/08/2010 03:12 AM, Nick Sabalausky wrote:
> "Nick Sabalausky"<a@a.a>  wrote in message
> news:huktq1$8tr$1@digitalmars.com...
>> "Ruslan Nikolaev"<nruslan_devel@yahoo.com>  wrote in message
>> news:mailman.128.1275979841.24349.digitalmars-d@puremagic.com...
>>> In addition, C# has been released already when UTF-16 became variable
>>> length.
>>
>> Right, like I said, C#/.NET use UTF-16 because that's what MS had already
>> standardized on.
>>
>
> s/UTF-16/16-bit/  It's getting late and I'm starting to mix terminology...

s/16-bit/UCS-2/

The story is that Windows standardized on UCS-2, which is the uniform 16-bit-per-character encoding that predates UTF-16. When UCS-2 turned out to be insufficient, it was extended to the variable-length UTF-16. As has been discussed, that has been quite unpleasant because a lot of code out there handles strings as if they were UCS-2.

Andrei
June 08, 2010
On 2010-06-08 04:15:50 -0400, Ruslan Nikolaev <nruslan_devel@yahoo.com> said:

> No. From the very beginning I said "it would also be nice to have some builtin function for conversion to dchar". That means it would be nice to have function that converts from tchar (regardless of its width) to UTF-32. The reason was always clear - you normally don't need UTF-32 chars/strings but for some character analysis you might need them.

Is this what you want?

	version (utf16)
		alias wchar tchar;
	else
		alias char tchar;

	alias immutable(tchar)[] tstring;

	import std.utf;

	unittest {
		tstring tstr = "hello";
		dstring dstr = toUTF32(tstr);
	}


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

June 08, 2010
> 
> Is this what you want?
> 
>     version (utf16)
>         alias wchar tchar;
>     else
>         alias char tchar;
> 
>     alias immutable(tchar)[] tstring;
> 
>     import std.utf;
> 
>     unittest {
>         tstring tstr =
> "hello";
>         dstring dstr =
> toUTF32(tstr);
>     }
> 

Yes, I think something like this but standardized by the language. Also would be nice to have for interoperability (like I also mentioned in the beginning) toUTF16, toUTF8, fromUTF16, fromUTF8, fromUTF32, as tchar can be anything. If it's UTF-16, and you do toUTF16 - it won't do actual conversion, rather use input string instead. Something like this.

The other point of argument - whether to use this kind of type as the main character type. My point was that having this kind of type used in dynamic libraries would be nice since you don't need to provide instances for every other character type, and at the same time - use native character encoding available on system. Of course it does not mean, that you should be deprived of other types. If you need specific type to do something specific, you can always use it.



June 08, 2010
On 2010-06-08 09:22:02 -0400, Ruslan Nikolaev <nruslan_devel@yahoo.com> said:

> you don't need to provide instances for every other character type, and at the same time - use native character encoding available on system.

My opinion is thinking this will work is a fallacy. Here's why...

Generally Linux systems use UTF-8 so I guess the "system encoding" there will be UTF-8. But then if you start to use QT you have to use UTF-16, but you might have to intermix UTF-8 to work with other libraries in the backend (libraries which are not necessarily D libraries, nor system libraries). So you may have a UTF-8 backend (such as the MySQL library), UTF-8 "system encoding" glue code, and UTF-16 GUI code (QT). That might be a good or a bad choice, depending on various factors, such as whether the glue code send more strings to the backend or the GUI.

Now try to port the thing to Windows where you define the "system encoding" as UTF-16. Now you still have the same UTF-8 backend, and the same UTF-16 GUI code, but for some reason you're changing the glue code in the middle to UTF-16? Sure, it can be made to work, but all the string conversions will start to happen elsewhere, which may change the performance characteristics and add some potential for bugs, and all this for no real reason.

The problem is that what you call "system encoding" is only the encoding used by the system frameworks. It is relevant when working with the system frameworks, but when you're working with any other API, you'll probably want to use the same character type as that API does, not necessarily the "system encoding". Not all programs are based on extensive use of the system frameworks. In some situations you'll want to use UTF-16 on Linux, or UTF-8 on Windows, because you're dealing with libraries that expect that (QT, MySQL).

A compiler switch is a poor choice there, because you can't mix libraries compiled with a different compiler switches when that switch changes the default character type.

In most cases, it's much better in my opinion if the programmer just uses the same character type as one of the libraries it uses, stick to that, and is aware of what he's doing. If someone really want to deal with the complexity of supporting both character types depending on the environment it runs on, it's easy to create a "tchar" and "tstring" alias that depends on whether it's Windows or Linux, or on a custom version flag from a compiler switch, but that'll be his choice and his responsibility to make everything work. But I think in this case a better option might be to abstract all those 'strings' under a single type that work with all UTF encodings (something like [mtext]).

[mtext]: http://www.dprogramming.com/mtext.php

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

June 08, 2010
> 
> Generally Linux systems use UTF-8 so I guess the "system encoding" there will be UTF-8. But then if you start to use QT you have to use UTF-16, but you might have to intermix UTF-8 to work with other libraries in the backend (libraries which are not necessarily D libraries, nor system libraries). So you may have a UTF-8 backend (such as the MySQL library), UTF-8 "system encoding" glue code, and UTF-16 GUI code (QT). That might be a good or a bad choice, depending on various factors, such as whether the glue code send more strings to the backend or the GUI.
> 
> Now try to port the thing to Windows where you define the "system encoding" as UTF-16. Now you still have the same UTF-8 backend, and the same UTF-16 GUI code, but for some reason you're changing the glue code in the middle to UTF-16? Sure, it can be made to work, but all the string conversions will start to happen elsewhere, which may change the performance characteristics and add some potential for bugs, and all this for no real reason.
> 
> The problem is that what you call "system encoding" is only the encoding used by the system frameworks. It is relevant when working with the system frameworks, but when you're working with any other API, you'll probably want to use the same character type as that API does, not necessarily the "system encoding". Not all programs are based on extensive use of the system frameworks. In some situations you'll want to use UTF-16 on Linux, or UTF-8 on Windows, because you're dealing with libraries that expect that (QT, MySQL).
> 

Agreed. True, system encoding is not always that clear. Yet, usually UTF-8 is common for Linux (consider also Gtk, wxWidgets, system calls, etc.) At the same time, UTF-16 is more common for Windows (consider win32api, DFL, system calls, etc.). Some programs written in C even tend to have their own 'tchar' so that they can be compiled differently depending on platform.

> A compiler switch is a poor choice there, because you can't mix libraries compiled with a different compiler switches when that switch changes the default character type.

Compiler switch is only necessary for system programmer. For instance, gcc also has 'fshort-wchar' that changes width of wchar_t to 16 bit. It also DOES break the code casue libraries normally compiled for wchar_t to 32 bit. Again, it's generally not for application programmer.

> 
> In most cases, it's much better in my opinion if the programmer just uses the same character type as one of the libraries it uses, stick to that, and is aware of what he's doing. If someone really want to deal with the complexity of

Programmer should not know generally what encoding he works with. For both UTF-8 and UTF-16, it's easy to determine number of bytes (words) in multibyte (word) sequence by just looking at first code point. This can also be builtin function (e.g. numberOfChars(tchar firstChar)). Size of each element can easily be determined by sizeof. Conversion to UTF-32 and back can be done very transparently.

The only problem it might cause - bindings with other libraries (but in this case you can just use fromUTFxx and toUTFxx; you do this conversion anyway). Also, transferring data over the network - again you can just stick to a particular encoding (for network and files, UTF-8 is better since it's byte order free).

> supporting both character types depending on the environment it runs on, it's easy to create a "tchar" and "tstring" alias that depends on whether it's Windows or Linux, or on a custom version flag from a compiler switch, but that'll be his choice and his responsibility to make everything work.

If it's a choice of programmer, then almost all advantages of tchar are lost. It's like garbage collector - if used by everybody, you can expect advantages of using it. However, if it's optional - everybody will write libraries assuming no GC is available, thus - almost all performance advantages are lost.

And after all, one of the goals of D (if I am not wrong) to be flexible, so that performance gains will be available for particular configurations if they can be achieved (it's fully compiled language). It does not stick to something particular and say 'you must use UTF-8' or 'you must use UTF-16'.

> michel.fortin@michelf.com
> http://michelf.com/
> 
> 



June 08, 2010
please stop top-posting - just click on the post you want to reply and click then reply - your flooding the newsgroup root with replies ...

Am 08.06.2010 17:11, schrieb Ruslan Nikolaev:
>>
>>  Generally Linux systems use UTF-8 so I guess the "system
>>  encoding" there will be UTF-8. But then if you start to use
>>  QT you have to use UTF-16, but you might have to intermix
>>  UTF-8 to work with other libraries in the backend (libraries
>>  which are not necessarily D libraries, nor system
>>  libraries). So you may have a UTF-8 backend (such as the
>>  MySQL library), UTF-8 "system encoding" glue code, and
>>  UTF-16 GUI code (QT). That might be a good or a bad choice,
>>  depending on various factors, such as whether the glue code
>>  send more strings to the backend or the GUI.
>>
>>  Now try to port the thing to Windows where you define the
>>  "system encoding" as UTF-16. Now you still have the same
>>  UTF-8 backend, and the same UTF-16 GUI code, but for some
>>  reason you're changing the glue code in the middle to
>>  UTF-16? Sure, it can be made to work, but all the string
>>  conversions will start to happen elsewhere, which may change
>>  the performance characteristics and add some potential for
>>  bugs, and all this for no real reason.
>>
>>  The problem is that what you call "system encoding" is only
>>  the encoding used by the system frameworks. It is relevant
>>  when working with the system frameworks, but when you're
>>  working with any other API, you'll probably want to use the
>>  same character type as that API does, not necessarily the
>>  "system encoding". Not all programs are based on extensive
>>  use of the system frameworks. In some situations you'll want
>>  to use UTF-16 on Linux, or UTF-8 on Windows, because you're
>>  dealing with libraries that expect that (QT, MySQL).
>>
>
> Agreed. True, system encoding is not always that clear. Yet, usually UTF-8 is common for Linux (consider also Gtk, wxWidgets, system calls, etc.) At the same time, UTF-16 is more common for Windows (consider win32api, DFL, system calls, etc.). Some programs written in C even tend to have their own 'tchar' so that they can be compiled differently depending on platform.
>
>>  A compiler switch is a poor choice there, because you can't
>>  mix libraries compiled with a different compiler switches
>>  when that switch changes the default character type.
>
> Compiler switch is only necessary for system programmer. For instance, gcc also has 'fshort-wchar' that changes width of wchar_t to 16 bit. It also DOES break the code casue libraries normally compiled for wchar_t to 32 bit. Again, it's generally not for application programmer.
>
>>
>>  In most cases, it's much better in my opinion if the
>>  programmer just uses the same character type as one of the
>>  libraries it uses, stick to that, and is aware of what he's
>>  doing. If someone really want to deal with the complexity of
>
> Programmer should not know generally what encoding he works with. For both UTF-8 and UTF-16, it's easy to determine number of bytes (words) in multibyte (word) sequence by just looking at first code point. This can also be builtin function (e.g. numberOfChars(tchar firstChar)). Size of each element can easily be determined by sizeof. Conversion to UTF-32 and back can be done very transparently.
>
> The only problem it might cause - bindings with other libraries (but in this case you can just use fromUTFxx and toUTFxx; you do this conversion anyway). Also, transferring data over the network - again you can just stick to a particular encoding (for network and files, UTF-8 is better since it's byte order free).
>
>>  supporting both character types depending on the environment
>>  it runs on, it's easy to create a "tchar" and "tstring"
>>  alias that depends on whether it's Windows or Linux, or on a
>>  custom version flag from a compiler switch, but that'll be
>>  his choice and his responsibility to make everything work.
>
> If it's a choice of programmer, then almost all advantages of tchar are lost. It's like garbage collector - if used by everybody, you can expect advantages of using it. However, if it's optional - everybody will write libraries assuming no GC is available, thus - almost all performance advantages are lost.
>
> And after all, one of the goals of D (if I am not wrong) to be flexible, so that performance gains will be available for particular configurations if they can be achieved (it's fully compiled language). It does not stick to something particular and say 'you must use UTF-8' or 'you must use UTF-16'.
>
>>  michel.fortin@michelf.com
>>  http://michelf.com/
>>
>>
>
>
>