June 08, 2010
> It only generates code for the types that are actually
> needed. If, for
> instance, your progam never uses anything except UTF-8,
> then only one
> version of the function will be made - the UTF-8
> version.  If you don't use
> every char type, then it doesn't generate it for every char
> type - just the
> ones you choose to use.

Not quite right. If we create system dynamic libraries or dynamic libraries commonly used, we will have to compile every instance unless we want to burden user with this. Otherwise, the same code will be duplicated in users program over and over again.

> That's not good. First of all, UTF-16 is a lousy encoding,
> it combines the
> worst of both UTF-8 and UTF-32: It's multibyte and
> non-word-aligned like
> UTF-8, but it still wastes a lot of space like UTF-32. So
> even if your OS
> uses it natively, it's still best to do most internal
> processing in either
> UTF-8 or UTF-32. (And with templated string functions, if
> the programmer
> actually does want to use the native type in the *rare*
> cases where he's
> making enough OS calls that it would actually matter, he
> can still do so.)
>

First of all, UTF-16 is not a lousy encoding. It requires for most characters 2 bytes (not so big wastage especially if you consider other languages). Only for REALLY rare chars do you need 4 bytes. Whereas UTF-8 will require from 1 to 3 bytes for the same common characters. And also 4 chars for REALLY rare ones. In UTF-16 surrogate is an exception whereas in UTF-8 it is a rule (when something is an exception, it won't affect performance in most cases; when something is a rule - it will affect).

Finally, UTF-16 is used by a variety of systems/tools: Windows, Java, C#, Qt and many others. Developers of these systems chose to use UTF-16 even though some of them (e.g. Java, C#, Qt) were developed in the era of UTF-8


> Secondly, the programmer *should* be able to use whatever
> type he decides is
> appropriate. If he wants to stick with native, he can do

Why? He/She can just use conversion to UTF-32 (dchar) whenever better understanding of character is needed. At least, that's what should be done anyway.

> 
> You can have that easily:
> 
> version(Windows)
>     alias wstring tstring;
> else
>     alias string tstring;
> 

See that's my point. Nobody is going to do this unless the above is standardized by the language. Everybody will stick to something particular (either char or wchar).


> 
> With templated text functions, there is very little benefit
> to be gained
> from having a unified char. Just wouldn't serve any real

see my comment above about templates and dynamic libraries

Ruslan



June 08, 2010
Steven Schveighoffer wrote:
> a function that takes a char[] can also take a dchar[] if it is sent through a converter (i.e. toUtf8 on Tango I think).

In Phobos, there are text, wtext, and dtext in std.conv:

/**
   Convenience functions for converting any number and types of
   arguments into _text (the three character widths).

   Example:
   ----
   assert(text(42, ' ', 1.5, ": xyz") == "42 1.5: xyz");
   assert(wtext(42, ' ', 1.5, ": xyz") == "42 1.5: xyz"w);
   assert(dtext(42, ' ', 1.5, ": xyz") == "42 1.5: xyz"d);
   ----
*/

Ali
June 08, 2010
On Mon, 07 Jun 2010 19:26:02 -0700, Ruslan Nikolaev wrote:

>> It only generates code for the types that are actually needed. If, for
>> instance, your progam never uses anything except UTF-8, then only one
>> version of the function will be made - the UTF-8 version.  If you don't
>> use
>> every char type, then it doesn't generate it for every char type - just
>> the
>> ones you choose to use.
> 
> Not quite right. If we create system dynamic libraries or dynamic libraries commonly used, we will have to compile every instance unless we want to burden user with this. Otherwise, the same code will be duplicated in users program over and over again.
> 

I think you really need to look more into what templates are and do.

There is also going to be very little performance gain by using the "system type" for strings. Considering that most of the work is not likely going be to the system commands you mentioned, but within D itself.
June 08, 2010
--- On Tue, 6/8/10, Jesse Phillips <jessekphillips+D@gmail.com> wrote:

> I think you really need to look more into what templates are and do.
> 

Excuse me? Unless templates are something different in D (I can't be 100% sure since I am new D), it should be the case. At least in C++, that would be the case. As I said, for libraries you need to compile every commonly used instance, so that user will not be burdened with this overhead.

http://www.digitalmars.com/d/2.0/template.html

> There is also going to be very little performance gain by
> using the
> "system type" for strings. Considering that most of the
> work is not
> likely going be to the system commands you mentioned, but
> within D itself.
> 

It depends. For instance, if you work with files, write on the console output, use system functions, use Win32 api, DFL, there can be overhead.




June 08, 2010
Hello Ruslan,

> --- On Tue, 6/8/10, Jesse Phillips <jessekphillips+D@gmail.com> wrote:
> 
>> I think you really need to look more into what templates are and do.
>> 
> As I said, for libraries you need to compile
> every commonly used instance, so that user will not be burdened with
> this overhead.

You only need to do that where you are shipping closed source and for that, it should be trivial to get the compiler to generate all three versions. 

>> There is also going to be very little performance gain by
>> using the
>> "system type" for strings. Considering that most of the
>> work is not
>> likely going be to the system commands you mentioned, but
>> within D itself.
>
> It depends. For instance, if you work with files, write on the console
> output, use system functions, use Win32 api, DFL, there can be
> overhead.

Your, right: it depends. In the few cases I can think of where more of the D code will be interacting with non D code than just processing the text, you could almost use void[] as your type. Where would you care about the encoding but not do much worth it?

Also unless you have large amounts of text, you are going to have to work hard to get perf problems. If you do have large amounts of text, you are going to be I/O bound (cache misses etc.) and at that point, the cost of any operation, is it's I/O. From that, Reading in some date, doing a single pass of processing on it and writing it back out would only take 2/3 long with translations on both side.

-- 
... <IXOYE><



June 08, 2010
> You only need to do that where you are shipping closed source and for that, it should be trivial to get the compiler to generate all three versions.

You will also need to do it in open source projects if you want to include generated template code into dynamic library as opposed to user's program (read as unnecessary space "burden" where code is repeated over and over again across user programs).

But, yes, closed source programs is a good particular example. True, you can compile all 3 versions. But the whole argument was about additional generated code which someone claimed will not happen.


> 
> Your, right: it depends. In the few cases I can think of where more of the D code will be interacting with non D code than just processing the text, you could almost use void[] as your type. Where would you care about the encoding but not do much worth it?
> 
> Also unless you have large amounts of text, you are going to have to work hard to get perf problems. If you do have large amounts of text, you are going to be I/O bound (cache misses etc.) and at that point, the cost of any operation, is it's I/O. From that, Reading in some date, doing a single pass of processing on it and writing it back out would only take 2/3 long with translations on both side.
> 

True. But even simple string handling is faster for UTF-16. The time required to read 2 bytes from UTF-16 string is the same 1 byte from UTF-8. Generally, we have to read one code point after another (not more than this) since data guaranteed to be aligned by 2 byte boundary for wchar and 1 byte for char. Not to mention that converting 2 code points takes less time in UTF-16. And why not use this opportunity if system already natively support this?


In addition, I want to mention that reading/writing file in text mode is very transparent. For instance, in Windows, the conversion will happen automatically from multibyte to unicode for open, fopen, etc. when text mode is specified. In general, it is a good practice since 1 byte char text is not necessary UTF-8 anyway and can be ANSI as well.

Also, some other OS use 2 bytes UTF-16 natively, so it's not just for Windows. If I am not wrong, Symbian should be one such example.



June 08, 2010
"Ruslan Nikolaev" <nruslan_devel@yahoo.com> wrote in message news:mailman.124.1275963971.24349.digitalmars-d@puremagic.com...

Nick wrote:
> It only generates code for the types that are actually
> needed. If, for
> instance, your progam never uses anything except UTF-8,
> then only one
> version of the function will be made - the UTF-8
> version. If you don't use
> every char type, then it doesn't generate it for every char
> type - just the
> ones you choose to use.

>Not quite right. If we create system dynamic libraries or dynamic libraries commonly used, we will have to compile every instance unless we want to burden user with this. Otherwise, the same code will be duplicated in users program over and over again.<

That's a rather minor issue. I think you're overestimating the amount of bloat that occurs from having one string type versus three string types. Absolute worst case scenario would be a library that contains nothing but text-processing functions. That would triple in size, but what's the biggest such lib you've ever seen anyway? And for most libs, only a fraction is going to be taken up by text processing, so the difference won't be particularly large. In fact, the difference would likely be dwarfed anyway by the bloat incurred from all the other templated code (ie which would be largely unaffected by number of string types), and yes, *that* can get to be a problem, but it's an entirely separate one.

> That's not good. First of all, UTF-16 is a lousy encoding,
> it combines the
> worst of both UTF-8 and UTF-32: It's multibyte and
> non-word-aligned like
> UTF-8, but it still wastes a lot of space like UTF-32. So
> even if your OS
> uses it natively, it's still best to do most internal
> processing in either
> UTF-8 or UTF-32. (And with templated string functions, if
> the programmer
> actually does want to use the native type in the *rare*
> cases where he's
> making enough OS calls that it would actually matter, he
> can still do so.)
>
>First of all, UTF-16 is not a lousy encoding. It requires for most characters 2 bytes (not so big wastage especially if you consider other languages). Only for REALLY rare chars do you need 4 bytes. Whereas UTF-8 will require from 1 to 3 bytes for the same common characters. And also 4 chars for REALLY rare ones. In UTF-16 surrogate is an exception whereas in UTF-8 it is a rule (when something is an exception, it won't affect performance in most cases; when something is a rule - it will affect).<

Maybe "lousy" is too strong a word, but aside from compatibility with other libs/software that use it (which I'll address separately), UTF-16 is not particularly useful compared to UTF-8 and UTF-32:

Non-latin-alphabet language: UTF-8 vs UTF-16:

The real-word difference in sizes is minimal. But UTF-8 has some advantages: The nature of the the encoding makes backwards-scanning cheaper and easier. Also, as Walter said, bugs in the handling of multi-code-unit characters become fairly obvious. Advantages of UTF-16: None.

Latin-alphabet language: UTF-8 vs UTF-16:

All the same UTF-8 advantages for non-latin-alphabet languages still apply, plus there's a space savings: Under UTF-8, *most* characters are going to be 1 byte. Yes, there will be the occasional 2+ byte character, but they're so much less common that the overhead compared to ASCII (I'm only using ASCII as a baseline here, for the sake of comparisons) would only be around 0% to 15% depending on the language. UTF-16, however, has a consistent 100% overhead (slightly more when you count surrogate pairs, but I'll just leave it at 100%). So, depending on language, UTF-16 would be around 70%-100% larger than UTF-8. That's not insignificant.

Any language: UTF-32 vs UTF-16:

Using UTF-32 takes up extra space, but when that matters, UTF-8 already has the advantage over UTF-16 anyway regardless of whether or not UTF-8 is providing a space savings (see above), so the question of UTF-32 vs UTF-16 becomes useless. The rest of the time, UTF-32 has these advantages: Guaranteed one code-unit per character. And, the code-unit size is faster on typical CPUs (typical CPUs generally handle 32-bits faster than they handle 8- or 16-bits). Advantages of UTF-16: None.

So compatibility with certain tools/libs is really the only reason ever to choose UTF-16.

>Finally, UTF-16 is used by a variety of systems/tools: Windows, Java, C#, Qt and many others. Developers of these systems chose to use UTF-16 even though some of them (e.g. Java, C#, Qt) were developed in the era of UTF-8<

First of all, it's not exactly unheard of for big projects to make a sub-optimal decision.

Secondly, Java and Windows adapted 16-bit encodings back when many people were still under the mistaken impression that would allow them to hold any character in one code-unit. If that had been true, then it would indeed have had at least certain advantages over UTF-8. But by the time the programming world at large knew better, it was too late for Java or Windows to re-evaluate the decision; they'd already jumped in with both feet. C# and .NET use UTF-16 because Windows does. I don't know about Qt, but judging by how long Wikipedia says it's been around, I'd say it's probably the same story.

As for choosing to use UTF-16 because of interfacing with other tools and libs that use it: That's certainly a good reason to use UTF-16. But it's about the only reason. And it's a big mistake to just assume that the overhead of converting to/from UTF-16 when crossing those API borders is always going to outweigh all other concerns:

For instance, if you're writing an app that does a large amount of text-processing on relatively small amounts of text and only deals a little bit with a UTF-16 API, then the overhead of operating on 16-bits at a time can easily outweigh the overhead from the UTF-16 <-> UTF-32 conversions.

Or, maybe the app you're writing is more memory-limited than speed-limited.

There are perfectly legitimate reasons to want to use an encoding other than the OS-native. Why force those people to circumvent the type system to do it? Especially in a language that's intended to be usable as a systems language. Just to potentially save a couple megs on some .dll or .so?

> Secondly, the programmer *should* be able to use whatever
> type he decides is
> appropriate. If he wants to stick with native, he can do
>Why? He/She can just use conversion to UTF-32 (dchar) whenever better understanding of character is needed. At least, that's what should be done anyway.<

Weren't you saying that the main point of just having one string type (the OS-native string) was to avoid unnecessary conversions? But now you're arguing that's it's fine to do unnecessary conversions and to have the multiple string types?

>
> You can have that easily:
>
> version(Windows)
> alias wstring tstring;
> else
> alias string tstring;
>
>See that's my point. Nobody is going to do this unless the above is standardized by the language. Everybody will stick to something particular (either char or wchar).<

True enough. I don't have anything against having something like that in the std library as long as the others are still available too. Could be useful in a few cases. I do think having it *instead* of the three types is far too presumptuous, though.


June 08, 2010
"Ruslan Nikolaev" <nruslan_devel@yahoo.com> wrote in message news:mailman.127.1275974825.24349.digitalmars-d@puremagic.com...
>
> True. But even simple string handling is faster for UTF-16. The time required to read 2 bytes from UTF-16 string is the same 1 byte from UTF-8. Generally, we have to read one code point after another (not more than this) since data guaranteed to be aligned by 2 byte boundary for wchar and 1 byte for char. Not to mention that converting 2 code points takes less time in UTF-16. And why not use this opportunity if system already natively support this?
>

Why do you say that UTF-16 is faster than UTF-8?

>In general, it is a good practice since 1 byte char text is not necessary UTF-8 anyway and can be ANSI as well.
>

That's what the BOM is for.


June 08, 2010
Walter Bright Wrote:

> The problem with wchar's is that everyone forgets about surrogate pairs. Most UTF-16 programs in the wild, including nearly all Java programs, are broken with regard to surrogate pairs.

I'm affraid, it will pretty hard to show the bug. I don't know whether java is particularly nasty here, but for C code it will be hard.
June 08, 2010
> 
> Maybe "lousy" is too strong a word, but aside from
> compatibility with other
> libs/software that use it (which I'll address separately),
> UTF-16 is not
> particularly useful compared to UTF-8 and UTF-32:
...
> 

I tried to avoid commenting this because I am afraid we'll stray away from the main point (which is not discussion about which Unicode is better). But in short I would say: "Not quite right". UTF-16 as already mentioned is generally faster for non-Latin letters (as reading 2 bytes of aligned data takes the same time as reading 1 byte). Although, I am not familiar with Asian languages, I believe that UTF-16 requires just 2 bytes instead of 3 for most of symbols. That is one of the reason they don't like UTF-8. UTF-32 doesn't have any advantage except for being fixed length. It has a lot of unnecessary memory, cache, etc. overhead (the worst case scenario for both UTF8/16) which is not justified for any language.

> 
> First of all, it's not exactly unheard of for big projects
> to make a
> sub-optimal decision.

I would say, the decision was quite optimal for many reasons, including that "lousy programming" will not cause too many problems as in case of UTF-8.

> 
> Secondly, Java and Windows adapted 16-bit encodings back
> when many people
> were still under the mistaken impression that would allow
> them to hold any
> character in one code-unit. If that had been true, then it

I doubt that it was the only reason. UTF-8 was already available before Windows NT was released. It would be much easier to use UTF-8 instead of ANSI as opposed to creating parallel API. Nonetheless, UTF-16 has been chosen. In addition, C# has been released already when UTF-16 became variable length. I doubt that conversion overhead (which is small compared to VM) was the main reason to preserve UTF-16.


Concerning why I say that it's good to have conversion to UTF-32 (you asked somewhere):

I think you did not understand correctly what I meant. This a very common practice, and in fact - required, to convert from both UTF-8 and UTF-16 to UTF-32 when you need to do character analysis (e.g. mbtowc() in C). In fact, it is the only place where UTF-32 is commonly used and useful.