View mode: basic / threaded / horizontal-split · Log in · Help
June 08, 2010
Re: Wide characters support in D
> It only generates code for the types that are actually
> needed. If, for 
> instance, your progam never uses anything except UTF-8,
> then only one 
> version of the function will be made - the UTF-8
> version.  If you don't use 
> every char type, then it doesn't generate it for every char
> type - just the 
> ones you choose to use.

Not quite right. If we create system dynamic libraries or dynamic libraries commonly used, we will have to compile every instance unless we want to burden user with this. Otherwise, the same code will be duplicated in users program over and over again.

> That's not good. First of all, UTF-16 is a lousy encoding,
> it combines the 
> worst of both UTF-8 and UTF-32: It's multibyte and
> non-word-aligned like 
> UTF-8, but it still wastes a lot of space like UTF-32. So
> even if your OS 
> uses it natively, it's still best to do most internal
> processing in either 
> UTF-8 or UTF-32. (And with templated string functions, if
> the programmer 
> actually does want to use the native type in the *rare*
> cases where he's 
> making enough OS calls that it would actually matter, he
> can still do so.)
>

First of all, UTF-16 is not a lousy encoding. It requires for most characters 2 bytes (not so big wastage especially if you consider other languages). Only for REALLY rare chars do you need 4 bytes. Whereas UTF-8 will require from 1 to 3 bytes for the same common characters. And also 4 chars for REALLY rare ones. In UTF-16 surrogate is an exception whereas in UTF-8 it is a rule (when something is an exception, it won't affect performance in most cases; when something is a rule - it will affect).

Finally, UTF-16 is used by a variety of systems/tools: Windows, Java, C#, Qt and many others. Developers of these systems chose to use UTF-16 even though some of them (e.g. Java, C#, Qt) were developed in the era of UTF-8


> Secondly, the programmer *should* be able to use whatever
> type he decides is 
> appropriate. If he wants to stick with native, he can do

Why? He/She can just use conversion to UTF-32 (dchar) whenever better understanding of character is needed. At least, that's what should be done anyway.

> 
> You can have that easily:
> 
> version(Windows)
>     alias wstring tstring;
> else
>     alias string tstring;
> 

See that's my point. Nobody is going to do this unless the above is standardized by the language. Everybody will stick to something particular (either char or wchar). 


> 
> With templated text functions, there is very little benefit
> to be gained 
> from having a unified char. Just wouldn't serve any real

see my comment above about templates and dynamic libraries 

Ruslan
June 08, 2010
Re: Wide characters support in D
Steven Schveighoffer wrote:
> a function that takes 
> a char[] can also take a dchar[] if it is sent through a converter (i.e. 
> toUtf8 on Tango I think).

In Phobos, there are text, wtext, and dtext in std.conv:

/**
   Convenience functions for converting any number and types of
   arguments into _text (the three character widths).

   Example:
   ----
   assert(text(42, ' ', 1.5, ": xyz") == "42 1.5: xyz");
   assert(wtext(42, ' ', 1.5, ": xyz") == "42 1.5: xyz"w);
   assert(dtext(42, ' ', 1.5, ": xyz") == "42 1.5: xyz"d);
   ----
*/

Ali
June 08, 2010
Re: Wide characters support in D
On Mon, 07 Jun 2010 19:26:02 -0700, Ruslan Nikolaev wrote:

>> It only generates code for the types that are actually needed. If, for
>> instance, your progam never uses anything except UTF-8, then only one
>> version of the function will be made - the UTF-8 version.  If you don't
>> use
>> every char type, then it doesn't generate it for every char type - just
>> the
>> ones you choose to use.
> 
> Not quite right. If we create system dynamic libraries or dynamic
> libraries commonly used, we will have to compile every instance unless
> we want to burden user with this. Otherwise, the same code will be
> duplicated in users program over and over again.
> 

I think you really need to look more into what templates are and do.

There is also going to be very little performance gain by using the 
"system type" for strings. Considering that most of the work is not 
likely going be to the system commands you mentioned, but within D itself.
June 08, 2010
Re: Wide characters support in D
--- On Tue, 6/8/10, Jesse Phillips <jessekphillips+D@gmail.com> wrote:

> I think you really need to look more into what templates
> are and do.
> 

Excuse me? Unless templates are something different in D (I can't be 100% sure since I am new D), it should be the case. At least in C++, that would be the case. As I said, for libraries you need to compile every commonly used instance, so that user will not be burdened with this overhead.

http://www.digitalmars.com/d/2.0/template.html

> There is also going to be very little performance gain by
> using the 
> "system type" for strings. Considering that most of the
> work is not 
> likely going be to the system commands you mentioned, but
> within D itself.
> 

It depends. For instance, if you work with files, write on the console output, use system functions, use Win32 api, DFL, there can be overhead.
June 08, 2010
Re: Wide characters support in D
Hello Ruslan,

> --- On Tue, 6/8/10, Jesse Phillips <jessekphillips+D@gmail.com> wrote:
> 
>> I think you really need to look more into what templates are and do.
>> 
> As I said, for libraries you need to compile
> every commonly used instance, so that user will not be burdened with
> this overhead.

You only need to do that where you are shipping closed source and for that, 
it should be trivial to get the compiler to generate all three versions. 

>> There is also going to be very little performance gain by
>> using the
>> "system type" for strings. Considering that most of the
>> work is not
>> likely going be to the system commands you mentioned, but
>> within D itself.
>
> It depends. For instance, if you work with files, write on the console
> output, use system functions, use Win32 api, DFL, there can be
> overhead.

Your, right: it depends. In the few cases I can think of where more of the 
D code will be interacting with non D code than just processing the text, 
you could almost use void[] as your type. Where would you care about the 
encoding but not do much worth it?

Also unless you have large amounts of text, you are going to have to work 
hard to get perf problems. If you do have large amounts of text, you are 
going to be I/O bound (cache misses etc.) and at that point, the cost of 
any operation, is it's I/O. From that, Reading in some date, doing a single 
pass of processing on it and writing it back out would only take 2/3 long 
with translations on both side.

-- 
... <IXOYE><
June 08, 2010
Re: Wide characters support in D
> You only need to do that where you are shipping closed
> source and for that, it should be trivial to get the
> compiler to generate all three versions. 

You will also need to do it in open source projects if you want to include generated template code into dynamic library as opposed to user's program (read as unnecessary space "burden" where code is repeated over and over again across user programs).

But, yes, closed source programs is a good particular example. True, you can compile all 3 versions. But the whole argument was about additional generated code which someone claimed will not happen. 


> 
> Your, right: it depends. In the few cases I can think of
> where more of the D code will be interacting with non D code
> than just processing the text, you could almost use void[]
> as your type. Where would you care about the encoding but
> not do much worth it?
> 
> Also unless you have large amounts of text, you are going
> to have to work hard to get perf problems. If you do have
> large amounts of text, you are going to be I/O bound (cache
> misses etc.) and at that point, the cost of any operation,
> is it's I/O. From that, Reading in some date, doing a single
> pass of processing on it and writing it back out would only
> take 2/3 long with translations on both side.
> 

True. But even simple string handling is faster for UTF-16. The time required to read 2 bytes from UTF-16 string is the same 1 byte from UTF-8. Generally, we have to read one code point after another (not more than this) since data guaranteed to be aligned by 2 byte boundary for wchar and 1 byte for char. Not to mention that converting 2 code points takes less time in UTF-16. And why not use this opportunity if system already natively support this?  


In addition, I want to mention that reading/writing file in text mode is very transparent. For instance, in Windows, the conversion will happen automatically from multibyte to unicode for open, fopen, etc. when text mode is specified. In general, it is a good practice since 1 byte char text is not necessary UTF-8 anyway and can be ANSI as well.

Also, some other OS use 2 bytes UTF-16 natively, so it's not just for Windows. If I am not wrong, Symbian should be one such example.
June 08, 2010
Re: Wide characters support in D
"Ruslan Nikolaev" <nruslan_devel@yahoo.com> wrote in message 
news:mailman.124.1275963971.24349.digitalmars-d@puremagic.com...

Nick wrote:
> It only generates code for the types that are actually
> needed. If, for
> instance, your progam never uses anything except UTF-8,
> then only one
> version of the function will be made - the UTF-8
> version. If you don't use
> every char type, then it doesn't generate it for every char
> type - just the
> ones you choose to use.

>Not quite right. If we create system dynamic libraries or dynamic libraries 
>commonly used, we will have to compile every instance unless we want to 
>burden user with this. Otherwise, the same code will be duplicated in users 
>program over and over again.<

That's a rather minor issue. I think you're overestimating the amount of 
bloat that occurs from having one string type versus three string types. 
Absolute worst case scenario would be a library that contains nothing but 
text-processing functions. That would triple in size, but what's the biggest 
such lib you've ever seen anyway? And for most libs, only a fraction is 
going to be taken up by text processing, so the difference won't be 
particularly large. In fact, the difference would likely be dwarfed anyway 
by the bloat incurred from all the other templated code (ie which would be 
largely unaffected by number of string types), and yes, *that* can get to be 
a problem, but it's an entirely separate one.

> That's not good. First of all, UTF-16 is a lousy encoding,
> it combines the
> worst of both UTF-8 and UTF-32: It's multibyte and
> non-word-aligned like
> UTF-8, but it still wastes a lot of space like UTF-32. So
> even if your OS
> uses it natively, it's still best to do most internal
> processing in either
> UTF-8 or UTF-32. (And with templated string functions, if
> the programmer
> actually does want to use the native type in the *rare*
> cases where he's
> making enough OS calls that it would actually matter, he
> can still do so.)
>
>First of all, UTF-16 is not a lousy encoding. It requires for most 
>characters 2 bytes (not so big wastage especially if you consider other 
>languages). Only for REALLY rare chars do you need 4 bytes. Whereas UTF-8 
>will require from 1 to 3 bytes for the same common characters. And also 4 
>chars for REALLY rare ones. In UTF-16 surrogate is an exception whereas in 
>UTF-8 it is a rule (when something is an exception, it won't affect 
>performance in most cases; when something is a rule - it will affect).<

Maybe "lousy" is too strong a word, but aside from compatibility with other 
libs/software that use it (which I'll address separately), UTF-16 is not 
particularly useful compared to UTF-8 and UTF-32:

Non-latin-alphabet language: UTF-8 vs UTF-16:

The real-word difference in sizes is minimal. But UTF-8 has some advantages: 
The nature of the the encoding makes backwards-scanning cheaper and easier. 
Also, as Walter said, bugs in the handling of multi-code-unit characters 
become fairly obvious. Advantages of UTF-16: None.

Latin-alphabet language: UTF-8 vs UTF-16:

All the same UTF-8 advantages for non-latin-alphabet languages still apply, 
plus there's a space savings: Under UTF-8, *most* characters are going to be 
1 byte. Yes, there will be the occasional 2+ byte character, but they're so 
much less common that the overhead compared to ASCII (I'm only using ASCII 
as a baseline here, for the sake of comparisons) would only be around 0% to 
15% depending on the language. UTF-16, however, has a consistent 100% 
overhead (slightly more when you count surrogate pairs, but I'll just leave 
it at 100%). So, depending on language, UTF-16 would be around 70%-100% 
larger than UTF-8. That's not insignificant.

Any language: UTF-32 vs UTF-16:

Using UTF-32 takes up extra space, but when that matters, UTF-8 already has 
the advantage over UTF-16 anyway regardless of whether or not UTF-8 is 
providing a space savings (see above), so the question of UTF-32 vs UTF-16 
becomes useless. The rest of the time, UTF-32 has these advantages: 
Guaranteed one code-unit per character. And, the code-unit size is faster on 
typical CPUs (typical CPUs generally handle 32-bits faster than they handle 
8- or 16-bits). Advantages of UTF-16: None.

So compatibility with certain tools/libs is really the only reason ever to 
choose UTF-16.

>Finally, UTF-16 is used by a variety of systems/tools: Windows, Java, C#, 
>Qt and many others. Developers of these systems chose to use UTF-16 even 
>though some of them (e.g. Java, C#, Qt) were developed in the era of UTF-8<

First of all, it's not exactly unheard of for big projects to make a 
sub-optimal decision.

Secondly, Java and Windows adapted 16-bit encodings back when many people 
were still under the mistaken impression that would allow them to hold any 
character in one code-unit. If that had been true, then it would indeed have 
had at least certain advantages over UTF-8. But by the time the programming 
world at large knew better, it was too late for Java or Windows to 
re-evaluate the decision; they'd already jumped in with both feet. C# and 
.NET use UTF-16 because Windows does. I don't know about Qt, but judging by 
how long Wikipedia says it's been around, I'd say it's probably the same 
story.

As for choosing to use UTF-16 because of interfacing with other tools and 
libs that use it: That's certainly a good reason to use UTF-16. But it's 
about the only reason. And it's a big mistake to just assume that the 
overhead of converting to/from UTF-16 when crossing those API borders is 
always going to outweigh all other concerns:

For instance, if you're writing an app that does a large amount of 
text-processing on relatively small amounts of text and only deals a little 
bit with a UTF-16 API, then the overhead of operating on 16-bits at a time 
can easily outweigh the overhead from the UTF-16 <-> UTF-32 conversions.

Or, maybe the app you're writing is more memory-limited than speed-limited.

There are perfectly legitimate reasons to want to use an encoding other than 
the OS-native. Why force those people to circumvent the type system to do 
it? Especially in a language that's intended to be usable as a systems 
language. Just to potentially save a couple megs on some .dll or .so?

> Secondly, the programmer *should* be able to use whatever
> type he decides is
> appropriate. If he wants to stick with native, he can do
>Why? He/She can just use conversion to UTF-32 (dchar) whenever better 
>understanding of character is needed. At least, that's what should be done 
>anyway.<

Weren't you saying that the main point of just having one string type (the 
OS-native string) was to avoid unnecessary conversions? But now you're 
arguing that's it's fine to do unnecessary conversions and to have the 
multiple string types?

>
> You can have that easily:
>
> version(Windows)
> alias wstring tstring;
> else
> alias string tstring;
>
>See that's my point. Nobody is going to do this unless the above is 
>standardized by the language. Everybody will stick to something particular 
>(either char or wchar).<

True enough. I don't have anything against having something like that in the 
std library as long as the others are still available too. Could be useful 
in a few cases. I do think having it *instead* of the three types is far too 
presumptuous, though.
June 08, 2010
Re: Wide characters support in D
"Ruslan Nikolaev" <nruslan_devel@yahoo.com> wrote in message 
news:mailman.127.1275974825.24349.digitalmars-d@puremagic.com...
>
> True. But even simple string handling is faster for UTF-16. The time 
> required to read 2 bytes from UTF-16 string is the same 1 byte from UTF-8. 
> Generally, we have to read one code point after another (not more than 
> this) since data guaranteed to be aligned by 2 byte boundary for wchar and 
> 1 byte for char. Not to mention that converting 2 code points takes less 
> time in UTF-16. And why not use this opportunity if system already 
> natively support this?
>

Why do you say that UTF-16 is faster than UTF-8?

>In general, it is a good practice since 1 byte char text is not necessary 
>UTF-8 anyway and can be ANSI as well.
>

That's what the BOM is for.
June 08, 2010
Re: Wide characters support in D
Walter Bright Wrote:

> The problem with wchar's is that everyone forgets about surrogate pairs. Most 
> UTF-16 programs in the wild, including nearly all Java programs, are broken with 
> regard to surrogate pairs.

I'm affraid, it will pretty hard to show the bug. I don't know whether java is particularly nasty here, but for C code it will be hard.
June 08, 2010
Re: Wide characters support in D
> 
> Maybe "lousy" is too strong a word, but aside from
> compatibility with other 
> libs/software that use it (which I'll address separately),
> UTF-16 is not 
> particularly useful compared to UTF-8 and UTF-32:
...
> 

I tried to avoid commenting this because I am afraid we'll stray away from the main point (which is not discussion about which Unicode is better). But in short I would say: "Not quite right". UTF-16 as already mentioned is generally faster for non-Latin letters (as reading 2 bytes of aligned data takes the same time as reading 1 byte). Although, I am not familiar with Asian languages, I believe that UTF-16 requires just 2 bytes instead of 3 for most of symbols. That is one of the reason they don't like UTF-8. UTF-32 doesn't have any advantage except for being fixed length. It has a lot of unnecessary memory, cache, etc. overhead (the worst case scenario for both UTF8/16) which is not justified for any language.

> 
> First of all, it's not exactly unheard of for big projects
> to make a 
> sub-optimal decision.

I would say, the decision was quite optimal for many reasons, including that "lousy programming" will not cause too many problems as in case of UTF-8.

> 
> Secondly, Java and Windows adapted 16-bit encodings back
> when many people 
> were still under the mistaken impression that would allow
> them to hold any 
> character in one code-unit. If that had been true, then it

I doubt that it was the only reason. UTF-8 was already available before Windows NT was released. It would be much easier to use UTF-8 instead of ANSI as opposed to creating parallel API. Nonetheless, UTF-16 has been chosen. In addition, C# has been released already when UTF-16 became variable length. I doubt that conversion overhead (which is small compared to VM) was the main reason to preserve UTF-16.


Concerning why I say that it's good to have conversion to UTF-32 (you asked somewhere):

I think you did not understand correctly what I meant. This a very common practice, and in fact - required, to convert from both UTF-8 and UTF-16 to UTF-32 when you need to do character analysis (e.g. mbtowc() in C). In fact, it is the only place where UTF-32 is commonly used and useful.
1 2 3 4 5 6
Top | Discussion index | About this forum | D home