Jump to page: 1 2 3
Thread overview
Selectable encodings
Apr 06, 2006
John C
Apr 06, 2006
Mike Capp
Apr 06, 2006
Oskar Linde
Apr 06, 2006
James Dunne
Ceci n'est pas une char (was: Re: Selectable encodings)
Apr 06, 2006
Mike Capp
Re: Ceci n'est pas une char
Apr 06, 2006
Georg Wrede
Apr 06, 2006
Thomas Kuehne
Apr 06, 2006
Thomas Kuehne
Apr 07, 2006
Walter Bright
Apr 07, 2006
Sean Kelly
Apr 07, 2006
Walter Bright
Apr 07, 2006
Sean Kelly
Apr 07, 2006
Walter Bright
Apr 07, 2006
Georg Wrede
Apr 07, 2006
kris
Re: Ceci n'est pas une char
Apr 06, 2006
Sean Kelly
Re: Ceci n'est pas une char
April 06, 2006
I know of three ways to support a user-selected char encoding in a library, but each has its drawbacks.

1) Method overloading
Introduces conflicts with string literals (forcing a c/w/d suffix to be
used) and you can't overload by return type.

2) Parameterising all types that use strings
Making every class a template just to get this functionality seems over the
top.
class SomeClassT(TChar) {
    TChar[] getSomeString() {}
}
alias SomeClassT!(char) SomeClass; // in library module
alias SomeClassT!(wchar) SomeClass; // in user module

3) A compiler version condition with aliases.
The version condition approach is the most attractive to me, but some people
aren't fond of it.
version (utf8) alias mlchar char;
else version (utf16) alias mlchar wchar;
else version (utf32) alias mlchar dchar;

There's a fourth way - encoding conversion, but there's a runtime cost.

So does anyone use an alternative way to enable users to select which char encoding they want to use at compile time?


April 06, 2006
In article <e12j34$2gi2$1@digitaldaemon.com>, John C says...
>
>version (utf8) alias mlchar char;

Apologies for going off at a tangent to your question, but I've never quite understood what D thinks it's doing here. If char[] is an array of characters, then it can't be a UTF-8 string, because UTF-8 is a variable-length encoding. So is char[] an array of characters from some other charset (e.g. the subset of UTF-8 representable in one byte), or is it an array of bytes encoding a UTF-8 string (in which case I suspect quite a lot of string-handling code is badly broken)?

cheers
Mike


April 06, 2006
Mike Capp skrev:
> In article <e12j34$2gi2$1@digitaldaemon.com>, John C says...
>> version (utf8) alias mlchar char;
> 
> Apologies for going off at a tangent to your question, but I've never quite
> understood what D thinks it's doing here. If char[] is an array of characters,
> then it can't be a UTF-8 string, because UTF-8 is a variable-length encoding. So
> is char[] an array of characters from some other charset (e.g. the subset of
> UTF-8 representable in one byte), or is it an array of bytes encoding a UTF-8
> string (in which case I suspect quite a lot of string-handling code is badly
> broken)?

It is the latter. But I don't think much of the string handling code is broken because of that.

/Oskar
April 06, 2006
Oskar Linde wrote:
> Mike Capp skrev:
> 
>> In article <e12j34$2gi2$1@digitaldaemon.com>, John C says...
>>
>>> version (utf8) alias mlchar char;
>>
>>
>> Apologies for going off at a tangent to your question, but I've never quite
>> understood what D thinks it's doing here. If char[] is an array of characters,
>> then it can't be a UTF-8 string, because UTF-8 is a variable-length encoding. So
>> is char[] an array of characters from some other charset (e.g. the subset of
>> UTF-8 representable in one byte), or is it an array of bytes encoding a UTF-8
>> string (in which case I suspect quite a lot of string-handling code is badly
>> broken)?
> 
> 
> It is the latter. But I don't think much of the string handling code is broken because of that.
> 
> /Oskar

The char type is really a misnomer for dealing with UTF-8 encoded strings.  It should be named closer to "code-unit for UTF-8 encoding". For my own research language I've chosen what I believe to be a nice type naming system:

    char            - 32-bit Unicode code point

    u8cu            - UTF-8 code unit
    u16cu           - UTF-16 code unit
    u32cu           - UTF-32 code unit

I could be wrong (and I bet I am) on the terminology used to describe char, but I really mean it to just store a full Unicode character such that strings of chars can safely assume character index == array index.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O M--@ V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e h>--->++ r+++ y+++
------END GEEK CODE BLOCK------

James Dunne
April 06, 2006
James Dunne wrote:

> The char type is really a misnomer for dealing with UTF-8 encoded strings.  It should be named closer to "code-unit for UTF-8 encoding". 

Yeah, but it does hold an *ASCII* character ?

Usually the D code handles char[] with dchar,
but with a "short path" for ASCII characters...

> I could be wrong (and I bet I am) on the terminology used to describe
> char, but I really mean it to just store a full Unicode character
> such that strings of chars can safely assume character index == array
> index.

For the general case, UTF-32 is a pretty wasteful
Unicode encoding just to have that priviledge ?

See http://www.unicode.org/faq/utf_bom.html#12

--anders
April 06, 2006
(Changing subject line since we seem to have rudely hijacked the OP's topic)

In article <e13b56$is0$1@digitaldaemon.com>, =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
>
>James Dunne wrote:
>
>> The char type is really a misnomer for dealing with UTF-8 encoded strings.  It should be named closer to "code-unit for UTF-8 encoding".

(I fully agree with this statement, by the way.)

>Yeah, but it does hold an *ASCII* character ?

I don't find that very helpful - seeing a char[] in code doesn't tell me anything about whether it's byte-per-character ASCII or possibly-multibyte UTF-8.

>For the general case, UTF-32 is a pretty wasteful
>Unicode encoding just to have that priviledge ?

I'm not sure there is a "general case", so it's hard to say. Some programmers have to deal with MBCS every day; others can go for years without ever having to worry about anything but vanilla ASCII.

"Wasteful" is also relative. UTF-32 is certainly wasteful of memory space, but UTF-8 is potentially far more wasteful of CPU cycles and memory bandwidth. Finding the millionth character in a UTF-8 string means looping through at least a million bytes, and executing some conditional logic for each one. Finding the millionth character in a UTF-32 string is a simple pointer offset and one-word fetch.

At the risk of repeating James, I do think that spelling "string" as "char[]"/"wchar[]" is grossly misleading, particularly to people coming from any other C-family language. If I was doing any serious string-handling work in D I'd almost certainly write a opaque String class that overloaded opIndex (returning dchar) to do the right thing, and optimised the underlying storage to suit the app's requirements.

cheers
Mike


April 06, 2006
Mike Capp wrote:
> (Changing subject line since we seem to have rudely hijacked the OP's
> topic)
> 
> In article <e13b56$is0$1@digitaldaemon.com>, =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
> 
>> James Dunne wrote:
>> 
>>> The char type is really a misnomer for dealing with UTF-8 encoded
>>> strings.  It should be named closer to "code-unit for UTF-8
>>> encoding".
> 
> (I fully agree with this statement, by the way.)

Yes. And it's a _gross_ misnomer.

And we who are used to D can't even _begin_ to appreciate the [unnecessary!] extra work and effort needed to gradually come to understand it "our way", for those new to D.

>> Yeah, but it does hold an *ASCII* character ?
> 
> I don't find that very helpful - seeing a char[] in code doesn't tell
> me anything about whether it's byte-per-character ASCII or
> possibly-multibyte UTF-8.

(( A dumb idea: the input stream has a flag that gets set as soon as the first non-ASCII character is found. ))

>> For the general case, UTF-32 is a pretty wasteful Unicode encoding
>> just to have that priviledge ?
> 
> I'm not sure there is a "general case", so it's hard to say. Some
> programmers have to deal with MBCS every day; others can go for years
> without ever having to worry about anything but vanilla ASCII.

True!! Folks in Boise, Idaho, vs. folks in the non-British Europe or the Far East.

> "Wasteful" is also relative. UTF-32 is certainly wasteful of memory
> space, but UTF-8 is potentially far more wasteful of CPU cycles and
> memory bandwidth.

It sure looks like it. Then again, studying the UTF-8 spec, and "why we did it this way" (sorry, no URL here. Anybody?), shows that it actually is _amazingly_ light on CPU cycles! Really.

(( I sure wish there was somebody in this NG who could write a Scientifically Valid test to compare the time needed to find the millionth character in UTF-8 vs. UTF-8 first converted to UTF-32. ))

> Finding the millionth character in a UTF-8 string
> means looping through at least a million bytes, and executing some
> conditional logic for each one. Finding the millionth character in a
> UTF-32 string is a simple pointer offset and one-word fetch.

True. And even if we'd exclude any "character width logic" in the search, we still end up with sequential lookup O(n) vs. O(1).

Then again, when's the last time anyone here had to find the millionth character of anything?  :-)

So, of course for library writers, this appears as most relevant, but for real world programming tasks, I think after profiling, the time wasted may be minor, in practice.

(Ah, and of course, turning a UTF-8 input into UTF-32 and then straight shooting the millionth character, is way more expensive (both in time and size) than just a loop through the UTF-8 as such. Not to mention the losses if one were, instead, to have a million-character file on hard disk in UTF-32 (i.e. a 4MB file) to avoid the look-through. Probably the time reading in the file gets so much longer that this in itself defeats the "gain".)

> At the risk of repeating James, I do think that spelling "string" as "char[]"/"wchar[]" is grossly misleading, particularly to people
> coming from any other C-family language.

No argument here. :-)

In the midst of The Great Character Width Brouhaha (about November last year), I tried to convince Walter on this particular issue.
April 06, 2006
Mike Capp wrote:
> (Changing subject line since we seem to have rudely hijacked the OP's topic)
> 
> In article <e13b56$is0$1@digitaldaemon.com>,
> =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
>> James Dunne wrote:
>>
>>> The char type is really a misnomer for dealing with UTF-8 encoded strings.  It should be named closer to "code-unit for UTF-8 encoding". 
> 
> (I fully agree with this statement, by the way.)
> 
>> Yeah, but it does hold an *ASCII* character ?
> 
> I don't find that very helpful - seeing a char[] in code doesn't tell me
> anything about whether it's byte-per-character ASCII or possibly-multibyte
> UTF-8.

Since UTF-8 is compatible with ASCII, might it not be reasonable to assume char strings are always UTF-8?  I'll admit this suggests many of the D string functions are broken, but they can certainly be fixed. I've been considering rewriting find and rfind to support multibyte strings.  Fixing find is pretty straightforward, though rfind might be a tad messy.  As a related question, can anyone verify whether std.utf.stride will return a correct result for evaluating an arbitrary offset in all potential input strings?

>> For the general case, UTF-32 is a pretty wasteful
>> Unicode encoding just to have that priviledge ?
> 
> I'm not sure there is a "general case", so it's hard to say. Some programmers
> have to deal with MBCS every day; others can go for years without ever having to
> worry about anything but vanilla ASCII.
> 
> "Wasteful" is also relative. UTF-32 is certainly wasteful of memory space, but
> UTF-8 is potentially far more wasteful of CPU cycles and memory bandwidth.
> Finding the millionth character in a UTF-8 string means looping through at least
> a million bytes, and executing some conditional logic for each one. Finding the
> millionth character in a UTF-32 string is a simple pointer offset and one-word
> fetch.

For what it's worth, I believe the correct behavior for string/array operations is to provide overloads for char[] and wchar[] that require input to be valid UTF-8 and UTF-16, respectively.  If the user knows their data is pure ASCII or they otherwise want to process it as a fixed-width string they can cast to ubyte[] or ushort[].  This is what I'm planning for std.array in Ares.


Sean
April 06, 2006
Mike Capp wrote:

> At the risk of repeating James, I do think that spelling "string" as
> "char[]"/"wchar[]" is grossly misleading, particularly to people coming from any
> other C-family language. If I was doing any serious string-handling work in D
> I'd almost certainly write a opaque String class that overloaded opIndex
> (returning dchar) to do the right thing, and optimised the underlying storage to
> suit the app's requirements.

I'm not sure that C guys would miss a string class (after all, char[]
is a lot better than the raw "undefined" char* they used to be using...)
but I do see how having an easy String class around is useful sometimes.

I even wrote a simple one myself, based on something Java-like:
http://www.algonet.se/~afb/d/dcaf/html/class_string.html
http://www.algonet.se/~afb/d/dcaf/html/class_string_buffer.html


But for wxD we use a simple char[] alias for strings, works just fine...
If the backend uses UTF-16, it will convert them at runtime when needed.
(wxWidgets can be built in a "ASCII"/UTF-8, or in "Unicode"/UTF-16 mode)

Then again it only does the occasional window title or dialog string etc

--anders
April 06, 2006
Georg Wrede wrote:
> (( I sure wish there was somebody in this NG who could write a
> Scientifically Valid test to compare the time needed to find the
> millionth character in UTF-8 vs. UTF-8 first converted to UTF-32. ))

It's O(n) vs O(n). :) You have to go through all the bytes in both
cases. I guess the conversion has a higher coefficient.

> So, of course for library writers, this appears as most relevant, but for real world programming tasks, I think after profiling, the time wasted may be minor, in practice.

Why not use the same encoding throughout the whole program and it's libraries? No need to convert anywhere.

> (Ah, and of course, turning a UTF-8 input into UTF-32 and then straight shooting the millionth character, is way more expensive (both in time and size) than just a loop through the UTF-8 as such. Not to mention the losses if one were, instead, to have a million-character file on hard disk in UTF-32 (i.e. a 4MB file) to avoid the look-through. Probably the time reading in the file gets so much longer that this in itself defeats the "gain".)

That's very true. A "normal" hard drive reads 60 MB/s. So, reading a 4 MB file takes at least 66 ms and a 1 MB UTF-8-file (only ASCII-characters) is read in 17 ms (well, I'm a bit optimistic here :). A modern processor executes 3 000 000 000 operations in a second. Going through the UTF-8 stream takes 1 000 000 * 10 (perhaps?) operations and thus costs 3 ms. So it's actually faster to read UTF-8.

-- 
Jari-Matti
« First   ‹ Prev
1 2 3