View mode: basic / threaded / horizontal-split · Log in · Help
April 06, 2006
Selectable encodings
I know of three ways to support a user-selected char encoding in a library, 
but each has its drawbacks.

1) Method overloading
Introduces conflicts with string literals (forcing a c/w/d suffix to be 
used) and you can't overload by return type.

2) Parameterising all types that use strings
Making every class a template just to get this functionality seems over the 
top.
class SomeClassT(TChar) {
   TChar[] getSomeString() {}
}
alias SomeClassT!(char) SomeClass; // in library module
alias SomeClassT!(wchar) SomeClass; // in user module

3) A compiler version condition with aliases.
The version condition approach is the most attractive to me, but some people 
aren't fond of it.
version (utf8) alias mlchar char;
else version (utf16) alias mlchar wchar;
else version (utf32) alias mlchar dchar;

There's a fourth way - encoding conversion, but there's a runtime cost.

So does anyone use an alternative way to enable users to select which char 
encoding they want to use at compile time?
April 06, 2006
Re: Selectable encodings
In article <e12j34$2gi2$1@digitaldaemon.com>, John C says...
>
>version (utf8) alias mlchar char;

Apologies for going off at a tangent to your question, but I've never quite
understood what D thinks it's doing here. If char[] is an array of characters,
then it can't be a UTF-8 string, because UTF-8 is a variable-length encoding. So
is char[] an array of characters from some other charset (e.g. the subset of
UTF-8 representable in one byte), or is it an array of bytes encoding a UTF-8
string (in which case I suspect quite a lot of string-handling code is badly
broken)?

cheers
Mike
April 06, 2006
Re: Selectable encodings
Mike Capp skrev:
> In article <e12j34$2gi2$1@digitaldaemon.com>, John C says...
>> version (utf8) alias mlchar char;
> 
> Apologies for going off at a tangent to your question, but I've never quite
> understood what D thinks it's doing here. If char[] is an array of characters,
> then it can't be a UTF-8 string, because UTF-8 is a variable-length encoding. So
> is char[] an array of characters from some other charset (e.g. the subset of
> UTF-8 representable in one byte), or is it an array of bytes encoding a UTF-8
> string (in which case I suspect quite a lot of string-handling code is badly
> broken)?

It is the latter. But I don't think much of the string handling code is 
broken because of that.

/Oskar
April 06, 2006
Re: Selectable encodings
Oskar Linde wrote:
> Mike Capp skrev:
> 
>> In article <e12j34$2gi2$1@digitaldaemon.com>, John C says...
>>
>>> version (utf8) alias mlchar char;
>>
>>
>> Apologies for going off at a tangent to your question, but I've never 
>> quite
>> understood what D thinks it's doing here. If char[] is an array of 
>> characters,
>> then it can't be a UTF-8 string, because UTF-8 is a variable-length 
>> encoding. So
>> is char[] an array of characters from some other charset (e.g. the 
>> subset of
>> UTF-8 representable in one byte), or is it an array of bytes encoding 
>> a UTF-8
>> string (in which case I suspect quite a lot of string-handling code is 
>> badly
>> broken)?
> 
> 
> It is the latter. But I don't think much of the string handling code is 
> broken because of that.
> 
> /Oskar

The char type is really a misnomer for dealing with UTF-8 encoded 
strings.  It should be named closer to "code-unit for UTF-8 encoding". 
For my own research language I've chosen what I believe to be a nice 
type naming system:

    char            - 32-bit Unicode code point

    u8cu            - UTF-8 code unit
    u16cu           - UTF-16 code unit
    u32cu           - UTF-32 code unit

I could be wrong (and I bet I am) on the terminology used to describe 
char, but I really mean it to just store a full Unicode character such 
that strings of chars can safely assume character index == array index.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O 
M--@ V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e 
h>--->++ r+++ y+++
------END GEEK CODE BLOCK------

James Dunne
April 06, 2006
Re: Selectable encodings
James Dunne wrote:

> The char type is really a misnomer for dealing with UTF-8 encoded 
> strings.  It should be named closer to "code-unit for UTF-8 encoding". 

Yeah, but it does hold an *ASCII* character ?

Usually the D code handles char[] with dchar,
but with a "short path" for ASCII characters...

> I could be wrong (and I bet I am) on the terminology used to describe
> char, but I really mean it to just store a full Unicode character
> such that strings of chars can safely assume character index == array
> index.

For the general case, UTF-32 is a pretty wasteful
Unicode encoding just to have that priviledge ?

See http://www.unicode.org/faq/utf_bom.html#12

--anders
April 06, 2006
Ceci n'est pas une char (was: Re: Selectable encodings)
(Changing subject line since we seem to have rudely hijacked the OP's topic)

In article <e13b56$is0$1@digitaldaemon.com>,
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
>
>James Dunne wrote:
>
>> The char type is really a misnomer for dealing with UTF-8 encoded 
>> strings.  It should be named closer to "code-unit for UTF-8 encoding". 

(I fully agree with this statement, by the way.)

>Yeah, but it does hold an *ASCII* character ?

I don't find that very helpful - seeing a char[] in code doesn't tell me
anything about whether it's byte-per-character ASCII or possibly-multibyte
UTF-8.

>For the general case, UTF-32 is a pretty wasteful
>Unicode encoding just to have that priviledge ?

I'm not sure there is a "general case", so it's hard to say. Some programmers
have to deal with MBCS every day; others can go for years without ever having to
worry about anything but vanilla ASCII.

"Wasteful" is also relative. UTF-32 is certainly wasteful of memory space, but
UTF-8 is potentially far more wasteful of CPU cycles and memory bandwidth.
Finding the millionth character in a UTF-8 string means looping through at least
a million bytes, and executing some conditional logic for each one. Finding the
millionth character in a UTF-32 string is a simple pointer offset and one-word
fetch.

At the risk of repeating James, I do think that spelling "string" as
"char[]"/"wchar[]" is grossly misleading, particularly to people coming from any
other C-family language. If I was doing any serious string-handling work in D
I'd almost certainly write a opaque String class that overloaded opIndex
(returning dchar) to do the right thing, and optimised the underlying storage to
suit the app's requirements.

cheers
Mike
April 06, 2006
Re: Ceci n'est pas une char
Mike Capp wrote:
> (Changing subject line since we seem to have rudely hijacked the OP's
> topic)
> 
> In article <e13b56$is0$1@digitaldaemon.com>, 
> =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
> 
>> James Dunne wrote:
>> 
>>> The char type is really a misnomer for dealing with UTF-8 encoded
>>> strings.  It should be named closer to "code-unit for UTF-8
>>> encoding".
> 
> (I fully agree with this statement, by the way.)

Yes. And it's a _gross_ misnomer.

And we who are used to D can't even _begin_ to appreciate the 
[unnecessary!] extra work and effort needed to gradually come to 
understand it "our way", for those new to D.

>> Yeah, but it does hold an *ASCII* character ?
> 
> I don't find that very helpful - seeing a char[] in code doesn't tell
> me anything about whether it's byte-per-character ASCII or
> possibly-multibyte UTF-8.

(( A dumb idea: the input stream has a flag that gets set as soon as the 
first non-ASCII character is found. ))

>> For the general case, UTF-32 is a pretty wasteful Unicode encoding
>> just to have that priviledge ?
> 
> I'm not sure there is a "general case", so it's hard to say. Some
> programmers have to deal with MBCS every day; others can go for years
> without ever having to worry about anything but vanilla ASCII.

True!! Folks in Boise, Idaho, vs. folks in the non-British Europe or the 
Far East.

> "Wasteful" is also relative. UTF-32 is certainly wasteful of memory
> space, but UTF-8 is potentially far more wasteful of CPU cycles and
> memory bandwidth.

It sure looks like it. Then again, studying the UTF-8 spec, and "why we 
did it this way" (sorry, no URL here. Anybody?), shows that it actually 
is _amazingly_ light on CPU cycles! Really.

(( I sure wish there was somebody in this NG who could write a 
Scientifically Valid test to compare the time needed to find the 
millionth character in UTF-8 vs. UTF-8 first converted to UTF-32. ))

> Finding the millionth character in a UTF-8 string
> means looping through at least a million bytes, and executing some
> conditional logic for each one. Finding the millionth character in a
> UTF-32 string is a simple pointer offset and one-word fetch.

True. And even if we'd exclude any "character width logic" in the 
search, we still end up with sequential lookup O(n) vs. O(1).

Then again, when's the last time anyone here had to find the millionth 
character of anything?  :-)

So, of course for library writers, this appears as most relevant, but 
for real world programming tasks, I think after profiling, the time 
wasted may be minor, in practice.

(Ah, and of course, turning a UTF-8 input into UTF-32 and then straight 
shooting the millionth character, is way more expensive (both in time 
and size) than just a loop through the UTF-8 as such. Not to mention the 
losses if one were, instead, to have a million-character file on hard 
disk in UTF-32 (i.e. a 4MB file) to avoid the look-through. Probably the 
time reading in the file gets so much longer that this in itself defeats 
the "gain".)

> At the risk of repeating James, I do think that spelling "string" as 
> "char[]"/"wchar[]" is grossly misleading, particularly to people
> coming from any other C-family language.

No argument here. :-)

In the midst of The Great Character Width Brouhaha (about November last 
year), I tried to convince Walter on this particular issue.
April 06, 2006
Re: Ceci n'est pas une char
Mike Capp wrote:
> (Changing subject line since we seem to have rudely hijacked the OP's topic)
> 
> In article <e13b56$is0$1@digitaldaemon.com>,
> =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
>> James Dunne wrote:
>>
>>> The char type is really a misnomer for dealing with UTF-8 encoded 
>>> strings.  It should be named closer to "code-unit for UTF-8 encoding". 
> 
> (I fully agree with this statement, by the way.)
> 
>> Yeah, but it does hold an *ASCII* character ?
> 
> I don't find that very helpful - seeing a char[] in code doesn't tell me
> anything about whether it's byte-per-character ASCII or possibly-multibyte
> UTF-8.

Since UTF-8 is compatible with ASCII, might it not be reasonable to 
assume char strings are always UTF-8?  I'll admit this suggests many of 
the D string functions are broken, but they can certainly be fixed. 
I've been considering rewriting find and rfind to support multibyte 
strings.  Fixing find is pretty straightforward, though rfind might be a 
tad messy.  As a related question, can anyone verify whether 
std.utf.stride will return a correct result for evaluating an arbitrary 
offset in all potential input strings?

>> For the general case, UTF-32 is a pretty wasteful
>> Unicode encoding just to have that priviledge ?
> 
> I'm not sure there is a "general case", so it's hard to say. Some programmers
> have to deal with MBCS every day; others can go for years without ever having to
> worry about anything but vanilla ASCII.
> 
> "Wasteful" is also relative. UTF-32 is certainly wasteful of memory space, but
> UTF-8 is potentially far more wasteful of CPU cycles and memory bandwidth.
> Finding the millionth character in a UTF-8 string means looping through at least
> a million bytes, and executing some conditional logic for each one. Finding the
> millionth character in a UTF-32 string is a simple pointer offset and one-word
> fetch.

For what it's worth, I believe the correct behavior for string/array 
operations is to provide overloads for char[] and wchar[] that require 
input to be valid UTF-8 and UTF-16, respectively.  If the user knows 
their data is pure ASCII or they otherwise want to process it as a 
fixed-width string they can cast to ubyte[] or ushort[].  This is what 
I'm planning for std.array in Ares.


Sean
April 06, 2006
Re: Ceci n'est pas une char
Mike Capp wrote:

> At the risk of repeating James, I do think that spelling "string" as
> "char[]"/"wchar[]" is grossly misleading, particularly to people coming from any
> other C-family language. If I was doing any serious string-handling work in D
> I'd almost certainly write a opaque String class that overloaded opIndex
> (returning dchar) to do the right thing, and optimised the underlying storage to
> suit the app's requirements.

I'm not sure that C guys would miss a string class (after all, char[]
is a lot better than the raw "undefined" char* they used to be using...)
but I do see how having an easy String class around is useful sometimes.

I even wrote a simple one myself, based on something Java-like:
http://www.algonet.se/~afb/d/dcaf/html/class_string.html
http://www.algonet.se/~afb/d/dcaf/html/class_string_buffer.html


But for wxD we use a simple char[] alias for strings, works just fine...
If the backend uses UTF-16, it will convert them at runtime when needed.
(wxWidgets can be built in a "ASCII"/UTF-8, or in "Unicode"/UTF-16 mode)

Then again it only does the occasional window title or dialog string etc

--anders
April 06, 2006
Re: Ceci n'est pas une char
Georg Wrede wrote:
> (( I sure wish there was somebody in this NG who could write a
> Scientifically Valid test to compare the time needed to find the
> millionth character in UTF-8 vs. UTF-8 first converted to UTF-32. ))

It's O(n) vs O(n). :) You have to go through all the bytes in both
cases. I guess the conversion has a higher coefficient.

> So, of course for library writers, this appears as most relevant, but
> for real world programming tasks, I think after profiling, the time
> wasted may be minor, in practice.

Why not use the same encoding throughout the whole program and it's
libraries? No need to convert anywhere.

> (Ah, and of course, turning a UTF-8 input into UTF-32 and then straight
> shooting the millionth character, is way more expensive (both in time
> and size) than just a loop through the UTF-8 as such. Not to mention the
> losses if one were, instead, to have a million-character file on hard
> disk in UTF-32 (i.e. a 4MB file) to avoid the look-through. Probably the
> time reading in the file gets so much longer that this in itself defeats
> the "gain".)

That's very true. A "normal" hard drive reads 60 MB/s. So, reading a 4
MB file takes at least 66 ms and a 1 MB UTF-8-file (only
ASCII-characters) is read in 17 ms (well, I'm a bit optimistic here :).
A modern processor executes 3 000 000 000 operations in a second. Going
through the UTF-8 stream takes 1 000 000 * 10 (perhaps?) operations and
thus costs 3 ms. So it's actually faster to read UTF-8.

-- 
Jari-Matti
« First   ‹ Prev
1 2 3
Top | Discussion index | About this forum | D home