View mode: basic / threaded / horizontal-split · Log in · Help
November 24, 2005
YAUST v1.0
YAUST - Yet Another Unified String Theory :)

Well, here's my proposal for cleaning up strings. I tried to

- be as practical as possible
- leave full control over encoding when one wants to have it
- remove any possible confusion as to what each type is
- allow efficiency where possible, without excessive effort

First, the proposed changes are listed, followed by rationale.

==============================

I. drop char and wchar

--

II.

create cchar (1-byte unsigned character of platform-specific encoding, 
C-equivalent)
create utf8  (1 byte of UTF8)
create utf16 (2 bytes of UTF16)
leave dchar as is

--

III.

version(Windows) {
	alias utf16[] string;
} else
version(Unix/Linux) {
	alias utf8[] string;
}

add suffix ""s for explicitly specifying platform-specific encoding 
(i.e. the string type), and make auto type inference default to that 
same type (this applies to the auto keyword, not undecorated strings). 
Add docs explaining that string is just a platform-dependant alias.

--

IV.

add the following implicit casts for interoperability

from: cchar[], utf8[], utf16[], dchar[]
to  : cchar*, utf8*, utf16*, dchar*

all of them ensure 0-termination. If cchar is converted to any other 
form, it becomes the appropriate Unicode char. In the reverse direction, 
all unrepresentable characters become '?'. when runtime transcoding 
and/or reallocation is required, make them produce a warning.

--

V.

add the following implicit (transcoding) casts

from: cchar[], utf8[], utf16[], dchar[]
to  : cchar[], utf8[], utf16[], dchar[]

when runtime transcoding is required, make them produce a warning (i.e. 
always, except when casting from T to T).

--

VI.

modify explicit casts between all the array and pointer types to
- transcode rather than paint
- use '?' for unrepresentable characters (applies to encoding into 
cchar*/cchar[] only)
- not produce the warnings from above

--

VII.

create compatibility kit:

module std.srccompatibility.oldchartypes;
// yes, it should be big and ugly

alias utf8 char;
alias utf16 wchar;

--

VIII.

add the following methods to all 4 array types

 utf8[] .asUTF8
utf16[] .asUTF16
dchar[] .asUTF32
cchar[] .asCchars

ubyte[] .asUTF8   (bool dummy) // I think there's no UTF-8 BOM
ubyte[] .asUTF16LE(bool includeBOM)
ubyte[] .asUTF16BE(bool includeBOM)
ubyte[] .asUTF32LE(bool includeBOM)
ubyte[] .asUTF32BE(bool includeBOM)

--

IX.

modify the ~ operator between the 4 types to work as follows:

a) infer the result type from context, as with undecorated strings
b) if calling a function and there are multiple overloads
b.1) if both operand types are known, use that type
b.2) if one us known and another is undecorated literal, use the known type
b.3) if neither is known or both are known, but different, bork

--

X.

Disallow utf8 and utf16 as a stand-alone var type, only arrays and 
pointers allowed

========================

Point I. removes the confusion of "char" and "wchar" not actually 
representing characters.

Point II. explicitly states that the strings are either UTF-encoded, 
complete characters* or C-compatible characters.

Point III. makes the code

string abc="abc";
someOSFunc(abc);
someOtherOSFunc("qwe"s); // s only neccessary if there is more than one 
option

least likely to produce any transcoding.

Point IV. makes it nearly impossible to do the wrong thing and doesn't 
require explicit casts when interfacing to C code, assuming the C 
functions are declared properly (i.e. the correct of the two 1-byte 
types is declared). When used with literals, the 0 can be appended 
compile-time, like it is now.

Point V. makes it easier to use different types without explicit 
casting, but will still produce warnings when transcoding happens. In 
most cases it will be obvious anyway.

Point VI. breaks behavior of other array casts (which only paint), but 
strings are getting special behavior anyway, and you can still paint via 
void[], and even more importantly, if you need to paint between 
UTF8/UTF16/UTF32/cchar, either the source or destination type is wrong 
in the first place.

Point VII. will make it somewhat easier to make the transition.

Point VIII. provides an alternative to casting and allows specifying 
endianness when writing to network and/or files. The methods should be 
compile-time resolvable when possible, so this would be both valid and 
evaluated in compile time:

ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);

Point IX. allows concatenation of strings in different encodings without 
significantly increasing the complexity of overloading rules, while also 
not requiring an inefficient toUTFxx followed by concatenation (which 
copies the result again).

Point X. prevents some invalid code:
- treating a UTF-8 code unit as a character
- treating a UTF-16 code unit as a character
- iterating over code units instead of characters

Note that it is still possible to iterate over the string using a cchar 
and dchar, which actually do represent characters. Also note that for 
I/O purposes, which are the only thing one should be doing with code 
units, you can still paint the string as void[] or byte[] (or even 
better, call one of the methods above), but then you give up the view 
that it is a string and lose language support/special treatment.



So, what do you guys/gals think? :)


xs0

* note that even dchar[] still doesn't neccessarily contain complete 
characters, at least as seen by the user. For example, the letter 
LATIN_C_WITH_CARON can also be written as LATIN_C + COMBINING_CARON, and 
they are in fact equivalent as far as Unicode is concerned (afaik). 
Splitting the string inbetween will thus produce a "wrong" result, but I 
don't think D should include any kind of full Unicode processing, as 
it's actually needed quite rarely, so that problem is ignored...
November 24, 2005
Re: YAUST v1.0
xs0 wrote:
<snip>
> III.
> 
> version(Windows) {
>     alias utf16[] string;
> } else
> version(Unix/Linux) {
>     alias utf8[] string;
> }
> 
> add suffix ""s for explicitly specifying platform-specific encoding 
> (i.e. the string type), and make auto type inference default to that 
> same type (this applies to the auto keyword, not undecorated strings). 
> Add docs explaining that string is just a platform-dependant alias.
> 
The idea (platform-independence) here is correct. :) The only thing is 
that you _don't_ need to know, which utf-implementation the current 
compiler is using. If you are using Unicode to communicate with the user 
and/or native D libraries, you don't need to do any string conversions - 
they all use the same string representation, for god's sake.

> IV.
> 
> add the following implicit casts for interoperability
> 
> from: cchar[], utf8[], utf16[], dchar[]
> to  : cchar*, utf8*, utf16*, dchar*
> 
> all of them ensure 0-termination. If cchar is converted to any other 
> form, it becomes the appropriate Unicode char. In the reverse direction, 
> all unrepresentable characters become '?'. when runtime transcoding 
> and/or reallocation is required, make them produce a warning.

You mean C/C++ -interoperability?

Replacing all non-ASCII characters with '?'s means that we don't 
actually want to support all the legacy systems out there. So it would 
be impossible to write Unicode-compliant portable programs that 
supported 'ä' on the Windows 9x/NT/XP command line without version() {} 
-logic?

> V.
> 
> add the following implicit (transcoding) casts
> 
> from: cchar[], utf8[], utf16[], dchar[]
> to  : cchar[], utf8[], utf16[], dchar[]
> 
> when runtime transcoding is required, make them produce a warning (i.e. 
> always, except when casting from T to T).

Again, the main reason for Unicode is that you don't need to transcode 
between several representations all the time.

> VI.
> 
> modify explicit casts between all the array and pointer types to
> - transcode rather than paint
> - use '?' for unrepresentable characters (applies to encoding into 
> cchar*/cchar[] only)
> - not produce the warnings from above
> 
> -- 
> 
> VII.
> 
> create compatibility kit:
> 
> module std.srccompatibility.oldchartypes;
> // yes, it should be big and ugly
> 
> alias utf8 char;
> alias utf16 wchar;
> 

You know, sweeping the problem under the carpet doesn't help us much. 
char/wchar won't get any better by calling them with a different name. 
Still char won't be able to store more than the first 127 Unicode symbols.

> VIII.
> 
> add the following methods to all 4 array types
> 
>  utf8[] .asUTF8
> utf16[] .asUTF16
> dchar[] .asUTF32
> cchar[] .asCchars

Why, section V. already allows you to transcode these implicitely.

> ubyte[] .asUTF8   (bool dummy) // I think there's no UTF-8 BOM
> ubyte[] .asUTF16LE(bool includeBOM)
> ubyte[] .asUTF16BE(bool includeBOM)
> ubyte[] .asUTF32LE(bool includeBOM)
> ubyte[] .asUTF32BE(bool includeBOM)
> 

This looks pretty familiar. My own proposal does this on a library level 
for a reason. You see, conversions from Unicode to ISO-8859-x/KOI8-R/... 
should be allowed. It's easier to maintain the conversion table in a 
separate library. This also saves Walter from a lot of unnecessary work.

UTF-8 _does_ have a BOM.

> IX.
> 
> modify the ~ operator between the 4 types to work as follows:
> 
> a) infer the result type from context, as with undecorated strings
> b) if calling a function and there are multiple overloads
> b.1) if both operand types are known, use that type
> b.2) if one us known and another is undecorated literal, use the known type
> b.3) if neither is known or both are known, but different, bork
> 

If we didn't have several types of strings, this all would be much easier.

> X.
> 
> Disallow utf8 and utf16 as a stand-alone var type, only arrays and 
> pointers allowed
> 

Yes, this is a 'working' solution. Although I would like to be able to 
slice strings and do things like:

char[] s = "Älyttömämmäksi voinee mennä?"
s[15..21] = "ei voi"
writefln(s) // outputs: Älyttömämmäksi ei voi mennä?

Of course you can do this all using library functions, but tell me one 
thing: why should I do simple string slicing using library calls and 
much more complex Unicode conversion using language structures.

> Point I. removes the confusion of "char" and "wchar" not actually 
> representing characters.
> 
True.

> Point II. explicitly states that the strings are either UTF-encoded, 
> complete characters* or C-compatible characters.
True.

> Point III. makes the code
> 
> string abc="abc";
> someOSFunc(abc);
> someOtherOSFunc("qwe"s); // s only neccessary if there is more than one 
> option
> 
> least likely to produce any transcoding.

Of course you need to do transcoding, if the OS-function expects 
ISO-8859-x and you're string has utf8/16.

> Point IV. makes it nearly impossible to do the wrong thing and doesn't 
> require explicit casts when interfacing to C code, assuming the C 
> functions are declared properly (i.e. the correct of the two 1-byte 
> types is declared). When used with literals, the 0 can be appended 
> compile-time, like it is now.
Why do you have to output Unicode strings using legacy non-Unicode 
C-APIs? AFAIK DUI / stardard I/O and other libraries use standard 
Unicode, right? At least QT / GTK+ / Win32API / Linux console do support 
Unicode.

> Point V. makes it easier to use different types without explicit 
> casting, but will still produce warnings when transcoding happens. In 
> most cases it will be obvious anyway.
It would easier with only a single Unicode-compliant string-type. Ask 
the Java guys.

> Point VI. breaks behavior of other array casts (which only paint), but 
> strings are getting special behavior anyway, and you can still paint via 
> void[], and even more importantly, if you need to paint between 
> UTF8/UTF16/UTF32/cchar, either the source or destination type is wrong 
> in the first place.
?

> Point VII. will make it somewhat easier to make the transition.
?

> Point VIII. provides an alternative to casting and allows specifying 
> endianness when writing to network and/or files.
Partly true. Still, I think it would be much better if we had these as a 
std.stream.UnicodeStream class. Again, Java does this well.

> The methods should be 
> compile-time resolvable when possible, so this would be both valid and 
> evaluated in compile time:
> 
> ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);
Why? Converting a 14 character string doesn't take much time. Besides, 
if all our strings and i/o were utf-8, there wouldn't be any 
conversions, right?

> Point IX. allows concatenation of strings in different encodings without 
> significantly increasing the complexity of overloading rules, while also 
> not requiring an inefficient toUTFxx followed by concatenation (which 
> copies the result again).
True, but as I previously said, I don't believe we need to do great 
amount of conversions in the runtime-level. All conversions should be 
near network/file-interfaces, thus using Stream-classes, right?

> Point X. prevents some invalid code:

> Note that it is still possible to iterate over the string using a cchar 
> and dchar, which actually do represent characters. Also note that for 
> I/O purposes, which are the only thing one should be doing with code 
> units, you can still paint the string as void[] or byte[] (or even 
> better, call one of the methods above), but then you give up the view 
> that it is a string and lose language support/special treatment.
True.

> Splitting the string inbetween will thus produce a "wrong" result, but I 
> don't think D should include any kind of full Unicode processing, as 
> it's actually needed quite rarely, so that problem is ignored...

Sigh. Maybe you're not doing full Unicode processing every day. What 
about the Chinese? And what is full Unicode processing?
November 24, 2005
Re: YAUST v1.0
Before anything else: while I agree that a (really well-thought out) 
string class would probably be a good solution, the D spec would seem to 
suggest an array-based approach is preferred, and Walter isn't one to 
change his mind easily :)
Besides, any kind of string class has it's share of problems (one size 
never fits all), and with the array based approach it's easy to add 
pseudo-methods doing all kinds of funky things, while a language-defined 
class makes it impossible.


Jari-Matti Mäkelä wrote:
>> version(Windows) {
>>     alias utf16[] string;
>> } else
>> version(Unix/Linux) {
>>     alias utf8[] string;
>> }
>>
>> add suffix ""s for explicitly specifying platform-specific encoding 
>> (i.e. the string type), and make auto type inference default to that 
>> same type (this applies to the auto keyword, not undecorated strings). 
>> Add docs explaining that string is just a platform-dependant alias.
>>
> The idea (platform-independence) here is correct. :) The only thing is 
> that you _don't_ need to know, which utf-implementation the current 
> compiler is using. 

Well, sometimes you do and most times you don't (and it is often the 
case that at least some part of any app does need to know). I don't 
think it's wise to force anything down anyone's throat, so I tries to 
give options - you can use a specific UTF encoding, the native encoding 
for legacy OSes, or leave it to the compiler to choose the "best" one 
for you, where I believe best is what the underlying OS is using.


> If you are using Unicode to communicate with the user 
> and/or native D libraries, you don't need to do any string conversions - 
> they all use the same string representation, for god's sake.

Well, flexibility will definitely require some bloat in libraries, but 
for communicating with the user, you definitely need conversions, if 
you're not using the OS-native type (which, again, you do have the 
option of using with being explicit about it).


>> add the following implicit casts for interoperability
>>
>> from: cchar[], utf8[], utf16[], dchar[]
>> to  : cchar*, utf8*, utf16*, dchar*
>>
>> all of them ensure 0-termination. If cchar is converted to any other 
>> form, it becomes the appropriate Unicode char. In the reverse 
>> direction, all unrepresentable characters become '?'. when runtime 
>> transcoding and/or reallocation is required, make them produce a warning.
> 
> You mean C/C++ -interoperability?

Yup.

> Replacing all non-ASCII characters with '?'s means that we don't 
> actually want to support all the legacy systems out there. So it would 
> be impossible to write Unicode-compliant portable programs that 
> supported 'ä' on the Windows 9x/NT/XP command line without version() {} 
> -logic?

No, who mentioned ASCII? On windows, cchar would be exactly the legacy 
encoding each non-unicode app uses, and conversions between app's 
internal UTF-x and cchar[] would transcode into that charset. So, for 
example, a word processor on a non-unicode windows version could still 
use unicode internally, while automatically talking to the OS using all 
the characters its charset provides.


>> add the following implicit (transcoding) casts
>>
>> from: cchar[], utf8[], utf16[], dchar[]
>> to  : cchar[], utf8[], utf16[], dchar[]
>>
>> when runtime transcoding is required, make them produce a warning 
>> (i.e. always, except when casting from T to T).
> 
> Again, the main reason for Unicode is that you don't need to transcode 
> between several representations all the time.

Again, sometimes you do and most times you don't. But anyhow, painting 
casts between UTF types make no sense, and I don't think explicit casts 
are neccessary, as there can't be any loss (ok, except to cchar[]).


>> create compatibility kit:
>>
>> module std.srccompatibility.oldchartypes;
>> // yes, it should be big and ugly
>>
>> alias utf8 char;
>> alias utf16 wchar;
>>
> 
> You know, sweeping the problem under the carpet doesn't help us much. 
> char/wchar won't get any better by calling them with a different name. 
> Still char won't be able to store more than the first 127 Unicode symbols.

I'm not sure if you're referring to those aliases or not, but in YAUST, 
there is no single char(utf8) anymore, and I think there's quite a 
difference between "char[]" and "utf8[]", especially in a C-influenced 
world the Earth is :)


>> add the following methods to all 4 array types
>>
>>  utf8[] .asUTF8
>> utf16[] .asUTF16
>> dchar[] .asUTF32
>> cchar[] .asCchars
> 
> Why, section V. already allows you to transcode these implicitely.

Yup, but with warnings; using one of these shows that you've thought 
about what you're doing, so the compiler is free to shut up :)


>> ubyte[] .asUTF8   (bool dummy) // I think there's no UTF-8 BOM
>> ubyte[] .asUTF16LE(bool includeBOM)
>> ubyte[] .asUTF16BE(bool includeBOM)
>> ubyte[] .asUTF32LE(bool includeBOM)
>> ubyte[] .asUTF32BE(bool includeBOM)
>>
> 
> This looks pretty familiar. My own proposal does this on a library level 
> for a reason. You see, conversions from Unicode to ISO-8859-x/KOI8-R/... 
> should be allowed. 

Sure they should be allowed, but D is supposed to be Unicode, so a D app 
should generally only deal with that, and other charsets should 
generally only exist in byte[] buffers before input or after output.


> It's easier to maintain the conversion table in a 
> separate library. This also saves Walter from a lot of unnecessary work.

Well, conversions between UTFs are done already, so the only thing 
remaining would be from/to cchar[], which shouldn't be too hard. Others 
definitely belong in some library, as they mostly won't be needed, I guess..


> UTF-8 _does_ have a BOM.

It does? What is it? I thought that single bytes have no Byte Order, so 
why would you need a Mark?


>> modify the ~ operator between the 4 types to work as follows:
>>
>> a) infer the result type from context, as with undecorated strings
>> b) if calling a function and there are multiple overloads
>> b.1) if both operand types are known, use that type
>> b.2) if one us known and another is undecorated literal, use the known 
>> type
>> b.3) if neither is known or both are known, but different, bork
> 
> If we didn't have several types of strings, this all would be much easier.

Agreed, but we do have several types of strings :)


>> Disallow utf8 and utf16 as a stand-alone var type, only arrays and 
>> pointers allowed
>>
> 
> Yes, this is a 'working' solution. Although I would like to be able to 
> slice strings and do things like:
> 
> char[] s = "Älyttömämmäksi voinee mennä?"
> s[15..21] = "ei voi"
> writefln(s) // outputs: Älyttömämmäksi ei voi mennä?
> 
> Of course you can do this all using library functions, but tell me one 
> thing: why should I do simple string slicing using library calls and 
> much more complex Unicode conversion using language structures.

Because it's actually the opposite - Unicode conversions are simple, 
while slicing is hard (at least slicing on character boundaries). Even 
in the simple example you give, I have no idea whether the first Ä is 
one character or two, as both cases look the same.


>> Point III. makes the code
>>
>> string abc="abc";
>> someOSFunc(abc);
>> someOtherOSFunc("qwe"s); // s only neccessary if there is more than 
>> one option
>>
>> least likely to produce any transcoding.
> 
> Of course you need to do transcoding, if the OS-function expects 
> ISO-8859-x and you're string has utf8/16.

True, I just said "least likely". But at least you can use the same 
(non-transcoding) code for both UTF-8 OSes and UTF-16 OSes.


>> Point IV. makes it nearly impossible to do the wrong thing and doesn't 
>> require explicit casts when interfacing to C code, assuming the C 
>> functions are declared properly (i.e. the correct of the two 1-byte 
>> types is declared). When used with literals, the 0 can be appended 
>> compile-time, like it is now.
> 
> Why do you have to output Unicode strings using legacy non-Unicode 
> C-APIs? AFAIK DUI / stardard I/O and other libraries use standard 
> Unicode, right? At least QT / GTK+ / Win32API / Linux console do support 
> Unicode.

Well, your point is moot, because if there's no such function to call, 
then there is no problem. But when there is such a function, you would 
hope that the language/library does something sensible by default, 
wouldn't you?


>> Point V. makes it easier to use different types without explicit 
>> casting, but will still produce warnings when transcoding happens. In 
>> most cases it will be obvious anyway.
> 
> It would easier with only a single Unicode-compliant string-type. Ask 
> the Java guys.

Well, I am one of the Java guys, and java.lang.String leaves a lot to be 
desired. Because it's language defined in the way it is, it's
1) immutable, which sucks if it's forced down your throat 100% of time
2) UTF-16 for ever and ever, which sucks if you want it to either take 
less memory or don't want to worry about surrogates; just look at all 
the crappy functions they had to add in Java 5 to support the entire 
Unicode charset :)


>> Point VI. breaks behavior of other array casts (which only paint), but 
>> strings are getting special behavior anyway, and you can still paint 
>> via void[], and even more importantly, if you need to paint between 
>> UTF8/UTF16/UTF32/cchar, either the source or destination type is wrong 
>> in the first place.
> 
> ?

Well, a sequence of bytes can be either cchar[], UTF-8, UTF-16 or 
UTF-32, but not more than one at the same time (OK, unless it's ASCII 
only, which fits both the first two). So, for example, if you cast 
utf8[] to utf16[], either the data is UTF-8 and you don't get a UTF-16 
string (but some mumbo jumbo), or it's UTF-16 and was never valid UTF-8 
in the first place.


>> Point VII. will make it somewhat easier to make the transition.
> 
> ?

?

>> Point VIII. provides an alternative to casting and allows specifying 
>> endianness when writing to network and/or files.
> 
> Partly true. Still, I think it would be much better if we had these as a 
> std.stream.UnicodeStream class. Again, Java does this well.

Why should you be forced to use a stream for something so simple? What 
if you want to use two encodings on the same stream (it's not even so 
far fetched - the first line in a HTTP request can only contain UTF-8, 
but you may want to send POST contents in UTF-16, for example). Etc. etc.


>> The methods should be compile-time resolvable when possible, so this 
>> would be both valid and evaluated in compile time:
>>
>> ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);
> 
> Why? Converting a 14 character string doesn't take much time. 

Why would it not evaluate at compile time? Do you see any benefit in 
that? And while it doesn't take much time once, it does take some, and 
more importantly, allocates new memory each time. If you're trying to do 
more than one request (as in thousands), I'm sure it adds up..

> Besides, 
> if all our strings and i/o were utf-8, there wouldn't be any 
> conversions, right?

Except every time you'd call a Win32 function, which is what's on most 
computers?


>> Point IX. allows concatenation of strings in different encodings 
>> without significantly increasing the complexity of overloading rules, 
>> while also not requiring an inefficient toUTFxx followed by 
>> concatenation (which copies the result again).
> 
> True, but as I previously said, I don't believe we need to do great 
> amount of conversions in the runtime-level. All conversions should be 
> near network/file-interfaces, thus using Stream-classes, right?

I agree decent stream classes can solve many problems, but not all of them.


>> Splitting the string inbetween will thus produce a "wrong" result, but 
>> I don't think D should include any kind of full Unicode processing, as 
>> it's actually needed quite rarely, so that problem is ignored...
> 
> Sigh. Maybe you're not doing full Unicode processing every day. What 
> about the Chinese? And what is full Unicode processing?

Unicode is much more than a really large character set. There's UTFs, 
collation, bidirectionality, combining characters, locales, etc. etc., see
http://www.unicode.org/reports/index.html

So, if you want to create a decent text editor according to Unicode 
specs, you'll have to implement "full Unicode processing", but a large 
majority of other apps just needs to be able to interface to OS and 
libraries to get and display the text, usually without even caring 
what's inside, so I see no point to include all that in D, not even as a 
standard library (or perhaps after many other things are implemented first)


xs0
November 24, 2005
Re: YAUST v1.0
xs0 wrote:
> Before anything else: while I agree that a (really well-thought out) 
> string class would probably be a good solution, the D spec would seem to 
> suggest an array-based approach is preferred, and Walter isn't one to 
> change his mind easily :)

I believe we can achieve quite much with just simple array-like syntax.

> Besides, any kind of string class has it's share of problems (one size 
> never fits all), and with the array based approach it's easy to add 
> pseudo-methods doing all kinds of funky things, while a language-defined 
> class makes it impossible.

Although D is able to support some hard coded properties too.

>> The idea (platform-independence) here is correct. :) The only thing is 
>> that you _don't_ need to know, which utf-implementation the current 
>> compiler is using. 
> 
> Well, sometimes you do and most times you don't (and it is often the 
> case that at least some part of any app does need to know). I don't 
> think it's wise to force anything down anyone's throat, so I tries to 
> give options - you can use a specific UTF encoding, the native encoding 
> for legacy OSes, or leave it to the compiler to choose the "best" one 
> for you, where I believe best is what the underlying OS is using.

I'd give my vote for the "let compiler choose" option.

>> If you are using Unicode to communicate with the user and/or native D 
>> libraries, you don't need to do any string conversions - they all use 
>> the same string representation, for god's sake.
> 
> Well, flexibility will definitely require some bloat in libraries, but 
> for communicating with the user, you definitely need conversions, if 
> you're not using the OS-native type (which, again, you do have the 
> option of using with being explicit about it).

But if you let the compiler vendor to decide the encoding, there's a 
high probability that you don't need any explicit transcoding.

>>> add the following implicit casts for interoperability
>>>
>>> from: cchar[], utf8[], utf16[], dchar[]
>>> to  : cchar*, utf8*, utf16*, dchar*
>>>
>>> all of them ensure 0-termination. If cchar is converted to any other 
>>> form, it becomes the appropriate Unicode char. In the reverse 
>>> direction, all unrepresentable characters become '?'. when runtime 
>>> transcoding and/or reallocation is required, make them produce a 
>>> warning.
>>
>> You mean C/C++ -interoperability?
> Yup.

I was just thinking that once D has complete wrappers for all necessary 
stuff, you don't need these anymore. Library (wrapper) writers should be 
patient enough to use explicit conversion rules.

>> Replacing all non-ASCII characters with '?'s means that we don't 
>> actually want to support all the legacy systems out there. So it would 
>> be impossible to write Unicode-compliant portable programs that 
>> supported 'ä' on the Windows 9x/NT/XP command line without version() 
>> {} -logic?
> 
> 
> No, who mentioned ASCII? On windows, cchar would be exactly the legacy 
> encoding each non-unicode app uses, and conversions between app's 
> internal UTF-x and cchar[] would transcode into that charset. So, for 
> example, a word processor on a non-unicode windows version could still 
> use unicode internally, while automatically talking to the OS using all 
> the characters its charset provides.
> 

You said
"In the reverse direction, all unrepresentable characters become '?'."

The thing is that D compiler doesn't know anything about your system 
character encoding. You can even change it on the fly, if your system is 
capable of doing that. Therefore this transcoding must use the greatest 
common divisor which is probably 7-bit ASCII.

>>> add the following implicit (transcoding) casts
>>>
>>> from: cchar[], utf8[], utf16[], dchar[]
>>> to  : cchar[], utf8[], utf16[], dchar[]
>>>
>>> when runtime transcoding is required, make them produce a warning 
>>> (i.e. always, except when casting from T to T).
>>
>> Again, the main reason for Unicode is that you don't need to transcode 
>> between several representations all the time.
> 
> Again, sometimes you do and most times you don't. But anyhow, painting 
> casts between UTF types make no sense, and I don't think explicit casts 
> are neccessary, as there can't be any loss (ok, except to cchar[]).

You don't need to convert inside your own code unless you're really 
creating a program that is supposed to convert stuff. I mean you need 
the transcoding only when interfacing with foreign code / i/o.

>>> add the following methods to all 4 array types
>>>
>>>  utf8[] .asUTF8
>>> utf16[] .asUTF16
>>> dchar[] .asUTF32
>>> cchar[] .asCchars
>>
>> Why, section V. already allows you to transcode these implicitely.
> 
> Yup, but with warnings; using one of these shows that you've thought 
> about what you're doing, so the compiler is free to shut up :)

Yes, now you're right. The programmer should _always_ explicitely 
declare all conversions.

>>> ubyte[] .asUTF8   (bool dummy) // I think there's no UTF-8 BOM
>>> ubyte[] .asUTF16LE(bool includeBOM)
>>> ubyte[] .asUTF16BE(bool includeBOM)
>>> ubyte[] .asUTF32LE(bool includeBOM)
>>> ubyte[] .asUTF32BE(bool includeBOM)
>>>
>> This looks pretty familiar. My own proposal does this on a library 
>> level for a reason. You see, conversions from Unicode to 
>> ISO-8859-x/KOI8-R/... should be allowed. 
> 
> Sure they should be allowed, but D is supposed to be Unicode, so a D app 
> should generally only deal with that, and other charsets should 
> generally only exist in byte[] buffers before input or after output.

Then tell me, how do I fill these buffers with your new functions? I 
would definitely want to explicitely define the character encoding. IMHO 
this is much better done using static classes (std.utf.e[n/de]code) than 
variable properties.

>> It's easier to maintain the conversion table in a separate library. 
>> This also saves Walter from a lot of unnecessary work.
> 
> Well, conversions between UTFs are done already, so the only thing 
> remaining would be from/to cchar[], which shouldn't be too hard.

Yes, between UTFs, but between legacy charsets and UTFs is not! They 
aren't that hard, but as you might know, there are maybe hundreds of 
possible encoding types.

> Others
> definitely belong in some library, as they mostly won't be needed, I
> guess..

This isn't a very consistent approach. Some functions belong in some 
library, others should be implemented in the language...wtf?

>> UTF-8 _does_ have a BOM.
> 
> It does? What is it? I thought that single bytes have no Byte Order, so 
> why would you need a Mark?

0xEF 0xBB 0xBF

http://www.unicode.org/faq/utf_bom.html#25

See also

http://www.unicode.org/faq/utf_bom.html#29

>> If we didn't have several types of strings, this all would be much 
>> easier.
> 
> Agreed, but we do have several types of strings :)

I'm trying to say we don't need several types of strings :)

>>> Disallow utf8 and utf16 as a stand-alone var type, only arrays and 
>>> pointers allowed
>>>
>>
>> Yes, this is a 'working' solution. Although I would like to be able to 
>> slice strings and do things like:
>>
>> char[] s = "Älyttömämmäksi voinee mennä?"
>> s[15..21] = "ei voi"
>> writefln(s) // outputs: Älyttömämmäksi ei voi mennä?
>>
>> Of course you can do this all using library functions, but tell me one 
>> thing: why should I do simple string slicing using library calls and 
>> much more complex Unicode conversion using language structures.
> 
> 
> Because it's actually the opposite - Unicode conversions are simple, 
> while slicing is hard (at least slicing on character boundaries). Even 
> in the simple example you give, I have no idea whether the first Ä is 
> one character or two, as both cases look the same.

It's not really that hard. One downside is that you have to parse 
through the string (unless compiler uses UTF-16/32 as an internal string 
type).

Slicing the string on the code unit level doesn't make any sense, now 
does it? Because char should be treated as a special type by the 
compiler, I see no other use for slicing than this. Like you said, the 
alternative slicing can be achieved by casting the string to void[] (for 
i/o data buffering, etc).

>>> Point III. makes the code
>>>
>>> string abc="abc";
>>> someOSFunc(abc);
>>> someOtherOSFunc("qwe"s); // s only neccessary if there is more than 
>>> one option
>>>
>>> least likely to produce any transcoding.
>>
>>
>> Of course you need to do transcoding, if the OS-function expects 
>> ISO-8859-x and you're string has utf8/16.
> 
> 
> True, I just said "least likely". But at least you can use the same 
> (non-transcoding) code for both UTF-8 OSes and UTF-16 OSes.

Again, the compiler nor the compiled binary don't know anything about 
the OS standard encoding. Even some linux-systems still use iso-8859-x. 
If you're running windows-programs through vmware or wine on linux, you 
can't tell if it's always faster to use UTF-16 instead of UTF-8.

>>> Point IV. makes it nearly impossible to do the wrong thing and 
>>> doesn't require explicit casts when interfacing to C code, assuming 
>>> the C functions are declared properly (i.e. the correct of the two 
>>> 1-byte types is declared). When used with literals, the 0 can be 
>>> appended compile-time, like it is now.
>>
>>
>> Why do you have to output Unicode strings using legacy non-Unicode 
>> C-APIs? AFAIK DUI / stardard I/O and other libraries use standard 
>> Unicode, right? At least QT / GTK+ / Win32API / Linux console do 
>> support Unicode.
> 
> 
> Well, your point is moot, because if there's no such function to call, 
> then there is no problem. But when there is such a function, you would 
> hope that the language/library does something sensible by default, 
> wouldn't you?

No, this brilliant invention of yours causes problems even if we didn't 
have any 'legacy'-systems/APIs. You see, Library-writer 1 might use 
UTF-16 for his library because he uses Windows and thinks it's the 
fastest charset. Now Library-writer 2 has done his work using UTF-8 as 
an internal format. If you make a client program that links with these 
both, you (may) have to create unnecessary conversions just because one 
guy decided to create his own standards.

>>> Point V. makes it easier to use different types without explicit 
>>> casting, but will still produce warnings when transcoding happens. In 
>>> most cases it will be obvious anyway.
>>
>>
>> It would easier with only a single Unicode-compliant string-type. Ask 
>> the Java guys.
> 
> 
> Well, I am one of the Java guys, and java.lang.String leaves a lot to be 
> desired. Because it's language defined in the way it is, it's
> 1) immutable, which sucks if it's forced down your throat 100% of time

I agree.

> 2) UTF-16 for ever and ever, which sucks if you want it to either take 
> less memory or don't want to worry about surrogates; just look at all 
> the crappy functions they had to add in Java 5 to support the entire 
> Unicode charset :)

Partly true. What I meant was that most Java programmers use only one 
kind of string class (because they don't have/need other types).

>>> Point VI. breaks behavior of other array casts (which only paint), 
>>> but strings are getting special behavior anyway, and you can still 
>>> paint via void[], and even more importantly, if you need to paint 
>>> between UTF8/UTF16/UTF32/cchar, either the source or destination type 
>>> is wrong in the first place.
>>
>> ?
> 
> Well, a sequence of bytes can be either cchar[], UTF-8, UTF-16 or 
> UTF-32, but not more than one at the same time (OK, unless it's ASCII 
> only, which fits both the first two). So, for example, if you cast 
> utf8[] to utf16[], either the data is UTF-8 and you don't get a UTF-16 
> string (but some mumbo jumbo), or it's UTF-16 and was never valid UTF-8 
> in the first place.

Ok. But I thought you said utf8[] is implicitely converted to utf16[]. 
Then it's always valid whatever-type-it-is.

>>> Point VII. will make it somewhat easier to make the transition.

How? I don't believe.

>>> Point VIII. provides an alternative to casting and allows specifying 
>>> endianness when writing to network and/or files.
>>
>>
>> Partly true. Still, I think it would be much better if we had these as 
>> a std.stream.UnicodeStream class. Again, Java does this well.
> 
> 
> Why should you be forced to use a stream for something so simple?

So simple? Ahem, std.stream.File _is_ a stream. Here's my version:

  File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF8);
  f.writeLine("valid unicode text åäöü");
  f.close;

  File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF16LE);
  f.writeLine("valid unicode text åäöü");
  f.close;

Advantages:
-supports BOM values
-easy to use, right?

> What
> if you want to use two encodings on the same stream (it's not even so 
> far fetched - the first line in a HTTP request can only contain UTF-8, 
> but you may want to send POST contents in UTF-16, for example). Etc. etc.

Simple, just implement a method for changing the stream type:

Stream s = UnicodeSocketStream(socket, mode, encoding);

s.changeEncoding(encoding2);

If you want high-performance streams, you can convert the strings in a 
separate thread before you use them, right?

>>> The methods should be compile-time resolvable when possible, so this 
>>> would be both valid and evaluated in compile time:
>>>
>>> ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);
>>
>>
>> Why? Converting a 14 character string doesn't take much time. 
> 
> 
> Why would it not evaluate at compile time? Do you see any benefit in 
> that? And while it doesn't take much time once, it does take some, and 
> more importantly, allocates new memory each time. If you're trying to do 
> more than one request (as in thousands), I'm sure it adds up..

You only need to convert once.

>> Besides, if all our strings and i/o were utf-8, there wouldn't be any 
>> conversions, right?
> 
> Except every time you'd call a Win32 function, which is what's on most 
> computers?

My mistake, let's forget the utf-8 for a while. Actually I meant that if 
all strings were in the native OS format (let the compiler decide), 
there would be no need to convert.

>>> Point IX. allows concatenation of strings in different encodings 

Why do you want to do that?

>>> without significantly increasing the complexity of overloading rules, 
>>> while also not requiring an inefficient toUTFxx followed by 
>>> concatenation (which copies the result again).
>>
>>
>> True, but as I previously said, I don't believe we need to do great 
>> amount of conversions in the runtime-level. All conversions should be 
>> near network/file-interfaces, thus using Stream-classes, right?
> 
> 
> I agree decent stream classes can solve many problems, but not all of them.

"Many, but not all of them." That's why we should have 
std.utf.encode/decode-functions.

>>> Splitting the string inbetween will thus produce a "wrong" result, 
>>> but I don't think D should include any kind of full Unicode 
>>> processing, as it's actually needed quite rarely, so that problem is 
>>> ignored...
> 
> So, if you want to create a decent text editor according to Unicode 
> specs, you'll have to implement "full Unicode processing", but a large 
> majority of other apps just needs to be able to interface to OS and 
> libraries to get and display the text, usually without even caring 
> what's inside, so I see no point to include all that in D, not even as a 
> standard library (or perhaps after many other things are implemented first)

Ok, now I see your point. I thought you didn't want full Unicode 
processing even as a addon library. I agree, you don't need these 
'advanced' algorithms in the core language, rather as a separate 
library. Time will tell, maybe someday when we haven't got anything else 
to do, Phobos will finally include some cool Unicode tricks.


Jari-Matti
November 25, 2005
Re: YAUST v1.0
>> Well, flexibility will definitely require some bloat in libraries, but 
>> for communicating with the user, you definitely need conversions, if 
>> you're not using the OS-native type (which, again, you do have the 
>> option of using with being explicit about it).
> 
> But if you let the compiler vendor to decide the encoding, there's a 
> high probability that you don't need any explicit transcoding.

Sure you may need transcoding, you may use 15 different libraries, each 
expecting its own thing.. The one thing that can be done is to not 
require transcoding at least when talking to OS, which all apps have to 
do at some point. But even then, you should have the option to choose 
otherwise - if you have a UTF-8 library that you use in 99% of 
string-related calls, it's still faster to use UTF-8 and transcode when 
talking to OS.


>>> You mean C/C++ -interoperability?
>>
>> Yup.
> 
> I was just thinking that once D has complete wrappers for all necessary 
> stuff, you don't need these anymore. Library (wrapper) writers should be 
> patient enough to use explicit conversion rules.

But why should one have to create wrappers in the first place? With my 
proposal, you can directly link to many libraries and the compiler will 
do the conversions for you.


>> No, who mentioned ASCII? On windows, cchar would be exactly the legacy 
>> encoding each non-unicode app uses, and conversions between app's 
>> internal UTF-x and cchar[] would transcode into that charset. So, for 
>> example, a word processor on a non-unicode windows version could still 
>> use unicode internally, while automatically talking to the OS using 
>> all the characters its charset provides.
> 
> You said
> "In the reverse direction, all unrepresentable characters become '?'."
> 
> The thing is that D compiler doesn't know anything about your system 
> character encoding. You can even change it on the fly, if your system is 
> capable of doing that. Therefore this transcoding must use the greatest 
> common divisor which is probably 7-bit ASCII.

While the compiler may not, I'm sure it's possible to figure it out in 
runtime. For example, many old apps use a different language based on 
your settings, browsers send different Accept-Language, etc. So, it is 
possible, I think.


>> Again, sometimes you do and most times you don't. But anyhow, painting 
>> casts between UTF types make no sense, and I don't think explicit 
>> casts are neccessary, as there can't be any loss (ok, except to cchar[]).
> 
> You don't need to convert inside your own code unless you're really 
> creating a program that is supposed to convert stuff. I mean you need 
> the transcoding only when interfacing with foreign code / i/o.

If you don't need to convert, fine. If you do need to convert, I see no 
point in it being as easy/convenient as possible.

>> Yup, but with warnings; using one of these shows that you've thought 
>> about what you're doing, so the compiler is free to shut up :)
> 
> Yes, now you're right. The programmer should _always_ explicitely 
> declare all conversions.

Why?


>>>> ubyte[] .asUTF8   (bool dummy) // I think there's no UTF-8 BOM
>>>> ubyte[] .asUTF16LE(bool includeBOM)
>>>> ubyte[] .asUTF16BE(bool includeBOM)
>>>> ubyte[] .asUTF32LE(bool includeBOM)
>>>> ubyte[] .asUTF32BE(bool includeBOM)
>>>>
>>> This looks pretty familiar. My own proposal does this on a library 
>>> level for a reason. You see, conversions from Unicode to 
>>> ISO-8859-x/KOI8-R/... should be allowed. 
>>
>>
>> Sure they should be allowed, but D is supposed to be Unicode, so a D 
>> app should generally only deal with that, and other charsets should 
>> generally only exist in byte[] buffers before input or after output.
> 
> Then tell me, how do I fill these buffers with your new functions? 

You don't. Only UTFs and one OS-native encoding are supported in the 
language, the latter for obvious convenience. Others have to be done 
with a library. Note that the compiler is free to use the same library, 
it's not like anything would have to be done twice.


>>> UTF-8 _does_ have a BOM.
>>
>> It does? What is it? I thought that single bytes have no Byte Order, 
>> so why would you need a Mark?
> 
> 0xEF 0xBB 0xBF

OK, then it's not a dummy parameter :)


>>> If we didn't have several types of strings, this all would be much 
>>> easier.
>>
>> Agreed, but we do have several types of strings :)
> 
> I'm trying to say we don't need several types of strings :)

Why? I think if it's done properly, there are benefits from having a 
choice, while not complicating matters when one doesn't care.


>> Because it's actually the opposite - Unicode conversions are simple, 
>> while slicing is hard (at least slicing on character boundaries). Even 
>> in the simple example you give, I have no idea whether the first Ä is 
>> one character or two, as both cases look the same.
> 
> It's not really that hard. One downside is that you have to parse 
> through the string (unless compiler uses UTF-16/32 as an internal string 
> type).

It is "hard" - if you want to get the first character, as in the first 
character that the user sees, it can actually be from 1 to x characters, 
where x can be at least 5 (that case is actually in the unicode 
standard) and possibly more (and I don't mean code units, but characters).


> Slicing the string on the code unit level doesn't make any sense, now 
> does it? Because char should be treated as a special type by the 
> compiler, I see no other use for slicing than this. Like you said, the 
> alternative slicing can be achieved by casting the string to void[] (for 
> i/o data buffering, etc).

Well, I sure don't have anything against making slicing strings slice on 
character boundaries... Although that complicates matters - which length 
should .length then return? It will surely bork all kinds of templates, 
so perhaps it should be done with a different operator, like {a..b} 
instead of [a..b], and length-in-characters should be .strlen.


>>> Why do you have to output Unicode strings using legacy non-Unicode 
>>> C-APIs? AFAIK DUI / stardard I/O and other libraries use standard 
>>> Unicode, right? At least QT / GTK+ / Win32API / Linux console do 
>>> support Unicode.
>>
>> Well, your point is moot, because if there's no such function to call, 
>> then there is no problem. But when there is such a function, you would 
>> hope that the language/library does something sensible by default, 
>> wouldn't you?
> 
> No, this brilliant invention of yours causes problems even if we didn't 
> have any 'legacy'-systems/APIs. You see, Library-writer 1 might use 
> UTF-16 for his library because he uses Windows and thinks it's the 
> fastest charset. Now Library-writer 2 has done his work using UTF-8 as 
> an internal format. If you make a client program that links with these 
> both, you (may) have to create unnecessary conversions just because one 
> guy decided to create his own standards.

Please don't get personal, as I and many others don't consider it polite.

Anyhow, even if all D libraries use the same encoding, D is still 
directly linkable to C libraries and it's obvious one doesn't have 
control over what encoding they're using, so I fail to see what is wrong 
with supporting different ones, and I also fail to see how it will help 
to decree one of them The One and ignore all others.


>> 2) UTF-16 for ever and ever, which sucks if you want it to either take 
>> less memory or don't want to worry about surrogates; just look at all 
>> the crappy functions they had to add in Java 5 to support the entire 
>> Unicode charset :)
> 
> Partly true. What I meant was that most Java programmers use only one 
> kind of string class (because they don't have/need other types).

Well, writing something high-performance string-related in Java 
definitely takes a lot of code, because the built-in String class is 
often useless. I see no need to repeat that in D.


>> Well, a sequence of bytes can be either cchar[], UTF-8, UTF-16 or 
>> UTF-32, but not more than one at the same time (OK, unless it's ASCII 
>> only, which fits both the first two). So, for example, if you cast 
>> utf8[] to utf16[], either the data is UTF-8 and you don't get a UTF-16 
>> string (but some mumbo jumbo), or it's UTF-16 and was never valid 
>> UTF-8 in the first place.
> 
> Ok. But I thought you said utf8[] is implicitely converted to utf16[]. 
> Then it's always valid whatever-type-it-is.

Yes I did and that has nothing to do with the above paragraph, as it's 
referring to the current sitation, where casts between char types 
actually don't transcode.


>> Why should you be forced to use a stream for something so simple?
> 
> So simple? Ahem, std.stream.File _is_ a stream. Here's my version:
> 
>   File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF8);
>   f.writeLine("valid unicode text åäöü");
>   f.close;
> 
>   File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF16LE);
>   f.writeLine("valid unicode text åäöü");
>   f.close;
> 
> Advantages:
> -supports BOM values
> -easy to use, right?

Well, I sure don't think so :P Why do I need a special class just to be 
able to output strings?  Where is the BOM placed? Does every string 
include a BOM or just the file at the beginning? How can I change that? 
If the writeLine is 2000 lines away from the stream declaration, how can 
I tell what it will do?

I'd certainly prefer

File f=new File("foo", FileMode.Out);
f.write("valid whatever".asUTF16LE);
f.close;

Less typing, too :)


> If you want high-performance streams, you can convert the strings in a 
> separate thread before you use them, right?

I don't know why you need a thread, but in any case, is that the easiest 
solution (to code) you can think of?


>>>> The methods should be compile-time resolvable when possible, so this 
>>>> would be both valid and evaluated in compile time:
>>>>
>>>> ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);
>>>
>>> Why? Converting a 14 character string doesn't take much time. 
>>
>> Why would it not evaluate at compile time? Do you see any benefit in 
>> that? And while it doesn't take much time once, it does take some, and 
>> more importantly, allocates new memory each time. If you're trying to 
>> do more than one request (as in thousands), I'm sure it adds up..
> 
> You only need to convert once.

Again, why would it not evaluate at compile time? Do you see any benefit 
in that?


>>>> Point IX. allows concatenation of strings in different encodings 
> 
> Why do you want to do that?

I don't, I want the whole world to use dchar[]s. But it doesn't, so 
using multiple encodings should be as easy as possible.


xs0
November 25, 2005
Re: YAUST v1.0
xs0 wrote:

>> I was just thinking that once D has complete wrappers for all 
>> necessary stuff, you don't need these anymore. Library (wrapper) 
>> writers should be patient enough to use explicit conversion rules.
> 
> 
> But why should one have to create wrappers in the first place? With my 
> proposal, you can directly link to many libraries and the compiler will 
> do the conversions for you.
In case you haven't noticed, most things in Java are made of wrappers. 
Even D uses wrappers because they're easier to work with. If you think 
that wrapper might be a slow, the D specs allow the compiler to inline 
wrapper functions.

>>> No, who mentioned ASCII? On windows, cchar would be exactly the 
>>> legacy encoding each non-unicode app uses, and conversions between 
>>> app's internal UTF-x and cchar[] would transcode into that charset. 
>>> So, for example, a word processor on a non-unicode windows version 
>>> could still use unicode internally, while automatically talking to 
>>> the OS using all the characters its charset provides.
>>
>>
>> You said
>> "In the reverse direction, all unrepresentable characters become '?'."
>>
>> The thing is that D compiler doesn't know anything about your system 
>> character encoding. You can even change it on the fly, if your system 
>> is capable of doing that. Therefore this transcoding must use the 
>> greatest common divisor which is probably 7-bit ASCII.
> 
> 
> While the compiler may not, I'm sure it's possible to figure it out in 
> runtime. For example, many old apps use a different language based on 
> your settings, browsers send different Accept-Language, etc. So, it is 
> possible, I think.

You can't be serious. Of course do browsers use several encodings, but 
they also let the users choose them. You cannot achieve such a 
functionality with a statically chosen cchar-type. If you're going to 
change the cchar-type on the fly, characters 128-255 become corrupted 
sooner than you think. That's why I would use conversion libraries.

>> You don't need to convert inside your own code unless you're really 
>> creating a program that is supposed to convert stuff. I mean you need 
>> the transcoding only when interfacing with foreign code / i/o.
> 
> 
> If you don't need to convert, fine. If you do need to convert, I see no 
> point in it being as easy/convenient as possible.

But you don't need to convert inside your own code:

utf8 foo(utf16 param) { return param.asUTF8; }
utf32 bar(utf8 param) { return param.asUTF32LE; }
utf16 zoo(utf32 param) { return param.asUTF16LE; }

void main() {
utf16 string = "something";
writefln( utf16( utf32( utf8(string) ) ) );
}

Doesn't look pretty useful to me, at least :)
It's the same thing with implicit conversions. You don't need them in 
your 'own' code.

>> Yes, now you're right. The programmer should _always_ explicitely 
>> declare all conversions.
> Why?

Because it will remove all 'hidden' (string) conversions.

>>>> If we didn't have several types of strings, this all would be much 
>>>> easier.
>>> Agreed, but we do have several types of strings :)
>> I'm trying to say we don't need several types of strings :)
> Why? I think if it's done properly, there are benefits from having a 
> choice, while not complicating matters when one doesn't care.

Of course there's always a benefit, but it makes things more complex. 
Are you really saying that having 4 string types is easier than having 
just one? With only one type you don't need casting rules nor so many 
encumbering keywords etc. You always have to make a tradeoff somewhere. 
I'm not suggesting my own proposal just because I'm stubborn or 
something, I just know that you _can_ write Unicode-aware programs with 
just one string type and it doesn't cost much (in runtime 
performance/memory footprint). If you don't believe, please try to 
simulate these proposals using custom string classes.

>>> Because it's actually the opposite - Unicode conversions are simple, 
>>> while slicing is hard (at least slicing on character boundaries). 
>>> Even in the simple example you give, I have no idea whether the first 
>>> Ä is one character or two, as both cases look the same.
>>
>>
>> It's not really that hard. One downside is that you have to parse 
>> through the string (unless compiler uses UTF-16/32 as an internal 
>> string type).
> 
> 
> It is "hard" - if you want to get the first character, as in the first 
> character that the user sees, it can actually be from 1 to x characters, 
> where x can be at least 5

Oh, I thought that UTF-16 character is always encoded using 16 bits, 
UTF-32 using 32 bits and UTF-8 using 8-32 bits? I'm I wrong?

> (that case is actually in the unicode 
> standard) and possibly more (and I don't mean code units, but characters).

Slicing&indexing with UTF-16/32 is straightforward. Just multiply the 
index by 2/4. UTF-8 is only a bit harder - you need to iterate through 
the string, but it's not that hard. It's usually much faster than O(n).

>> Slicing the string on the code unit level doesn't make any sense, now 
>> does it? Because char should be treated as a special type by the 
>> compiler, I see no other use for slicing than this. Like you said, the 
>> alternative slicing can be achieved by casting the string to void[] 
>> (for i/o data buffering, etc).
> 
> 
> Well, I sure don't have anything against making slicing strings slice on 
> character boundaries... Although that complicates matters - which length 
> should .length then return? It will surely bork all kinds of templates, 
> so perhaps it should be done with a different operator, like {a..b} 
> instead of [a..b], and length-in-characters should be .strlen.

Yes, it's true. My solution is a bit inconsistent, but doesn't hurt 
anyone: it uses character boundaries inside the []-syntax (also .length 
might be character-version inside the braces), but code unit -version 
elsewhere. I think D should use an internal counter for data type length 
and provide an intelligent (data type specific) .length for the 
programmer. {a..b} doesn't look good to me.

>>> Well, your point is moot, because if there's no such function to 
>>> call, then there is no problem. But when there is such a function, 
>>> you would hope that the language/library does something sensible by 
>>> default, wouldn't you?
>>
>> No, this brilliant invention of yours causes problems even if we 
>> didn't have any 'legacy'-systems/APIs. You see, Library-writer 1 might 
>> use UTF-16 for his library because he uses Windows and thinks it's the 
>> fastest charset. Now Library-writer 2 has done his work using UTF-8 as 
>> an internal format. If you make a client program that links with these 
>> both, you (may) have to create unnecessary conversions just because 
>> one guy decided to create his own standards.
> 
> 
> Please don't get personal, as I and many others don't consider it polite.

Sorry, trying to calm down a bit ;) You know, this thing is important to 
me as I write most of my programs using Unicode I/O.

> 
> Anyhow, even if all D libraries use the same encoding, D is still 
> directly linkable to C libraries and it's obvious one doesn't have 
> control over what encoding they're using,

That's true.

> so I fail to see what is wrong 
> with supporting different ones, and I also fail to see how it will help 
> to decree one of them The One and ignore all others.

Surely you agree that all transcoding is bad for the performance. 
Minimizing the need to transcode inside D code (by eliminating the 
unnecessary string types) maximizes the performance, right?

> Well, writing something high-performance string-related in Java 
> definitely takes a lot of code, because the built-in String class is 
> often useless. I see no need to repeat that in D.

IMHO implying regular programmers to use high-performance strings 
everywhere as an only option is bad. All strings don't need to be that 
fast. It would look pretty funny, if you really needed to choose a 
proper encoding just to create a valid 'Hello world!' example.

>>   File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF8);
>>   f.writeLine("valid unicode text åäöü");
>>   f.close;
>>
>>   File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF16LE);
>>   f.writeLine("valid unicode text åäöü");
>>   f.close;
>>
>> Advantages:
>> -supports BOM values
>> -easy to use, right?
> 
> Well, I sure don't think so :P Why do I need a special class just to be 
> able to output strings?  Where is the BOM placed? Does every string 
> include a BOM or just the file at the beginning? How can I change that? 
> If the writeLine is 2000 lines away from the stream declaration, how can 
> I tell what it will do?
> 
> I'd certainly prefer
> 
> File f=new File("foo", FileMode.Out);
> f.write("valid whatever".asUTF16LE);
> f.close;
> 
> Less typing, too :)

Less typing? No you're wrong. Your approach requires the programmer to 
remember the correct encoding everytime (s)he writes to that file. In 
case you didn't know, valid UTF-x files use BOM only in the beginning of 
the file. My UnicodeFile-class knows this. Your solution writes the BOM 
every time you write a string (test it, if you don't believe). In 
addition, changing the BOM in the middle of a valid UTF-x stream is 
illegal. If you want to create a datafile that serializes the 'objects', 
you can use regular files just like you did here.

>> If you want high-performance streams, you can convert the strings in a 
>> separate thread before you use them, right?
> 
> 
> I don't know why you need a thread, but in any case, is that the easiest 
> solution (to code) you can think of?

No, not the easiest. AFAIK in real life a high-performance web server 
uses separate threads for data processing. In case you're writing a 
single-threaded application, you can precalculate the string in the 
_same_ thread.

>>>>> The methods should be compile-time resolvable when possible, so 
>>>>> this would be both valid and evaluated in compile time:
>>>>>
>>>>> ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);
>>>>
>>>>
>>>> Why? Converting a 14 character string doesn't take much time. 
>>>
>>>
>>> Why would it not evaluate at compile time? Do you see any benefit in 
>>> that? And while it doesn't take much time once, it does take some, 
>>> and more importantly, allocates new memory each time. If you're 
>>> trying to do more than one request (as in thousands), I'm sure it 
>>> adds up..
>>
>> You only need to convert once.
> 
> Again, why would it not evaluate at compile time? Do you see any benefit 
> in that?

I think I already said that you really don't know, what would be the 
best encoding to use at compile time. You're saying (by having several 
types) that the programmer should decide this. Now building portable 
multiplatform programs isn't that simple. Your approach implyes you to 
define several version {} -blocks for different architectures => it 
isn't that simple anymore. You need to use version-blocks because if you 
decided to use utf-8, it would be fast on *nixes and slow on Windows. 
And if you used utf-16, the opposite would happer.

>>>>> Point IX. allows concatenation of strings in different encodings 
>>
>> Why do you want to do that?
> 
> I don't, I want the whole world to use dchar[]s. But it doesn't, so 
> using multiple encodings should be as easy as possible.

But I'm saying here that we don't need several string types.

Jari-Matti

P.S. I won't be reading the NG for the next couple of days. I'll try to 
answer your (potential) future posts as soon as I get back.
November 25, 2005
Re: YAUST v1.0
xs0 wrote:
> 
> I'd certainly prefer
> 
> File f=new File("foo", FileMode.Out);
> f.write("valid whatever".asUTF16LE);
> f.close;
> 
> Less typing, too :)

I'd have hoped you'd prefer

File f = new File("foo", FileMode.Out.UTF16LE);
f.print("Just doit! Nike");
f.close;

Save even more ink, in case you print more than once to the file, too.

And it's smarter overall, right?
November 25, 2005
Re: YAUST v1.0
On Fri, 25 Nov 2005 15:50:13 +0200, Jari-Matti Mäkelä wrote:


[snip]


> Oh, I thought that UTF-16 character is always encoded using 16 bits, 
> UTF-32 using 32 bits and UTF-8 using 8-32 bits? I'm I wrong?

Wrong, I'm afraid. Some characters use 32 bits in UTF16.

UTF8:  1, 2, 3, and 4 byte characters.
UTF16: 2 and 4 byte characters.
UTF32: 4 byte characters (only)

-- 
Derek Parnell
Melbourne, Australia
26/11/2005 8:37:13 AM
November 26, 2005
Re: YAUST v1.0
Derek Parnell wrote:
> On Fri, 25 Nov 2005 15:50:13 +0200, Jari-Matti Mäkelä wrote:
> 
> 
> [snip]
> 
> 
> 
>>Oh, I thought that UTF-16 character is always encoded using 16 bits, 
>>UTF-32 using 32 bits and UTF-8 using 8-32 bits? I'm I wrong?
> 
> 
> Wrong, I'm afraid. Some characters use 32 bits in UTF16.
> 
> UTF8:  1, 2, 3, and 4 byte characters.
> UTF16: 2 and 4 byte characters.
> UTF32: 4 byte characters (only)

Furthermore, a single visible character can be encoded using more than 
one Unicode character (for example, a C with a caron can be both a 
single character and two characters, C + combining caron). Since there's 
no limit to how many combining characters a single "normal" char can 
have, slicing on char boundaries is not solved merely by finding UTF 
boundaries, which was my initial point.


xs0
November 28, 2005
Re: YAUST v1.0
xs0 wrote:
>>> Oh, I thought that UTF-16 character is always encoded using 16 bits, 
>>> UTF-32 using 32 bits and UTF-8 using 8-32 bits? I'm I wrong?
>>
>> Wrong, I'm afraid. Some characters use 32 bits in UTF16.
>>
>> UTF8:  1, 2, 3, and 4 byte characters.
>> UTF16: 2 and 4 byte characters.
>> UTF32: 4 byte characters (only)
> 
> Furthermore, a single visible character can be encoded using more than 
> one Unicode character (for example, a C with a caron can be both a 
> single character and two characters, C + combining caron). Since there's 
> no limit to how many combining characters a single "normal" char can 
> have, slicing on char boundaries is not solved merely by finding UTF 
> boundaries, which was my initial point.

Thanks, I wasn't aware of this before.

It seems that I have underestimated the performance issues (web servers, 
etc.) of having only one Unicode text type. I have to admit the current 
types in D are a suitable compromise. They're not always the "easiest" 
way to do things, but have no greater weaknesses either.

I guess the only thing I tried to say was that it really _is_ possible 
to write all programs with only a single encoding-independent Unicode 
type. But this approach has few big downsides in some performance 
critical applications and therefore shouldn't be the default behavior 
for a systems programming language like D. On a scripting language it 
would be a killer feature, though.

---

* IMO support for indexing & slicing on Unicode character boundaries is 
not that obligatory on the language syntax level, but it would be nice 
to have this functionality somewhere. :) At least there's little use for 
[d,w]char slicing now.

* I wish Walter could fix this [1] bug: (I know why it produces 
compile-time errors, but don't know why DMD allows you to do that)

[1] http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/30566

I wish it worked like this:

char foo = '\u0000'            // ok (C-strings compatibility)
char foo = '\u0001' - '\u007f' // ok
char foo = '\u0080' - '\uffff' // compile error

* A fully Unicode-aware stream system [2] would also be a nice feature: 
(currently there's no convenient way to create valid UTF-encoded text 
files with BOM)

[2] http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5636

That would (perhaps) require Walter/us to reconsider the Phobos stream 
class hierarchy.
Top | Discussion index | About this forum | D home