YAUST v1.0 - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » YAUST v1.0

Thread overview

YAUST v1.0
Nov 24, 2005 xs0
Nov 24, 2005 Jari-Matti Mäkelä
Nov 24, 2005 xs0
Nov 24, 2005 Jari-Matti Mäkelä
Nov 25, 2005 xs0
Nov 25, 2005 Jari-Matti Mäkelä
Nov 25, 2005 Derek Parnell
Nov 26, 2005 xs0
Nov 28, 2005 Jari-Matti Mäkelä
Nov 25, 2005 Georg Wrede

November 24, 2005

Posted by xs0

xs0

YAUST - Yet Another Unified String Theory :)

Well, here's my proposal for cleaning up strings. I tried to

- be as practical as possible
- leave full control over encoding when one wants to have it
- remove any possible confusion as to what each type is
- allow efficiency where possible, without excessive effort

First, the proposed changes are listed, followed by rationale.

==============================

I. drop char and wchar

--

II.

create cchar (1-byte unsigned character of platform-specific encoding, C-equivalent)
create utf8  (1 byte of UTF8)
create utf16 (2 bytes of UTF16)
leave dchar as is

--

III.

version(Windows) {
	alias utf16[] string;
} else
version(Unix/Linux) {
	alias utf8[] string;
}

add suffix ""s for explicitly specifying platform-specific encoding (i.e. the string type), and make auto type inference default to that same type (this applies to the auto keyword, not undecorated strings). Add docs explaining that string is just a platform-dependant alias.

--

IV.

add the following implicit casts for interoperability

from: cchar[], utf8[], utf16[], dchar[]
to  : cchar*, utf8*, utf16*, dchar*

all of them ensure 0-termination. If cchar is converted to any other form, it becomes the appropriate Unicode char. In the reverse direction, all unrepresentable characters become '?'. when runtime transcoding and/or reallocation is required, make them produce a warning.

--

V.

add the following implicit (transcoding) casts

from: cchar[], utf8[], utf16[], dchar[]
to  : cchar[], utf8[], utf16[], dchar[]

when runtime transcoding is required, make them produce a warning (i.e. always, except when casting from T to T).

--

VI.

modify explicit casts between all the array and pointer types to
- transcode rather than paint
- use '?' for unrepresentable characters (applies to encoding into cchar*/cchar[] only)
- not produce the warnings from above

--

VII.

create compatibility kit:

module std.srccompatibility.oldchartypes;
// yes, it should be big and ugly

alias utf8 char;
alias utf16 wchar;

--

VIII.

add the following methods to all 4 array types

 utf8[] .asUTF8
utf16[] .asUTF16
dchar[] .asUTF32
cchar[] .asCchars

ubyte[] .asUTF8   (bool dummy) // I think there's no UTF-8 BOM
ubyte[] .asUTF16LE(bool includeBOM)
ubyte[] .asUTF16BE(bool includeBOM)
ubyte[] .asUTF32LE(bool includeBOM)
ubyte[] .asUTF32BE(bool includeBOM)

--

IX.

modify the ~ operator between the 4 types to work as follows:

a) infer the result type from context, as with undecorated strings
b) if calling a function and there are multiple overloads
b.1) if both operand types are known, use that type
b.2) if one us known and another is undecorated literal, use the known type
b.3) if neither is known or both are known, but different, bork

--

X.

Disallow utf8 and utf16 as a stand-alone var type, only arrays and pointers allowed

========================

Point I. removes the confusion of "char" and "wchar" not actually representing characters.

Point II. explicitly states that the strings are either UTF-encoded, complete characters* or C-compatible characters.

Point III. makes the code

string abc="abc";
someOSFunc(abc);
someOtherOSFunc("qwe"s); // s only neccessary if there is more than one option

least likely to produce any transcoding.

Point IV. makes it nearly impossible to do the wrong thing and doesn't require explicit casts when interfacing to C code, assuming the C functions are declared properly (i.e. the correct of the two 1-byte types is declared). When used with literals, the 0 can be appended compile-time, like it is now.

Point V. makes it easier to use different types without explicit casting, but will still produce warnings when transcoding happens. In most cases it will be obvious anyway.

Point VI. breaks behavior of other array casts (which only paint), but strings are getting special behavior anyway, and you can still paint via void[], and even more importantly, if you need to paint between UTF8/UTF16/UTF32/cchar, either the source or destination type is wrong in the first place.

Point VII. will make it somewhat easier to make the transition.

Point VIII. provides an alternative to casting and allows specifying endianness when writing to network and/or files. The methods should be compile-time resolvable when possible, so this would be both valid and evaluated in compile time:

ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);

Point IX. allows concatenation of strings in different encodings without significantly increasing the complexity of overloading rules, while also not requiring an inefficient toUTFxx followed by concatenation (which copies the result again).

Point X. prevents some invalid code:
- treating a UTF-8 code unit as a character
- treating a UTF-16 code unit as a character
- iterating over code units instead of characters

Note that it is still possible to iterate over the string using a cchar and dchar, which actually do represent characters. Also note that for I/O purposes, which are the only thing one should be doing with code units, you can still paint the string as void[] or byte[] (or even better, call one of the methods above), but then you give up the view that it is a string and lose language support/special treatment.



So, what do you guys/gals think? :)


xs0

* note that even dchar[] still doesn't neccessarily contain complete characters, at least as seen by the user. For example, the letter LATIN_C_WITH_CARON can also be written as LATIN_C + COMBINING_CARON, and they are in fact equivalent as far as Unicode is concerned (afaik). Splitting the string inbetween will thus produce a "wrong" result, but I don't think D should include any kind of full Unicode processing, as it's actually needed quite rarely, so that problem is ignored...

November 24, 2005

Posted by Jari-Matti Mäkelä
in reply to xs0

Jari-Matti Mäkelä

Posted in reply to xs0

xs0 wrote:
<snip>
> III.
> 
> version(Windows) {
>     alias utf16[] string;
> } else
> version(Unix/Linux) {
>     alias utf8[] string;
> }
> 
> add suffix ""s for explicitly specifying platform-specific encoding (i.e. the string type), and make auto type inference default to that same type (this applies to the auto keyword, not undecorated strings). Add docs explaining that string is just a platform-dependant alias.
> 
The idea (platform-independence) here is correct. :) The only thing is that you _don't_ need to know, which utf-implementation the current compiler is using. If you are using Unicode to communicate with the user and/or native D libraries, you don't need to do any string conversions - they all use the same string representation, for god's sake.

> IV.
> 
> add the following implicit casts for interoperability
> 
> from: cchar[], utf8[], utf16[], dchar[]
> to  : cchar*, utf8*, utf16*, dchar*
> 
> all of them ensure 0-termination. If cchar is converted to any other form, it becomes the appropriate Unicode char. In the reverse direction, all unrepresentable characters become '?'. when runtime transcoding and/or reallocation is required, make them produce a warning.

You mean C/C++ -interoperability?

Replacing all non-ASCII characters with '?'s means that we don't actually want to support all the legacy systems out there. So it would be impossible to write Unicode-compliant portable programs that supported 'ä' on the Windows 9x/NT/XP command line without version() {} -logic?

> V.
> 
> add the following implicit (transcoding) casts
> 
> from: cchar[], utf8[], utf16[], dchar[]
> to  : cchar[], utf8[], utf16[], dchar[]
> 
> when runtime transcoding is required, make them produce a warning (i.e. always, except when casting from T to T).

Again, the main reason for Unicode is that you don't need to transcode between several representations all the time.

> VI.
> 
> modify explicit casts between all the array and pointer types to
> - transcode rather than paint
> - use '?' for unrepresentable characters (applies to encoding into cchar*/cchar[] only)
> - not produce the warnings from above
> 
> -- 
> 
> VII.
> 
> create compatibility kit:
> 
> module std.srccompatibility.oldchartypes;
> // yes, it should be big and ugly
> 
> alias utf8 char;
> alias utf16 wchar;
> 

You know, sweeping the problem under the carpet doesn't help us much. char/wchar won't get any better by calling them with a different name. Still char won't be able to store more than the first 127 Unicode symbols.

> VIII.
> 
> add the following methods to all 4 array types
> 
>  utf8[] .asUTF8
> utf16[] .asUTF16
> dchar[] .asUTF32
> cchar[] .asCchars

Why, section V. already allows you to transcode these implicitely.

> ubyte[] .asUTF8   (bool dummy) // I think there's no UTF-8 BOM
> ubyte[] .asUTF16LE(bool includeBOM)
> ubyte[] .asUTF16BE(bool includeBOM)
> ubyte[] .asUTF32LE(bool includeBOM)
> ubyte[] .asUTF32BE(bool includeBOM)
> 

This looks pretty familiar. My own proposal does this on a library level for a reason. You see, conversions from Unicode to ISO-8859-x/KOI8-R/... should be allowed. It's easier to maintain the conversion table in a separate library. This also saves Walter from a lot of unnecessary work.

UTF-8 _does_ have a BOM.

> IX.
> 
> modify the ~ operator between the 4 types to work as follows:
> 
> a) infer the result type from context, as with undecorated strings
> b) if calling a function and there are multiple overloads
> b.1) if both operand types are known, use that type
> b.2) if one us known and another is undecorated literal, use the known type
> b.3) if neither is known or both are known, but different, bork
> 

If we didn't have several types of strings, this all would be much easier.

> X.
> 
> Disallow utf8 and utf16 as a stand-alone var type, only arrays and pointers allowed
> 

Yes, this is a 'working' solution. Although I would like to be able to slice strings and do things like:

char[] s = "Älyttömämmäksi voinee mennä?"
s[15..21] = "ei voi"
writefln(s) // outputs: Älyttömämmäksi ei voi mennä?

Of course you can do this all using library functions, but tell me one thing: why should I do simple string slicing using library calls and much more complex Unicode conversion using language structures.

> Point I. removes the confusion of "char" and "wchar" not actually representing characters.
> 
True.

> Point II. explicitly states that the strings are either UTF-encoded, complete characters* or C-compatible characters.
True.

> Point III. makes the code
> 
> string abc="abc";
> someOSFunc(abc);
> someOtherOSFunc("qwe"s); // s only neccessary if there is more than one option
> 
> least likely to produce any transcoding.

Of course you need to do transcoding, if the OS-function expects ISO-8859-x and you're string has utf8/16.

> Point IV. makes it nearly impossible to do the wrong thing and doesn't require explicit casts when interfacing to C code, assuming the C functions are declared properly (i.e. the correct of the two 1-byte types is declared). When used with literals, the 0 can be appended compile-time, like it is now.
Why do you have to output Unicode strings using legacy non-Unicode C-APIs? AFAIK DUI / stardard I/O and other libraries use standard Unicode, right? At least QT / GTK+ / Win32API / Linux console do support Unicode.

> Point V. makes it easier to use different types without explicit casting, but will still produce warnings when transcoding happens. In most cases it will be obvious anyway.
It would easier with only a single Unicode-compliant string-type. Ask the Java guys.

> Point VI. breaks behavior of other array casts (which only paint), but strings are getting special behavior anyway, and you can still paint via void[], and even more importantly, if you need to paint between UTF8/UTF16/UTF32/cchar, either the source or destination type is wrong in the first place.
?

> Point VII. will make it somewhat easier to make the transition.
?

> Point VIII. provides an alternative to casting and allows specifying endianness when writing to network and/or files.
Partly true. Still, I think it would be much better if we had these as a std.stream.UnicodeStream class. Again, Java does this well.

> The methods should be compile-time resolvable when possible, so this would be both valid and evaluated in compile time:
> 
> ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);
Why? Converting a 14 character string doesn't take much time. Besides, if all our strings and i/o were utf-8, there wouldn't be any conversions, right?

> Point IX. allows concatenation of strings in different encodings without significantly increasing the complexity of overloading rules, while also not requiring an inefficient toUTFxx followed by concatenation (which copies the result again).
True, but as I previously said, I don't believe we need to do great amount of conversions in the runtime-level. All conversions should be near network/file-interfaces, thus using Stream-classes, right?

> Point X. prevents some invalid code:

> Note that it is still possible to iterate over the string using a cchar and dchar, which actually do represent characters. Also note that for I/O purposes, which are the only thing one should be doing with code units, you can still paint the string as void[] or byte[] (or even better, call one of the methods above), but then you give up the view that it is a string and lose language support/special treatment.
True.

> Splitting the string inbetween will thus produce a "wrong" result, but I don't think D should include any kind of full Unicode processing, as it's actually needed quite rarely, so that problem is ignored...

Sigh. Maybe you're not doing full Unicode processing every day. What about the Chinese? And what is full Unicode processing?

November 24, 2005

Posted by xs0
in reply to Jari-Matti Mäkelä

xs0

Posted in reply to Jari-Matti Mäkelä

Before anything else: while I agree that a (really well-thought out) string class would probably be a good solution, the D spec would seem to suggest an array-based approach is preferred, and Walter isn't one to change his mind easily :)
Besides, any kind of string class has it's share of problems (one size never fits all), and with the array based approach it's easy to add pseudo-methods doing all kinds of funky things, while a language-defined class makes it impossible.

Jari-Matti Mäkelä wrote:
>> version(Windows) {
>>     alias utf16[] string;
>> } else
>> version(Unix/Linux) {
>>     alias utf8[] string;
>> }
>>
>> add suffix ""s for explicitly specifying platform-specific encoding (i.e. the string type), and make auto type inference default to that same type (this applies to the auto keyword, not undecorated strings). Add docs explaining that string is just a platform-dependant alias.
>>
> The idea (platform-independence) here is correct. :) The only thing is that you _don't_ need to know, which utf-implementation the current compiler is using. 

Well, sometimes you do and most times you don't (and it is often the case that at least some part of any app does need to know). I don't think it's wise to force anything down anyone's throat, so I tries to give options - you can use a specific UTF encoding, the native encoding for legacy OSes, or leave it to the compiler to choose the "best" one for you, where I believe best is what the underlying OS is using.

> If you are using Unicode to communicate with the user and/or native D libraries, you don't need to do any string conversions - they all use the same string representation, for god's sake.

Well, flexibility will definitely require some bloat in libraries, but for communicating with the user, you definitely need conversions, if you're not using the OS-native type (which, again, you do have the option of using with being explicit about it).

>> add the following implicit casts for interoperability
>>
>> from: cchar[], utf8[], utf16[], dchar[]
>> to  : cchar*, utf8*, utf16*, dchar*
>>
>> all of them ensure 0-termination. If cchar is converted to any other form, it becomes the appropriate Unicode char. In the reverse direction, all unrepresentable characters become '?'. when runtime transcoding and/or reallocation is required, make them produce a warning.
> 
> You mean C/C++ -interoperability?

Yup.

> Replacing all non-ASCII characters with '?'s means that we don't actually want to support all the legacy systems out there. So it would be impossible to write Unicode-compliant portable programs that supported 'ä' on the Windows 9x/NT/XP command line without version() {} -logic?

No, who mentioned ASCII? On windows, cchar would be exactly the legacy encoding each non-unicode app uses, and conversions between app's internal UTF-x and cchar[] would transcode into that charset. So, for example, a word processor on a non-unicode windows version could still use unicode internally, while automatically talking to the OS using all the characters its charset provides.

>> add the following implicit (transcoding) casts
>>
>> from: cchar[], utf8[], utf16[], dchar[]
>> to  : cchar[], utf8[], utf16[], dchar[]
>>
>> when runtime transcoding is required, make them produce a warning (i.e. always, except when casting from T to T).
> 
> Again, the main reason for Unicode is that you don't need to transcode between several representations all the time.

Again, sometimes you do and most times you don't. But anyhow, painting casts between UTF types make no sense, and I don't think explicit casts are neccessary, as there can't be any loss (ok, except to cchar[]).

>> create compatibility kit:
>>
>> module std.srccompatibility.oldchartypes;
>> // yes, it should be big and ugly
>>
>> alias utf8 char;
>> alias utf16 wchar;
>>
> 
> You know, sweeping the problem under the carpet doesn't help us much. char/wchar won't get any better by calling them with a different name. Still char won't be able to store more than the first 127 Unicode symbols.

I'm not sure if you're referring to those aliases or not, but in YAUST, there is no single char(utf8) anymore, and I think there's quite a difference between "char[]" and "utf8[]", especially in a C-influenced world the Earth is :)

>> add the following methods to all 4 array types
>>
>>  utf8[] .asUTF8
>> utf16[] .asUTF16
>> dchar[] .asUTF32
>> cchar[] .asCchars
> 
> Why, section V. already allows you to transcode these implicitely.

Yup, but with warnings; using one of these shows that you've thought about what you're doing, so the compiler is free to shut up :)

>> ubyte[] .asUTF8   (bool dummy) // I think there's no UTF-8 BOM
>> ubyte[] .asUTF16LE(bool includeBOM)
>> ubyte[] .asUTF16BE(bool includeBOM)
>> ubyte[] .asUTF32LE(bool includeBOM)
>> ubyte[] .asUTF32BE(bool includeBOM)
>>
> 
> This looks pretty familiar. My own proposal does this on a library level for a reason. You see, conversions from Unicode to ISO-8859-x/KOI8-R/... should be allowed. 

Sure they should be allowed, but D is supposed to be Unicode, so a D app should generally only deal with that, and other charsets should generally only exist in byte[] buffers before input or after output.

> It's easier to maintain the conversion table in a separate library. This also saves Walter from a lot of unnecessary work.

Well, conversions between UTFs are done already, so the only thing remaining would be from/to cchar[], which shouldn't be too hard. Others definitely belong in some library, as they mostly won't be needed, I guess..

> UTF-8 _does_ have a BOM.

It does? What is it? I thought that single bytes have no Byte Order, so why would you need a Mark?

>> modify the ~ operator between the 4 types to work as follows:
>>
>> a) infer the result type from context, as with undecorated strings
>> b) if calling a function and there are multiple overloads
>> b.1) if both operand types are known, use that type
>> b.2) if one us known and another is undecorated literal, use the known type
>> b.3) if neither is known or both are known, but different, bork
> 
> If we didn't have several types of strings, this all would be much easier.

Agreed, but we do have several types of strings :)

>> Disallow utf8 and utf16 as a stand-alone var type, only arrays and pointers allowed
>>
> 
> Yes, this is a 'working' solution. Although I would like to be able to slice strings and do things like:
> 
> char[] s = "Älyttömämmäksi voinee mennä?"
> s[15..21] = "ei voi"
> writefln(s) // outputs: Älyttömämmäksi ei voi mennä?
> 
> Of course you can do this all using library functions, but tell me one thing: why should I do simple string slicing using library calls and much more complex Unicode conversion using language structures.

Because it's actually the opposite - Unicode conversions are simple, while slicing is hard (at least slicing on character boundaries). Even in the simple example you give, I have no idea whether the first Ä is one character or two, as both cases look the same.

>> Point III. makes the code
>>
>> string abc="abc";
>> someOSFunc(abc);
>> someOtherOSFunc("qwe"s); // s only neccessary if there is more than one option
>>
>> least likely to produce any transcoding.
> 
> Of course you need to do transcoding, if the OS-function expects ISO-8859-x and you're string has utf8/16.

True, I just said "least likely". But at least you can use the same (non-transcoding) code for both UTF-8 OSes and UTF-16 OSes.

>> Point IV. makes it nearly impossible to do the wrong thing and doesn't require explicit casts when interfacing to C code, assuming the C functions are declared properly (i.e. the correct of the two 1-byte types is declared). When used with literals, the 0 can be appended compile-time, like it is now.
> 
> Why do you have to output Unicode strings using legacy non-Unicode C-APIs? AFAIK DUI / stardard I/O and other libraries use standard Unicode, right? At least QT / GTK+ / Win32API / Linux console do support Unicode.

Well, your point is moot, because if there's no such function to call, then there is no problem. But when there is such a function, you would hope that the language/library does something sensible by default, wouldn't you?

>> Point V. makes it easier to use different types without explicit casting, but will still produce warnings when transcoding happens. In most cases it will be obvious anyway.
> 
> It would easier with only a single Unicode-compliant string-type. Ask the Java guys.

Well, I am one of the Java guys, and java.lang.String leaves a lot to be desired. Because it's language defined in the way it is, it's
1) immutable, which sucks if it's forced down your throat 100% of time
2) UTF-16 for ever and ever, which sucks if you want it to either take less memory or don't want to worry about surrogates; just look at all the crappy functions they had to add in Java 5 to support the entire Unicode charset :)

>> Point VI. breaks behavior of other array casts (which only paint), but strings are getting special behavior anyway, and you can still paint via void[], and even more importantly, if you need to paint between UTF8/UTF16/UTF32/cchar, either the source or destination type is wrong in the first place.
> 
> ?

Well, a sequence of bytes can be either cchar[], UTF-8, UTF-16 or UTF-32, but not more than one at the same time (OK, unless it's ASCII only, which fits both the first two). So, for example, if you cast utf8[] to utf16[], either the data is UTF-8 and you don't get a UTF-16 string (but some mumbo jumbo), or it's UTF-16 and was never valid UTF-8 in the first place.

>> Point VII. will make it somewhat easier to make the transition.
> 
> ?

?

>> Point VIII. provides an alternative to casting and allows specifying endianness when writing to network and/or files.
> 
> Partly true. Still, I think it would be much better if we had these as a std.stream.UnicodeStream class. Again, Java does this well.

Why should you be forced to use a stream for something so simple? What if you want to use two encodings on the same stream (it's not even so far fetched - the first line in a HTTP request can only contain UTF-8, but you may want to send POST contents in UTF-16, for example). Etc. etc.

>> The methods should be compile-time resolvable when possible, so this would be both valid and evaluated in compile time:
>>
>> ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);
> 
> Why? Converting a 14 character string doesn't take much time. 

Why would it not evaluate at compile time? Do you see any benefit in that? And while it doesn't take much time once, it does take some, and more importantly, allocates new memory each time. If you're trying to do more than one request (as in thousands), I'm sure it adds up..

> Besides, if all our strings and i/o were utf-8, there wouldn't be any conversions, right?

Except every time you'd call a Win32 function, which is what's on most computers?

>> Point IX. allows concatenation of strings in different encodings without significantly increasing the complexity of overloading rules, while also not requiring an inefficient toUTFxx followed by concatenation (which copies the result again).
> 
> True, but as I previously said, I don't believe we need to do great amount of conversions in the runtime-level. All conversions should be near network/file-interfaces, thus using Stream-classes, right?

I agree decent stream classes can solve many problems, but not all of them.

>> Splitting the string inbetween will thus produce a "wrong" result, but I don't think D should include any kind of full Unicode processing, as it's actually needed quite rarely, so that problem is ignored...
> 
> Sigh. Maybe you're not doing full Unicode processing every day. What about the Chinese? And what is full Unicode processing?

Unicode is much more than a really large character set. There's UTFs, collation, bidirectionality, combining characters, locales, etc. etc., see
http://www.unicode.org/reports/index.html

So, if you want to create a decent text editor according to Unicode specs, you'll have to implement "full Unicode processing", but a large majority of other apps just needs to be able to interface to OS and libraries to get and display the text, usually without even caring what's inside, so I see no point to include all that in D, not even as a standard library (or perhaps after many other things are implemented first)

xs0

November 24, 2005

Posted by Jari-Matti Mäkelä
in reply to xs0

Jari-Matti Mäkelä

Posted in reply to xs0

xs0 wrote:
> Before anything else: while I agree that a (really well-thought out) string class would probably be a good solution, the D spec would seem to suggest an array-based approach is preferred, and Walter isn't one to change his mind easily :)

I believe we can achieve quite much with just simple array-like syntax.

> Besides, any kind of string class has it's share of problems (one size never fits all), and with the array based approach it's easy to add pseudo-methods doing all kinds of funky things, while a language-defined class makes it impossible.

Although D is able to support some hard coded properties too.

>> The idea (platform-independence) here is correct. :) The only thing is that you _don't_ need to know, which utf-implementation the current compiler is using. 
> 
> Well, sometimes you do and most times you don't (and it is often the case that at least some part of any app does need to know). I don't think it's wise to force anything down anyone's throat, so I tries to give options - you can use a specific UTF encoding, the native encoding for legacy OSes, or leave it to the compiler to choose the "best" one for you, where I believe best is what the underlying OS is using.

I'd give my vote for the "let compiler choose" option.

>> If you are using Unicode to communicate with the user and/or native D libraries, you don't need to do any string conversions - they all use the same string representation, for god's sake.
> 
> Well, flexibility will definitely require some bloat in libraries, but for communicating with the user, you definitely need conversions, if you're not using the OS-native type (which, again, you do have the option of using with being explicit about it).

But if you let the compiler vendor to decide the encoding, there's a high probability that you don't need any explicit transcoding.

>>> add the following implicit casts for interoperability
>>>
>>> from: cchar[], utf8[], utf16[], dchar[]
>>> to  : cchar*, utf8*, utf16*, dchar*
>>>
>>> all of them ensure 0-termination. If cchar is converted to any other form, it becomes the appropriate Unicode char. In the reverse direction, all unrepresentable characters become '?'. when runtime transcoding and/or reallocation is required, make them produce a warning.
>>
>> You mean C/C++ -interoperability?
> Yup.

I was just thinking that once D has complete wrappers for all necessary stuff, you don't need these anymore. Library (wrapper) writers should be patient enough to use explicit conversion rules.

>> Replacing all non-ASCII characters with '?'s means that we don't actually want to support all the legacy systems out there. So it would be impossible to write Unicode-compliant portable programs that supported 'ä' on the Windows 9x/NT/XP command line without version() {} -logic?
> 
> 
> No, who mentioned ASCII? On windows, cchar would be exactly the legacy encoding each non-unicode app uses, and conversions between app's internal UTF-x and cchar[] would transcode into that charset. So, for example, a word processor on a non-unicode windows version could still use unicode internally, while automatically talking to the OS using all the characters its charset provides.
> 

You said
"In the reverse direction, all unrepresentable characters become '?'."

The thing is that D compiler doesn't know anything about your system character encoding. You can even change it on the fly, if your system is capable of doing that. Therefore this transcoding must use the greatest common divisor which is probably 7-bit ASCII.

>>> add the following implicit (transcoding) casts
>>>
>>> from: cchar[], utf8[], utf16[], dchar[]
>>> to  : cchar[], utf8[], utf16[], dchar[]
>>>
>>> when runtime transcoding is required, make them produce a warning (i.e. always, except when casting from T to T).
>>
>> Again, the main reason for Unicode is that you don't need to transcode between several representations all the time.
> 
> Again, sometimes you do and most times you don't. But anyhow, painting casts between UTF types make no sense, and I don't think explicit casts are neccessary, as there can't be any loss (ok, except to cchar[]).

You don't need to convert inside your own code unless you're really creating a program that is supposed to convert stuff. I mean you need the transcoding only when interfacing with foreign code / i/o.

>>> add the following methods to all 4 array types
>>>
>>>  utf8[] .asUTF8
>>> utf16[] .asUTF16
>>> dchar[] .asUTF32
>>> cchar[] .asCchars
>>
>> Why, section V. already allows you to transcode these implicitely.
> 
> Yup, but with warnings; using one of these shows that you've thought about what you're doing, so the compiler is free to shut up :)

Yes, now you're right. The programmer should _always_ explicitely declare all conversions.

>>> ubyte[] .asUTF8   (bool dummy) // I think there's no UTF-8 BOM
>>> ubyte[] .asUTF16LE(bool includeBOM)
>>> ubyte[] .asUTF16BE(bool includeBOM)
>>> ubyte[] .asUTF32LE(bool includeBOM)
>>> ubyte[] .asUTF32BE(bool includeBOM)
>>>
>> This looks pretty familiar. My own proposal does this on a library level for a reason. You see, conversions from Unicode to ISO-8859-x/KOI8-R/... should be allowed. 
> 
> Sure they should be allowed, but D is supposed to be Unicode, so a D app should generally only deal with that, and other charsets should generally only exist in byte[] buffers before input or after output.

Then tell me, how do I fill these buffers with your new functions? I would definitely want to explicitely define the character encoding. IMHO this is much better done using static classes (std.utf.e[n/de]code) than variable properties.

>> It's easier to maintain the conversion table in a separate library. This also saves Walter from a lot of unnecessary work.
> 
> Well, conversions between UTFs are done already, so the only thing remaining would be from/to cchar[], which shouldn't be too hard.

Yes, between UTFs, but between legacy charsets and UTFs is not! They aren't that hard, but as you might know, there are maybe hundreds of possible encoding types.

> Others
> definitely belong in some library, as they mostly won't be needed, I
> guess..

This isn't a very consistent approach. Some functions belong in some library, others should be implemented in the language...wtf?

>> UTF-8 _does_ have a BOM.
> 
> It does? What is it? I thought that single bytes have no Byte Order, so why would you need a Mark?

0xEF 0xBB 0xBF

http://www.unicode.org/faq/utf_bom.html#25

See also

http://www.unicode.org/faq/utf_bom.html#29

>> If we didn't have several types of strings, this all would be much easier.
> 
> Agreed, but we do have several types of strings :)

I'm trying to say we don't need several types of strings :)

>>> Disallow utf8 and utf16 as a stand-alone var type, only arrays and pointers allowed
>>>
>>
>> Yes, this is a 'working' solution. Although I would like to be able to slice strings and do things like:
>>
>> char[] s = "Älyttömämmäksi voinee mennä?"
>> s[15..21] = "ei voi"
>> writefln(s) // outputs: Älyttömämmäksi ei voi mennä?
>>
>> Of course you can do this all using library functions, but tell me one thing: why should I do simple string slicing using library calls and much more complex Unicode conversion using language structures.
> 
> 
> Because it's actually the opposite - Unicode conversions are simple, while slicing is hard (at least slicing on character boundaries). Even in the simple example you give, I have no idea whether the first Ä is one character or two, as both cases look the same.

It's not really that hard. One downside is that you have to parse through the string (unless compiler uses UTF-16/32 as an internal string type).

Slicing the string on the code unit level doesn't make any sense, now does it? Because char should be treated as a special type by the compiler, I see no other use for slicing than this. Like you said, the alternative slicing can be achieved by casting the string to void[] (for i/o data buffering, etc).

>>> Point III. makes the code
>>>
>>> string abc="abc";
>>> someOSFunc(abc);
>>> someOtherOSFunc("qwe"s); // s only neccessary if there is more than one option
>>>
>>> least likely to produce any transcoding.
>>
>>
>> Of course you need to do transcoding, if the OS-function expects ISO-8859-x and you're string has utf8/16.
> 
> 
> True, I just said "least likely". But at least you can use the same (non-transcoding) code for both UTF-8 OSes and UTF-16 OSes.

Again, the compiler nor the compiled binary don't know anything about the OS standard encoding. Even some linux-systems still use iso-8859-x. If you're running windows-programs through vmware or wine on linux, you can't tell if it's always faster to use UTF-16 instead of UTF-8.

>>> Point IV. makes it nearly impossible to do the wrong thing and doesn't require explicit casts when interfacing to C code, assuming the C functions are declared properly (i.e. the correct of the two 1-byte types is declared). When used with literals, the 0 can be appended compile-time, like it is now.
>>
>>
>> Why do you have to output Unicode strings using legacy non-Unicode C-APIs? AFAIK DUI / stardard I/O and other libraries use standard Unicode, right? At least QT / GTK+ / Win32API / Linux console do support Unicode.
> 
> 
> Well, your point is moot, because if there's no such function to call, then there is no problem. But when there is such a function, you would hope that the language/library does something sensible by default, wouldn't you?

No, this brilliant invention of yours causes problems even if we didn't have any 'legacy'-systems/APIs. You see, Library-writer 1 might use UTF-16 for his library because he uses Windows and thinks it's the fastest charset. Now Library-writer 2 has done his work using UTF-8 as an internal format. If you make a client program that links with these both, you (may) have to create unnecessary conversions just because one guy decided to create his own standards.

>>> Point V. makes it easier to use different types without explicit casting, but will still produce warnings when transcoding happens. In most cases it will be obvious anyway.
>>
>>
>> It would easier with only a single Unicode-compliant string-type. Ask the Java guys.
> 
> 
> Well, I am one of the Java guys, and java.lang.String leaves a lot to be desired. Because it's language defined in the way it is, it's
> 1) immutable, which sucks if it's forced down your throat 100% of time

I agree.

> 2) UTF-16 for ever and ever, which sucks if you want it to either take less memory or don't want to worry about surrogates; just look at all the crappy functions they had to add in Java 5 to support the entire Unicode charset :)

Partly true. What I meant was that most Java programmers use only one kind of string class (because they don't have/need other types).

>>> Point VI. breaks behavior of other array casts (which only paint), but strings are getting special behavior anyway, and you can still paint via void[], and even more importantly, if you need to paint between UTF8/UTF16/UTF32/cchar, either the source or destination type is wrong in the first place.
>>
>> ?
> 
> Well, a sequence of bytes can be either cchar[], UTF-8, UTF-16 or UTF-32, but not more than one at the same time (OK, unless it's ASCII only, which fits both the first two). So, for example, if you cast utf8[] to utf16[], either the data is UTF-8 and you don't get a UTF-16 string (but some mumbo jumbo), or it's UTF-16 and was never valid UTF-8 in the first place.

Ok. But I thought you said utf8[] is implicitely converted to utf16[]. Then it's always valid whatever-type-it-is.

>>> Point VII. will make it somewhat easier to make the transition.

How? I don't believe.

>>> Point VIII. provides an alternative to casting and allows specifying endianness when writing to network and/or files.
>>
>>
>> Partly true. Still, I think it would be much better if we had these as a std.stream.UnicodeStream class. Again, Java does this well.
> 
> 
> Why should you be forced to use a stream for something so simple?

So simple? Ahem, std.stream.File _is_ a stream. Here's my version:

  File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF8);
  f.writeLine("valid unicode text åäöü");
  f.close;

  File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF16LE);
  f.writeLine("valid unicode text åäöü");
  f.close;

Advantages:
-supports BOM values
-easy to use, right?

> What
> if you want to use two encodings on the same stream (it's not even so far fetched - the first line in a HTTP request can only contain UTF-8, but you may want to send POST contents in UTF-16, for example). Etc. etc.

Simple, just implement a method for changing the stream type:

Stream s = UnicodeSocketStream(socket, mode, encoding);

s.changeEncoding(encoding2);

If you want high-performance streams, you can convert the strings in a separate thread before you use them, right?

>>> The methods should be compile-time resolvable when possible, so this would be both valid and evaluated in compile time:
>>>
>>> ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);
>>
>>
>> Why? Converting a 14 character string doesn't take much time. 
> 
> 
> Why would it not evaluate at compile time? Do you see any benefit in that? And while it doesn't take much time once, it does take some, and more importantly, allocates new memory each time. If you're trying to do more than one request (as in thousands), I'm sure it adds up..

You only need to convert once.

>> Besides, if all our strings and i/o were utf-8, there wouldn't be any conversions, right?
> 
> Except every time you'd call a Win32 function, which is what's on most computers?

My mistake, let's forget the utf-8 for a while. Actually I meant that if all strings were in the native OS format (let the compiler decide), there would be no need to convert.

>>> Point IX. allows concatenation of strings in different encodings 

Why do you want to do that?

>>> without significantly increasing the complexity of overloading rules, while also not requiring an inefficient toUTFxx followed by concatenation (which copies the result again).
>>
>>
>> True, but as I previously said, I don't believe we need to do great amount of conversions in the runtime-level. All conversions should be near network/file-interfaces, thus using Stream-classes, right?
> 
> 
> I agree decent stream classes can solve many problems, but not all of them.

"Many, but not all of them." That's why we should have std.utf.encode/decode-functions.

>>> Splitting the string inbetween will thus produce a "wrong" result, but I don't think D should include any kind of full Unicode processing, as it's actually needed quite rarely, so that problem is ignored...
> 
> So, if you want to create a decent text editor according to Unicode specs, you'll have to implement "full Unicode processing", but a large majority of other apps just needs to be able to interface to OS and libraries to get and display the text, usually without even caring what's inside, so I see no point to include all that in D, not even as a standard library (or perhaps after many other things are implemented first)

Ok, now I see your point. I thought you didn't want full Unicode processing even as a addon library. I agree, you don't need these 'advanced' algorithms in the core language, rather as a separate library. Time will tell, maybe someday when we haven't got anything else to do, Phobos will finally include some cool Unicode tricks.


Jari-Matti

November 25, 2005

Posted by xs0
in reply to Jari-Matti Mäkelä

xs0

Posted in reply to Jari-Matti Mäkelä


>> Well, flexibility will definitely require some bloat in libraries, but for communicating with the user, you definitely need conversions, if you're not using the OS-native type (which, again, you do have the option of using with being explicit about it).
> 
> But if you let the compiler vendor to decide the encoding, there's a high probability that you don't need any explicit transcoding.

Sure you may need transcoding, you may use 15 different libraries, each expecting its own thing.. The one thing that can be done is to not require transcoding at least when talking to OS, which all apps have to do at some point. But even then, you should have the option to choose otherwise - if you have a UTF-8 library that you use in 99% of string-related calls, it's still faster to use UTF-8 and transcode when talking to OS.


>>> You mean C/C++ -interoperability?
>>
>> Yup.
> 
> I was just thinking that once D has complete wrappers for all necessary stuff, you don't need these anymore. Library (wrapper) writers should be patient enough to use explicit conversion rules.

But why should one have to create wrappers in the first place? With my proposal, you can directly link to many libraries and the compiler will do the conversions for you.


>> No, who mentioned ASCII? On windows, cchar would be exactly the legacy encoding each non-unicode app uses, and conversions between app's internal UTF-x and cchar[] would transcode into that charset. So, for example, a word processor on a non-unicode windows version could still use unicode internally, while automatically talking to the OS using all the characters its charset provides.
> 
> You said
> "In the reverse direction, all unrepresentable characters become '?'."
> 
> The thing is that D compiler doesn't know anything about your system character encoding. You can even change it on the fly, if your system is capable of doing that. Therefore this transcoding must use the greatest common divisor which is probably 7-bit ASCII.

While the compiler may not, I'm sure it's possible to figure it out in runtime. For example, many old apps use a different language based on your settings, browsers send different Accept-Language, etc. So, it is possible, I think.


>> Again, sometimes you do and most times you don't. But anyhow, painting casts between UTF types make no sense, and I don't think explicit casts are neccessary, as there can't be any loss (ok, except to cchar[]).
> 
> You don't need to convert inside your own code unless you're really creating a program that is supposed to convert stuff. I mean you need the transcoding only when interfacing with foreign code / i/o.

If you don't need to convert, fine. If you do need to convert, I see no point in it being as easy/convenient as possible.

>> Yup, but with warnings; using one of these shows that you've thought about what you're doing, so the compiler is free to shut up :)
> 
> Yes, now you're right. The programmer should _always_ explicitely declare all conversions.

Why?


>>>> ubyte[] .asUTF8   (bool dummy) // I think there's no UTF-8 BOM
>>>> ubyte[] .asUTF16LE(bool includeBOM)
>>>> ubyte[] .asUTF16BE(bool includeBOM)
>>>> ubyte[] .asUTF32LE(bool includeBOM)
>>>> ubyte[] .asUTF32BE(bool includeBOM)
>>>>
>>> This looks pretty familiar. My own proposal does this on a library level for a reason. You see, conversions from Unicode to ISO-8859-x/KOI8-R/... should be allowed. 
>>
>>
>> Sure they should be allowed, but D is supposed to be Unicode, so a D app should generally only deal with that, and other charsets should generally only exist in byte[] buffers before input or after output.
> 
> Then tell me, how do I fill these buffers with your new functions? 

You don't. Only UTFs and one OS-native encoding are supported in the language, the latter for obvious convenience. Others have to be done with a library. Note that the compiler is free to use the same library, it's not like anything would have to be done twice.


>>> UTF-8 _does_ have a BOM.
>>
>> It does? What is it? I thought that single bytes have no Byte Order, so why would you need a Mark?
> 
> 0xEF 0xBB 0xBF

OK, then it's not a dummy parameter :)


>>> If we didn't have several types of strings, this all would be much easier.
>>
>> Agreed, but we do have several types of strings :)
> 
> I'm trying to say we don't need several types of strings :)

Why? I think if it's done properly, there are benefits from having a choice, while not complicating matters when one doesn't care.


>> Because it's actually the opposite - Unicode conversions are simple, while slicing is hard (at least slicing on character boundaries). Even in the simple example you give, I have no idea whether the first Ä is one character or two, as both cases look the same.
> 
> It's not really that hard. One downside is that you have to parse through the string (unless compiler uses UTF-16/32 as an internal string type).

It is "hard" - if you want to get the first character, as in the first character that the user sees, it can actually be from 1 to x characters, where x can be at least 5 (that case is actually in the unicode standard) and possibly more (and I don't mean code units, but characters).


> Slicing the string on the code unit level doesn't make any sense, now does it? Because char should be treated as a special type by the compiler, I see no other use for slicing than this. Like you said, the alternative slicing can be achieved by casting the string to void[] (for i/o data buffering, etc).

Well, I sure don't have anything against making slicing strings slice on character boundaries... Although that complicates matters - which length should .length then return? It will surely bork all kinds of templates, so perhaps it should be done with a different operator, like {a..b} instead of [a..b], and length-in-characters should be .strlen.


>>> Why do you have to output Unicode strings using legacy non-Unicode C-APIs? AFAIK DUI / stardard I/O and other libraries use standard Unicode, right? At least QT / GTK+ / Win32API / Linux console do support Unicode.
>>
>> Well, your point is moot, because if there's no such function to call, then there is no problem. But when there is such a function, you would hope that the language/library does something sensible by default, wouldn't you?
> 
> No, this brilliant invention of yours causes problems even if we didn't have any 'legacy'-systems/APIs. You see, Library-writer 1 might use UTF-16 for his library because he uses Windows and thinks it's the fastest charset. Now Library-writer 2 has done his work using UTF-8 as an internal format. If you make a client program that links with these both, you (may) have to create unnecessary conversions just because one guy decided to create his own standards.

Please don't get personal, as I and many others don't consider it polite.

Anyhow, even if all D libraries use the same encoding, D is still directly linkable to C libraries and it's obvious one doesn't have control over what encoding they're using, so I fail to see what is wrong with supporting different ones, and I also fail to see how it will help to decree one of them The One and ignore all others.


>> 2) UTF-16 for ever and ever, which sucks if you want it to either take less memory or don't want to worry about surrogates; just look at all the crappy functions they had to add in Java 5 to support the entire Unicode charset :)
> 
> Partly true. What I meant was that most Java programmers use only one kind of string class (because they don't have/need other types).

Well, writing something high-performance string-related in Java definitely takes a lot of code, because the built-in String class is often useless. I see no need to repeat that in D.


>> Well, a sequence of bytes can be either cchar[], UTF-8, UTF-16 or UTF-32, but not more than one at the same time (OK, unless it's ASCII only, which fits both the first two). So, for example, if you cast utf8[] to utf16[], either the data is UTF-8 and you don't get a UTF-16 string (but some mumbo jumbo), or it's UTF-16 and was never valid UTF-8 in the first place.
> 
> Ok. But I thought you said utf8[] is implicitely converted to utf16[]. Then it's always valid whatever-type-it-is.

Yes I did and that has nothing to do with the above paragraph, as it's referring to the current sitation, where casts between char types actually don't transcode.


>> Why should you be forced to use a stream for something so simple?
> 
> So simple? Ahem, std.stream.File _is_ a stream. Here's my version:
> 
>   File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF8);
>   f.writeLine("valid unicode text åäöü");
>   f.close;
> 
>   File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF16LE);
>   f.writeLine("valid unicode text åäöü");
>   f.close;
> 
> Advantages:
> -supports BOM values
> -easy to use, right?

Well, I sure don't think so :P Why do I need a special class just to be able to output strings?  Where is the BOM placed? Does every string include a BOM or just the file at the beginning? How can I change that? If the writeLine is 2000 lines away from the stream declaration, how can I tell what it will do?

I'd certainly prefer

File f=new File("foo", FileMode.Out);
f.write("valid whatever".asUTF16LE);
f.close;

Less typing, too :)


> If you want high-performance streams, you can convert the strings in a separate thread before you use them, right?

I don't know why you need a thread, but in any case, is that the easiest solution (to code) you can think of?


>>>> The methods should be compile-time resolvable when possible, so this would be both valid and evaluated in compile time:
>>>>
>>>> ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);
>>>
>>> Why? Converting a 14 character string doesn't take much time. 
>>
>> Why would it not evaluate at compile time? Do you see any benefit in that? And while it doesn't take much time once, it does take some, and more importantly, allocates new memory each time. If you're trying to do more than one request (as in thousands), I'm sure it adds up..
> 
> You only need to convert once.

Again, why would it not evaluate at compile time? Do you see any benefit in that?


>>>> Point IX. allows concatenation of strings in different encodings 
> 
> Why do you want to do that?

I don't, I want the whole world to use dchar[]s. But it doesn't, so using multiple encodings should be as easy as possible.


xs0

November 25, 2005

Posted by Jari-Matti Mäkelä
in reply to xs0

Jari-Matti Mäkelä

Posted in reply to xs0

xs0 wrote:

>> I was just thinking that once D has complete wrappers for all necessary stuff, you don't need these anymore. Library (wrapper) writers should be patient enough to use explicit conversion rules.
> 
> 
> But why should one have to create wrappers in the first place? With my proposal, you can directly link to many libraries and the compiler will do the conversions for you.
In case you haven't noticed, most things in Java are made of wrappers. Even D uses wrappers because they're easier to work with. If you think that wrapper might be a slow, the D specs allow the compiler to inline wrapper functions.

>>> No, who mentioned ASCII? On windows, cchar would be exactly the legacy encoding each non-unicode app uses, and conversions between app's internal UTF-x and cchar[] would transcode into that charset. So, for example, a word processor on a non-unicode windows version could still use unicode internally, while automatically talking to the OS using all the characters its charset provides.
>>
>>
>> You said
>> "In the reverse direction, all unrepresentable characters become '?'."
>>
>> The thing is that D compiler doesn't know anything about your system character encoding. You can even change it on the fly, if your system is capable of doing that. Therefore this transcoding must use the greatest common divisor which is probably 7-bit ASCII.
> 
> 
> While the compiler may not, I'm sure it's possible to figure it out in runtime. For example, many old apps use a different language based on your settings, browsers send different Accept-Language, etc. So, it is possible, I think.

You can't be serious. Of course do browsers use several encodings, but they also let the users choose them. You cannot achieve such a functionality with a statically chosen cchar-type. If you're going to change the cchar-type on the fly, characters 128-255 become corrupted sooner than you think. That's why I would use conversion libraries.

>> You don't need to convert inside your own code unless you're really creating a program that is supposed to convert stuff. I mean you need the transcoding only when interfacing with foreign code / i/o.
> 
> 
> If you don't need to convert, fine. If you do need to convert, I see no point in it being as easy/convenient as possible.

But you don't need to convert inside your own code:

utf8 foo(utf16 param) { return param.asUTF8; }
utf32 bar(utf8 param) { return param.asUTF32LE; }
utf16 zoo(utf32 param) { return param.asUTF16LE; }

void main() {
utf16 string = "something";
writefln( utf16( utf32( utf8(string) ) ) );
}

Doesn't look pretty useful to me, at least :)
It's the same thing with implicit conversions. You don't need them in your 'own' code.

>> Yes, now you're right. The programmer should _always_ explicitely declare all conversions.
> Why?

Because it will remove all 'hidden' (string) conversions.

>>>> If we didn't have several types of strings, this all would be much easier.
>>> Agreed, but we do have several types of strings :)
>> I'm trying to say we don't need several types of strings :)
> Why? I think if it's done properly, there are benefits from having a choice, while not complicating matters when one doesn't care.

Of course there's always a benefit, but it makes things more complex. Are you really saying that having 4 string types is easier than having just one? With only one type you don't need casting rules nor so many encumbering keywords etc. You always have to make a tradeoff somewhere. I'm not suggesting my own proposal just because I'm stubborn or something, I just know that you _can_ write Unicode-aware programs with just one string type and it doesn't cost much (in runtime performance/memory footprint). If you don't believe, please try to simulate these proposals using custom string classes.

>>> Because it's actually the opposite - Unicode conversions are simple, while slicing is hard (at least slicing on character boundaries). Even in the simple example you give, I have no idea whether the first Ä is one character or two, as both cases look the same.
>>
>>
>> It's not really that hard. One downside is that you have to parse through the string (unless compiler uses UTF-16/32 as an internal string type).
> 
> 
> It is "hard" - if you want to get the first character, as in the first character that the user sees, it can actually be from 1 to x characters, where x can be at least 5

Oh, I thought that UTF-16 character is always encoded using 16 bits, UTF-32 using 32 bits and UTF-8 using 8-32 bits? I'm I wrong?

> (that case is actually in the unicode standard) and possibly more (and I don't mean code units, but characters).

Slicing&indexing with UTF-16/32 is straightforward. Just multiply the index by 2/4. UTF-8 is only a bit harder - you need to iterate through the string, but it's not that hard. It's usually much faster than O(n).

>> Slicing the string on the code unit level doesn't make any sense, now does it? Because char should be treated as a special type by the compiler, I see no other use for slicing than this. Like you said, the alternative slicing can be achieved by casting the string to void[] (for i/o data buffering, etc).
> 
> 
> Well, I sure don't have anything against making slicing strings slice on character boundaries... Although that complicates matters - which length should .length then return? It will surely bork all kinds of templates, so perhaps it should be done with a different operator, like {a..b} instead of [a..b], and length-in-characters should be .strlen.

Yes, it's true. My solution is a bit inconsistent, but doesn't hurt anyone: it uses character boundaries inside the []-syntax (also .length might be character-version inside the braces), but code unit -version elsewhere. I think D should use an internal counter for data type length and provide an intelligent (data type specific) .length for the programmer. {a..b} doesn't look good to me.

>>> Well, your point is moot, because if there's no such function to call, then there is no problem. But when there is such a function, you would hope that the language/library does something sensible by default, wouldn't you?
>>
>> No, this brilliant invention of yours causes problems even if we didn't have any 'legacy'-systems/APIs. You see, Library-writer 1 might use UTF-16 for his library because he uses Windows and thinks it's the fastest charset. Now Library-writer 2 has done his work using UTF-8 as an internal format. If you make a client program that links with these both, you (may) have to create unnecessary conversions just because one guy decided to create his own standards.
> 
> 
> Please don't get personal, as I and many others don't consider it polite.

Sorry, trying to calm down a bit ;) You know, this thing is important to me as I write most of my programs using Unicode I/O.

> 
> Anyhow, even if all D libraries use the same encoding, D is still directly linkable to C libraries and it's obvious one doesn't have control over what encoding they're using,

That's true.

> so I fail to see what is wrong with supporting different ones, and I also fail to see how it will help to decree one of them The One and ignore all others.

Surely you agree that all transcoding is bad for the performance. Minimizing the need to transcode inside D code (by eliminating the unnecessary string types) maximizes the performance, right?

> Well, writing something high-performance string-related in Java definitely takes a lot of code, because the built-in String class is often useless. I see no need to repeat that in D.

IMHO implying regular programmers to use high-performance strings everywhere as an only option is bad. All strings don't need to be that fast. It would look pretty funny, if you really needed to choose a proper encoding just to create a valid 'Hello world!' example.

>>   File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF8);
>>   f.writeLine("valid unicode text åäöü");
>>   f.close;
>>
>>   File f = new UnicodeFile("foo", FileMode.Out, FileEncoding.UTF16LE);
>>   f.writeLine("valid unicode text åäöü");
>>   f.close;
>>
>> Advantages:
>> -supports BOM values
>> -easy to use, right?
> 
> Well, I sure don't think so :P Why do I need a special class just to be able to output strings?  Where is the BOM placed? Does every string include a BOM or just the file at the beginning? How can I change that? If the writeLine is 2000 lines away from the stream declaration, how can I tell what it will do?
> 
> I'd certainly prefer
> 
> File f=new File("foo", FileMode.Out);
> f.write("valid whatever".asUTF16LE);
> f.close;
> 
> Less typing, too :)

Less typing? No you're wrong. Your approach requires the programmer to remember the correct encoding everytime (s)he writes to that file. In case you didn't know, valid UTF-x files use BOM only in the beginning of the file. My UnicodeFile-class knows this. Your solution writes the BOM every time you write a string (test it, if you don't believe). In addition, changing the BOM in the middle of a valid UTF-x stream is illegal. If you want to create a datafile that serializes the 'objects', you can use regular files just like you did here.

>> If you want high-performance streams, you can convert the strings in a separate thread before you use them, right?
> 
> 
> I don't know why you need a thread, but in any case, is that the easiest solution (to code) you can think of?

No, not the easiest. AFAIK in real life a high-performance web server uses separate threads for data processing. In case you're writing a single-threaded application, you can precalculate the string in the _same_ thread.

>>>>> The methods should be compile-time resolvable when possible, so this would be both valid and evaluated in compile time:
>>>>>
>>>>> ubyte[] myRequest="GET / HTTP/1.0".asUTF8(false);
>>>>
>>>>
>>>> Why? Converting a 14 character string doesn't take much time. 
>>>
>>>
>>> Why would it not evaluate at compile time? Do you see any benefit in that? And while it doesn't take much time once, it does take some, and more importantly, allocates new memory each time. If you're trying to do more than one request (as in thousands), I'm sure it adds up..
>>
>> You only need to convert once.
> 
> Again, why would it not evaluate at compile time? Do you see any benefit in that?

I think I already said that you really don't know, what would be the best encoding to use at compile time. You're saying (by having several types) that the programmer should decide this. Now building portable multiplatform programs isn't that simple. Your approach implyes you to define several version {} -blocks for different architectures => it isn't that simple anymore. You need to use version-blocks because if you decided to use utf-8, it would be fast on *nixes and slow on Windows. And if you used utf-16, the opposite would happer.

>>>>> Point IX. allows concatenation of strings in different encodings 
>>
>> Why do you want to do that?
> 
> I don't, I want the whole world to use dchar[]s. But it doesn't, so using multiple encodings should be as easy as possible.

But I'm saying here that we don't need several string types.

Jari-Matti

P.S. I won't be reading the NG for the next couple of days. I'll try to answer your (potential) future posts as soon as I get back.

November 25, 2005

Posted by Georg Wrede
in reply to xs0

Georg Wrede

Posted in reply to xs0

xs0 wrote:
> 
> I'd certainly prefer
> 
> File f=new File("foo", FileMode.Out);
> f.write("valid whatever".asUTF16LE);
> f.close;
> 
> Less typing, too :)

I'd have hoped you'd prefer

File f = new File("foo", FileMode.Out.UTF16LE);
f.print("Just doit! Nike");
f.close;

Save even more ink, in case you print more than once to the file, too.

And it's smarter overall, right?

November 25, 2005

Posted by Derek Parnell
in reply to Jari-Matti Mäkelä

Derek Parnell

Posted in reply to Jari-Matti Mäkelä

On Fri, 25 Nov 2005 15:50:13 +0200, Jari-Matti Mäkelä wrote:


[snip]


> Oh, I thought that UTF-16 character is always encoded using 16 bits, UTF-32 using 32 bits and UTF-8 using 8-32 bits? I'm I wrong?

Wrong, I'm afraid. Some characters use 32 bits in UTF16.

UTF8:  1, 2, 3, and 4 byte characters.
UTF16: 2 and 4 byte characters.
UTF32: 4 byte characters (only)

-- 
Derek Parnell
Melbourne, Australia
26/11/2005 8:37:13 AM

November 26, 2005

Posted by xs0
in reply to Derek Parnell

xs0

Posted in reply to Derek Parnell

Derek Parnell wrote:
> On Fri, 25 Nov 2005 15:50:13 +0200, Jari-Matti Mäkelä wrote:
> 
> 
> [snip]
> 
> 
> 
>>Oh, I thought that UTF-16 character is always encoded using 16 bits, UTF-32 using 32 bits and UTF-8 using 8-32 bits? I'm I wrong?
> 
> 
> Wrong, I'm afraid. Some characters use 32 bits in UTF16.
> 
> UTF8:  1, 2, 3, and 4 byte characters.
> UTF16: 2 and 4 byte characters.
> UTF32: 4 byte characters (only)

Furthermore, a single visible character can be encoded using more than one Unicode character (for example, a C with a caron can be both a single character and two characters, C + combining caron). Since there's no limit to how many combining characters a single "normal" char can have, slicing on char boundaries is not solved merely by finding UTF boundaries, which was my initial point.

xs0

November 28, 2005

Posted by Jari-Matti Mäkelä
in reply to xs0

Jari-Matti Mäkelä

Posted in reply to xs0

xs0 wrote:
>>> Oh, I thought that UTF-16 character is always encoded using 16 bits, UTF-32 using 32 bits and UTF-8 using 8-32 bits? I'm I wrong?
>>
>> Wrong, I'm afraid. Some characters use 32 bits in UTF16.
>>
>> UTF8:  1, 2, 3, and 4 byte characters.
>> UTF16: 2 and 4 byte characters.
>> UTF32: 4 byte characters (only)
> 
> Furthermore, a single visible character can be encoded using more than one Unicode character (for example, a C with a caron can be both a single character and two characters, C + combining caron). Since there's no limit to how many combining characters a single "normal" char can have, slicing on char boundaries is not solved merely by finding UTF boundaries, which was my initial point.

Thanks, I wasn't aware of this before.

It seems that I have underestimated the performance issues (web servers, etc.) of having only one Unicode text type. I have to admit the current types in D are a suitable compromise. They're not always the "easiest" way to do things, but have no greater weaknesses either.

I guess the only thing I tried to say was that it really _is_ possible to write all programs with only a single encoding-independent Unicode type. But this approach has few big downsides in some performance critical applications and therefore shouldn't be the default behavior for a systems programming language like D. On a scripting language it would be a killer feature, though.

---

* IMO support for indexing & slicing on Unicode character boundaries is not that obligatory on the language syntax level, but it would be nice to have this functionality somewhere. :) At least there's little use for [d,w]char slicing now.

* I wish Walter could fix this [1] bug: (I know why it produces compile-time errors, but don't know why DMD allows you to do that)

[1] http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/30566

I wish it worked like this:

char foo = '\u0000'            // ok (C-strings compatibility)
char foo = '\u0001' - '\u007f' // ok
char foo = '\u0080' - '\uffff' // compile error

* A fully Unicode-aware stream system [2] would also be a nice feature: (currently there's no convenient way to create valid UTF-encoded text files with BOM)

[2] http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5636

That would (perhaps) require Walter/us to reconsider the Phobos stream class hierarchy.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation