November 24, 2005
On Thu, 24 Nov 2005 21:46:50 +1300, Regan Heath wrote:

> Ok, it appears I picked some really bad type names in my proposal and it is causing some confusion.

Regan,
the idea stinks. Sorry, but that *is* the nice response.

It is far more complicated than it needs to be. Maybe it's the name confusion, but I don't think so.

When dealing with strings, almost nobody needs to deal with partial-characters. We really only need to deal with characters except for some obscure functionality (maybe interfacing with an external system?).

So we don't need to deal with the individual bytes that make up the characters in the various UTF encodings. Sure, we will need to know how big a character is from time to time. For example, given a string (regardless of encoding format), we might need to know how many bytes the third character uses. The answer will depend on the UTF encoding *and* the code point value.

Mostly we won't even need to know the encoding format. We might, if that is an interfacing requirement, and we might in some circumstances to improve performance. But generally, we shouldn't care.

So how about we just have a string datatype called 'string'. The default encoding format in RAM is compiler dependant but we can on a declaration basis, define specific internal encoding format for a string. Furthermore, we can access any of the three UTF encoding formats for a string as a property of the string. The compiler would generate the call to transcode if required to. The string could also have array properties such that each element addressed an entire character. If one ever really needed to get down to the byte level of a character they could assign it to a new datatype called a 'unicode' (for example) and that would have properties such as the encoding format and byte size, and the bytes in a unicode could be accessed using array syntax too.

  string Foo = "Some string";
  unicode C;

  C = Foo[4];
  if (C.encoding = unicode.utf8)
  {
     foreach (ubyte b; C)
     {
       . . .
     }
   }


We can then dismiss wchar, wchar[], dchar, dchar[] entirely. And make char and char[] array have the C/C++ semantics.

If some function absolutely insisted on a utf16 string for example, ...

   SomeFunc(Foo.utf16);

would pass the utf16 version of the string to the function.

As for declarations ...

   utf16 { // force RAM encoding to be utf16
    string Foo;
    string Bar;
   }
   string Qwerty; // RAM encoding is compiler choice.


-- 
Derek Parnell
Melbourne, Australia
24/11/2005 7:54:01 PM
November 24, 2005
Regan Heath wrote:

> * add a new type/alias "utf", this would alias utf8, 16 or 32. It  represents the application specific native encoding. This allows efficient  code, like:
> 
> string s = "test";
> foreach(utf c; s) {
> }
> 
> regardless of the applications selected native encoding.

I will rewrite this with your changed names (cp*):

> * add a new type/alias "cpn", this would alias cp1, cp2 or cp4. It
> represents the application specific native encoding. This allows
> efficient  code, like:
>
> string s = "test";
> foreach(cpn c; s) {
> }
>
> regardless of the applications selected native encoding.

Say you instead have:

string s = "smörgåsbord";
foreach(cpn c; s) {
}

This code would then work on Win32 (with UTF-16 being the native encoding), but not on Linux (with UTF-8).

You have introduced platform dependence where there previously was none.

What do you gain by this?

As I see it, there are only two views you need on a unicode string:
a) The code units
b) The unicode characters

By your suggestion, there would be a third view:
c) The unicode characters that are encoded by a single code unit.

Why is this useful?
Should the "smörgåsbord"-example above throw an error?
Isn't what you want instead:
assert_only_contains_single_code_unit_characters_in_native_encoding(string)

/Oskar
November 24, 2005
Regan Heath wrote:
> On Thu, 24 Nov 2005 09:23:21 +0100, Oskar Linde  <oskar.lindeREM@OVEgmail.com> wrote:
> 
>> Regan Heath wrote:
> 
>>> (b) Is this really ASCII or is it system dependant? i.e. Latin-1 or  similar. Is it ASCII values 127 or less perhaps? To be honest I'm not  sure.
>>
>>
>> ASCII is equal to the first 128 code points in Unicode.
>> Latin-1 is equal to the first 256 code points in Unicode.
> 
> 
> And which does a C function expect? Or is that defined by the C function?  Does strcmp care? Does strlen, strchr, ...?

This is not defined. strcmp doesn't care. strlen etc only counts bytes until '\0'. You can use latin-1, utf-8 or any 8-bit encoding.
This is why UTF-8 is so popular. You can just plug it in and almost everything that used to assume latin-1 or any 8-bit encoding will just work without any changes.

Not even the OS cares very much. To the OS, things like a file name, file contents, usernames, etc are just a bunch of bytes. Different file systems may then define different encodings the file names should be interpreted in. This is just how the file name is presented to the user. (Transcoding to/from the terminal)

/Oskar
November 24, 2005
Derek, I must have done a terrible job explaining this, because you've completely missunderstood me, in fact your counter proposal is essentially what my proposal was intended to be.

More inline...

On Thu, 24 Nov 2005 20:15:56 +1100, Derek Parnell <derek@psych.ward> wrote:
> On Thu, 24 Nov 2005 21:46:50 +1300, Regan Heath wrote:
>
>> Ok, it appears I picked some really bad type names in my proposal and it
>> is causing some confusion.
>
> Regan,
> the idea stinks. Sorry, but that *is* the nice response.
>
> It is far more complicated than it needs to be. Maybe it's the name
> confusion, but I don't think so.
>
> When dealing with strings, almost nobody needs to deal with
> partial-characters.

I think you're confused. My proposal removes the need for dealing with partial characters completely, if you think otherwise then I've done a bad job explaining it.

> So we don't need to deal with the individual bytes that make up the
> characters in the various UTF encodings. Sure, we will need to know how big a character is from time to time. For example, given a string (regardless
> of encoding format), we might need to know how many bytes the third
> character uses. The answer will depend on the UTF encoding *and* the code
> point value.

Exactly my point, and the reason for the "cpn" alias.

> Mostly we won't even need to know the encoding format. We might, if that is an interfacing requirement, and we might in some circumstances to improve
> performance. But generally, we shouldn't care.

Yes, exactly.

> So how about we just have a string datatype called 'string'. The default
> encoding format in RAM is compiler dependant but we can on a declaration
> basis, define specific internal encoding format for a string. Furthermore, we can access any of the three UTF encoding formats for a string as a
> property of the string. The compiler would generate the call to transcode
> if required to. The string could also have array properties such that each element addressed an entire character.

That, is exactly what I proposed.

> We can then dismiss wchar, wchar[], dchar, dchar[] entirely. And make char and char[] array have the C/C++ semantics.

I proposed exactly that, except char[] should not exist either.
char and char* are all that are required.

Regan
November 24, 2005
On Thu, 24 Nov 2005 10:18:20 +0100, Oskar Linde <oskar.lindeREM@OVEgmail.com> wrote:
> Regan Heath wrote:
>
>> * add a new type/alias "utf", this would alias utf8, 16 or 32. It  represents the application specific native encoding. This allows efficient  code, like:
>>  string s = "test";
>> foreach(utf c; s) {
>> }
>>  regardless of the applications selected native encoding.
>
> I will rewrite this with your changed names (cp*):
>
>  > * add a new type/alias "cpn", this would alias cp1, cp2 or cp4. It
>  > represents the application specific native encoding. This allows
>  > efficient  code, like:
>  >
>  > string s = "test";
>  > foreach(cpn c; s) {
>  > }
>  >
>  > regardless of the applications selected native encoding.
>
> Say you instead have:
>
> string s = "smörgåsbord";
> foreach(cpn c; s) {
> }
>
> This code would then work on Win32 (with UTF-16 being the native encoding), but not on Linux (with UTF-8).

No. "string" would be UTF-8 encoded internally on both platforms.

My proposal stated that "cpn" would thus be an alias for "cp1" but clearly that idea isn't going to work in this case as (I'm assuming) it's impossible to represent some of those characters using a single byte. Java uses an int, maybe we should just do the same?

> You have introduced platform dependence where there previously was none.
> What do you gain by this?

No, there is no platform dependance. The choice of encoding is entirely up to the programmer, they choose a default encoding for each program they write, it defaults to UTF-8.

> As I see it, there are only two views you need on a unicode string:
> a) The code units
> b) The unicode characters

(a) is seldom required. (b) is the common and thus goal view IMO.

> By your suggestion, there would be a third view:
> c) The unicode characters that are encoded by a single code unit.

(c) was intended to be equal to (b). It was intended that we have 3 types so that ASCII programs would not be forced to use an int sized variable for single character values. It seems we're stuck doing that.

> Why is this useful?

It's not, it's not what I intended.

> Should the "smörgåsbord"-example above throw an error?

No, certainly not.

> Isn't what you want instead:
> assert_only_contains_single_code_unit_characters_in_native_encoding(string)

I have no idea what you mean here.

Regan
November 24, 2005
Regan Heath wrote:
> On Thu, 24 Nov 2005 10:18:20 +0100, Oskar Linde  <oskar.lindeREM@OVEgmail.com> wrote:
> 
>> Regan Heath wrote:
>>
>>> * add a new type/alias "utf", this would alias utf8, 16 or 32. It   represents the application specific native encoding. This allows  efficient  code, like:
>>>  string s = "test";
>>> foreach(utf c; s) {
>>> }
>>>  regardless of the applications selected native encoding.
>>
>>
>> I will rewrite this with your changed names (cp*):
>>
>>  > * add a new type/alias "cpn", this would alias cp1, cp2 or cp4. It
>>  > represents the application specific native encoding. This allows
>>  > efficient  code, like:
>>  >
>>  > string s = "test";
>>  > foreach(cpn c; s) {
>>  > }
>>  >
>>  > regardless of the applications selected native encoding.
>>
>> Say you instead have:
>>
>> string s = "smörgåsbord";
>> foreach(cpn c; s) {
>> }
>>
>> This code would then work on Win32 (with UTF-16 being the native  encoding), but not on Linux (with UTF-8).
> 
> 
> No. "string" would be UTF-8 encoded internally on both platforms.

> My proposal stated that "cpn" would thus be an alias for "cp1" but 

Ok. I assumed cpn would be the platform native (preferred) encoding.

> clearly  that idea isn't going to work in this case as (I'm assuming) it's  impossible to represent some of those characters using a single byte. Java  uses an int, maybe we should just do the same?

D uses dchar. Better would maybe be to rename it to char (or maybe character), giving:

utf8  (todays char)
utf16 (todays wchar)
char  (todays dchar)

>> As I see it, there are only two views you need on a unicode string:
>> a) The code units
>> b) The unicode characters
> 
> (a) is seldom required. (b) is the common and thus goal view IMO.

Actually, I think it is the other way around. (b) is seldom required.
You can search, split, trim, parse, etc.. D:s char[], without any regard of encoding. This is the beauty of UTF-8 and the reason D strings all work on code units rather than characters.

When would you actually need character based indexing?
I believe the answer is less often than you think.

/Oskar
November 24, 2005

Congrats, Regan! Great job!

And the thread subject is simply a Killer!



If I understand you correctly, then the following would work:

string st = "aaa\U41bbb\UC4ccc\U0107ddd";   // aaaAbbbÄcccćddd
cp1 s3 = st[3];   // A
cp1 s7 = st[7];   // Ä
cp1 s11 = st[11]; // error, too narrow
cp2 s11 = st[11]; // ć

assert( s3 == 0x41 && s7 == 0xC4 && s11 == 0x107 );

So, s3 would contain "A", which the old system would store as utf8 with no problem. s3 is 8 bits.

s7 would contain "Ä", which the old system shouldn't have stored in 8-bit (char) because it is too big, but with your proposal it would be ok, since the _code_point_ (i.e. the "value" of the character in Unicode) does fit in 8 bits. And _we_are_storing_ the codepoint, not the UTF character here, right?

s11 would error, since even the Unicode value is too big for 8 bits.

The second s11 assignment would be ok, since the Unicode value of ć fits in 16 bits.

And, st itself would be "regular" UTF-8 on a Linux, and (probably) UTF-16 on Windows.

Yes?
November 24, 2005
Regan, your proposal is absolutely too complex. I don't get it and I really don't like it. D is supposed to be a _simple_ language. Here's an alternative proposal:

-Allowed text string types:

char, char[] (we don't need silly aliases nor wchar/dchar)

-Text string implementation:

char - Unicode code unit (UTF-8, it's up to the compiler vendor to decide between 1-4xbytes and an int)

char[] - array of char-types, thus a valid Unicode string encoded in UTF-8, no BOM is needed because all char[]s are UTF-8.

-Text string operations:

char a = 'ä', b = 'å';
char[] s = "åäö", t;

t ~= a;		// s == [ 'ä' ]
t ~= b;		// s == [ 'ä', 'å' ] == "äå"

s[1..2] == "äö"

foreach(char c; s) writefln(c);	// outputs: å \n ä \n ö \n

-I/O:

writef/writefln - does implicit conversion (utf-8 -> terminal encoding)
puts/gets -

File I/O - through UnicodeStream() (handles encoding issues)

-Conversion:

std.utf - two functions needed:

byte[] encode(char[] string, EncodingType et)
char[] decode(byte[] stream, EncodingType et)

-Compatibility:

This new char[] is fully compatible with C-language char*, when 0-127 ASCII-values and a trailing zero-value are used.

Access to Windows/Unix-API available (std.utf.[en/de]code)
Access to Unicode files available (std.stream.UnicodeStream)

-Advantages:

OS/compiler vendor independent
Easy to use

-Disadvantages:

Hard to implement (or is it, Walter seems to have problems with UTF-8 - OTOH this proposal doesn't imply you to implement strings using UTF-8, you can also use "fixed-width" UTF16/32)
It's not super high performance (need to convert a lot on Windows&legacy systems)
Indexing problem (as UTF-8 streams are variable length, it's hard to tell the exact position of a single character. This affects all string operations except concatenating.)

---

Please stop whining about the slowness of utf-conversions. If it's really so slow, I would certainly want to see some real world benchmarks.
November 24, 2005
Regan Heath wrote:
> 
> * add a new type/alias "cpn", this alias will be cp1, cp2 or cp4
> depending on the native encoding chosen. This allows efficient
> code, like:
> 
> string s = "test";
> foreach(cpn c; s) {
> }
> 
> * slicing string gives another string
> 
> * indexing a string gives a cp1, cp2 or cp4

I hope you are not implying that indexing would choose between cp1..4 based on content? And if not, then the cpX would be either some "default", or programmer chosen? Now, that leads to Americans choosing cp1 all over the place, right?

(Ah, upon proofreading before posting, I only now noticed the cpn sentence at the top. I'll remark on it at the very end.)

---

While we are now intricately submerged in UTF and char width issues, one day, when D is a household word, programmers wouldn't have to even know about UTF and stuff.

Just like last summer, when none of us European D folk knew anything about UTF, and just wrote stuff like

    foo = "Hôtel California";

and such, without further ado.

This also means that (when D is perfect) no normal programmer ever touches string contents directly. They only use library routines for searching and all the other usual string operations.

That would make it unusual to use single "characters", althoug not exceptional.

If this is true, then we might consider blatantly skipping cp1 and cp2, and only having cp4 (possibly also renaming it utfchar).

Then it would be a lot simpler for the programmer, right? He'd have even less need to start researching in this UTF swamp. And everything would "just work".

This would make it possible for us to fully automate the extraction and insertion of single "characters" into our new strings.

    string foo = "gagaga";
    utfchar bar = '\UFE9D'; // you don't want to know the name :-)
    utfchar baf = 'a';
    foo ~= bar ~ baf;

(I admit the last line doesn't probably work currently, but it should, IMHO.) Anyhow, the point being that if the utfchar type is 32 bits, then it doesn't hurt anybody, and also doesn't lead to gratuituous incompatibility with foreign characters -- which is the D aim all along.

For completeness, we could have the painting casts (as opposed to converting casts). They'd be for the (seldom) situations where the programmer _does_ want to do serious tinkering on our strings.

    ubyte[] myarr1 = cast(ubyte[])foo;
    ushort[] myarr2 = cast(ushort[]) foo;
    uint[] myarr3 = cast(uint[]) foo;

These give raw arrays, like exact images of the string. The burden of COW would lie on the programmer.

To get a sane array of utfchar you'd have to write

    utfchar[] myarr4 = cast(utfchar[])toUTF32(foo);

(This, of course could be nice to have as a library call. :-) )

While this might look tedious (for the library writer and for the library user), I think it is consistent, clear, and both easy to use and understand for the programmer not familiar with this D News Group.

---

The cpn remark: I think D programs should be (as much as possible) UTF clean, even if the programmer didn't come to think about it. This has the advantage that his programs won't break embarrassingly when a guy in China suddenly uses them.

It would also be quite nice if the programmer didn't have to think about such issues at all. Just code his stuff.

Having cpn as something else than 32 bits, will prevent this dream.
(Heh, and only having single chars as 32 bits would make writing the libraries so much easier, too, I think.)
November 24, 2005
> The "utf8", "utf16" and "utf32" types I refer to are essentially byte, short and int. They cannot contain any code point, only those that fit (I thought I said that?)

In that case I don't like your idea : )

It makes far more sense to have only 1 _character_ type, that holds any UNICODE character. Whether it comes from an utf8, utf16 or utf32 string shouldn't matter:

string s="Whatever";    //imagine it with a small circle on the a, comma
under the t
foreach(uchar u; s) {}

Read "uchar" as "unicode char", essentially dchar, could in fact still be named dchar, I just didn't want to mix old/new terminology. The underlying type of "string" would be determined at compile time, but still convertable using properties (that part I liked very much).

D's "char" should go back to C's char, signed even. Many decissions in D where made to ease the porting of C code, so why this "char" got overriden beats me. char[] should then behave no differently from byte[] (except maybe the element being signed).

L.