View mode: basic / threaded / horizontal-split · Log in · Help
November 24, 2005
Re: Unified String Theory [READ THIS FIRST]
On Thu, 24 Nov 2005 21:46:50 +1300, Regan Heath wrote:

> Ok, it appears I picked some really bad type names in my proposal and it  
> is causing some confusion.

Regan,
the idea stinks. Sorry, but that *is* the nice response.

It is far more complicated than it needs to be. Maybe it's the name
confusion, but I don't think so. 

When dealing with strings, almost nobody needs to deal with
partial-characters. We really only need to deal with characters except for
some obscure functionality (maybe interfacing with an external system?).

So we don't need to deal with the individual bytes that make up the
characters in the various UTF encodings. Sure, we will need to know how big
a character is from time to time. For example, given a string (regardless
of encoding format), we might need to know how many bytes the third
character uses. The answer will depend on the UTF encoding *and* the code
point value.

Mostly we won't even need to know the encoding format. We might, if that is
an interfacing requirement, and we might in some circumstances to improve
performance. But generally, we shouldn't care.

So how about we just have a string datatype called 'string'. The default
encoding format in RAM is compiler dependant but we can on a declaration
basis, define specific internal encoding format for a string. Furthermore,
we can access any of the three UTF encoding formats for a string as a
property of the string. The compiler would generate the call to transcode
if required to. The string could also have array properties such that each
element addressed an entire character. If one ever really needed to get
down to the byte level of a character they could assign it to a new
datatype called a 'unicode' (for example) and that would have properties
such as the encoding format and byte size, and the bytes in a unicode could
be accessed using array syntax too.

 string Foo = "Some string";
 unicode C;

 C = Foo[4];
 if (C.encoding = unicode.utf8)
 {
    foreach (ubyte b; C)
    {
      . . . 
    }
  }


We can then dismiss wchar, wchar[], dchar, dchar[] entirely. And make char
and char[] array have the C/C++ semantics.

If some function absolutely insisted on a utf16 string for example, ...

  SomeFunc(Foo.utf16);

would pass the utf16 version of the string to the function.

As for declarations ...

  utf16 { // force RAM encoding to be utf16
   string Foo;
   string Bar;
  }
  string Qwerty; // RAM encoding is compiler choice.


-- 
Derek Parnell
Melbourne, Australia
24/11/2005 7:54:01 PM
November 24, 2005
Re: Unified String Theory..
Regan Heath wrote:

> * add a new type/alias "utf", this would alias utf8, 16 or 32. It  
> represents the application specific native encoding. This allows 
> efficient  code, like:
> 
> string s = "test";
> foreach(utf c; s) {
> }
> 
> regardless of the applications selected native encoding.

I will rewrite this with your changed names (cp*):

> * add a new type/alias "cpn", this would alias cp1, cp2 or cp4. It
> represents the application specific native encoding. This allows
> efficient  code, like:
>
> string s = "test";
> foreach(cpn c; s) {
> }
>
> regardless of the applications selected native encoding.

Say you instead have:

string s = "smörgåsbord";
foreach(cpn c; s) {
}

This code would then work on Win32 (with UTF-16 being the native 
encoding), but not on Linux (with UTF-8).

You have introduced platform dependence where there previously was none.

What do you gain by this?

As I see it, there are only two views you need on a unicode string:
a) The code units
b) The unicode characters

By your suggestion, there would be a third view:
c) The unicode characters that are encoded by a single code unit.

Why is this useful?
Should the "smörgåsbord"-example above throw an error?
Isn't what you want instead:
assert_only_contains_single_code_unit_characters_in_native_encoding(string)

/Oskar
November 24, 2005
Re: Unified String Theory..
Regan Heath wrote:
> On Thu, 24 Nov 2005 09:23:21 +0100, Oskar Linde  
> <oskar.lindeREM@OVEgmail.com> wrote:
> 
>> Regan Heath wrote:
> 
>>> (b) Is this really ASCII or is it system dependant? i.e. Latin-1 or  
>>> similar. Is it ASCII values 127 or less perhaps? To be honest I'm 
>>> not  sure.
>>
>>
>> ASCII is equal to the first 128 code points in Unicode.
>> Latin-1 is equal to the first 256 code points in Unicode.
> 
> 
> And which does a C function expect? Or is that defined by the C 
> function?  Does strcmp care? Does strlen, strchr, ...?

This is not defined. strcmp doesn't care. strlen etc only counts bytes 
until '\0'. You can use latin-1, utf-8 or any 8-bit encoding.
This is why UTF-8 is so popular. You can just plug it in and almost 
everything that used to assume latin-1 or any 8-bit encoding will just 
work without any changes.

Not even the OS cares very much. To the OS, things like a file name, 
file contents, usernames, etc are just a bunch of bytes. Different file 
systems may then define different encodings the file names should be 
interpreted in. This is just how the file name is presented to the user. 
(Transcoding to/from the terminal)

/Oskar
November 24, 2005
Re: Unified String Theory [READ THIS FIRST]
Derek, I must have done a terrible job explaining this, because you've  
completely missunderstood me, in fact your counter proposal is essentially  
what my proposal was intended to be.

More inline...

On Thu, 24 Nov 2005 20:15:56 +1100, Derek Parnell <derek@psych.ward> wrote:
> On Thu, 24 Nov 2005 21:46:50 +1300, Regan Heath wrote:
>
>> Ok, it appears I picked some really bad type names in my proposal and it
>> is causing some confusion.
>
> Regan,
> the idea stinks. Sorry, but that *is* the nice response.
>
> It is far more complicated than it needs to be. Maybe it's the name
> confusion, but I don't think so.
>
> When dealing with strings, almost nobody needs to deal with
> partial-characters.

I think you're confused. My proposal removes the need for dealing with  
partial characters completely, if you think otherwise then I've done a bad  
job explaining it.

> So we don't need to deal with the individual bytes that make up the
> characters in the various UTF encodings. Sure, we will need to know how  
> big a character is from time to time. For example, given a string  
> (regardless
> of encoding format), we might need to know how many bytes the third
> character uses. The answer will depend on the UTF encoding *and* the code
> point value.

Exactly my point, and the reason for the "cpn" alias.

> Mostly we won't even need to know the encoding format. We might, if that  
> is an interfacing requirement, and we might in some circumstances to  
> improve
> performance. But generally, we shouldn't care.

Yes, exactly.

> So how about we just have a string datatype called 'string'. The default
> encoding format in RAM is compiler dependant but we can on a declaration
> basis, define specific internal encoding format for a string.  
> Furthermore, we can access any of the three UTF encoding formats for a  
> string as a
> property of the string. The compiler would generate the call to transcode
> if required to. The string could also have array properties such that  
> each element addressed an entire character.

That, is exactly what I proposed.

> We can then dismiss wchar, wchar[], dchar, dchar[] entirely. And make  
> char and char[] array have the C/C++ semantics.

I proposed exactly that, except char[] should not exist either.
char and char* are all that are required.

Regan
November 24, 2005
Re: Unified String Theory..
On Thu, 24 Nov 2005 10:18:20 +0100, Oskar Linde  
<oskar.lindeREM@OVEgmail.com> wrote:
> Regan Heath wrote:
>
>> * add a new type/alias "utf", this would alias utf8, 16 or 32. It   
>> represents the application specific native encoding. This allows  
>> efficient  code, like:
>>  string s = "test";
>> foreach(utf c; s) {
>> }
>>  regardless of the applications selected native encoding.
>
> I will rewrite this with your changed names (cp*):
>
>  > * add a new type/alias "cpn", this would alias cp1, cp2 or cp4. It
>  > represents the application specific native encoding. This allows
>  > efficient  code, like:
>  >
>  > string s = "test";
>  > foreach(cpn c; s) {
>  > }
>  >
>  > regardless of the applications selected native encoding.
>
> Say you instead have:
>
> string s = "smörgåsbord";
> foreach(cpn c; s) {
> }
>
> This code would then work on Win32 (with UTF-16 being the native  
> encoding), but not on Linux (with UTF-8).

No. "string" would be UTF-8 encoded internally on both platforms.

My proposal stated that "cpn" would thus be an alias for "cp1" but clearly  
that idea isn't going to work in this case as (I'm assuming) it's  
impossible to represent some of those characters using a single byte. Java  
uses an int, maybe we should just do the same?

> You have introduced platform dependence where there previously was none.
> What do you gain by this?

No, there is no platform dependance. The choice of encoding is entirely up  
to the programmer, they choose a default encoding for each program they  
write, it defaults to UTF-8.

> As I see it, there are only two views you need on a unicode string:
> a) The code units
> b) The unicode characters

(a) is seldom required. (b) is the common and thus goal view IMO.

> By your suggestion, there would be a third view:
> c) The unicode characters that are encoded by a single code unit.

(c) was intended to be equal to (b). It was intended that we have 3 types  
so that ASCII programs would not be forced to use an int sized variable  
for single character values. It seems we're stuck doing that.

> Why is this useful?

It's not, it's not what I intended.

> Should the "smörgåsbord"-example above throw an error?

No, certainly not.

> Isn't what you want instead:
> assert_only_contains_single_code_unit_characters_in_native_encoding(string)

I have no idea what you mean here.

Regan
November 24, 2005
Re: Unified String Theory..
Regan Heath wrote:
> On Thu, 24 Nov 2005 10:18:20 +0100, Oskar Linde  
> <oskar.lindeREM@OVEgmail.com> wrote:
> 
>> Regan Heath wrote:
>>
>>> * add a new type/alias "utf", this would alias utf8, 16 or 32. It   
>>> represents the application specific native encoding. This allows  
>>> efficient  code, like:
>>>  string s = "test";
>>> foreach(utf c; s) {
>>> }
>>>  regardless of the applications selected native encoding.
>>
>>
>> I will rewrite this with your changed names (cp*):
>>
>>  > * add a new type/alias "cpn", this would alias cp1, cp2 or cp4. It
>>  > represents the application specific native encoding. This allows
>>  > efficient  code, like:
>>  >
>>  > string s = "test";
>>  > foreach(cpn c; s) {
>>  > }
>>  >
>>  > regardless of the applications selected native encoding.
>>
>> Say you instead have:
>>
>> string s = "smörgåsbord";
>> foreach(cpn c; s) {
>> }
>>
>> This code would then work on Win32 (with UTF-16 being the native  
>> encoding), but not on Linux (with UTF-8).
> 
> 
> No. "string" would be UTF-8 encoded internally on both platforms.

> My proposal stated that "cpn" would thus be an alias for "cp1" but 

Ok. I assumed cpn would be the platform native (preferred) encoding.

> clearly  that idea isn't going to work in this case as (I'm assuming) 
> it's  impossible to represent some of those characters using a single 
> byte. Java  uses an int, maybe we should just do the same?

D uses dchar. Better would maybe be to rename it to char (or maybe 
character), giving:

utf8  (todays char)
utf16 (todays wchar)
char  (todays dchar)

>> As I see it, there are only two views you need on a unicode string:
>> a) The code units
>> b) The unicode characters
> 
> (a) is seldom required. (b) is the common and thus goal view IMO.

Actually, I think it is the other way around. (b) is seldom required.
You can search, split, trim, parse, etc.. D:s char[], without any regard 
of encoding. This is the beauty of UTF-8 and the reason D strings all 
work on code units rather than characters.

When would you actually need character based indexing?
I believe the answer is less often than you think.

/Oskar
November 24, 2005
Re: Unified String Theory
Congrats, Regan! Great job!

And the thread subject is simply a Killer!



If I understand you correctly, then the following would work:

string st = "aaa\U41bbb\UC4ccc\U0107ddd";   // aaaAbbbÄcccćddd
cp1 s3 = st[3];   // A
cp1 s7 = st[7];   // Ä
cp1 s11 = st[11]; // error, too narrow
cp2 s11 = st[11]; // ć

assert( s3 == 0x41 && s7 == 0xC4 && s11 == 0x107 );

So, s3 would contain "A", which the old system would store as utf8 with 
no problem. s3 is 8 bits.

s7 would contain "Ä", which the old system shouldn't have stored in 
8-bit (char) because it is too big, but with your proposal it would be 
ok, since the _code_point_ (i.e. the "value" of the character in 
Unicode) does fit in 8 bits. And _we_are_storing_ the codepoint, not the 
UTF character here, right?

s11 would error, since even the Unicode value is too big for 8 bits.

The second s11 assignment would be ok, since the Unicode value of ć fits 
in 16 bits.

And, st itself would be "regular" UTF-8 on a Linux, and (probably) 
UTF-16 on Windows.

Yes?
November 24, 2005
Re: Unified String Theory..
Regan, your proposal is absolutely too complex. I don't get it and I 
really don't like it. D is supposed to be a _simple_ language. Here's an 
alternative proposal:

-Allowed text string types:

char, char[] (we don't need silly aliases nor wchar/dchar)

-Text string implementation:

char - Unicode code unit (UTF-8, it's up to the compiler vendor to 
decide between 1-4xbytes and an int)

char[] - array of char-types, thus a valid Unicode string encoded in 
UTF-8, no BOM is needed because all char[]s are UTF-8.

-Text string operations:

char a = 'ä', b = 'å';
char[] s = "åäö", t;

t ~= a;		// s == [ 'ä' ]
t ~= b;		// s == [ 'ä', 'å' ] == "äå"

s[1..2] == "äö"

foreach(char c; s) writefln(c);	// outputs: å \n ä \n ö \n

-I/O:

writef/writefln - does implicit conversion (utf-8 -> terminal encoding)
puts/gets -

File I/O - through UnicodeStream() (handles encoding issues)

-Conversion:

std.utf - two functions needed:

byte[] encode(char[] string, EncodingType et)
char[] decode(byte[] stream, EncodingType et)

-Compatibility:

This new char[] is fully compatible with C-language char*, when 0-127 
ASCII-values and a trailing zero-value are used.

Access to Windows/Unix-API available (std.utf.[en/de]code)
Access to Unicode files available (std.stream.UnicodeStream)

-Advantages:

OS/compiler vendor independent
Easy to use

-Disadvantages:

Hard to implement (or is it, Walter seems to have problems with UTF-8 - 
OTOH this proposal doesn't imply you to implement strings using UTF-8, 
you can also use "fixed-width" UTF16/32)
It's not super high performance (need to convert a lot on Windows&legacy 
systems)
Indexing problem (as UTF-8 streams are variable length, it's hard to 
tell the exact position of a single character. This affects all string 
operations except concatenating.)

---

Please stop whining about the slowness of utf-conversions. If it's 
really so slow, I would certainly want to see some real world benchmarks.
November 24, 2005
Re: Unified String Theory [READ THIS FIRST]
Regan Heath wrote:
> 
> * add a new type/alias "cpn", this alias will be cp1, cp2 or cp4
> depending on the native encoding chosen. This allows efficient
> code, like:
> 
> string s = "test";
> foreach(cpn c; s) {
> }
> 
> * slicing string gives another string
> 
> * indexing a string gives a cp1, cp2 or cp4

I hope you are not implying that indexing would choose between cp1..4 
based on content? And if not, then the cpX would be either some 
"default", or programmer chosen? Now, that leads to Americans choosing 
cp1 all over the place, right?

(Ah, upon proofreading before posting, I only now noticed the cpn 
sentence at the top. I'll remark on it at the very end.)

---

While we are now intricately submerged in UTF and char width issues, one 
day, when D is a household word, programmers wouldn't have to even know 
about UTF and stuff.

Just like last summer, when none of us European D folk knew anything 
about UTF, and just wrote stuff like

    foo = "Hôtel California";

and such, without further ado.

This also means that (when D is perfect) no normal programmer ever 
touches string contents directly. They only use library routines for 
searching and all the other usual string operations.

That would make it unusual to use single "characters", althoug not 
exceptional.

If this is true, then we might consider blatantly skipping cp1 and cp2, 
and only having cp4 (possibly also renaming it utfchar).

Then it would be a lot simpler for the programmer, right? He'd have even 
less need to start researching in this UTF swamp. And everything would 
"just work".

This would make it possible for us to fully automate the extraction and 
insertion of single "characters" into our new strings.

    string foo = "gagaga";
    utfchar bar = '\UFE9D'; // you don't want to know the name :-)
    utfchar baf = 'a';
    foo ~= bar ~ baf;

(I admit the last line doesn't probably work currently, but it should, 
IMHO.) Anyhow, the point being that if the utfchar type is 32 bits, then 
it doesn't hurt anybody, and also doesn't lead to gratuituous 
incompatibility with foreign characters -- which is the D aim all along.

For completeness, we could have the painting casts (as opposed to 
converting casts). They'd be for the (seldom) situations where the 
programmer _does_ want to do serious tinkering on our strings.

    ubyte[] myarr1 = cast(ubyte[])foo;
    ushort[] myarr2 = cast(ushort[]) foo;
    uint[] myarr3 = cast(uint[]) foo;

These give raw arrays, like exact images of the string. The burden of 
COW would lie on the programmer.

To get a sane array of utfchar you'd have to write

    utfchar[] myarr4 = cast(utfchar[])toUTF32(foo);

(This, of course could be nice to have as a library call. :-) )

While this might look tedious (for the library writer and for the 
library user), I think it is consistent, clear, and both easy to use and 
understand for the programmer not familiar with this D News Group.

---

The cpn remark: I think D programs should be (as much as possible) UTF 
clean, even if the programmer didn't come to think about it. This has 
the advantage that his programs won't break embarrassingly when a guy in 
China suddenly uses them.

It would also be quite nice if the programmer didn't have to think about 
such issues at all. Just code his stuff.

Having cpn as something else than 32 bits, will prevent this dream.
(Heh, and only having single chars as 32 bits would make writing the 
libraries so much easier, too, I think.)
November 24, 2005
Re: Unified String Theory..
> The "utf8", "utf16" and "utf32" types I refer to are essentially byte, 
> short and int. They cannot contain any code point, only those that fit (I 
> thought I said that?)

In that case I don't like your idea : )

It makes far more sense to have only 1 _character_ type, that holds any 
UNICODE character. Whether it comes from an utf8, utf16 or utf32 string 
shouldn't matter:

string s="Whatever";    //imagine it with a small circle on the a, comma 
under the t
foreach(uchar u; s) {}

Read "uchar" as "unicode char", essentially dchar, could in fact still be 
named dchar, I just didn't want to mix old/new terminology. The underlying 
type of "string" would be determined at compile time, but still convertable 
using properties (that part I liked very much).

D's "char" should go back to C's char, signed even. Many decissions in D 
where made to ease the porting of C code, so why this "char" got overriden 
beats me. char[] should then behave no differently from byte[] (except maybe 
the element being signed).

L.
1 2 3 4 5
Top | Discussion index | About this forum | D home