String theory in D

I'd heard a bit about D, but this is the first time I've taken a bit of time to look it over. I'm glad I did, because I love the design.

I am wondering about something, though, and that's the apparent decision to have three different standard string types, each with its encoding exposed to the developer. I've had some experience designing text models--I worked with Sun upgrading Java's string model from UCS-2 to UTF-16 and for Macromedia upgrading the string types within Flash and ColdFusion, for example--but every case has its unique constraints.

I don't know enough about D to be sure of the issues and constraints in this case, but I'm wondering if it wouldn't make sense to have a single standard "String" class for the majority of text handling plus something like char/wchar/dchar/ubyte arrays reserved for special cases.

In both Java and Flash we kept having to throw away brainstorming ideas because they implied changes to internal string implementation details that had unnecessarily--in my opinion--been exposed to programmers. I've become increasingly convinced that programmers don't need to know, much less be forced to decide, how most of their text is encoded. They should be thinking in terms of text semantically most of the time, without concerning themselves with its byte representation.

I see text handling as analogous to memory handling in the sense that I think the time has come to have the platform handle the general cases via automated internal mechanisms that are not exposed, while still allowing programmer manual  intervention for occasional special cases.

D already seems to have this memory model (very nice!), and it seems to me that the corresponding text model would be a single standard "String" class, whose internal encoding was the implementation's business, not the programmer's. The String would have the ability to produce explicitly encoded/formatted byte arrays for special cases, such as I/O, where encoding mattered. I would also want the ability to bypass Strings entirely on some occasions and use byte arrays directly. (By "byte arrays" I mean something like D's existing char[], wchar[], etc.)

Since the internal encoding of the standard String would not be exposed to the programmer, it could be optimized differently on every platform. I would probably implement my String class in UTF-16 on Windows and UTF-8 on Linux to make interactions with the OS and neighboring processes as lightweight as possible.

Then I would probably provide standard function wrappers for common OS calls such as getting directory listings, opening files, etc. These wrapper functions would pass text in the form of Strings. Source code that used only these functions would be portable across platforms, and since String's implementation would be optimized for its platform, this portable source code could produce nearly optimal object code on all platforms.

For calling OS functions directly, where you always need to have your text in a specific format, you could just have your Strings create an explicitly formatted byte sequence for you. A call to a Windows API function might pass something like "my_string.toUTF16()". Since the internal format would probably already be UTF-16, this "conversion" could be optimized away by the compiler, but it would leave you the freedom to change the underlying String implementation in the future without breaking anybody's code.

And, of course, you would still have the ability to use char[], wchar[], dchar[], and even ubyte[] directly when needed for special cases.

Having a single String to use for most text handling would make writing, reading, porting, and maintaining code much easier. Having an underlying encoding that isn't exposed would make it possible for implementers to optimize the standard String for the platform, so that programmers who used it would find code that was easier to write to begin with was also more performant when ported. This has huge implications for the creation of the rich libraries that make or break a language these days.

And if for no other reason, it seems to me that a new language should have a single, standard String class from the start just to avoid degenerating into the tangled hairball of conflicting string types that C++ text handling has become. Library creators and architects working in languages that have had a single, standard String class from the start doggedly use the standard String for everything. You could easily create your own alternative string classes for languages like Java or C#, but almost nobody does. As long as the standard String is good enough, it's just not worth the trouble of having to juggle multiple string types. All libraries and APIs in these languages use a single, consistent text model, which is a big advantage these days over C++.

Again, I realize that I may be overlooking any number of important issues that would make this argument inapplicable or irrelevant in this case, but I'm wondering if this would make sense for D.

October 25, 2004

Re: String theory in D

Posted by Ben Hinkle
in reply to Glen Perkins

Permalink

Ben Hinkle

Posted in reply to Glen Perkins

Permalink

Glen Perkins wrote:

> I'd heard a bit about D, but this is the first time I've taken a bit of time to look it over. I'm glad I did, because I love the design.
> 
> I am wondering about something, though, and that's the apparent decision to have three different standard string types, each with its encoding exposed to the developer. I've had some experience designing text models--I worked with Sun upgrading Java's string model from UCS-2 to UTF-16 and for Macromedia upgrading the string types within Flash and ColdFusion, for example--but every case has its unique constraints.

welcome.

> I don't know enough about D to be sure of the issues and constraints in this case, but I'm wondering if it wouldn't make sense to have a single standard "String" class for the majority of text handling plus something like char/wchar/dchar/ubyte arrays reserved for special cases.

There is a port of IBM's ICU unicode library underway and that will help fill in various unicode shortcomings of phobos. What else do you see a class doing that isn't in phobos?

> In both Java and Flash we kept having to throw away brainstorming ideas because they implied changes to internal string implementation details that had unnecessarily--in my opinion--been exposed to programmers. I've become increasingly convinced that programmers don't need to know, much less be forced to decide, how most of their text is encoded. They should be thinking in terms of text semantically most of the time, without concerning themselves with its byte representation.

are you referring to indexing and slicing being character lookup and not byte lookup?

> I see text handling as analogous to memory handling in the sense that I think the time has come to have the platform handle the general cases via automated internal mechanisms that are not exposed, while still allowing programmer manual  intervention for occasional special cases.
> 
> D already seems to have this memory model (very nice!), and it seems to me that the corresponding text model would be a single standard "String" class, whose internal encoding was the implementation's business, not the programmer's. The String would have the ability to produce explicitly encoded/formatted byte arrays for special cases, such as I/O, where encoding mattered. I would also want the ability to bypass Strings entirely on some occasions and use byte arrays directly. (By "byte arrays" I mean something like D's existing char[], wchar[], etc.)
> 
> Since the internal encoding of the standard String would not be exposed to the programmer, it could be optimized differently on every platform. I would probably implement my String class in UTF-16 on Windows and UTF-8 on Linux to make interactions with the OS and neighboring processes as lightweight as possible.

Aliases can introduce a symbol that can mean different things on different platforms:

// "Operating System" character
version (Win32) {
 alias wchar oschar;
} else {
 alias char oschar;
}
oschar[] a_string_in_the_OS_preferred_format;

> Then I would probably provide standard function wrappers for common OS calls such as getting directory listings, opening files, etc. These wrapper functions would pass text in the form of Strings. Source code that used only these functions would be portable across platforms, and since String's implementation would be optimized for its platform, this portable source code could produce nearly optimal object code on all platforms.

These should already be in phobos. If the aliases approach is used all that is required are overloaded versions for char[] or wchar[].

> For calling OS functions directly, where you always need to have your text in a specific format, you could just have your Strings create an explicitly formatted byte sequence for you. A call to a Windows API function might pass something like "my_string.toUTF16()". Since the internal format would probably already be UTF-16, this "conversion" could be optimized away by the compiler, but it would leave you the freedom to change the underlying String implementation in the future without breaking anybody's code.

There exist overloaded versions of std.utf.toUTF16 for char, wchar and dchar arrays. So calling toUTF16(my_string) would do what you propose. Changing the type of my_string would require a recompile but no code change.

> And, of course, you would still have the ability to use char[], wchar[], dchar[], and even ubyte[] directly when needed for special cases.
> 
> Having a single String to use for most text handling would make writing, reading, porting, and maintaining code much easier. Having an underlying encoding that isn't exposed would make it possible for implementers to optimize the standard String for the platform, so that programmers who used it would find code that was easier to write to begin with was also more performant when ported. This has huge implications for the creation of the rich libraries that make or break a language these days.
> 
> And if for no other reason, it seems to me that a new language should have a single, standard String class from the start just to avoid degenerating into the tangled hairball of conflicting string types that C++ text handling has become. Library creators and architects working in languages that have had a single, standard String class from the start doggedly use the standard String for everything. You could easily create your own alternative string classes for languages like Java or C#, but almost nobody does. As long as the standard String is good enough, it's just not worth the trouble of having to juggle multiple string types. All libraries and APIs in these languages use a single, consistent text model, which is a big advantage these days over C++.
> 
> Again, I realize that I may be overlooking any number of important issues that would make this argument inapplicable or irrelevant in this case, but I'm wondering if this would make sense for D.

One disadvantage of a String class is that the methods of the class are fixed. With arrays and functions anyone can add a string "method". A class will actually reduce flexibility in the eyes of the user IMO. Another disadvantage is that classes in D are by reference (like Java) and so slicing will have to allocate memory - today a slice is a length and pointer to shared data so no allocation is needed. A String struct would be an option if a class isn't used, though.

October 25, 2004

Re: String theory in D

Posted by Regan Heath
in reply to Glen Perkins

Permalink

Regan Heath

Posted in reply to Glen Perkins

Permalink

Glen,

I think you make some very good points. In the past several people have argued for a single string type. Some may even have written one, I know it's on the cards.

In the past I have argued for implicit conversion between the existing string types, this would allow them to be used interchangably and converted 'on the fly' where required. This idea can have performance issues as it can cause a lot of excess conversions. My suggestion was in reaction to the impression that the 3 existing types were going to stay.

I think ideally having only one 'string' type would be best. The trick is making it efficient enough for those situations where that sort of thing matters, i.e. embedded software etc.

That said, a well designed class that could be told what encoding to use internally (if required) might be efficient enough for 99% of cases, and in the last 1% a ubyte[] should perhaps be used?

If that class were to come into existance, I don't see the need for 3 char types, instead ubyte[], ushort[] and uint[] would/could be used by the string class internally to represent the data stored.

It's interesting to hear your views on this, I hope your post draws some of the older NG members with opinions on this out of the woodwork, it's been quiet here the last month or so.

Regan

On Mon, 25 Oct 2004 15:07:30 -0700, Glen Perkins <please.dont@email.com> wrote:
> I'd heard a bit about D, but this is the first time I've taken a bit of time to look it over. I'm glad I did, because I love the design.
>
> I am wondering about something, though, and that's the apparent decision to have three different standard string types, each with its encoding exposed to the developer. I've had some experience designing text models--I worked with Sun upgrading Java's string model from UCS-2 to UTF-16 and for Macromedia upgrading the string types within Flash and ColdFusion, for example--but every case has its unique constraints.
>
> I don't know enough about D to be sure of the issues and constraints in this case, but I'm wondering if it wouldn't make sense to have a single standard "String" class for the majority of text handling plus something like char/wchar/dchar/ubyte arrays reserved for special cases.
>
> In both Java and Flash we kept having to throw away brainstorming ideas because they implied changes to internal string implementation details that had unnecessarily--in my opinion--been exposed to programmers. I've become increasingly convinced that programmers don't need to know, much less be forced to decide, how most of their text is encoded. They should be thinking in terms of text semantically most of the time, without concerning themselves with its byte representation.
>
> I see text handling as analogous to memory handling in the sense that I think the time has come to have the platform handle the general cases via automated internal mechanisms that are not exposed, while still allowing programmer manual  intervention for occasional special cases.
>
> D already seems to have this memory model (very nice!), and it seems to me that the corresponding text model would be a single standard "String" class, whose internal encoding was the implementation's business, not the programmer's. The String would have the ability to produce explicitly encoded/formatted byte arrays for special cases, such as I/O, where encoding mattered. I would also want the ability to bypass Strings entirely on some occasions and use byte arrays directly. (By "byte arrays" I mean something like D's existing char[], wchar[], etc.)
>
> Since the internal encoding of the standard String would not be exposed to the programmer, it could be optimized differently on every platform. I would probably implement my String class in UTF-16 on Windows and UTF-8 on Linux to make interactions with the OS and neighboring processes as lightweight as possible.
>
> Then I would probably provide standard function wrappers for common OS calls such as getting directory listings, opening files, etc. These wrapper functions would pass text in the form of Strings. Source code that used only these functions would be portable across platforms, and since String's implementation would be optimized for its platform, this portable source code could produce nearly optimal object code on all platforms.
>
> For calling OS functions directly, where you always need to have your text in a specific format, you could just have your Strings create an explicitly formatted byte sequence for you. A call to a Windows API function might pass something like "my_string.toUTF16()". Since the internal format would probably already be UTF-16, this "conversion" could be optimized away by the compiler, but it would leave you the freedom to change the underlying String implementation in the future without breaking anybody's code.
>
> And, of course, you would still have the ability to use char[], wchar[], dchar[], and even ubyte[] directly when needed for special cases.
>
> Having a single String to use for most text handling would make writing, reading, porting, and maintaining code much easier. Having an underlying encoding that isn't exposed would make it possible for implementers to optimize the standard String for the platform, so that programmers who used it would find code that was easier to write to begin with was also more performant when ported. This has huge implications for the creation of the rich libraries that make or break a language these days.
>
> And if for no other reason, it seems to me that a new language should have a single, standard String class from the start just to avoid degenerating into the tangled hairball of conflicting string types that C++ text handling has become. Library creators and architects working in languages that have had a single, standard String class from the start doggedly use the standard String for everything. You could easily create your own alternative string classes for languages like Java or C#, but almost nobody does. As long as the standard String is good enough, it's just not worth the trouble of having to juggle multiple string types. All libraries and APIs in these languages use a single, consistent text model, which is a big advantage these days over C++.
>
> Again, I realize that I may be overlooking any number of important issues that would make this argument inapplicable or irrelevant in this case, but I'm wondering if this would make sense for D.
>
>



-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

October 25, 2004

Re: String theory in D

Posted by Regan Heath
in reply to Regan Heath

Permalink

Regan Heath

Posted in reply to Regan Heath

Permalink

On Tue, 26 Oct 2004 12:47:44 +1300, Regan Heath <regan@netwin.co.nz> wrote:
> I think you make some very good points. In the past several people have argued for a single string type. Some may even have written one, I know it's on the cards.

To clarify, I believe some people think one is required and will write one to attempt to proove one is better. AFAIK Walter does not see the need for one and/or believe char,wchar,dchar to be better.

> In the past I have argued for implicit conversion between the existing string types, this would allow them to be used interchangably and converted 'on the fly' where required. This idea can have performance issues as it can cause a lot of excess conversions. My suggestion was in reaction to the impression that the 3 existing types were going to stay.
>
> I think ideally having only one 'string' type would be best. The trick is making it efficient enough for those situations where that sort of thing matters, i.e. embedded software etc.
>
> That said, a well designed class that could be told what encoding to use internally (if required) might be efficient enough for 99% of cases, and in the last 1% a ubyte[] should perhaps be used?
>
> If that class were to come into existance, I don't see the need for 3 char types, instead ubyte[], ushort[] and uint[] would/could be used by the string class internally to represent the data stored.
>
> It's interesting to hear your views on this, I hope your post draws some of the older NG members with opinions on this out of the woodwork, it's been quiet here the last month or so.
>
> Regan
>
> On Mon, 25 Oct 2004 15:07:30 -0700, Glen Perkins <please.dont@email.com> wrote:
>> I'd heard a bit about D, but this is the first time I've taken a bit of time to look it over. I'm glad I did, because I love the design.
>>
>> I am wondering about something, though, and that's the apparent decision to have three different standard string types, each with its encoding exposed to the developer. I've had some experience designing text models--I worked with Sun upgrading Java's string model from UCS-2 to UTF-16 and for Macromedia upgrading the string types within Flash and ColdFusion, for example--but every case has its unique constraints.
>>
>> I don't know enough about D to be sure of the issues and constraints in this case, but I'm wondering if it wouldn't make sense to have a single standard "String" class for the majority of text handling plus something like char/wchar/dchar/ubyte arrays reserved for special cases.
>>
>> In both Java and Flash we kept having to throw away brainstorming ideas because they implied changes to internal string implementation details that had unnecessarily--in my opinion--been exposed to programmers. I've become increasingly convinced that programmers don't need to know, much less be forced to decide, how most of their text is encoded. They should be thinking in terms of text semantically most of the time, without concerning themselves with its byte representation.
>>
>> I see text handling as analogous to memory handling in the sense that I think the time has come to have the platform handle the general cases via automated internal mechanisms that are not exposed, while still allowing programmer manual  intervention for occasional special cases.
>>
>> D already seems to have this memory model (very nice!), and it seems to me that the corresponding text model would be a single standard "String" class, whose internal encoding was the implementation's business, not the programmer's. The String would have the ability to produce explicitly encoded/formatted byte arrays for special cases, such as I/O, where encoding mattered. I would also want the ability to bypass Strings entirely on some occasions and use byte arrays directly. (By "byte arrays" I mean something like D's existing char[], wchar[], etc.)
>>
>> Since the internal encoding of the standard String would not be exposed to the programmer, it could be optimized differently on every platform. I would probably implement my String class in UTF-16 on Windows and UTF-8 on Linux to make interactions with the OS and neighboring processes as lightweight as possible.
>>
>> Then I would probably provide standard function wrappers for common OS calls such as getting directory listings, opening files, etc. These wrapper functions would pass text in the form of Strings. Source code that used only these functions would be portable across platforms, and since String's implementation would be optimized for its platform, this portable source code could produce nearly optimal object code on all platforms.
>>
>> For calling OS functions directly, where you always need to have your text in a specific format, you could just have your Strings create an explicitly formatted byte sequence for you. A call to a Windows API function might pass something like "my_string.toUTF16()". Since the internal format would probably already be UTF-16, this "conversion" could be optimized away by the compiler, but it would leave you the freedom to change the underlying String implementation in the future without breaking anybody's code.
>>
>> And, of course, you would still have the ability to use char[], wchar[], dchar[], and even ubyte[] directly when needed for special cases.
>>
>> Having a single String to use for most text handling would make writing, reading, porting, and maintaining code much easier. Having an underlying encoding that isn't exposed would make it possible for implementers to optimize the standard String for the platform, so that programmers who used it would find code that was easier to write to begin with was also more performant when ported. This has huge implications for the creation of the rich libraries that make or break a language these days.
>>
>> And if for no other reason, it seems to me that a new language should have a single, standard String class from the start just to avoid degenerating into the tangled hairball of conflicting string types that C++ text handling has become. Library creators and architects working in languages that have had a single, standard String class from the start doggedly use the standard String for everything. You could easily create your own alternative string classes for languages like Java or C#, but almost nobody does. As long as the standard String is good enough, it's just not worth the trouble of having to juggle multiple string types. All libraries and APIs in these languages use a single, consistent text model, which is a big advantage these days over C++.
>>
>> Again, I realize that I may be overlooking any number of important issues that would make this argument inapplicable or irrelevant in this case, but I'm wondering if this would make sense for D.
>>
>>
>
>
>



-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

October 26, 2004

Re: String theory in D

Posted by Anders F Björklund
in reply to Glen Perkins

Permalink

Anders F Björklund

Posted in reply to Glen Perkins

Permalink

Glen Perkins wrote:

> I don't know enough about D to be sure of the issues and constraints in this case, but I'm wondering if it wouldn't make sense to have a single standard "String" class for the majority of text handling plus something like char/wchar/dchar/ubyte arrays reserved for special cases.

Since OOP is *optional* in D, it isn't given to have a *class* ?
(a String class is still useful, but not as main implementation)

As for a "string" type alias, I think that's a very good idea...
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/11821

> And if for no other reason, it seems to me that a new language should have a single, standard String class from the start just to avoid degenerating into the tangled hairball of conflicting string types that C++ text handling has become. Library creators and architects working in languages that have had a single, standard String class from the start doggedly use the standard String for everything. You could easily create your own alternative string classes for languages like Java or C#, but almost nobody does. As long as the standard String is good enough, it's just not worth the trouble of having to juggle multiple string types. All libraries and APIs in these languages use a single, consistent text model, which is a big advantage these days over C++.

There is no "string" type, and there is no "bool" type in D.
This seems to have been done by design, as Walter's explained ?

The recommended types to use is "char[]" for the usual strings,
(even if wchar[] or even dchar[] is sometimes also useful to have)
and "bit" for booleans. (even if char and int are sometimes used)

There isn't really a conflict, since all strings are Unicode
and all booleans follow the "zero is false, non-zero is true".
But it does expose the underlying storage and implementation...

It seems the best that can be done at this point are *aliases*?
(and improving upon the D library support in Phobos and Deimos)

--anders

October 26, 2004

Re: String theory in D

Posted by A. Coward (not related to Noël)
in reply to Regan Heath

Permalink

A. Coward (not related to Noël)

Posted in reply to Regan Heath

Permalink

I think Glen's thoughts are excellent.

As long as we use D for smallish programs, library development, and such, it may seem obvious to continue using arrays to store sequences of characters (of the size of our choice for the project at hand).

Our aim (at least I think) is to have D usurp C, C++, and to some extent C# and Java. By that time D would be used in the Programming Industry. Once we are there it may seem equally obvious that a programmer should not have to spend time thinking about character sets or widths. A requisite for this is that there is a string class/type that Everyone Uses.

We don't have to skip our current character arrays and library functions, it just means that we really should create a default for the future. And this IMHO should be done pretty much along the lines Glen suggested.

Newcomers to D (newbies as well as Old Pros) should be directed to use this new string. This is what should be prominent and well described in the documentation. And we should move the current text manipulation docs to the hairier sections, right where OS-gurus, embedded programmers, performance pros, and metal-benders go looking. Oh yes, and library developers, too.

The default should be that everyone uses the Default string, and that only profiling should be used to decide whether some snippets should then be programmed with arrays (or whatever), as a last resort.

October 26, 2004

Re: String theory in D

Posted by Regan Heath
in reply to Anders F Björklund

Permalink

Regan Heath

Posted in reply to Anders F Björklund

Permalink

On Tue, 26 Oct 2004 13:34:23 +0200, Anders F Björklund <afb@algonet.se> wrote:
> Glen Perkins wrote:
>
>> I don't know enough about D to be sure of the issues and constraints in this case, but I'm wondering if it wouldn't make sense to have a single standard "String" class for the majority of text handling plus something like char/wchar/dchar/ubyte arrays reserved for special cases.
>
> Since OOP is *optional* in D, it isn't given to have a *class* ?
> (a String class is still useful, but not as main implementation)

In that case, perhaps not a 'class', but a struct as Ben suggested, or, better yet a built-in type like the current arrays, which we can extend in the same way as we can arrays. I think that is important.

> As for a "string" type alias, I think that's a very good idea...
> http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/11821

I don't like it:

1- I personally find 'utf_8' ugly and nasty to type.

2- The style guide mentions that 'meaningless type aliases should be avoided' I think aliasing 'char' to 'utf_8' is meaningless because a char is a utf-8 type by definition.

3- I don't want 'more' character types, I want 'less'.

>> And if for no other reason, it seems to me that a new language should have a single, standard String class from the start just to avoid degenerating into the tangled hairball of conflicting string types that C++ text handling has become. Library creators and architects working in languages that have had a single, standard String class from the start doggedly use the standard String for everything. You could easily create your own alternative string classes for languages like Java or C#, but almost nobody does. As long as the standard String is good enough, it's just not worth the trouble of having to juggle multiple string types. All libraries and APIs in these languages use a single, consistent text model, which is a big advantage these days over C++.
>
> There is no "string" type, and there is no "bool" type in D.
> This seems to have been done by design, as Walter's explained ?

Yes and no. Walter has intentionally made the character types utf ones, IMO a good decision, however it has created a problem where they are not easily interchangable i.e. you have to call conversion functions all the time because some people use one while others use another.

I suggested implicit conversion between them to solve that. Walter sort of liked that idea, but has not done anything about it yet. A better soln IMO would be a single 'string' type which can handle 'being' in any encoding you need.

> The recommended types to use is "char[]" for the usual strings,
> (even if wchar[] or even dchar[] is sometimes also useful to have)
> and "bit" for booleans. (even if char and int are sometimes used)
>
> There isn't really a conflict, since all strings are Unicode
> and all booleans follow the "zero is false, non-zero is true".
> But it does expose the underlying storage and implementation...

All strings are _not_ Unicode, strings can be in any encoding you want.
D currently has 3 'string' types (char,wchar,dchar) which are all Unicode.

There is no difference in my mind between a char[] and a ubyte[] array, except for the fact that the char[] array remembers that it's contents are supposed to be UTF-8 and verifies that on occasion. So, a struct/class/whatever like:

struct string {
  StringType type;
  union {
    ubyte[]  bs;
    ushort[] ss;
    ulong[]  ls;
  }
}

could replace char, wchar, and dchar. It could do implicit conversions where required via 'cast' operators (do we have them yet?). It could handle many more encodings than the 3 handled by char, wchar, and dchar.

If such a type existed char, wchar, and dchar would become obsolete, there would be no need for them at all.

The only weakness a struct has is that you cannot extend it as you can the built-in arrays eg.

void foo(char[] a, int b) {}
char[] bob;
bob.foo(1);  <- calls the 'foo' function above passing 'bob' as 1st arg.

This is a really useful feature, it is why IMO we need a partially built-in solution.

> It seems the best that can be done at this point are *aliases*?
> (and improving upon the D library support in Phobos and Deimos)

We can write a string struct/class/whatever and use that, if it becomes as popular as I imagine it will, it will likely be adopted into Phobos. Basically I'm saying, if we proove it's the right way to go, we just might convince Walter.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

October 26, 2004

Re: String theory in D

Posted by Anders F Björklund
in reply to Regan Heath

Permalink

Anders F Björklund

Posted in reply to Regan Heath

Permalink

Regan Heath wrote:

> I don't like it:
> 
> 1- I personally find 'utf_8' ugly and nasty to type.

Actually it was utf8_t, utf16_t, utf32_t - but point taken :-)

> 2- The style guide mentions that 'meaningless type aliases should be avoided' I think aliasing 'char' to 'utf_8' is meaningless because a char is a utf-8 type by definition.
> 
> 3- I don't want 'more' character types, I want 'less'.

They were meant to 'compliment' the standard int aliases - in stdint.d :
int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t

They were not meant as "pretty", more like: self-explanatory
(explains what type it is: utf/int, and how many bits it is)

Didn't intend to change any built-in class names, like char/wchar/dchar
or byte/short/int/long. Just offer *one* "offical" alias for each type.

What did you think about the "string" (char[]) and "ustring" (wchar[]) ?

> All strings are _not_ Unicode, strings can be in any encoding you want.
> D currently has 3 'string' types (char,wchar,dchar) which are all Unicode.

I meant the string types that interact with "quotes" and the ~ operator.

You are right in that one *could* store strings in ubyte[] or void[]...

> If such a type existed char, wchar, and dchar would become obsolete, there would be no need for them at all.

Unless you like type safety ? As in: chars and ints being different ?

They are of the same bit size as ubyte, ushort and uint - that's true.

 > We can write a string struct/class/whatever and use that, if it becomes
> as popular as I imagine it will, it will likely be adopted into Phobos. Basically I'm saying, if we proove it's the right way to go, we just might convince Walter.

Currently Walter *has* picked the char[] type as the basic string type.
Deimos has, inspired by the ICU library, picked wchar[] as the basis...
(difference being that char[] is best for ASCII, wchar[] for Unicode)

Says http://oss.software.ibm.com/icu/userguide/icufaq.html:
> UTF-8 is 50% smaller than UTF-16 for US-ASCII, but UTF-8 is
> 50% larger than UTF-16 for East and South Asian scripts.
> There is no memory difference for Latin extensions, [...]

I just thought "main(string[] args)" better than "main(char[][] args)" ?
(just as I think the "bool" alias to be better than the built-in "bit")

But I'm not sure I like a "magic" class with a hidden run-time cost...

--anders

October 26, 2004

Re: String theory in D

Posted by Glen Perkins
in reply to Ben Hinkle

Permalink

Glen Perkins

Posted in reply to Ben Hinkle

Permalink

"Ben Hinkle" <bhinkle4@juno.com> wrote in message
news:clk269$haj$1@digitaldaemon.com...
> Glen Perkins wrote:
>
> welcome.

Thanks.

> There is a port of IBM's ICU unicode library underway and that will
> help
> fill in various unicode shortcomings of phobos. What else do you see
> a
> class doing that isn't in phobos?

I don't know enough to comment at this point. I don't even know how
modularity works for compiled executables in D, and I don't want to
propose something that would violate D's priorities by, for example,
creating a heavyweight string full of ICU features, that would end up
being statically linked into every little "hello, world" written in D,
ruining the goal of tiny executables if, for example, that is a high
priority in D.

If there's no chance of a standard string class for general string
operations in D, then there's no point in designing one. If there is a
chance, then the design would have to start with the priorities and
constraints of this particular language.

My sense is that a string class similar to that in C#, but
noncommittal regarding its internal encoding, would be nice for a
language like D.

>> ...I've become increasingly convinced that programmers don't
>> need to know, much less be forced to decide, how most of their text
>> is
>> encoded. They should be thinking in terms of text semantically most
>> of
>> the time, without concerning themselves with its byte
>> representation.
>
> are you referring to indexing and slicing being character lookup and
> not
> byte lookup?

Yes, that's a specific example of what I'm referring to, which is the
general notion of just thinking about algorithms for working with the
text in terms of text itself without regard to how the computer might
be representing that text inside (except in the minority of cases
where you MUST work explicitly with the representation.)

And though it's probably too radical for D (so nobody freak out), we
may well evolve to the point where the most reasonable default for
walking through the "characters" in general text data is something
like 'foreach char ch in mystring do {}', where the built-in "char"
datatype in the language is a variable length entity designed to hold
a complete grapheme. Only where optimization was required would you
drop down to the level of "foreach codepoint cp in mytext do {}',
where mytext was defined as 'codepoint[] mytext', or even more
radically to 'foreach byte b in mytext do {}', where mytext was
defined as 'byte[] mytext'.

Once again, I'm not proposing that for D, I'm just promoting the
general notion of keeping the developer's mind on the text and off of
the representation details to the extent that it is *reasonable*.

>> Since the internal encoding of the standard String would not be
>> exposed to the programmer, it could be optimized differently on
>> every
>> platform. I would probably implement my String class in UTF-16 on
>> Windows and UTF-8 on Linux to make interactions with the OS and
>> neighboring processes as lightweight as possible.
>
> Aliases can introduce a symbol that can mean different things on
> different
> platforms:
>
> // "Operating System" character
> version (Win32) {
> alias wchar oschar;
> } else {
> alias char oschar;
> }
> oschar[] a_string_in_the_OS_preferred_format;

Thanks for pointing out this feature. I like it. It provides a
mechanism for manual optimization at the cost of greater complexity
for those special cases where optimization is called for. You could
have different string representations for different zones in your app,
labeled by zone name: oschar for internal and OS API calls, xmlchar
for an XML I/O boundary etc., so you could change the OS format from
OS to OS while leaving the XML format unchanged.

I can't help thinking, though, that it would be best reserved for
optimization cases, with a simple works-everywhere, called "string"
everywhere, string class for the general case. Otherwise, your
language tutorials would be teaching you that a string is "char[]" but
real production code would almost always be based on locally-invented
names for string types. Libraries, which are also trying hard to be
real production quality code, would use the above alias approach and
invent their own names. Not just at points you needed to manually
optimize but literally everywhere you did anything with a string
internally, you'd have to choose among the three standard names, char,
wchar, and dchar, plus your own custom oschar and xmlchar, plus your
GUI library's gchar or kchar, and your ICU library's unichar, plus a
database orachar designed to match the database encoding, etc.

You could easily end up with so many conversions going on between
types locally optimized for each zone in your app that you are
globally unoptimized.

> One disadvantage of a String class is that the methods of the class
> are
> fixed. With arrays and functions anyone can add a string "method". A
> class
> will actually reduce flexibility in the eyes of the user IMO.
> Another
> disadvantage is that classes in D are by reference (like Java) and
> so
> slicing will have to allocate memory - today a slice is a length and
> pointer to shared data so no allocation is needed. A String struct
> would be
> an option if a class isn't used, though.

It's true what you're saying about the relative lack of flexibility of
built-in methods vs. external functions. You can always apply
functions to strings, though, and the conservative approach would be
to have a few clearly important methods in the string, implement other
operations as functions that take string arguments, and over time
consider migrating those operations into the string itself.

Another possibility might be to have this "oschar" approach above
actually built-in, with everybody (starting from the first "hello,
world" tutorial) encouraged to use that one by default. That's tricky,
though, because when you asked for mystring[3] from your oschar-based
string, what would you get? People would expect the third text
character, but as you know it would depend on the platform, and would
not have any useful meaning in general, which seems pretty awkward for
a standard string. It doesn't seem very useful to present something in
an array format without the individual elements of the array being
very useful. You could make them useful by making dchar[] the default,
but everybody would probably fuss about the wasted memory, and
production code would end up using char or wchar. So that brings us
back to a string class where operator overloading could make the []
array-type access yield consistent, complete codepoints on every
platform.

I'm sympathetic to performance arguments. That would be one of the big
attractions of D. I still can't help thinking that sticking to a
single string class shared by almost all of your tutorials, your own
code, your downloaded snippets, and all of your libraries might not
only be the easiest for programmers to work with but could result in
apps that tended to be at least as performant as the existing
approach.

October 27, 2004

Re: String theory in D

Posted by Glen Perkins
in reply to Anders F Björklund

Permalink

Glen Perkins

Posted in reply to Anders F Björklund

Permalink

"Anders F Björklund" <afb@algonet.se> wrote in message news:clmimg$fvd$1@digitaldaemon.com...


> What did you think about the "string" (char[]) and "ustring" (wchar[]) ?

I don't think you were asking me, but my concern applies to any "let a hundred flowers bloom" design approach for strings. If you have multiple string types with no dominant leader, plus an "alias" feature, plus strong support for OOP but no standard string class, you are almost begging for a crazy quilt landscape of diverse and incompatible string types. I'd be concerned that most large applications would end up dealing with more string types than they wanted with no significant performance gains to show for it.

> Currently Walter *has* picked the char[] type as the basic string type.
> Deimos has, inspired by the ICU library, picked wchar[] as the basis...
> (difference being that char[] is best for ASCII, wchar[] for Unicode)
>
> Says http://oss.software.ibm.com/icu/userguide/icufaq.html:
>> UTF-8 is 50% smaller than UTF-16 for US-ASCII, but UTF-8 is
>> 50% larger than UTF-16 for East and South Asian scripts.
>> There is no memory difference for Latin extensions, [...]

There is so much room for "well, not necessarily" in all of these statements, most programmers understand the issues so little, and it usually matters so little, that it's a bit unfortunate to have a design that *requires* programmers to repeatedly make this decision. Different people, even smart ones, will choose differently, choices that may as well be random for all the difference it usually makes. Once again, I'm afraid that code will get more complicated than necessary with no compensating payoff. And I couldn't avoid the complexity by just choosing wisely myself, because every library author would be free to make his own decisions, and you need a lot of libraries to make a language useful. I could have unnecessary and performance sapping format conversions taking place at every library call.

Top | Forum index | About this forum

Forums