View mode: basic / threaded / horizontal-split · Log in · Help
August 25, 2008
Why Strings as Classes?
In another thread (about array append performance) I mentioned that 
Strings ought to be implemented as classes rather than as simple builtin
arrays. Superdan asked why. Here's my response...

I'll start with a few of the softball, easy reasons.

For starters, with strings implemented as character arrays, writing 
library code that accepts and operates on strings is a bit of a pain in 
the neck, since you always have to write templates and template code is 
slightly less readable than non-template code. You can't distribute your 
code as a DLL or a shared object, because the template instantiations 
won't be included (unless you create wrapper functions with explicit 
template instantiations, bloating your code size, but more importantly 
tripling the number of functions in your API).

Another good low-hanging argument is that strings are frequently used as 
keys in associative arrays. Every insertion and retrieval in an 
associative array requires a hashcode computation. And since D strings 
are just dumb arrays, they have no way of memoizing their hashcodes. 
We've already observed that D assoc arrays are less performant than even 
Python maps, so the extra cost of lookup operations is unwelcome.

But much more important than either of those reasons is the lack of 
polymorphism on character arrays. Arrays can't have subclasses, and they 
can't implement interfaces.

A good example of what I'm talking about can be seen in the Phobos and 
Tango regular expression engines. At least the Tango implementation 
matches against all string types (the Phobos one only works with char[] 
strings).

But what if I want to consume a 100 MB logfile, counting all lines that 
match a pattern?

Right now, to use the either regex engine, I have to read the entire 
logfile into an enormous array before invoking the regex search function.

Instead, what if there was a CharacterStream interface? And what if all 
the text-handling code in Phobos & Tango was written to consume and 
return instances of that interface?

A regex engine accepting a CharacterStream interface could process text 
from string literals, file input streams, socket input streams, database 
records, etc, etc, etc... without having to pollute the API with a bunch 
of casts, copies, and conversions. And my logfile processing application 
would consume only a tiny fraction of the memory needed by the character 
array implementation.

Most importantly, the contract between the regex engine and its 
consumers would provide a well-defined interface for processing text, 
regardless of the source or representation of that text.

Along a similar vein, I've worked on a lot of parsers over the past few 
years, for domain specific languages and templating engines, and stuff 
like that. Sometimes it'd be very handy to define a "Token" class that 
behaves exactly like a String, but with some additional behavior. 
Ideally, I'd like to implement that Token class as an implementor of the 
CharacterStream interface, so that it can be passed directly into other 
text-handling functions.

But, in D, with no polymorphic text handling, I can't do that.

As one final thought... I suspect that mutable/const/invariant string 
handling would be much more conveniently implemented with a 
MutableCharacterStream interface (as an extended interface of 
CharacterStream).

Any function written to accept a CharacterStream would automatically 
accept a MutableCharacterStream, thanks to interface polymorphism, 
without any casts, conversions, or copies. And various implementors of 
the interface could provide buffered implementations operating on 
in-memory strings, file data, or network data.

Coding against the CharacterStream interface, library authors wouldn't 
need to worry about const-correctness, since the interface wouldn't 
provide any mutator methods.

But then again, I haven't used any of the const functionality in D2, so 
I can't actually comment on relative usability of compiler-enforced 
immutability versus interface-enforced immutability.

Anyhow, those are some of my thoughts... I think there are a lot of 
compelling reasons for de-coupling the specification of string handling 
functionality from the implementation of that functionality, primarily 
for enabling polymorphic text-processing.

But memoized hashcodes would be cool too :-)

--benji
August 25, 2008
Re: Why Strings as Classes?
Oh, man, I forgot one other thing... And it's a biggie...

The D _Arrays_ page says that "A string is an array of characters. 
String literals are just an easy way to write character arrays."

http://digitalmars.com/d/1.0/arrays.html

In my previous post, I also use the "character array" terminology.

Unfortunately, though, it's just not true.

A char[] is actually an array of UTF-8 encoded octets, where each 
character may consume one or more consecutive elements of the array. 
Retrieving the str.length property may or may not tell you how many 
characters are in the string. And pretty much any code that tries to 
iterate character-by-character through the array elements is 
fundamentally broken.

Take a look at this code, for example:

------------------------------------------------------------------
import tango.io.Stdout;

void main() {

   // Create a string with UTF-8 content
   char[] str = "mötley crüe";
   Stdout.formatln("full string value: {}", str);

   Stdout.formatln("len: {}", str.length);
   // --> "len: 13" ... but there are only 11 characters!

   Stdout.formatln("2nd char: '{}'", str[1]);
   // --> "2nd char: ''" ... where'd my character go?

   Stdout.formatln("first 3 chars: '{}'", str[0..3]);
   // --> "first 3 chars: 'mö'" ... why only 2?

   char o_umlat = 'ö';
   Stdout.formatln("char value: '{}'", o_umlat);
   // --> "char value: ''" ... where's my char?

}
------------------------------------------------------------------

So you can't actually iterate the the char elements of a char[] without 
risking that you'll turn your string data into garbage. And you can't 
trust that the length property tells you how many characters there are. 
And you can't trust that an index or a slice will return valid data.

Also: take a look at the Phobos string "find" functions:

  int find(char[] s, dchar c);
  int ifind(char[] s, dchar c);
  int rfind(char[] s, dchar c);
  int irfind(char[] s, dchar c);

Huh?

To find a character in a char[] array, you have to use a dchar?

To me, that's like looking for a long within an int[] array.

So.. If a char[] actually consists of dchar elements, does that mean I 
can append a dchar to a char[] array?

  dchar u_umlat = 'ü';
  char[] newString = "mötley crüe" ~ u_umlat;

No. Of course not. The compiler complains that you can't concatenate a 
dchar to a char[] array. Even though the "find" functions indicate that 
the array is truly a collection of dchar elements.

Now, don't get me wrong. I understand why the string is encoded as 
UTF-8. And I understand that the encoding prevents accurate element 
iteration, indexing, slicing, and all the other nice array goodies.

The existing D string implementation is exactly what I'd expect to see 
inside the guts of a string class, because encodings are important and 
efficiency is important. But those implementation details shouldn't be 
exposed through a public API.

To claim that D strings are actually usable as character arrays is more 
than a little spurious, since direct access of the array elements can 
return fragmented garbage bytes.

If accurate string manipulation is impossible without a set of 
special-purpose functions, then I'll argue that the implementation is 
already equivalent to that of a class, but without any of the niceties 
of encapsulation and polymorphism.

--benji
August 25, 2008
Re: Why Strings as Classes?
Benji Smith Wrote:

> In another thread (about array append performance) I mentioned that 
> Strings ought to be implemented as classes rather than as simple builtin
> arrays. Superdan asked why. Here's my response...

well then allow me to retort.

> I'll start with a few of the softball, easy reasons.
> 
> For starters, with strings implemented as character arrays, writing 
> library code that accepts and operates on strings is a bit of a pain in 
> the neck, since you always have to write templates and template code is 
> slightly less readable than non-template code. You can't distribute your 
> code as a DLL or a shared object, because the template instantiations 
> won't be included (unless you create wrapper functions with explicit 
> template instantiations, bloating your code size, but more importantly 
> tripling the number of functions in your API).

so u mean with a class the encoding char/wchar/dchar won't be an issue anymore. that would be hidden behind the wraps. cool.

problem is that means there's an indirection cost for every character access. oops. so then apps that decided to use some particular encoding consistently must pay a price for stuff they don't use.

but if u have strings like today it's a no-brainer to define a class that does all that stuff. u can then use that class whenever you feel. it would be madness to put that class in the language definition. at best it's a candidate for the stdlib.

so that low-hangin' argument of yers ain't that low-hangin' after all. unless u call a hanged deadman low-hangin'.

> Another good low-hanging argument is that strings are frequently used as 
> keys in associative arrays. Every insertion and retrieval in an 
> associative array requires a hashcode computation. And since D strings 
> are just dumb arrays, they have no way of memoizing their hashcodes. 
> We've already observed that D assoc arrays are less performant than even 
> Python maps, so the extra cost of lookup operations is unwelcome.

again you want to build larger components from smaller components. you can build a string with memoized hashcode from a string without memoized hashcode. but you can't build a string without memoized hashcode from a string with memoized hashcode. but wait there's more. the extra field is paid for regardless. so what numbers do you have to back up your assertion that it's worth paying that cost for everything except hashtables.

> But much more important than either of those reasons is the lack of 
> polymorphism on character arrays. Arrays can't have subclasses, and they 
> can't implement interfaces.

that's why you can always define a class that does all those good things.

by the same arg why isn't int a class. the point is you can always create class Int that does what an int does, slower but more flexible. if all you had was class Int you'd be in slowland.

> A good example of what I'm talking about can be seen in the Phobos and 
> Tango regular expression engines. At least the Tango implementation 
> matches against all string types (the Phobos one only works with char[] 
> strings).
> 
> But what if I want to consume a 100 MB logfile, counting all lines that 
> match a pattern?
>
> Right now, to use the either regex engine, I have to read the entire 
> logfile into an enormous array before invoking the regex search function.
> 
> Instead, what if there was a CharacterStream interface? And what if all 
> the text-handling code in Phobos & Tango was written to consume and 
> return instances of that interface?

what exactly is the problem there aside from a library issue.

> A regex engine accepting a CharacterStream interface could process text 
> from string literals, file input streams, socket input streams, database 
> records, etc, etc, etc... without having to pollute the API with a bunch 
> of casts, copies, and conversions. And my logfile processing application 
> would consume only a tiny fraction of the memory needed by the character 
> array implementation.

library problem. or maybe you want to build character stream into the language too.

> Most importantly, the contract between the regex engine and its 
> consumers would provide a well-defined interface for processing text, 
> regardless of the source or representation of that text.
> 
> Along a similar vein, I've worked on a lot of parsers over the past few 
> years, for domain specific languages and templating engines, and stuff 
> like that. Sometimes it'd be very handy to define a "Token" class that 
> behaves exactly like a String, but with some additional behavior. 
> Ideally, I'd like to implement that Token class as an implementor of the 
> CharacterStream interface, so that it can be passed directly into other 
> text-handling functions.
> 
> But, in D, with no polymorphic text handling, I can't do that.

of course you can. you just don't want to for the sake of building a fragile argument.

> As one final thought... I suspect that mutable/const/invariant string 
> handling would be much more conveniently implemented with a 
> MutableCharacterStream interface (as an extended interface of 
> CharacterStream).
> 
> Any function written to accept a CharacterStream would automatically 
> accept a MutableCharacterStream, thanks to interface polymorphism, 
> without any casts, conversions, or copies. And various implementors of 
> the interface could provide buffered implementations operating on 
> in-memory strings, file data, or network data.
> 
> Coding against the CharacterStream interface, library authors wouldn't 
> need to worry about const-correctness, since the interface wouldn't 
> provide any mutator methods.

sounds great. so then go ahead and make the characterstream thingie. the language gives u everything u need to make it clean and fast.

> But then again, I haven't used any of the const functionality in D2, so 
> I can't actually comment on relative usability of compiler-enforced 
> immutability versus interface-enforced immutability.
> 
> Anyhow, those are some of my thoughts... I think there are a lot of 
> compelling reasons for de-coupling the specification of string handling 
> functionality from the implementation of that functionality, primarily 
> for enabling polymorphic text-processing.
> 
> But memoized hashcodes would be cool too :-)

sorry dood each and every argument talks straight against your case. if i had any doubts, you just convinced me that a builtin string class would be a mistake.
August 26, 2008
Re: Why Strings as Classes?
Benji Smith Wrote:

> Oh, man, I forgot one other thing... And it's a biggie...
> 
> The D _Arrays_ page says that "A string is an array of characters. 
> String literals are just an easy way to write character arrays."
> 
> http://digitalmars.com/d/1.0/arrays.html
> 
> In my previous post, I also use the "character array" terminology.
> 
> Unfortunately, though, it's just not true.
> 
> A char[] is actually an array of UTF-8 encoded octets, where each 
> character may consume one or more consecutive elements of the array. 
> Retrieving the str.length property may or may not tell you how many 
> characters are in the string. And pretty much any code that tries to 
> iterate character-by-character through the array elements is 
> fundamentally broken.

try this:

foreach (dchar c; str)
{
   process c
}

> Take a look at this code, for example:
> 
> ------------------------------------------------------------------
> import tango.io.Stdout;
> 
> void main() {
> 
>     // Create a string with UTF-8 content
>     char[] str = "mötley crüe";
>     Stdout.formatln("full string value: {}", str);
> 
>     Stdout.formatln("len: {}", str.length);
>     // --> "len: 13" ... but there are only 11 characters!
> 
>     Stdout.formatln("2nd char: '{}'", str[1]);
>     // --> "2nd char: ''" ... where'd my character go?
> 
>     Stdout.formatln("first 3 chars: '{}'", str[0..3]);
>     // --> "first 3 chars: 'mö'" ... why only 2?
> 
>     char o_umlat = 'ö';
>     Stdout.formatln("char value: '{}'", o_umlat);
>     // --> "char value: ''" ... where's my char?
> 
> }
> ------------------------------------------------------------------
> 
> So you can't actually iterate the the char elements of a char[] without 
> risking that you'll turn your string data into garbage. And you can't 
> trust that the length property tells you how many characters there are. 
> And you can't trust that an index or a slice will return valid data.

you can iterate with foreach or lib functions. an index or slice won't return valid data indeed, but it couldn't anyway. there's no o(1) indexing into a string unless it's utf32.

> Also: take a look at the Phobos string "find" functions:
> 
>    int find(char[] s, dchar c);
>    int ifind(char[] s, dchar c);
>    int rfind(char[] s, dchar c);
>    int irfind(char[] s, dchar c);
> 
> Huh?
> 
> To find a character in a char[] array, you have to use a dchar?
> 
> To me, that's like looking for a long within an int[] array.

because you're wrong. you look for a dchar which can represent all characters in an array of a given encoding. the comparison is off.

> So.. If a char[] actually consists of dchar elements, does that mean I 
> can append a dchar to a char[] array?
> 
>    dchar u_umlat = 'ü';
>    char[] newString = "mötley crüe" ~ u_umlat;
> 
> No. Of course not. The compiler complains that you can't concatenate a 
> dchar to a char[] array. Even though the "find" functions indicate that 
> the array is truly a collection of dchar elements.

that's a bug in the compiler. report it.

> Now, don't get me wrong. I understand why the string is encoded as 
> UTF-8. And I understand that the encoding prevents accurate element 
> iteration, indexing, slicing, and all the other nice array goodies.

i know you understand. you should also understand 

> The existing D string implementation is exactly what I'd expect to see 
> inside the guts of a string class, because encodings are important and 
> efficiency is important. But those implementation details shouldn't be 
> exposed through a public API.

exactly at this point your argument kinda explodes. yes, you should see that stuff inside the guts of a string. which means builtin strings should be just arrays that you build larger stuff from. but wait. that's exactly what happens right now.

> To claim that D strings are actually usable as character arrays is more 
> than a little spurious, since direct access of the array elements can 
> return fragmented garbage bytes.

agreed.

> If accurate string manipulation is impossible without a set of 
> special-purpose functions, then I'll argue that the implementation is 
> already equivalent to that of a class, but without any of the niceties 
> of encapsulation and polymorphism.

and without the disadvantages.
August 26, 2008
Re: Why Strings as Classes?
Reply to superdan,

> sorry dood each and every argument talks straight against your case.
> if i had any doubts, you just convinced me that a builtin string class
> would be a mistake.
> 

OTOH as standard lib string class...
August 26, 2008
Re: Why Strings as Classes?
Reply to superdan,

>> The existing D string implementation is exactly what I'd expect to see
>> inside the guts of a string class, because encodings are important and
>> efficiency is important. But those implementation details shouldn't be
>> exposed through a public API.
> 
> exactly at this point your argument kinda explodes. yes, you should
> see that stuff inside the guts of a string. which means builtin
> strings should be just arrays that you build larger stuff from. but
> wait. that's exactly what happens right now.
> 

Ditto, D is a *systems language* It's *supposed* to have access to the lowest 
level representation and build stuff on top of that
August 26, 2008
Re: Why Strings as Classes?
Sorry, dan. You're wrong.

superdan wrote:
> again you want to build larger components from smaller components.

Good APIs define public interfaces and hide implementation details, 
usually providing a default general-purpose implementation while 
allowing third-parties to define special-purpose implementations to suit 
their needs.

In D, the text handling is defined by the data format (unicode byte 
sequences) rather than by the interface, while providing no polymorphic 
mechanism for alternate implementations.

It's the opposite of good API design.

The regular expression engine accepts character arrays. And that's it. 
You want to match on a regex pattern, character-by-character, from an 
input stream? Tough nuts. It's not possible.

The new JSON parser in the Tango library operates on templated string 
arrays. If I want to read from a file or a socket, I have to first slurp 
the whole thing into a character array, even though the 
character-streaming would be more practical.

Parsers, formatters, console IO, socket IO... Anything that provides an 
iterable sequence of characters ought to comply with an interface 
facilitating polymorphic text processing. In some cases, there might be 
a slight memory/speed tradeoff. But in many more cases, the benefit of 
iterating over the transient characters in a stream would be much faster 
and require a tiny fraction of the memory of the character array.

There are performance benefits to be found on both sides of the coin.

Anyhow, I totally agree with you when you say that "larger components" 
should be built from "smaller components".

But the "small components" are the *interfaces*, not the implementation 
details.

--benji
August 26, 2008
Re: Why Strings as Classes?
BCS wrote:
> Ditto, D is a *systems language* It's *supposed* to have access to the 
> lowest level representation and build stuff on top of that

But in this "systems language", it's a O(n) operation to get the nth 
character from a string, to slice a string based on character offsets, 
or to determine the number of characters in the string.

I'd gladly pay the price of a single interface vtable lookup to turn all 
of those into O(1) operations.

--benji
August 26, 2008
Re: Why Strings as Classes?
Benji Smith Wrote:

> Sorry, dan. You're wrong.

well just sayin' it ain't makin' it. hope there's some in the way of a proof henceforth.

> superdan wrote:
> > again you want to build larger components from smaller components.
> 
> Good APIs define public interfaces and hide implementation details, 
> usually providing a default general-purpose implementation while 
> allowing third-parties to define special-purpose implementations to suit 
> their needs.

sure thing. all this gave me a warm fuzzy feelin' right there.

> In D, the text handling is defined by the data format (unicode byte 
> sequences) rather than by the interface, while providing no polymorphic 
> mechanism for alternate implementations.
> 
> It's the opposite of good API design.

wait a minute. you got terms all mixed up. and that's a recurring problem that mashes your argument badly. first you say `in d'. so then i assume there's a problem in the language. but then you go on with what looks like a library thing. sure enuff you end with `api' which is 100% a library thingy.

so if you have any beef say `in phobos' or `in tango'. `in d' must refer to the language proper.

the core lang is supposed to give you the necessary stuff to do that nice api design that gave me half an erection in ur first paragraph. if the language gives me the classes and apis and stuff then things will be slow for everyone and no viagra and no herbal supplement ain't gonna make stuff hard around here.

i think d strings are the cleverest thing yet. not too low level like pointers. not too high level like classes and stuff. just the right size. pun not intended.

> The regular expression engine accepts character arrays. And that's it. 
> You want to match on a regex pattern, character-by-character, from an 
> input stream? Tough nuts. It's not possible.

there are multiple problems with this argument of yours. first it's that you swap horses in the midstream. that, according to the cowboy proverb, is unrecommended. you switch from strings to streams. don't.

about streams. same in perl. it's a classic problem. you must first read as much text as you know could match. then you match. and nobody complains.

what you say is possible. it hasn't been written that way 'cause it's damn hard. motherintercoursin' hard. regexes must backtrack and when they do they need pronto access to the stuff behind. if that's in a file, yeah `tough nuts' eh.

> The new JSON parser in the Tango library operates on templated string 
> arrays. If I want to read from a file or a socket, I have to first slurp 
> the whole thing into a character array, even though the 
> character-streaming would be more practical.

non seqitur. see stuff above.

> Parsers, formatters, console IO, socket IO... Anything that provides an 
> iterable sequence of characters ought to comply with an interface 
> facilitating polymorphic text processing. In some cases, there might be 
> a slight memory/speed tradeoff. But in many more cases, the benefit of 
> iterating over the transient characters in a stream would be much faster 
> and require a tiny fraction of the memory of the character array.

so what u want is a streaming interface and an implementation of it that spans a string. for the mother of god i can't figure what this has to do with the language, and what stops you from writing it.

> There are performance benefits to be found on both sides of the coin.

no. all benefits only of my side of the coin. my side of the coin includes your side of the coin.

> Anyhow, I totally agree with you when you say that "larger components" 
> should be built from "smaller components".

cool.

> But the "small components" are the *interfaces*, not the implementation 
> details.

quite when i thought i drove a point home... dood we need to talk. you have all core language, primitives, libraries, and app code confused.
August 26, 2008
Re: Why Strings as Classes?
Reply to Benji,

> BCS wrote:
> 
>> Ditto, D is a *systems language* It's *supposed* to have access to
>> the lowest level representation and build stuff on top of that
>> 
> But in this "systems language", it's a O(n) operation to get the nth
> character from a string, to slice a string based on character offsets,
> or to determine the number of characters in the string.
> 
> I'd gladly pay the price of a single interface vtable lookup to turn
> all of those into O(1) operations.
> 
> --benji
> 

Then borrow, buy, steal or build a class that does that /on top of the D 
arrays/

No one has said that this should not be available, just that it should not 
/replace/ what is available
« First   ‹ Prev
1 2 3 4 5
Top | Discussion index | About this forum | D home