August 26, 2008
BCS wrote:
> Reply to Benji,
> 
>> BCS wrote:
>>
>>> Ditto, D is a *systems language* It's *supposed* to have access to
>>> the lowest level representation and build stuff on top of that
>>>
>> But in this "systems language", it's a O(n) operation to get the nth
>> character from a string, to slice a string based on character offsets,
>> or to determine the number of characters in the string.
>>
>> I'd gladly pay the price of a single interface vtable lookup to turn
>> all of those into O(1) operations.
>>
>> --benji
>>
> 
> Then borrow, buy, steal or build a class that does that /on top of the D arrays/
> 
> No one has said that this should not be available, just that it should not /replace/ what is available

The point is that the new string class would be incompatible with the *hundreds* of existing functions that process character arrays.

Why don't strings qualify for polymorphism?

Am I the only one who thinks the existing tradeoff is a fool's bargain?

--benji
August 26, 2008
Benji Smith Wrote:

> BCS wrote:
> > Ditto, D is a *systems language* It's *supposed* to have access to the lowest level representation and build stuff on top of that
> 
> But in this "systems language", it's a O(n) operation to get the nth character from a string, to slice a string based on character offsets, or to determine the number of characters in the string.
> 
> I'd gladly pay the price of a single interface vtable lookup to turn all of those into O(1) operations.

dood. i dunno where to start. allow me to answer from multiple angles.

1. when was the last time looking up one char in a string or computing length was your bottleneck.

2. you talk as if o(1) happens by magic that d currently disallows.

3. maybe i don't want to blow the size of my string by a factor of 4 if i'm just interested in some occasional character search.

4. implement all that nice stuff you wanna. nobody put a gun to yer head not to. understand you can't put a gun to my head to pay the price.
August 26, 2008
Reply to Benji,

> BCS wrote:
> 
>> Reply to Benji,
>> 
>>> BCS wrote:
>>> 
>>>> Ditto, D is a *systems language* It's *supposed* to have access to
>>>> the lowest level representation and build stuff on top of that
>>>> 
>>> But in this "systems language", it's a O(n) operation to get the nth
>>> character from a string, to slice a string based on character
>>> offsets, or to determine the number of characters in the string.
>>> 
>>> I'd gladly pay the price of a single interface vtable lookup to turn
>>> all of those into O(1) operations.
>>> 
>>> --benji
>>> 
>> Then borrow, buy, steal or build a class that does that /on top of
>> the D arrays/
>> 
>> No one has said that this should not be available, just that it
>> should not /replace/ what is available
>> 
> The point is that the new string class would be incompatible with the
> *hundreds* of existing functions that process character arrays.
> 

That is an issue with (and *only* with) "the *hundreds* of existing functions that process character arrays".


August 26, 2008
superdan wrote:
>> But the "small components" are the *interfaces*, not the implementation details.
> 
> quite when i thought i drove a point home... dood we need to talk. you have all core language, primitives, libraries, and app code confused.

The standard libraries are in a grey area between language the language spec and application code. There are all sorts implicit "interfaces" in exposed by the builtin types (and there's also plenty of core language functionality implemented in the standard lib... take the GC, for example).

You act there's no such thing as an interface for a builtin language feature.

With strings implemented as raw arrays, they take on the array API...

slicing: broken
indexing: busted
iterating: fucked
length: you guessed it

I don't think the internals of the string representation should be any different. UTF-8 arrays? Fine by me. Just don't make me look at the malformed, mis-sliced bytes. Provide an API (yes, implemented in the standard lib, but specified by the language spec) that actually makes sense for text data.

(Incidentally, this is the same reason I think the builtin dynamic arrays should be classes implementing a standard List interface, and the associative arrays should be classes implementing a Map interface. The language implementations are nice, but they're not polymorphic, and that makes it a pain in the ass to extend them.)

--benji
August 26, 2008
Reply to Benji,


> The new JSON parser in the Tango library operates on templated string
> arrays. If I want to read from a file or a socket, I have to first
> slurp the whole thing into a character array, even though the
> character-streaming would be more practical.
> 

Unless you are only going to parse the start of the file or are going to be throwing away most of it *while you parse it, not after* The best way to parse a file is to load it all in one OS system call and then run a slicing parser (like the Tango XML parser) on that. 

One memory allocation and one load or a mmap, and then only the meta structures get allocated later.


August 26, 2008
superdan wrote:
> but if u have strings like today it's a no-brainer to define a class that does all that stuff. u can then use that class whenever you feel. it would be madness to put that class in the language definition. at best it's a candidate for the stdlib.

Instead, the runtime has to know how to convert between utf8, utf16, and utf32. Encodings are not a trivial matter, either.
August 26, 2008
Reply to Benji,

> slicing: broken

works as defined

> indexing: busted

works as defined

> iterating: fucked

works as defined and with foreach(dchar) as you want

> length: you guessed it

works as defined

> Provide an API (yes, implemented in the
> standard lib, but specified by the language spec) that actually makes
> sense for text data.

BTW everything in phobos under std.* is /not part of the D language spec/.

> 
> (Incidentally, this is the same reason I think the builtin dynamic
> arrays should be classes implementing a standard List interface, and
> the associative arrays should be classes implementing a Map interface.
> The language implementations are nice, but they're not polymorphic,
> and that makes it a pain in the ass to extend them.)
> 

A system language MUST have arrays that are not classes or anything near that thick. If you must have that sort of interface, pick a different language, because D isn't intended to work that way.

> --benji
> 


August 26, 2008
superdan wrote:
> Benji Smith Wrote:
> 
>> BCS wrote:
>>> Ditto, D is a *systems language* It's *supposed* to have access to the lowest level representation and build stuff on top of that
>> But in this "systems language", it's a O(n) operation to get the nth character from a string, to slice a string based on character offsets, or to determine the number of characters in the string.
>>
>> I'd gladly pay the price of a single interface vtable lookup to turn all of those into O(1) operations.
> 
> dood. i dunno where to start. allow me to answer from multiple angles.
> 
> 1. when was the last time looking up one char in a string or computing length was your bottleneck.
> 
> 2. you talk as if o(1) happens by magic that d currently disallows.
> 
> 3. maybe i don't want to blow the size of my string by a factor of 4 if i'm just interested in some occasional character search.
> 
> 4. implement all that nice stuff you wanna. nobody put a gun to yer head not to. understand you can't put a gun to my head to pay the price.

Geez, man, you just keep missing the point, over and over again.

Let me make one point, blisteringly clear: I don't give a shit about the   data format. You want the fastest strings in the universe, implemented with zero-byte magic beans and burned into the local ROM. Fantastic! I'm completely in favor of it.

Presumably. people will be so into those strings that they'll write a shitload of functionality for them. Parsing, searching, sorting, indexing... the motherload.

One day, I come along, and I'd like to perform some text processing. But all of my string data comes from non-magic-beans data sources. I'd like to implement a new kind of string class that supports my data. I'm not going to push my super-slow string class on anybody else, because I know how concerned with performance you are.

But check this out... you can have your fast class, and I can have my slow class, and they can both implement the same interface. Like this:

interface CharSequence {
  int find(CharSequence needle);
  int rfind(CharSequence needle);
  // ...
}

class ZeroByteFastMagicString : CharSequence {
  // ...
}

class SuperSlowStoneTabletString : CharSequence {
  // ...
}

Now we can both use the same string functions. Just by implementing an interface, I can use the same text-processing as your hyper-compiler-optimized builtin arrays.

But only if the interface exists.

And only if library authors write their text-processing code against that interface.

That's the point.

A good API allows multiple implementations to make use of the same algorithms. Application authors can choose their own tradeoffs between speed, memory consumption, and functionality.

A rigid builtin implementation, with no interface definition, locks everybody into the same choices.

--benji
August 26, 2008
Reply to Benji,

> But check this out... you can have your fast class, and I can have my
> slow class, and they can both implement the same interface. Like this:
> 

No, you can't.

The overhead needed to implement that is EXACTLY what we are unwilling to use.

I want an indexed array load

x = arr[i];

to be:

-- load arr.ptr into a reg
-- add i to that reg
-- indirect load that into i


3 ASM ops.

If you can get that and what you want, go get a PhD. You will have earned it.


August 26, 2008
Benji Smith wrote:
> superdan wrote:
>> Benji Smith Wrote:
>>
>>> BCS wrote:
>>>> Ditto, D is a *systems language* It's *supposed* to have access to the lowest level representation and build stuff on top of that
>>> But in this "systems language", it's a O(n) operation to get the nth character from a string, to slice a string based on character offsets, or to determine the number of characters in the string.
>>>
>>> I'd gladly pay the price of a single interface vtable lookup to turn all of those into O(1) operations.
>>
>> dood. i dunno where to start. allow me to answer from multiple angles.
>>
>> 1. when was the last time looking up one char in a string or computing length was your bottleneck.
>>
>> 2. you talk as if o(1) happens by magic that d currently disallows.
>>
>> 3. maybe i don't want to blow the size of my string by a factor of 4 if i'm just interested in some occasional character search.
>>
>> 4. implement all that nice stuff you wanna. nobody put a gun to yer head not to. understand you can't put a gun to my head to pay the price.
> 
> Geez, man, you just keep missing the point, over and over again.
> 
> Let me make one point, blisteringly clear: I don't give a shit about the   data format. You want the fastest strings in the universe, implemented with zero-byte magic beans and burned into the local ROM. Fantastic! I'm completely in favor of it.
> 
> Presumably. people will be so into those strings that they'll write a shitload of functionality for them. Parsing, searching, sorting, indexing... the motherload.
> 
> One day, I come along, and I'd like to perform some text processing. But all of my string data comes from non-magic-beans data sources. I'd like to implement a new kind of string class that supports my data. I'm not going to push my super-slow string class on anybody else, because I know how concerned with performance you are.
> 
> But check this out... you can have your fast class, and I can have my slow class, and they can both implement the same interface. Like this:
> 
> interface CharSequence {
>   int find(CharSequence needle);
>   int rfind(CharSequence needle);
>   // ...
> }
> 
> class ZeroByteFastMagicString : CharSequence {
>   // ...
> }
> 
> class SuperSlowStoneTabletString : CharSequence {
>   // ...
> }
> 
> Now we can both use the same string functions. Just by implementing an interface, I can use the same text-processing as your hyper-compiler-optimized builtin arrays.
> 
> But only if the interface exists.
> 
> And only if library authors write their text-processing code against that interface.
> 
> That's the point.
> 
> A good API allows multiple implementations to make use of the same algorithms. Application authors can choose their own tradeoffs between speed, memory consumption, and functionality.
> 
> A rigid builtin implementation, with no interface definition, locks everybody into the same choices.
> 
> --benji

Superdan is confusing the issues here. The main argument against your proposal (besides backwards compatibility, of course) is that every access would require a virtual call, which can be fairly slow.