August 26, 2008
BCS wrote:
> Reply to Benji,
> 
> 
>> The new JSON parser in the Tango library operates on templated string
>> arrays. If I want to read from a file or a socket, I have to first
>> slurp the whole thing into a character array, even though the
>> character-streaming would be more practical.
>>
> 
> Unless you are only going to parse the start of the file or are going to be throwing away most of it *while you parse it, not after* The best way to parse a file is to load it all in one OS system call and then run a slicing parser (like the Tango XML parser) on that.
> One memory allocation and one load or a mmap, and then only the meta structures get allocated later.

There are cases where you might want to parse an XML file that won't fit easily in main memory. I think a stream processing SAX parser would be a good addition (perhaps not replacement for) the exiting one.
August 26, 2008
Benji Smith Wrote:

> superdan wrote:
> > Benji Smith Wrote:
> > 
> >> BCS wrote:
> >>> Ditto, D is a *systems language* It's *supposed* to have access to the lowest level representation and build stuff on top of that
> >> But in this "systems language", it's a O(n) operation to get the nth character from a string, to slice a string based on character offsets, or to determine the number of characters in the string.
> >>
> >> I'd gladly pay the price of a single interface vtable lookup to turn all of those into O(1) operations.
> > 
> > dood. i dunno where to start. allow me to answer from multiple angles.
> > 
> > 1. when was the last time looking up one char in a string or computing length was your bottleneck.
> > 
> > 2. you talk as if o(1) happens by magic that d currently disallows.
> > 
> > 3. maybe i don't want to blow the size of my string by a factor of 4 if i'm just interested in some occasional character search.
> > 
> > 4. implement all that nice stuff you wanna. nobody put a gun to yer head not to. understand you can't put a gun to my head to pay the price.
> 
> Geez, man, you just keep missing the point, over and over again.

relax. believe me i'm tryin', maybe you could put it a better way and meet me in the middle.

> Let me make one point, blisteringly clear: I don't give a shit about the
>    data format. You want the fastest strings in the universe,
> implemented with zero-byte magic beans and burned into the local ROM.
> Fantastic! I'm completely in favor of it.

so far so good.

> Presumably. people will be so into those strings that they'll write a shitload of functionality for them. Parsing, searching, sorting, indexing... the motherload.

cool.

> One day, I come along, and I'd like to perform some text processing. But all of my string data comes from non-magic-beans data sources. I'd like to implement a new kind of string class that supports my data. I'm not going to push my super-slow string class on anybody else, because I know how concerned with performance you are.

i'm in nirvana.

> But check this out... you can have your fast class, and I can have my slow class, and they can both implement the same interface. Like this:
> 
> interface CharSequence {
>    int find(CharSequence needle);
>    int rfind(CharSequence needle);
>    // ...
> }
> 
> class ZeroByteFastMagicString : CharSequence {
>    // ...
> }
> 
> class SuperSlowStoneTabletString : CharSequence {
>    // ...
> }
> 
> Now we can both use the same string functions. Just by implementing an interface, I can use the same text-processing as your hyper-compiler-optimized builtin arrays.

but maestro. the interface call is already what's costing.

> But only if the interface exists.
> 
> And only if library authors write their text-processing code against that interface.
> 
> That's the point.

then there was none. sorry.

> A good API allows multiple implementations to make use of the same algorithms. Application authors can choose their own tradeoffs between speed, memory consumption, and functionality.
> 
> A rigid builtin implementation, with no interface definition, locks everybody into the same choices.

no. this is just wrong. perfectly backwards in fact. a low-level builtin allows unbounded architectures with control over efficiency.
August 26, 2008
Robert Fraser Wrote:

> Benji Smith wrote:
> > superdan wrote:
> >> Benji Smith Wrote:
> >>
> >>> BCS wrote:
> >>>> Ditto, D is a *systems language* It's *supposed* to have access to the lowest level representation and build stuff on top of that
> >>> But in this "systems language", it's a O(n) operation to get the nth character from a string, to slice a string based on character offsets, or to determine the number of characters in the string.
> >>>
> >>> I'd gladly pay the price of a single interface vtable lookup to turn all of those into O(1) operations.
> >>
> >> dood. i dunno where to start. allow me to answer from multiple angles.
> >>
> >> 1. when was the last time looking up one char in a string or computing length was your bottleneck.
> >>
> >> 2. you talk as if o(1) happens by magic that d currently disallows.
> >>
> >> 3. maybe i don't want to blow the size of my string by a factor of 4 if i'm just interested in some occasional character search.
> >>
> >> 4. implement all that nice stuff you wanna. nobody put a gun to yer head not to. understand you can't put a gun to my head to pay the price.
> > 
> > Geez, man, you just keep missing the point, over and over again.
> > 
> > Let me make one point, blisteringly clear: I don't give a shit about the
> >   data format. You want the fastest strings in the universe, implemented
> > with zero-byte magic beans and burned into the local ROM. Fantastic! I'm
> > completely in favor of it.
> > 
> > Presumably. people will be so into those strings that they'll write a shitload of functionality for them. Parsing, searching, sorting, indexing... the motherload.
> > 
> > One day, I come along, and I'd like to perform some text processing. But all of my string data comes from non-magic-beans data sources. I'd like to implement a new kind of string class that supports my data. I'm not going to push my super-slow string class on anybody else, because I know how concerned with performance you are.
> > 
> > But check this out... you can have your fast class, and I can have my slow class, and they can both implement the same interface. Like this:
> > 
> > interface CharSequence {
> >   int find(CharSequence needle);
> >   int rfind(CharSequence needle);
> >   // ...
> > }
> > 
> > class ZeroByteFastMagicString : CharSequence {
> >   // ...
> > }
> > 
> > class SuperSlowStoneTabletString : CharSequence {
> >   // ...
> > }
> > 
> > Now we can both use the same string functions. Just by implementing an interface, I can use the same text-processing as your hyper-compiler-optimized builtin arrays.
> > 
> > But only if the interface exists.
> > 
> > And only if library authors write their text-processing code against that interface.
> > 
> > That's the point.
> > 
> > A good API allows multiple implementations to make use of the same algorithms. Application authors can choose their own tradeoffs between speed, memory consumption, and functionality.
> > 
> > A rigid builtin implementation, with no interface definition, locks everybody into the same choices.
> > 
> > --benji
> 
> Superdan is confusing the issues here. The main argument against your proposal (besides backwards compatibility, of course) is that every access would require a virtual call, which can be fairly slow.

i'm not confusin'. mentioned the efficiency thing a number of times, didn't seem to phase him a bit. so i tried some more viewpoints.
August 26, 2008
BCS:
> If you must have that sort of interface, pick a different language, because D isn't intended to work that way.

I suggest Benji to try C# 3+, despite all the problems it has and the borg-like nature of such software, etc, it will be used way more than D, and it has all the nice things Benji asks for.

Bye,
bearophile
August 26, 2008
superdan wrote:
> relax. believe me i'm tryin', maybe you could put it a better way and meet me in the middle.

Okay. I'll try :)

Think about a collection API.

The container classes are all written to satisfy a few basic primitive operations: you can get an item at a particular index, you can iterate in sequence (either forward or in reverse). You can insert items into a hashtable or retrieve them by key. And so on.

Someone else comes along and writes a library of algorithms. The algorithms can operate on any container that implements the necessary operations.

When someone clever comes along and writes a new sorting algorithm, I can plug my new container class right into it, and get the algorithm for free. Likewise for the guy with the clever new collection class.

We don't bat an eye at the idea of containers & algorithms connecting to one another using a reciprocal set of interfaces. In most cases, you get a performance **benefit** because you can mix and match the container and algorithm implementations that most suit your needs. You can design your own performance solution, rather than being stuck a single "low level" implementation that might be good for the general case but isn't ideal for your problem.

Over in another message BCS said he wants an array index to compile to 3 ASM ops. Cool I'm all for it.

I don't know a whole lot about the STL, but my understanding is that most C++ compilers are smart enough that they can produce the same ASM from an iterator moving over a vector as incrementing a pointer over an array.

So the default implementation is damn fast.

But if someone else, with special design constraints, needs to implement a custom container template, it's no problem. As long as the container provides a function for getting iterators to the container elements, it can consume any of the STL algorithms too, even if the performance isn't as good as the performance for a vector.

There's no good reason the same technique couldn't provide both speed and API flexibility for text processing.

--benji
August 26, 2008
bearophile wrote:
> BCS:
>> If you must have that sort of interface, pick a different language, because D isn't intended to work that way.
> 
> I suggest Benji to try C# 3+, despite all the problems it has and the borg-like nature of such software, etc, it will be used way more than D, and it has all the nice things Benji asks for.
> 
> Bye,
> bearophile

Yep, I like C# a lot. I think it's very well-designed, with the language and libraries dovetailing nicely together.

I'm using D on my current project because I need to distribute libraries on both windows and linux, with C-linkage.

And D is a helluva lot more pleasant than C/C++, even if there is a lot about D that I find lacking.

--benji
August 26, 2008
"superdan" <super@dan.org> wrote in message news:g8vh9b$fko$1@digitalmars.com...
> Benji Smith Wrote:
>> No. Of course not. The compiler complains that you can't concatenate a
>> dchar to a char[] array. Even though the "find" functions indicate that
>> the array is truly a collection of dchar elements.
>
> that's a bug in the compiler. report it.

I did, a long time ago. #111 if I'm not mistaken.

L. 

August 26, 2008
Benji Smith wrote:
> BCS wrote:
>> Reply to Benji,
>>
>>> BCS wrote:
>>>
>>>> Ditto, D is a *systems language* It's *supposed* to have access to the lowest level representation and build stuff on top of that
>>>>
>>> But in this "systems language", it's a O(n) operation to get the nth character from a string, to slice a string based on character offsets, or to determine the number of characters in the string.
>>>
>>> I'd gladly pay the price of a single interface vtable lookup to turn all of those into O(1) operations.
>>>
>>> --benji
>>>
>>
>> Then borrow, buy, steal or build a class that does that /on top of the D arrays/
>>
>> No one has said that this should not be available, just that it should not /replace/ what is available
> 
> The point is that the new string class would be incompatible with the *hundreds* of existing functions that process character arrays.
> 
> Why don't strings qualify for polymorphism?

-------------------------------------------
wchar[] foo="text"w;

int indexOf(char[] str,char ch){
    foreach(int idx,char c;str)
        if(c==ch) return idx;
    return -1;
}

void main() {
    assert(indexOf(foo, 'x')==2);
}
-------------------------------------------

If that does compile, it shouldn't.  The best way to get that to work is to use a template.  Templates can be annoying.  A String class could simplify the different kinds of String inherent in D.  The String class would (should) internally know what kind of String it is (wchar, char, dchar) and to know how to mitigate those differences when operations are called on it.

@Benji
If you want a String class, why don't you write one?  It's a fairly
simple task, even high-school CS students do it quite routinely in C++
(which is a lot more unwieldy for OOP than D is).

A very successful instance of Strings-as-objects is present in Java.
I'd suggest trying to duplicate that functionality.  Then you could
easily write wrappers on existing libraries to use the new String object.



August 26, 2008
On Mon, 25 Aug 2008 20:52:04 -0400, Benji Smith wrote:

> superdan wrote:
>>> But the "small components" are the *interfaces*, not the implementation details.
>> 
>> quite when i thought i drove a point home... dood we need to talk. you have all core language, primitives, libraries, and app code confused.
> 
> The standard libraries are in a grey area between language the language spec and application code. There are all sorts implicit "interfaces" in exposed by the builtin types (and there's also plenty of core language functionality implemented in the standard lib... take the GC, for example).
> 
> You act there's no such thing as an interface for a builtin language feature.
> 
> With strings implemented as raw arrays, they take on the array API...
> 
> slicing: broken
> indexing: busted
> iterating: fucked
> length: you guessed it
> 
> I don't think the internals of the string representation should be any different. UTF-8 arrays? Fine by me. Just don't make me look at the malformed, mis-sliced bytes. Provide an API (yes, implemented in the standard lib, but specified by the language spec) that actually makes sense for text data.
> 
> (Incidentally, this is the same reason I think the builtin dynamic arrays should be classes implementing a standard List interface, and the associative arrays should be classes implementing a Map interface. The language implementations are nice, but they're not polymorphic, and that makes it a pain in the ass to extend them.)
> 
> --benji

On the language spec vs standard library. While the GC is implemented in the standard library, I do not believe the spec says it has to be (though I don't think it is possible otherwise). So the spec could state that strings should be implemented your way, but it shouldn't.

On another note. I must say this as been quite a turn around. There have been many posts in the past with people arguing over having a String class, I think they have been staying out. But none the less it is nothing new.
August 26, 2008
Benji Smith:
> Yep, I like C# a lot. I think it's very well-designed, with the language and libraries dovetailing nicely together.

In the past I have said that C# 3.5/4 has some small ideas that D may enjoy copying. But probably having a complex coherent OOP structure from the bottom up isn't one of them. You must understand that D is lower level than C#, it means it's designed for people that like to suffer more :-) D is designed mostly for people coming from C and C++, and it must be fit to be used procedurally/functionally without any OOP too.

So D isn't C# and this means what you ask isn't much fit for it. Note that the situation isn't set in stone: time ago for example there was a person willing to program like in Python on the dot net platform, unhappy with C#. He has created the Boo language. It's not widespread, and it has few small design mistakes, but overall it's not a bad language, it's quite usable for its purposes. So you can create your language fit for your purposes... Do you know the Vala language? It looks like C#, but compiles to C... it's probably in beta stage still, but it may be closer to your dream language.

Another approach you may follow is to reinvent just the standard library/runtime of D to make it look more like the C# you like :-) Seeing it from outside, Tango too seems already closer to the Java std lib more than Phobos (but I may be wrong). I like Python, so I am writing a large lib that no one else uses that has partially the purpose of making D look like Python :-)

Bye,
bearophile