Ode to the set (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » D » Ode to the set (page 2)

August 22, 2001

Re: Ode to the set

Posted by Charles Hixson
in reply to Russell Bornschlegel

Charles Hixson

Posted in reply to Russell Bornschlegel

Russell Bornschlegel wrote:
> Charles Hixson wrote:
> 
>>If not, then if it came to a choice my
>>vote would be for the inclusion of B+Trees as a standard language
>>feature.  I would want to use them far more frequently, and they are
>>much more work to implement.
>>
> 
> Are you interested in the primitive functionality of B+Trees, or of
> some particular implementation-dependent characteristic of them?
> 
> In a garbage-collected language like D, where you have few or no performance guarantees to begin with, it doesn't seem necessary to make a big deal over the latter. 
> 
Yes.  I want to store data on files, and access it in random order.  And remember it from session to session.  An external database can do this, but installing it remotely can be a large hassle, and it's often overkill for what I need.

> 
>>One thing that most of these seem to do is to allow one to store data by
>>means of some non-numeric keying function (usually a string, but lets
>>try to keep this general).  Another is to test the structure for the
>>existence of that key.  Another is to retrieve data associated with that
>>key.  Another is to remove a key, and its associated data.  Anything
>>else primitive-common?
>>
> 
> Yes: iterate over the whole collection, performing an arbitrary operation on each element. This is the only operation I see in your list that isn't supported by associative arrays in the D spec. 
> 
> A set can be represented by an associative array where the contents of an existing element are unused -- all you care about is the existence or absence of an element. 
> 
> Consider that the STL set and map containers are normally implemented
> as some flavor of binary tree, while Perl's associative array, which does essentially the same thing, is normally referred to and implemented as a hash. They're used the same way, but have slightly different performance characteristics -- that most people aren't worried about, because they're lazy and would otherwise be doing linear searches on arrays or linked lists, so the near-constant
> lookup time of a hash or the log2 lookup time of a tree is a huge
> win for them. Trees can also maintain the collection in order, while hashes can't. 
> 
You're right.  Ok, in addition to the "in" operator we need an "each" or "for_each" operator.  But that could be implemented as a function, though an operator would be nicer.

Remember, that if these structures are implemented as classes, then additional functions can be added as needed, by sub-classing.  (E.g., is  an each operator sensible on a set?  Only the common sensible ones are primitive-common.)

Part of the question here is what are the inheritance characteristics. How does one specify a generic class parameter?  A set of integers and a set of strings have slightly different characteristics.  Does the inheritance mechanism allow for this specialization?  If I define a set of wheels, what is it's relationship to a set of bicycle wheels? (Assuming that BicycleWheel is a descendant from Wheel.)  Can I compare membership between the two sets?  etc.

Another question is what kind of languate is D?  Clearly STL/C++ and Python would have much different answers if I asked them "What kind of object can I put into a set?" and "What kind of object can I get back out of a set?"  Python can look at the thing you get back and tell what it is.  C++ depends on the definition of the set to tell it what kind of object it got back.  C++ is faster.  Python is more flexible.

Now my personal needs are for a language that produces stand-alone code.  Python has problems with this (though that's being worked on).  To me, speed is a secondary consideration.  I'm currently working on a dataset of about 1500 cases with a basically N^2 complexity.  It takes about 10-15 minutes to run using Eiffel code with all of the checks on.  I'd like a faster execution, but it's not a killer.  (Most of that time comes from hash table processing.  I'm searching for a better alternative.)  But it MUST be stand alone code.  People need to be able to run it from out on the server, and I may never have seen their computer.  Scratch Python.  Scratch all interpretive languages.  Scratch all languages that require massive dll's.  etc.  So that's how I notice what features are optimum.  Garbage collection is a real winner.  A B+Tree would really help, as would any method for maintaining persistent variables... help, but not be decisive.  Stand-alone code is decisive.

> 
>>Then the operations are, essentially a[key]=x, x=a[key], x in a, and
>>a[key]=null.
>>
> 
> The D-spec associative array specifies "delete a[key]" instead of "a[key] = null", and I think you meant "key in a" instead of "x in a", but yeah.  
> 
> Note -- I do agree that the language should support adding new data structures for your particular needs; to do them reasonably requires:
> 
> - operator overloading to make them pretty, including overloading operator delete, operator in, and operator [].
> 
> - templates or the equivalent to make them reusable and type-safe.
> 
> It looks like you aren't going to get both these features in early
> releases of D, unfortunately...
> 
> -Russell B
> 
Well, Walter said that it was alpha, and the features were subject to change.  So now's the time to speak up.  If I don't at least mention what I feel is important, then it's only my fault if it doesn't happen.  If it doesn't happen anyway ... all languages are design tradeoffs. You examine what's around, and you see what you can make do what you need done.  (And I'm more interested in the Linux version than in the Windows version, so what's in the early releases isn't THAT significant.  But if I don't get people thinking about it, then it probably won't be in the later one's either.)

August 22, 2001

Re: Ode to the set

Posted by Russell Bornschlegel
in reply to Russell Bornschlegel

Russell Bornschlegel

Posted in reply to Russell Bornschlegel

Please excuse reply-to-self.

Russell Bornschlegel wrote:
> Yes: iterate over the whole collection, performing an arbitrary operation on each element. This is the only operation I see in your list that isn't supported by associative arrays in the D spec.

My mistake: myarray.keys returns the keys of an associative array, over which you can iterate to your hearts content.

-Russell B

August 22, 2001

Re: Ode to the set

Posted by Russell Bornschlegel
in reply to Daniel Mantione

Russell Bornschlegel

Posted in reply to Daniel Mantione

Daniel Mantione wrote:
> 
> Russell Bornschlegel wrote:
> > For the general case, are D's associative arrays good enough, Daniel?
> 
> They are a cool feature, but how would you use them as a set replacement?

Associative array of bit; 1 = exists, 0 = unexists.

 bit[ char ] myset;     // set of char

 // do we need to explicitly zero things here?

 myset[ 'x' ] = 1;      // 'x' exists in set
 myset[ 'y' ] = 1;      // 'y' exists in set
 myset[ 'j' ] = 1;      // 'j' exists in set

 if (myset['j'] == 1)  { ... }

 myset['y'] = 0;  // take y out of the set

 if (myset['y'] == 1)  { ... }

Alternately, ignore the contents of each element, and use the existence of an element in the associative array as set membership.

 int[ char ] myset;     // set of char

 myset[ 'x' ] = 1;      // 'x' exists in set
 myset[ 'y' ] = 42;     // 'y' exists in set
 myset[ 'j' ] = 3721;   // 'j' exists in set

 if ('j' in myset)  { ... }

 delete myset['y'];  // take y out of the set

 if ('y' in myset)  { ... }

The D spec is silent on the implementation details for associative arrays, but they'd likely be binary trees (ordered keys, lookup time normally O(log2(n))) or hashes (disordered keys, lookup time normally O(1)).

The D compiler could conceivably do weird optimizations on bit-type associative arrays.

-RB

August 22, 2001

Re: Ode to the set

Posted by Russell Bornschlegel
in reply to Charles Hixson

Russell Bornschlegel

Posted in reply to Charles Hixson

Charles Hixson wrote:
> 
> Russell Bornschlegel wrote:
> > Charles Hixson wrote:
> >>One thing that most of these seem to do is to allow one to store data by means of some non-numeric keying function (usually a string, but lets try to keep this general).  Another is to test the structure for the existence of that key.  Another is to retrieve data associated with that key.  Another is to remove a key, and its associated data.  Anything else primitive-common?
> >>
> >
> > Yes: iterate over the whole collection, performing an arbitrary operation on each element. This is the only operation I see in your list that isn't supported by associative arrays in the D spec.
> 
> You're right.  Ok, in addition to the "in" operator we need an "each" or "for_each" operator.  But that could be implemented as a function, though an operator would be nicer.

Turns out we have it in the D spec:

  for (int i = 0; i < assoc_array.keys.length; i++)
  {
      operate_on( assoc_array[ assoc_array.keys[i] ] );
  }

> Another question is what kind of languate is D?  Clearly STL/C++ and Python would have much different answers if I asked them "What kind of object can I put into a set?" and "What kind of object can I get back out of a set?"  Python can look at the thing you get back and tell what it is.  C++ depends on the definition of the set to tell it what kind of object it got back.  C++ is faster.  Python is more flexible.

I know you're concerned with the builtin functionality, but it's worth mentioning that you can get the Pythonish functionality (if I understand what you're talking about -- I don't know Python) from C++ in a number of ways if you're willing to do some work. An STL container can store references to polymorphic types.

C++ can do just about anything, wrap that functionality in a class, and tuck it away -- the only problem is that someone has to write that class in the first place. The code to do it, apparently, is much smaller than the Ada equivalent, but that's not saying a lot. ;)

> Now my personal needs are for a language that produces stand-alone code.
>   Python has problems with this (though that's being worked on).  To me,
> speed is a secondary consideration.  I'm currently working on a dataset
> of about 1500 cases with a basically N^2 complexity. [problem description
> snipped]

All I can say is "C++ and STL maps." There's a reason C++ is widely used, and its syntactic elegance sure ain't it...

> > It looks like you aren't going to get both these features in early releases of D, unfortunately...
>
> Well, Walter said that it was alpha, and the features were subject to change.  So now's the time to speak up.  If I don't at least mention what I feel is important, then it's only my fault if it doesn't happen.

All granted. My concern right now is that a one-man, no-budget effort
to develop a new language can get derailed very easily by people screaming
"don't even bother unless you're going to put Feature X in!" I want Walter
to get a usable compiler out, _then_ we can beat on it and complain in
a more informed manner :)

-Russell B

August 22, 2001

Re: Ode to the set

Posted by Daniel Mantione
in reply to Russell Bornschlegel

Daniel Mantione

Posted in reply to Russell Bornschlegel

Hi,

No, this won't do as replacement; it is much to cumbersome to use. You are right that a set allways needs to be a bitmap in unoptimized form. The compiler can then decide if it optimizes the bitmap away.

Try to code my first example with such an array. It's no replacement.

Furthermore, sets also have union, difference and subset operations, which you don't want to hard code.

Daniel

Russell Bornschlegel wrote:

>> They are a cool feature, but how would you use them as a set replacement?
> 
> Associative array of bit; 1 = exists, 0 = unexists.
> 
>  bit[ char ] myset;     // set of char
> 
>  // do we need to explicitly zero things here?
> 
>  myset[ 'x' ] = 1;      // 'x' exists in set
>  myset[ 'y' ] = 1;      // 'y' exists in set
>  myset[ 'j' ] = 1;      // 'j' exists in set
> 
>  if (myset['j'] == 1)  { ... }
> 
>  myset['y'] = 0;  // take y out of the set
> 
>  if (myset['y'] == 1)  { ... }
> 
> Alternately, ignore the contents of each element, and use the existence of an element in the associative array as set membership.
> 
>  int[ char ] myset;     // set of char
> 
>  myset[ 'x' ] = 1;      // 'x' exists in set
>  myset[ 'y' ] = 42;     // 'y' exists in set
>  myset[ 'j' ] = 3721;   // 'j' exists in set
> 
>  if ('j' in myset)  { ... }
> 
>  delete myset['y'];  // take y out of the set
> 
>  if ('y' in myset)  { ... }
> 
> The D spec is silent on the implementation details for
> associative arrays, but they'd likely be binary trees (ordered
> keys, lookup time normally O(log2(n))) or hashes (disordered keys,
> lookup time normally O(1)).
> 
> The D compiler could conceivably do weird optimizations on bit-type associative arrays.
> 
> -RB

August 22, 2001

Re: Ode to the set

Posted by Charles Hixson
in reply to Russell Bornschlegel

Charles Hixson

Posted in reply to Russell Bornschlegel

Russell Bornschlegel wrote:
> Charles Hixson wrote:
> 
>>Russell Bornschlegel wrote:
>>
>>>Charles Hixson wrote:
>>>
>>>>One thing that most of these seem to do is to allow one to store data by
>>>>means of some non-numeric keying function (usually a string, but lets
>>>>try to keep this general).  Another is to test the structure for the
>>>>existence of that key.  Another is to retrieve data associated with that
>>>>key.  Another is to remove a key, and its associated data.  Anything
>>>>else primitive-common?
>>>>
>>>>
>>>Yes: iterate over the whole collection, performing an arbitrary
>>>operation on each element. This is the only operation I see in your
>>>list that isn't supported by associative arrays in the D spec.
>>>
>>You're right.  Ok, in addition to the "in" operator we need an "each" or
>>"for_each" operator.  But that could be implemented as a function,
>>though an operator would be nicer.
>>
> 
> Turns out we have it in the D spec:
> 
>   for (int i = 0; i < assoc_array.keys.length; i++)
>   {
>       operate_on( assoc_array[ assoc_array.keys[i] ] );
>   }
That makes it possible (well, easier, it was already possible).  That doesn't make it clean to use.  Contrast that with:

   assoc_array.for_each.key(k)
     operate_on (k);

The syntax of that example is a bit wierd, as I am mangling several different languages, but I trust the meaning is clear.  The question here is how the key should be passed to the operate_on method.
> 
>...
> 
> C++ can do just about anything, wrap that functionality in a class, and tuck it away -- the only problem is that someone has
> to write that class in the first place. The code to do it, apparently, is much smaller than
> the Ada equivalent, but that's not saying a lot. ;)

Actually, except in the sense of algorithmic completeness, there are a lot of things that neither C++ nor any other statically bound language can do.  But there's been a big speed and portability tradeoff, so languages have tended to divide into two camps, the dynamic linkers and the static linkers.  C++, Ada, Eiffel, that kind are static linkers. Smalltalk, Lisp, Python, Ruby, that kind are the dynamic linkers.  There are reasons for the split, and neither side is able to really capture the features of the other (though there are a lot of wild claims made).  Java is a rather interesting mix, sort of standing half way between them.  It attempts to get the advantages of both sides, and ends up getting a lot of the disadvantages of each, also.  As well as some unique strengths.  (Which it tends to ignore in favor of being familiar to C programmers, but look at Kiev to see some of the possibilities.)
> 
> 
>>...
>>
> 
> All I can say is "C++ and STL maps." There's a reason C++ is widely used, and its syntactic elegance sure ain't it...

Doing it by hand in Eiffel is, for me, easier and more portable than using C++ with the STL.  Possibly if I had to throw a Mac into the equation, my answer would be different, but so far I haven't needed that.  I guess that if someone tells me it has to run on a Mac, I may tell them "Yellow Dog Linux".  It's not a good answer, but I haven't really studied that question.  Or maybe by that time Python will be able to do stand-alone compiles.  Or Ruby.  But it won't be C++ until after they fix their inheritance rules mess.
> 
> 
>...
> 
> All granted. My concern right now is that a one-man, no-budget effort to develop a new language can get derailed very easily by people screaming
> "don't even bother unless you're going to put Feature X in!" I want Walter
> to get a usable compiler out, _then_ we can beat on it and complain in a more informed manner :)
> 
> -Russell B
> 
Which is why I try to keep my suggestions to what seems to me as simple syntactic sugar.  Maybe I'm wrong.  Maybe I get overenthusiastic.  But I try.
E.g., most of this discussion was about what minimal changes could make lots of standard data structures easy and clean to use.  Most of these would be library features.  Inheritance is already scheduled to be a lot more dynamic than in C++, garbage collection is already built in (eliminating a lot of work for destructors).  Etc.
  So what is being discussed is whether or not a couple of prepositions could be added, and whether or not they would be useful enough to bother with.  But with decently flexible inheritance and overloadable user defined operators, then, say, :in: and :each: could be defined within a class tree, and all data structures could just implement their own flavor of it.  So it would all reduce to libraries.  But if it had to be written "in( xxx )" and "each ( xxx )" then it would be much uglier and less friendly.

August 22, 2001

Re: Ode to the set

Posted by Daniel Mantione
in reply to Charles Hixson

Daniel Mantione

Posted in reply to Charles Hixson

Charles Hixson wrote:

> It's hard to argue that sets wouldn't be nice to have.  So would AVL-trees, and B+Trees.  And other classical data structures.  Perhaps a better approach would be to emphasize ways that make the language easy to extend with these features.  If not, then if it came to a choice my vote would be for the inclusion of B+Trees as a standard language feature.  I would want to use them far more frequently, and they are much more work to implement.

Nonsense. The set is an elementary type, just like the integer or pointer and no library can easily replace it, just like you cannot easily implement a float in a library. In fact, with any example you would be able to give, a the rewrite of that example with a set if almost allways superior in efficiency and/or maintainability and never has worse efficiency.

It is true that mathematically, a set can consist of everything, but this is not done so in programming languages. It is true that a mathematically behaving set (that can store everything) can better be implemented as a set. With operator overloading you can make it behave as a set.

Now the elementary datatype. A set is a bitmap in all programming languages that support it. In Pascal a set consits of a maximum of 256 different items, because the compiler uses a byte to enumerate them. This is a stupid limitation of Pascal, but the designers definitely got the idea. The built-in set should support types that can be enumerated.

For example:

enum    colours {red, green, blue};

.. would do as the basetype of a set.

char[];

.. would not do. You cannot enumerate all possible strings.

char[2];

.. perhaps would do. That would be a bitmap of 65536 entries.

Greetings,

Daniel Mantione

August 22, 2001

Re: Ode to the set

Posted by Russell Bornschlegel
in reply to Daniel Mantione

Russell Bornschlegel

Posted in reply to Daniel Mantione

Daniel Mantione wrote:

> No, this won't do as replacement; it is much to cumbersome to use. You are right that a set allways needs to be a bitmap in unoptimized form. The compiler can then decide if it optimizes the bitmap away.
> 
> Try to code my first example with such an array. It's no replacement.

Your Ns_EncodeUrl() example? In C, that's a mess for a number of reasons, not the least of which is C's inadequate string handling.

In terms of performance efficiency, in C, the set lookup is going to get swamped by the sprintf. In Pascal, the set lookup may be faster but it'll be drowned out over the long haul by array bounds checking. :)

In terms of implementation cleanliness, Pascal wins out here. In D, I think the _usage_ of the set is just as clean; if assoc-array-of-bits is optimized specially, then performance should be basically equal, and that just leaves the instantiation of the set.

So, I concede the desirability of static range-initialization in associative arrays in D:

  char[] hexlate = "0123456789ABCDEF";
  bit[char] safechars;

  // we want a way of saying this in a static initializer
  safechars[ 'a'..'z' ] = 1;
  safechars[ 'A'..'Z' ] = 1;
  safechars[ '0'..'9' ] = 1;

  // all others left undefined/null/nonexistent

  char[] source, destination;

  for (i = 0; i < source.length(); i++)
  {
    char c = source[i];
    if (c in safechars)
    {
        // append character
        destination.append( c );
	// Walter: does destination += c do the Right Thing?
    }
    else
    {
        // append %## code
        destination.append( '%%' );
        destination.append( hexlate[ ( (c >> 4) & 0x0F ) ] );   // upper 4 bits of c as hex digit
        destination.append( hexlate[ (c & 0x0F) ] );            // lower 4 bits of c as hex digit
    }
  }

Which seems reasonably clean to me. It's not clear to me from my reading of the D spec whether Walter intended range initialization to work on associative arrays.

-RB

August 22, 2001

Re: Ode to the set

Posted by Charles Hixson
in reply to Daniel Mantione

Charles Hixson

Posted in reply to Daniel Mantione

Daniel Mantione wrote:
> Charles Hixson wrote:
> 
> 
>>It's hard to argue that sets wouldn't be nice to have.  So would
>>AVL-trees, and B+Trees.  And other classical data structures. ...
>>
> 
> Nonsense. The set is an elementary type, just like the integer
> or pointer and no library can easily replace it, just like you
> cannot easily implement a float in a library. In fact, ...
> 
> .. perhaps would do. That would be a bitmap of 65536 entries.
> 
> Greetings,
> 
> Daniel Mantione
> 

Considering that bit arrays are an included part of the language, I fail to understand why a set could not be implemented as a library.  Given that the compiler would have all the necessary information, it should (this might be optomistic) be able to optimize references to your "SmallSet" class as inline calls.  So the problem is just syntax, which is what I was discussing.  If that was answered, then I didn't understand the answer.

Also, neither strings, associative arrays, nor sets are elementary types.  They are composits.  A reasonable way to model a string is as an array of characters.  A reasonable way to model an associative array is as a hash table.  And a reasonable way to model a small set is as a bit vector.  But even these aren't the elementary types.  If you wish evidence, consider C++ without the STL, or other complex libraries.  You can build each of those types out of the elements present, yet none of them are present.  And the things that you build them out of are simpler constructs that have been inherited from C.

Now, perhaps what you meant was that it seems to you vastly preferable that a set type be built-in to the language.  If so, then you have the right of it.  I can't argue about what your opinion is.  But others may legitimately have different opinions.

What I am proposing is that the language support the operations needed to facilitate an easy use of sets, among other data structures.  And that the implementation be parcelled out to the constructors of the libraries.  This seems to me a much more efficient approach, as well as a more reasonable one.  But then I don't often feel the need for sets. I even want to use bags more often than I want to use sets, and usually I prefer that my data be in some sorted form with attatched data structures.  So I am also biased.

August 22, 2001

Re: Ode to the set

Posted by Russell Bornschlegel
in reply to Charles Hixson

Russell Bornschlegel

Posted in reply to Charles Hixson

Charles Hixson wrote:
> 
> Russell Bornschlegel wrote:
> > Turns out we have it in the D spec:
> >
> >   for (int i = 0; i < assoc_array.keys.length; i++)
> >   {
> >       operate_on( assoc_array[ assoc_array.keys[i] ] );
> >   }
> That makes it possible (well, easier, it was already possible).  That doesn't make it clean to use.  Contrast that with:
> 
>     assoc_array.for_each.key(k)
>       operate_on (k);
> 
> The syntax of that example is a bit wierd, as I am mangling several different languages, but I trust the meaning is clear.  The question here is how the key should be passed to the operate_on method.

Hmm. How about:

   foreach (keytype k in assoc_array.keys)
   {
       operate_on( k );
   }

This overloads "in" somewhat, but the foreach should be a sufficient hint to the parser; the keytype (int, string, whatever) makes the k look like a declaration like any other.

> > C++ can do just about anything, wrap that functionality in a class, and tuck it away...
> 
> Actually, except in the sense of algorithmic completeness, there are a lot of things that neither C++ nor any other statically bound language can do.

Frankly, I've never used a dynamically linked language, I don't think. Can you give me a short example of a cool thing you can do with dynamic linking?

>    So what is being discussed is whether or not a couple of prepositions
> could be added, and whether or not they would be useful enough to bother
> with.  But with decently flexible inheritance and overloadable user
> defined operators, then, say, :in: and :each: could be defined within a
> class tree, and all data structures could just implement their own
> flavor of it.  So it would all reduce to libraries.  But if it had to be
> written "in( xxx )" and "each ( xxx )" then it would be much uglier and
> less friendly.

Okay. I freely admit to being unafraid of ugly constructs, and to having been thoroughly warped by using C++. I firmly believe that no matter how elegant the language can be, some damn fool is going to use two-space indents, asinine bracketing styles, uncommented state machines implemented via numeric literal cases in huge nested switches, or other constructs of equal value, and that I'll be forced to maintain their code. C++ is the least of my problems this week. :)

-RB

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation