Less coding

Re: Less coding + possible module bug

Feb 27, 2008

bearophile

February 26, 2008

Less coding

Posted by bearophile

Permalink

bearophile

Permalink

D is a system language, so probably most of you are quite interested in fast running programs. I understand that some people here may not appreciate or be interested in some of the qualities of the scripting languages (as Python and the like). But I can show a little example. It's an extreme example, uncommon, and you may think I have cheated a bit, so you don't have to take it too much seriously, but it looks nice and interesting.

I have seen this C code: http://nickmudge.info/c/hashtest.c.txt

It tests how many collisions are produced by seven different hash functions. It doesn't test their running time (and the demo strings are "wrong", because they are random, while it's better to test them on some real text), but for me it was interesting enough to do a D translation. It generates an array of distinct random strings of random length, computes hash values for them, and counts how many repeated results there are. It contains functions like:

linear_array_search
check_key_duplicates
get_random_character
make_random_strings

The original code tests only hash7, but I have added tests for all of them. Those four functions, plus the main() require about 80 lines (almost 95 if you don't use function pointers to shorten the code that calls the various hashing functions).

Using my d libs I have translated those four functions plus the main like this:

void main() {
    fastSeed = 10;

    string chars = letters ~ digits;
    int[string] tests;
    do {
        auto txt = map((int i){ return fastChoice(chars);}, xrange(fastRandInt(1, 24)));
        tests[txt] = 1;
    } while (tests.length < 10000);

    foreach(nf, func; [&hash2, &hash3, &hash4, &hash5, &hash6, &hash7]) {
        auto keys = fromKeys(xmap(func, tests), 0);
        putr("hash", nf+2, ": ", tests.length - keys.length);
    }
}


Explanations of the code:
- map() takes a function, and an one or more iterable(s), and it return an array of the function applied to every item of the iterable. If the iterable has a length attribute, it pre-allocates the result, so with a bit of inlining it's often almost as fast as a normal loop.
- xrange() return an iterable object, it just returns successive numbers in an interval, with a step too.
- randInt is fast enough, and it gives an integer in the closed integer interval. The seed is set to the current time() by default, but you may set the seed too.
- xmap() is like map, but like all the x-something it returns a lazy iterable object, that yields the items mapped by the given function.
- fromKeys() takes an iterable and a constant, and it returns an associative array with constant keys (it's similar to the useful fromkeys() method of python dicts. I hope D AAs will grow some useful methods).

The following code is for an unreleased yet version of my d libs, I am writing a set()/Set(), that is a class (and function helper) that just wraps D AAs to give a useful set semantics with many simple but useful methods:

void main() {
    string chars = letters ~ digits;
    auto tests = Set!(int);
    do {
        auto txt = map((int i){ return choice(chars);}, xrange(randInt(1, 24)));
        tests.add(txt);
    } while (tests.length < 50000);

    foreach(nf, func; [&hash2, &hash3, &hash4, &hash5, &hash6, &hash7])
        putr("hash", nf+2, ": ", tests.length - set(xmap(func, tests)).length);
}

Notes:
- On my PC with DMD (the version of the code without set()) the creation of the tests[] requires about 1 second with n=50_000. I have seen that using a lower level syle of coding it can be lowered to about 0.5 seconds (in most cases the speed loss is lower than 50%), but it's just initialization time, it runs only once, so I don't care much for for the extra half second. Losing that half second I gain a shorter code, that's faster to write and debug.
- The original C version uses linear scans, instead of the D hashes, so it's much slower, on my PC the D version is 15 times faster than the C one, with N = 10000, compiled with MinGW 4.2.1 (-O3) and the latest DMD.
- The faster Python version (with the Psyco JIT and array.array) needs about 3.5 seconds to generate the test data alone, with n=50000 still.
- I have removed the hash1 from all tests because it was too much slow for the C version.
- If I replace choice(chars) with choice(letters ~ digits) the code is becomes 4-5 times slower, because DMD 1.026 doesn't join those two constant strings (dynamic arrays) at compile/initialization time.
- I know that that C code runs everywhere, and it doesn't need external libs, while my code needs a lib, and I know you can write good libs in C too, that can help you shorten & simplify you code a lot. And I know you can write hashes in C too, or you can use the GNU data structures already written.
- The running time for the hash functions is lower for the C version, because GCC optimizes that code better.

Bye,
bearophile

Using the set it can written like this too: void main() { string chars = letters ~ digits; auto tests = Set!(int); do { tests ~= map((int i){ return choice(chars);}, xrange(randInt(1, 24))); } while (tests.length < 50000); foreach(nf, func; [&hash2, &hash3, &hash4, &hash5, &hash6, &hash7]) putr("hash", nf+2, ": ", tests.length - set(xmap(func, tests)).length); } Bye, bearophile

I like your collection of higher order functions, perhaps you can get them added to the core library in the future, or even better, Walter can build them into the language (ha). If anything, D really needs a 'range'. Something like: >int[] arr = [0..10]; //arr would contain the digits 0-9 That would be pretty cool, and it makes sense since it follows array initialization syntax, and looks similar to using an array slice. Then with that I could experience what I think would be happiness as I used: >foreach(i; [0..10]){...} for my loops instead of old >for (int i = 0; i < 10; ++i){...} Which frankly I'm tired of doing after seeing and using other languages that deal with it so much better (perl, python, php, ruby, haskell, lisp, even visual basic) It would also be nice to loop just a certain amount of times without needing the counter: >foreach([1..10]){...} A bit ugly I guess, but using foreach(1..10) brings in inconsistency.. Ahem, it seems I've sort of gone off on a tangent here. But yeah, HOP and functional stuff is awesome and would be nice to see being added to D, even if only as part of the library.

On Wed, 27 Feb 2008 11:01:57 +0000, Jarrod wrote: >stuff Oh my, I should look at 2.0 more often. std.algorithm is packed full of HOP functions, and a foreach range. I probably should have looked before posting, this is what I get for not keeping track of 2.0 :( Still, I have one valid point left about arrays being initialized with a range, but basically you can ignore the above post =/

Jarrod: > I like your collection of higher order functions, perhaps you can get them added to the core library in the future, or even better, Walter can build them into the language (ha). Thank you. I think most of that stuff doesn't need to become built-in (I think among them only two things may enjoy being supported by the compiler: an *exact* copy of Python list comps and generators. Partial or approximate copies are not worth). What I'd like to see in the compiler isn't more HOFs, but better basic tools to *build* and use such high-order things. AST macros may help. Other useful tools: - Built-in (or in phobos, if it can be used in a transparent way) ability to iterate on two generators in parallel. - Better type inference, to allow shorter anonymous function syntax and allow more automatic specialization: for example having defined many different foo() templated functions, I'd like to use it as &foo, instead of &foo!(uint, long). - A more logically sound module system (*). > std.algorithm is packed full of HOP functions, and a foreach range. D 2.x allows foreach(i; 1..10) too, but I am using D 1.x, and most (95%?) of my stuff isn't present in that std.algorithm. It's nothing revolutionary, but they seem to work well enough, and I have kept the usage very easy because they can be useful (almost) only if they are easy to remember :-) =============================================== (*) beside the other bugs/problems shown in the past by me regarding the module system, I may have found another problem in it. The following ones are 3 modules, the first is the main one: module main_module; import foo1; import foo2; void main() { str(1); } ------------------ module foo1; private import std.string: str = toString; ------------------ module foo2; string str(TyArgs...)(TyArgs args) { return ""; } ------------------ If compiled it produces: main_module.d(4): Error: foo1.str at foo1.d(2) conflicts with foo2.str(TyArgs...) at foo2.d(2) The "private" in foo1 must be unnecessary in a usable module system, but even adding it the "str" name in the "foo1" namespace becomes imported anyway in the "main_module" namespace, and it gives conflict. Bye, bearophile

Forums