Sometimes 100 lines of code can fix your ctfe perf

Aug 24, 2021

Stefan Koch

Aug 24, 2021

Aug 24, 2021

Aug 24, 2021

Aug 24, 2021

Aug 24, 2021

Aug 24, 2021

Aug 24, 2021

__COUNTER__ (was: Sometimes 100 lines of code can fix your ctfe perf)
Aug 26, 2021 Ali Çehreli
Aug 26, 2021 Stefan Koch
Aug 26, 2021 user1234

Aug 24, 2021

Stefan Koch

Aug 24, 2021

Per Nordlöw

Aug 24, 2021

Stefan Koch

August 24, 2021

Sometimes 100 lines of code can fix your ctfe perf

Posted by Stefan Koch

Permalink

Stefan Koch

Permalink

Good Morning,

I just saw the need to concatenate a long-ish string at ctfe.

TLDR; writing your own specialized chunking appender is way faster than naive CTFE
using the phobos provided appender seems to be a bad idea ;) at least at CTFE.

As you may know when you try to so this the naive way:

enum x = () { string result;
foreach(_; 0 ..n_iterations)
  result ~= "whatever"
return result; } ()

it scales quite poorly as n_iterations grows.
and eventually you'll run out of memory and time.

it turns out ctfe really doesn't like appending because it doesn't preallocate buffers or keep capacity around.

So what if we made our own string-builder that uses whole chunks to do the appending in?
in about 70 lines of code we can build our own string build that appends strings.

module stringBuilder;
struct StringBuilder
{
    enum CHUNK_SIZE = 4096;
    char[CHUNK_SIZE] fixedBuffer;
    size_t fixedBufferUsed;

    char[CHUNK_SIZE][] chunks;
    size_t[] chunk_lengths;

    ushort last_chunk_used;

    void appendString(const(char)[] s)
    {
        // can we used our preallocated fixed buffer?
        if (!last_chunk_used)
        {
            auto fixedBufferEnd = fixedBufferUsed + s.length;
            if (fixedBufferEnd < CHUNK_SIZE)
            {
                fixedBuffer[fixedBufferUsed .. fixedBufferEnd] = s;
                fixedBufferUsed = fixedBufferEnd;
            }
        }
        else
        {
        LappendToLastChunk:
            auto chunkBegin = chunk_lengths[last_chunk_used - 1];
            auto chunkEnd = chunkBegin + s.length;
            if (chunkEnd < CHUNK_SIZE)
            {
                chunks[last_chunk_used - 1][chunkBegin .. chunkEnd] = s;
                chunk_lengths[last_chunk_used - 1] = chunkEnd;
            }
            else if (s.length <= CHUNK_SIZE)
            {
                chunks.length = last_chunk_used + 1;
                last_chunk_used += 1;
                goto LappendToLastChunk;
            }
            else
            {
                assert(0,
                        "string to be appended is longer than "
                        ~ CHUNK_SIZE.stringof ~ " this is not supported currently");
            }
        }
    }

    string releaseBuffer()
    {
        size_t result_length = fixedBufferUsed;
        foreach (i; 0 .. last_chunk_used - 1)
        {
            result_length += chunk_lengths[i];
        }

        char[] result;
        result.length = result_length;
        result[0 .. fixedBufferUsed] = fixedBuffer[0 .. fixedBufferUsed];
        size_t result_begin_pos = fixedBufferUsed;

        foreach (i; 0 .. last_chunk_used - 1)
        {
            const chunk_used = chunk_lengths[i];
            result[result_begin_pos .. result_begin_pos + chunk_used] = chunks[i][0 .. chunk_used];
            result_begin_pos += chunk_used;
        }
        return cast(string) result;
    }
}

that wasn't too bad was it?
But will such a simple piece of code really be able to speed us up?

lets do a quick perf test.

import stringBuilder;
static if (is(typeof(import ("N_ITERATIONS"))))
{
    enum n_iterations = mixin(import("N_ITERATIONS"));
}
else
{
    enum n_iterations = 4096;
}

pragma(msg, "N: ", n_iterations);

enum big_string =
() {
    version (StringBuilder)
    {
        StringBuilder b;
    }
    else version (Appender)
    {
        import std.array : Appender;
        Appender!(string) result;
    }
    else
    {
        string result;
    }

    foreach(_; 0 ..n_iterations)
    {
        version(StringBuilder)
        {
            b.appendString("Let us see ... how meta can you go?\n");
        }
        else version (Appender)
        {
            result ~=     ("Let us see ... how meta can you go?\n");
        }
        else
        {
            result ~=     ("Let us see ... how meta can you go?\n");
        }
    }
    version (StringBuilder)
    {
        return b.releaseBuffer();
    }
    else version (Appender)
    {
        return result.opSlice();
    }
    else
        return result;
} ();

pragma(msg, big_string);

And here are the results for various values of N_ITERATIONS:
(times are in milliseconds and are the time) dmd -c takes

|   N   |  stringBuilder  | naive   | Appender                    |
|-------|-----------------|---------|-----------------------------|
| 32    | 28.1            | 28      | 53                          |
| 128   | 29.0            | 27.3    | 58                          |
| 512   | 29.4            | 29.7    | 120.4                       |
| 1024  | 30.6            | 33.7    | 304.4                       |
| 2048  | 31.8            | 51.2    | 1055                        |
| 4096  | 35.1            | 120.8   | 4338                        |
| 8192  | 41.0            | 368.1   | 18146                       |
| 16384 | 52.5            | 1353.1  | inf (it ran out of memeory) |

August 24, 2021

Re: Sometimes 100 lines of code can fix your ctfe perf

Posted by Bienlein
in reply to Stefan Koch

Permalink

Bienlein

Posted in reply to Stefan Koch

Permalink

This is nice. Would be good if your StringBuilder class makes it into Phobos somehow as it is of general use :-). Java's StringBuilder class was first added to JDK1.5 which was released in 2004 ...

August 24, 2021

Re: Sometimes 100 lines of code can fix your ctfe perf

Posted by Stefan Koch
in reply to Stefan Koch

Permalink

Stefan Koch

Posted in reply to Stefan Koch

Permalink

On Tuesday, 24 August 2021 at 14:07:45 UTC, Stefan Koch wrote:

Good Morning,

I just saw the need to concatenate a long-ish string at ctfe.

TLDR; writing your own specialized chunking appender is way faster than naive CTFE
using the phobos provided appender seems to be a bad idea ;) at least at CTFE.

I should have tested the code more thoroughly!
The reason why the string builder had so a flat performance curve was A BUG.
It just did not concatenate anything to the buffer after the initial was full.

This issue is now fixed and had the pleasant side effect of removing the gotos from the code

the fixed version looks like this:

struct StringBuilder
{
    enum CHUNK_SIZE = 4096;
    char[CHUNK_SIZE] fixedBuffer;
    size_t fixedBufferUsed;

    char[CHUNK_SIZE][] chunks;
    size_t[] chunk_lengths;

    ushort last_chunk_used;

    void addToNewChunk(const(char)[] s)
    {
            if (s.length <= CHUNK_SIZE)
            {
                chunks.length = last_chunk_used + 1;
                chunk_lengths.length = last_chunk_used + 1;
                last_chunk_used += 1;
                appendToLastChunk(s);
            }
            else
            {
                assert(0,
                        "string to be appended is longer than "
                        ~ CHUNK_SIZE.stringof ~ " this is not supported currently");
            }
    }

    void appendToLastChunk(const(char)[] s)
    {
            auto chunkBegin = chunk_lengths[last_chunk_used - 1];
            auto chunkEnd = chunkBegin + s.length;
            if (chunkEnd < CHUNK_SIZE)
            {
                chunks[last_chunk_used - 1][chunkBegin .. chunkEnd] = s;
                chunk_lengths[last_chunk_used - 1] = chunkEnd;
            }
            else
                addToNewChunk(s);

    }

    void appendString(const(char)[] s)
    {

        // can we used our preallocated fixed buffer?
        if (!last_chunk_used)
        {
            auto fixedBufferEnd = fixedBufferUsed + s.length;
            if (fixedBufferEnd < CHUNK_SIZE)
            {
                fixedBuffer[fixedBufferUsed .. fixedBufferEnd] = s;
                fixedBufferUsed = fixedBufferEnd;
            }
            else
            {
                addToNewChunk(s);
            }
        }
        else
            appendToLastChunk(s);
    }

    string releaseBuffer()
    {
        size_t result_length = fixedBufferUsed;
        foreach (i; 0 .. last_chunk_used)
        {
            result_length += chunk_lengths[i];
        }

        char[] result;
        result.length = result_length;
        result[0 .. fixedBufferUsed] = fixedBuffer[0 .. fixedBufferUsed];
        size_t result_begin_pos = fixedBufferUsed;

        foreach (i; 0 .. last_chunk_used - 1)
        {
            const chunk_used = chunk_lengths[i];
            result[result_begin_pos .. result_begin_pos + chunk_used] = chunks[i][0 .. chunk_used];
            result_begin_pos += chunk_used;
        }

        return cast(string) result;
    }

}

And now the resulting table is somewhat different as well:

|   N   |  stringBuilder  | naive   | Appender                    |
|-------|-----------------|---------|-----------------------------|
| 32    | 27.8            | 27.8    | 53                          |
| 128   | 29.8            | 27.2    | 54.1                        |
| 512   | 33.7            | 29.8    | 121                         |
| 1024  | 39.8            | 34.4    | 313.9                       |
| 2048  | 52.0            | 52.9    | 1047                        |
| 4096  | 77.0            | 123.9   | 4350                        |
| 8192  | 127.3           | 378.2   | 18146                       |
| 10000 | 151.1           | 548.6   | 26796                       |
| 16384 | 233.7           | 1386    | inf (it ran out of memory)  |

I inserted another row which is the N the appender can barely do before I run out of memory.
or memeory which is special memory to be used in memes.

We can see now the string build behaves linear rather than almost constant which is much more what I would have expected.

Now that were results for CTFE;
Let's now look at how it looks at regular runtime.
We compile with ldc -O3 we also only care for the table starting at 10000
we also add N == 0 to account for compile time as we are measuring compiletime and runtime together
I will write out only the difference from N0

|   N   |  stringBuilder  | naive   | Appender |
|-------|-----------------|---------|----------|
| 0     |-147.0           |-100.0   |-228.7    |
| 16384 | 12.5            | 6.2     | 3.7      |
| 32768 | 23.5            | 7.9     | 5.0      |
| 65536 | 19.6            | 7.6     | 5.6      |
| 131072| 37.0            | 15.2    | 14.7     |
| 262144| 44.4            | 24.3    | 16.3     |
| 524288| 108.8           | 30.6    | 21.1     |
|1048576| 236.5           | 55.4    | 43.7     |
|2097152| 369.4           | 101.1   | 72.3     |
|4194304| 823.2           | 198.0   | 150.1    |

We can see that the runtime performance of the string builder is quite bad as we scale.
and that the naive version is quite good even under extreme stress.
The Appender is somewhat faster at runtime than the naive version but not a whole lot.
And it takes twice the compile time.

So for code that is supposed to be at runtime a quick ~= in the code is not bad at all.
If it's very heavily used the Appender can make a difference, at runtime as well.
But it should absolutely be avoided if you are going to use the same code at CTFE.

The real lesson though is to be very skeptical of extermely good results and verify behavior as well as performance! :)

Cheers,

Stefan

August 24, 2021

Re: Sometimes 100 lines of code can fix your ctfe perf

Posted by Per Nordlöw
in reply to Stefan Koch

Permalink

Per Nordlöw

Posted in reply to Stefan Koch

Permalink

On Tuesday, 24 August 2021 at 14:07:45 UTC, Stefan Koch wrote:

Good Morning,

Would it be motivated to have the compiler deduce that result ~= "whatever" can be lowered to a reallocating append because result isn't aliased:

enum x = () { string result;
foreach(_; 0 ..n_iterations)
  result ~= "whatever"
return result; } ()

August 24, 2021

Re: Sometimes 100 lines of code can fix your ctfe perf

Posted by Stefan Koch
in reply to Per Nordlöw

Permalink

Stefan Koch

Posted in reply to Per Nordlöw

Permalink

On Tuesday, 24 August 2021 at 16:16:50 UTC, Per Nordlöw wrote:

On Tuesday, 24 August 2021 at 14:07:45 UTC, Stefan Koch wrote:

Good Morning,

Would it be motivated to have the compiler deduce that result ~= "whatever" can be lowered to a reallocating append because result isn't aliased:

enum x = () { string result;
foreach(_; 0 ..n_iterations)
  result ~= "whatever"
return result; } ()

Of course you could try to recognize certain patterns and treat them specially,
but that complicates the compiler and now your compile time performance is dependent on a bunch of heuristics, which in my experience have a habit of failing in the most inconvenient ways and situations.

August 24, 2021

Re: Sometimes 100 lines of code can fix your ctfe perf

Posted by Ali Çehreli
in reply to Stefan Koch

Permalink

Ali Çehreli

Posted in reply to Stefan Koch

Permalink

On 8/24/21 9:06 AM, Stefan Koch wrote:

> The reason why the string builder had so a flat performance curve was A
> BUG.

I suspected that as soon as saw your initial post. :) Happens to me too.

> results for CTFE;

I know it's not the same use case but I've standardized on the following style for string formatting both for CTFE and run time:

  auto s = format!"hello %s world %s"(a, b);

Not that I will change my style but I always wonder how it compares to the following horrible one. :)

  string s;
  s ~= "hello ";
  s ~= a;        // Let's assume 'a' and 'b' are strings
  s ~= " world ";
  s ~= b;

Thank you,
Ali

August 24, 2021

Re: Sometimes 100 lines of code can fix your ctfe perf

Posted by Adam D Ruppe
in reply to Ali Çehreli

Permalink

Adam D Ruppe

Posted in reply to Ali Çehreli

Permalink

On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali Çehreli wrote:
> I know it's not the same use case but I've standardized on the following style for string formatting both for CTFE and run time:
>
>   auto s = format!"hello %s world %s"(a, b);

yikes that's like the worst thing in the whole D world. brutally slow by all metrics.

August 24, 2021

Re: Sometimes 100 lines of code can fix your ctfe perf

Posted by Stefan Koch
in reply to Ali Çehreli

Permalink

Stefan Koch

Posted in reply to Ali Çehreli

Permalink

On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali Çehreli wrote:
> On 8/24/21 9:06 AM, Stefan Koch wrote:
>
> > The reason why the string builder had so a flat performance
> curve was A
> > BUG.
>
> I suspected that as soon as saw your initial post. :) Happens to me too.
>
> > results for CTFE;
>
> I know it's not the same use case but I've standardized on the following style for string formatting both for CTFE and run time:
>
>   auto s = format!"hello %s world %s"(a, b);
>
> Not that I will change my style but I always wonder how it compares to the following horrible one. :)
>
>   string s;
>   s ~= "hello ";
>   s ~= a;        // Let's assume 'a' and 'b' are strings
>   s ~= " world ";
>   s ~= b;
>
> Thank you,
> Ali

The compile-time performance of std.format is bad.
Just importing the module costs you a lot.

Instantiating some amount of templates per format string and parameter type-sets, is also not going to make you particularity happy.

As for the runtime performance I couldn't comment.
I assume it depends on your usecase.

August 24, 2021

Re: Sometimes 100 lines of code can fix your ctfe perf

Posted by Alexandru Ermicioi
in reply to Adam D Ruppe

Permalink

Alexandru Ermicioi

Posted in reply to Adam D Ruppe

Permalink

On Tuesday, 24 August 2021 at 16:58:26 UTC, Adam D Ruppe wrote:
> On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali Çehreli wrote:
>> I know it's not the same use case but I've standardized on the following style for string formatting both for CTFE and run time:
>>
>>   auto s = format!"hello %s world %s"(a, b);
>
> yikes that's like the worst thing in the whole D world. brutally slow by all metrics.

Best here from use point of view is text method for simple use cases:

text("Hello ", a, " world ", b);

Hope it is better on perf at ctfe than format.

Regards,
Alexandru.

August 24, 2021

Re: Sometimes 100 lines of code can fix your ctfe perf

Posted by Paul Backus
in reply to Alexandru Ermicioi

Permalink

Paul Backus

Posted in reply to Alexandru Ermicioi

Permalink

On Tuesday, 24 August 2021 at 19:35:39 UTC, Alexandru Ermicioi wrote:

On Tuesday, 24 August 2021 at 16:58:26 UTC, Adam D Ruppe wrote:

On Tuesday, 24 August 2021 at 16:52:30 UTC, Ali Çehreli wrote:

I know it's not the same use case but I've standardized on the following style for string formatting both for CTFE and run time:

auto s = format!"hello %s world %s"(a, b);

yikes that's like the worst thing in the whole D world. brutally slow by all metrics.

Best here from use point of view is text method for simple use cases:

text("Hello ", a, " world ", b);

Hope it is better on perf at ctfe than format.

Looks like it'd be worse. Internally, format uses formattedWrite with an Appender as the output range [1], which means that (in the best case) it can write its result directly into the output buffer without allocating any memory for intermediate results. text, on the other hand, converts each of its non-string arguments to a string separately before writing them to its output buffer [2], so even in the best case it will require O(arguments) additional allocations compared to format.

I am honestly kind of surprised that text works like this--before looking, I'd have expected both to use the same mechanism under the hood. Maybe it's necessary to avoid some edge-case inconsistency between value.to!string and value.format!"%s"?

[1] https://phobos.dpldocs.info/source/std.format.write.d.html#L681
[2] https://dpldocs.info/experimental-docs/source/std.conv.d.html#L4233

Top | Forum index | About this forum

Forums