Jump to page: 1 2
Thread overview
Q: fast stack allocation?
May 30, 2004
Kris
Re: fast stack allocation?
May 30, 2004
Walter
May 30, 2004
Kris
May 30, 2004
Walter
May 30, 2004
Kris
May 30, 2004
Arcane Jill
May 30, 2004
Kris
May 30, 2004
Walter
May 31, 2004
Kris
May 31, 2004
Kris
Jun 02, 2004
Matthew
Jun 02, 2004
Norbert Nemec
Jun 04, 2004
Matthew
May 31, 2004
J Anderson
May 31, 2004
Kris
Jun 01, 2004
Norbert Nemec
May 30, 2004
How does one allocate an array on the stack without incurring the wrath of a "clear everything to zero" loop? I've tried the obvious, such as

char[Size] array;

and

char[] array = (cast(char *) alloca(Size))[0..Size];

both of which insist on clearing the content to zero. I appreciate deterministic state as much as the next person, but there are times when it gets in the way of execution speed ...

- Kris


May 30, 2004
"Kris" <someidiot@earthlink.dot.dot.dot.net> wrote in message news:c9c1l3$1tfp$1@digitaldaemon.com...
> How does one allocate an array on the stack without incurring the wrath of
a
> "clear everything to zero" loop? I've tried the obvious, such as
>
> char[Size] array;
>
> and
>
> char[] array = (cast(char *) alloca(Size))[0..Size];
>
> both of which insist on clearing the content to zero. I appreciate deterministic state as much as the next person, but there are times when
it
> gets in the way of execution speed ...

char[] array;
array = (cast(char *) alloca(Size))[0..Size];


May 30, 2004
alloca() still does a < rep movs dword ptr [edi],dword ptr [esi] >, albeit a shorter loop than < char[Size] array > emits.

Surely there must be a simple means of circumventing all this overhead? I mean, I'm allocating on the stack so as to avoid heap management: yet still have all this to contend with. Whatever happened to <sub sp, Size> type stack allocation? There's still a real need for that ...

- Kris


"Walter" <newshound@digitalmars.com> wrote in message news:c9c7p0$260b$1@digitaldaemon.com...
>
> "Kris" <someidiot@earthlink.dot.dot.dot.net> wrote in message news:c9c1l3$1tfp$1@digitaldaemon.com...
> > How does one allocate an array on the stack without incurring the wrath
of
> a
> > "clear everything to zero" loop? I've tried the obvious, such as
> >
> > char[Size] array;
> >
> > and
> >
> > char[] array = (cast(char *) alloca(Size))[0..Size];
> >
> > both of which insist on clearing the content to zero. I appreciate deterministic state as much as the next person, but there are times when
> it
> > gets in the way of execution speed ...
>
> char[] array;
> array = (cast(char *) alloca(Size))[0..Size];
>
>


May 30, 2004
"Kris" <someidiot@earthlink.dot.dot.dot.net> wrote in message news:c9d2n6$7hp$1@digitaldaemon.com...
> alloca() still does a < rep movs dword ptr [edi],dword ptr [esi] >, albeit
a
> shorter loop than < char[Size] array > emits.

That loop doesn't initialize the array, it only adjusts the stack frame to make room. The array itself is uninitialized.

>Surely there must be a simple means of circumventing all this overhead? I mean, I'm allocating on the stack so as to avoid heap management: yet still have all this to contend with. Whatever happened to <sub sp, Size> type stack allocation? There's still a real need for that ...

Uninitialized variables are a huge source of hidden bugs in C/C++. Getting rid of that is a design goal of D.


May 30, 2004
I have absolutely no argument with your point regarding uninitialized variables, but there are occasions where a "practical programmer needs a practical language".

That adjustment of the stack frame is a 44 byte *copy* loop. With today's 12+ multiplier CPUs, that's a noticeable amount of time spent waiting for memory access ...

What I'm doing here is making a scratchpad available for lower-level functions to format content into. This has to be done on a thread-by-thread basis, so the stack is a safe location to house said scratchpad. The alternative is for me to provide a 'freelist' of preallocated buffers ... in the latter case, the freelist would likely end up (over time) with as many instances as there are threads running. If the number of threads is quite large, and the buffer is of a reasonable size,  you can end up eating gobs of memory instead of simply placing the buffer on the stack in the first place. To get around this, you end up creating a "reaper" for the freelist.

This is not a good solution. To make matters worse, freelists must be synchronized which adds more overhead. To add insult to injury, a synchronized method often includes the hidden equivalent of a try/catch to ensure the monitor is released. All this, just because I wanted to quickly allocate a scratchpad on the stack? Good grief :-)

This issue has arisen simply because I want to ensure the overhead for runtime logging/tracing is minimal. Surely this is a valid and practical goal?

To illustrate, here's a simplified example:

        void test ()
        {
                // provide a scratchpad on the stack
                char[40] output;

                // do something with the returned slice
                uitoa (output, 17);
        }

        static char[] uitoa (char[] s, uint i)
        {
                uint len = s.length;
                while (i && len)
                      {
                      s[--len] = i % 10 + '0';
                      i /= 10;
                      }
                return s[len..s.length];
        }

That is, rather than uitoa() allocating from the heap (a la Phobos) it
populates and returns a slice on the provided buffer. You must realize that
the initialization of the caller's stack array is actually likely to cost
more (in time) than the uitoa() call itself for small numbers ...

There has to be a better solution, and I'd much appreciate your suggestions.

- Kris


"Walter" <newshound@digitalmars.com> wrote in message news:c9d510$akc$2@digitaldaemon.com...
>
> "Kris" <someidiot@earthlink.dot.dot.dot.net> wrote in message news:c9d2n6$7hp$1@digitaldaemon.com...
> > alloca() still does a < rep movs dword ptr [edi],dword ptr [esi] >,
albeit
> a
> > shorter loop than < char[Size] array > emits.
>
> That loop doesn't initialize the array, it only adjusts the stack frame to make room. The array itself is uninitialized.
>
> >Surely there must be a simple means of circumventing all this overhead? I mean, I'm allocating on the stack so as to avoid heap management: yet
still
> >have all this to contend with. Whatever happened to <sub sp, Size> type stack allocation? There's still a real need for that ...
>
> Uninitialized variables are a huge source of hidden bugs in C/C++. Getting rid of that is a design goal of D.
>
>


May 30, 2004
In article <c9d8n1$fq6$1@digitaldaemon.com>, Kris says...
>
>I have absolutely no argument with your point regarding uninitialized variables, but there are occasions where a "practical programmer needs a practical language".

inline assembler?

Jill


May 30, 2004
Yeah, I didn't dare mention that one AJ 'cos it seems ridiculous to go there just for a fast stack array :-)

I'm not saying that D should follow the C/C++ mentality here (for fast stack arrays), but surely this is a valid issue?

- Kris

"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:c9da9d$hp3$1@digitaldaemon.com...
> In article <c9d8n1$fq6$1@digitaldaemon.com>, Kris says...
> >
> >I have absolutely no argument with your point regarding uninitialized variables, but there are occasions where a "practical programmer needs a practical language".
>
> inline assembler?
>
> Jill
>
>


May 30, 2004
"Kris" <someidiot@earthlink.dot.dot.dot.net> wrote in message news:c9d8n1$fq6$1@digitaldaemon.com...
> I have absolutely no argument with your point regarding uninitialized variables, but there are occasions where a "practical programmer needs a practical language".
>
> That adjustment of the stack frame is a 44 byte *copy* loop. With today's 12+ multiplier CPUs, that's a noticeable amount of time spent waiting for memory access ...

Copying a few bytes that are on the stack are in the cache already.

> What I'm doing here is making a scratchpad available for lower-level functions to format content into. This has to be done on a
thread-by-thread
> basis, so the stack is a safe location to house said scratchpad. The alternative is for me to provide a 'freelist' of preallocated buffers ...
in
> the latter case, the freelist would likely end up (over time) with as many instances as there are threads running. If the number of threads is quite large, and the buffer is of a reasonable size,  you can end up eating gobs of memory instead of simply placing the buffer on the stack in the first place. To get around this, you end up creating a "reaper" for the
freelist.
>
> This is not a good solution. To make matters worse, freelists must be synchronized which adds more overhead. To add insult to injury, a synchronized method often includes the hidden equivalent of a try/catch to ensure the monitor is released. All this, just because I wanted to quickly allocate a scratchpad on the stack? Good grief :-)
>
> This issue has arisen simply because I want to ensure the overhead for runtime logging/tracing is minimal. Surely this is a valid and practical goal?
>
> To illustrate, here's a simplified example:
>
>         void test ()
>         {
>                 // provide a scratchpad on the stack
>                 char[40] output;
>
>                 // do something with the returned slice
>                 uitoa (output, 17);
>         }
>
>         static char[] uitoa (char[] s, uint i)
>         {
>                 uint len = s.length;
>                 while (i && len)
>                       {
>                       s[--len] = i % 10 + '0';
>                       i /= 10;
>                       }
>                 return s[len..s.length];
>         }
>
> That is, rather than uitoa() allocating from the heap (a la Phobos) it
> populates and returns a slice on the provided buffer. You must realize
that
> the initialization of the caller's stack array is actually likely to cost
> more (in time) than the uitoa() call itself for small numbers ...
>
> There has to be a better solution, and I'd much appreciate your
suggestions.

In most cases, a good optimizer will be able to eliminate the redundant initialization. But not all cases. If the array is just a few bytes long, I doubt you'll be able to detect it in a real app. For large arrays, alloca() is the way to go.

I think the price of a bit of an occaisional over-initialization is worth paying in exchange for more reliable programs. And I speak as one who has a mania for detailed optimization and have a reputation for using dirty tricks to get it.


May 31, 2004
"Walter" wrote
> Copying a few bytes that are on the stack are in the cache already.

If so, then fair point ... (though it would hurt a typical MCU).

> I think the price of a bit of an occaisional over-initialization is worth paying in exchange for more reliable programs. And I speak as one who has
a
> mania for detailed optimization and have a reputation for using dirty
tricks
> to get it.

I was going to pull you on the latter, but thought it might be out-of-order <g>. However, let's not confuse the current issue with a dirty trick; it's a perfectly reasonable and legitimate thing to do. And it's not even close to manic optimization. If you disagree with those two points, then please say so :-)

Let me ask a question (or two), and then I'll drop it :: suppose there were some reasonable assignment syntax that permitted the programmer to omit default array initialization under carefully considered circumstances, would you still claim that alloca() is the right way to go, even with its overhead? For all purposes?

If nothing else, the exercise uncovered two bugs (in the buglist) ...

- Kris


May 31, 2004
"Kris"  wrote in
> If nothing else, the exercise uncovered two bugs (in the buglist) ...

Errrr ... perhaps one bug (should have checked first).


« First   ‹ Prev
1 2