May 24, 2013
On 24 May 2013 19:40, Dmitry Olshansky <dmitry.olsh@gmail.com> wrote:

> 24-May-2013 09:02, Manu пишет:
>
>> On 24 May 2013 14:11, Marco Leise <Marco.Leise@gmx.de <mailto:Marco.Leise@gmx.de>> wrote:
>>
>>     Am Thu, 23 May 2013 20:21:47 -0400
>>     schrieb "Jonathan M Davis" <jmdavisProg@gmx.com
>>     <mailto:jmdavisProg@gmx.com>>:
>>
>>
>>      > At some point, we're probably going to need to
>>      > benchmark stuff more agressively and optimize Phobos in general
>>     more, because
>>      > it's the standard library. And eliminating unnecessary memory
>>     allocations
>>      > definitely goes along with that.
>>      >
>>      > - Jonathan M Davis
>>
>>     On a related note, a while back I benchmarked the naive Phobos
>>     approach to create a Windows API (wchar) string from a D
>>     string with using alloca to convert the string on a piece of
>>     stack memory like this: http://dpaste.1azy.net/**b60d37d4<http://dpaste.1azy.net/b60d37d4>
>>     IIRC it was 13(!) times faster for ~100 chars of English text
>>     and 5 times for some multi-byte characters.
>>     I think this approach is too hackish for Phobos, but it
>>     demonstrates that there is much room.
>>
>>
>> I don't think it's hack-ish at all, that's precisely what the stack is
>> there for. It would be awesome for people to use alloca in places that
>> it makes sense.
>> Especially in cases where the function is a leaf or leaf-stem (ie, if
>> there is no possibility of recursion), then using the stack should be
>> encouraged.
>> For safety, obviously phobos should do something like:
>>    void[] buffer = bytes < reasonable_anticipated_buffer_**size ?
>> alloca(bytes) : new void[bytes];
>>
>> toStringz is a very common source of allocations. This alloca approach would be great in those cases, filenames in particular.
>>
>
> Alternatively just make a TLS buffer as scratchpad and use that everywhere.


How is that any different than just using the stack in practise?


May 24, 2013
On 24 May 2013 19:49, Jacob Carlborg <doob@me.com> wrote:

> On 2013-05-23 23:42, Joseph Rushton Wakeling wrote:
>
>  I'm also in agreement with Manu.  There may well already be bugs for some
>> of
>> them -- e.g. there is one for toUpperInPlace which he referred to, and the
>> source of the allocation is clear and is even responsible for other bugs:
>> http://d.puremagic.com/issues/**show_bug.cgi?id=9629<http://d.puremagic.com/issues/show_bug.cgi?id=9629>
>>
>
> toUpper/lower cannot be made in place if it should handle all Unicode. Some characters will change their length when convert to/from uppercase. Examples of these are the German double S and some Turkish I.


ß and SS are both actually 2 bytes, so it works in UTF-8 at least! ;)


May 24, 2013
On 24 May 2013 20:24, Regan Heath <regan@netmail.co.nz> wrote:

> On Fri, 24 May 2013 01:11:17 +0100, Manu <turkeyman@gmail.com> wrote:
>
>> /agree, except the issue I raised, when ~ is used in phobos.
>> That means that function is now off-limits. And there's no way to know
>> which functions they are...
>>
>
> It's not the allocation caused by ~ which is the issue though is it, it's the collection it might trigger, right?
>

Yes, but the unpredictability is the real concern. It's hard to control
something that you don't know about.
If the phobos function can avoid the allocation, then why not avoid it?


So what you really need are 3 main things:
>
> 1. A way to prevent the GC collecting until a given point(*).
> 2. A way to limit the GC collection time.
> 3. For phobos functions to be optimised to not allocate or to use alloca
> where possible.
>
> #1 should be trivial.
>

I think we can already do this.

#2 is much harder to achieve in the general case (I believe).
>

The incremental(+precise) GC idea, I think this would be the silver bullet
for games!

#3 is not essential but desirable and could be added/improved over time.
>

Yes, I think effort to improve this would be universally appreciated.


(*) Until the collection point the GC would ask the OS for more memory (a
> new pool or page) or fail and throw an Error.  Much like in Leandro's concurrent GC talk/example where he talks about eager allocation.
>

Bear in mind, most embedded hardware does now have virtual memory, and
often a fairly small hard limit.
If we are trying to manually sequence out allocations and collects, like
schedule collects when you change scenes on a black screen or something for
instance, then you can't have random phobos functions littering small
allocations all over the place.
The usefulness of #1 depends largely on #3.


In order to make #2 easier to achieve I had an idea, not sure how workable
> this is..
>
> Lets imagine you can mark a thread as not stopped by the pause-the-world.
>  Lets imagine it still does allocations which we want to collect at some
> stage.  How would this work..
>
> 1. The GC would remove the thread stack and global space from it's list of roots scanned by normal collections.  It would not pause it on normal collections.
>
> 2. (*) above would be in effect, the first allocation in the thread would cause the GC to create a thread local pool, this pool would not be shared by other threads (no locking required, not scanned by normal GC collections).  This pool could be pre-allocated by a new GC primitive "GC.allocLocalPool();" for efficiency.  Allocation would come from this thread-local pool, or trigger a new pool allocation - so minimal locking should be required.
>
> 3. The thread would call a new GC primitive at the point where collection was desired i.e. "GC.localCollect(size_t maxMicroSecs);".  This collection would be special, it would not stop the thread, but would occur inline.  It would only scan the thread local pool and would do so with an enforced upper bound collection time.
>
> 4. There are going to be issues around 'shared' /mutable/ data, e.g.
>
>  - The non-paused thread accessing it (esp during collection)
>  - If the thread allocated 'shared' data
>
> I am hoping that if the thread main function is marked as @notpaused (or similar) then the compiler can statically verify neither of these occur and produce a compile time error.
>
> So, that's the idea.  I don't know the current GC all that well so I've probably missed something crucial.  I doubt this idea is revolutionary and it is perhaps debatable whether the complexity is worth the effort, also whether it actually makes placing an upper bound on the collection any easier.
>
> Thoughts?


It sounds kinda complex... but I'm not qualified to comment.


May 24, 2013
On Fri, May 24, 2013 at 09:34:41AM +1000, Manu wrote:
[...]
> Just to be clear, while I've hard many have, I've NEVER argued for
> removing the GC. I think that's a hallmark of a modern language. I
> want to use the GC in games, but it needs to have performance
> characteristics that are applicable to realtime and embedded use.
> Those are:
> 1. Can't stop the world.
> 2. Needs tight controls, enable/disable, and the allocators interface
> so alternative memory sources can be used in mane places.
> 3. Needs to (somehow) run incrementally. I'm happy to budget a few
> hundred µs per frame, but not a millisecond every 10 frames, or 1
> second every 1000.
>     It can have 1-2% of overall frame time each frame, but it can't
> have 10-100% of random frames here and there. This results in
> framerate spikes.

Makes sense, so basically the GC should not cause jittery framerates, but should distribute its workload across frames so that the framerate is more-or-less constant?


> The GC its self can be much less efficient than the existing GC if it want's, it's only important that it can be halted at fine grained intervals, and that it will eventually complete its collect cycle over the long-term.
>
> I know that an incremental GC like this is very complex, but I've never heard of any real experiments, so maybe it's not impossible?

Is there a hard upper limit to how much time the GC can take per frame? Is it acceptable to use, say, a millisecond every frame as long as it's *every* frame and not every 10 frames (which causes jitter)?

For me, I'm also interested in incremental GCs -- for time-sensitive applications (even if it's just soft realtime, not hard), long stop-the-world pauses are really disruptive. I'd rather have the option of a somewhat larger memory footprint and a less efficient GC (in terms of rate of memory reclamation) if it can be incremental, rather than a very efficient GC that introduces big pauses every now and then. I'm even willing to settle for lower framerates if it means I don't have to deal with framerate spikes that makes the result jittery and unpleasant.


T

-- 
"I suspect the best way to deal with procrastination is to put off the procrastination itself until later. I've been meaning to try this, but haven't gotten around to it yet. " -- swr
May 24, 2013
I might just add that there are some other important targets as well in the vein of this discussion.

DLL's *still* don't work properly. druntime/phobos still don't really work
as dll's.
They are getting some attention, but it's been a really long standing and
seriously major issue. Shared libraries are like, important!


On 25 May 2013 00:50, Manu <turkeyman@gmail.com> wrote:

> On 24 May 2013 20:24, Regan Heath <regan@netmail.co.nz> wrote:
>
>> On Fri, 24 May 2013 01:11:17 +0100, Manu <turkeyman@gmail.com> wrote:
>>
>>> /agree, except the issue I raised, when ~ is used in phobos.
>>> That means that function is now off-limits. And there's no way to know
>>> which functions they are...
>>>
>>
>> It's not the allocation caused by ~ which is the issue though is it, it's the collection it might trigger, right?
>>
>
> Yes, but the unpredictability is the real concern. It's hard to control
> something that you don't know about.
> If the phobos function can avoid the allocation, then why not avoid it?
>
>
> So what you really need are 3 main things:
>>
>> 1. A way to prevent the GC collecting until a given point(*).
>> 2. A way to limit the GC collection time.
>> 3. For phobos functions to be optimised to not allocate or to use alloca
>> where possible.
>>
>> #1 should be trivial.
>>
>
> I think we can already do this.
>
> #2 is much harder to achieve in the general case (I believe).
>>
>
> The incremental(+precise) GC idea, I think this would be the silver bullet
> for games!
>
> #3 is not essential but desirable and could be added/improved over time.
>>
>
> Yes, I think effort to improve this would be universally appreciated.
>
>
>  (*) Until the collection point the GC would ask the OS for more memory (a
>> new pool or page) or fail and throw an Error.  Much like in Leandro's concurrent GC talk/example where he talks about eager allocation.
>>
>
> Bear in mind, most embedded hardware does now have virtual memory, and
> often a fairly small hard limit.
> If we are trying to manually sequence out allocations and collects, like
> schedule collects when you change scenes on a black screen or something for
> instance, then you can't have random phobos functions littering small
> allocations all over the place.
> The usefulness of #1 depends largely on #3.
>
>
> In order to make #2 easier to achieve I had an idea, not sure how workable
>> this is..
>>
>> Lets imagine you can mark a thread as not stopped by the pause-the-world.
>>  Lets imagine it still does allocations which we want to collect at some
>> stage.  How would this work..
>>
>> 1. The GC would remove the thread stack and global space from it's list of roots scanned by normal collections.  It would not pause it on normal collections.
>>
>> 2. (*) above would be in effect, the first allocation in the thread would cause the GC to create a thread local pool, this pool would not be shared by other threads (no locking required, not scanned by normal GC collections).  This pool could be pre-allocated by a new GC primitive "GC.allocLocalPool();" for efficiency.  Allocation would come from this thread-local pool, or trigger a new pool allocation - so minimal locking should be required.
>>
>> 3. The thread would call a new GC primitive at the point where collection was desired i.e. "GC.localCollect(size_t maxMicroSecs);".  This collection would be special, it would not stop the thread, but would occur inline.  It would only scan the thread local pool and would do so with an enforced upper bound collection time.
>>
>> 4. There are going to be issues around 'shared' /mutable/ data, e.g.
>>
>>  - The non-paused thread accessing it (esp during collection)
>>  - If the thread allocated 'shared' data
>>
>> I am hoping that if the thread main function is marked as @notpaused (or similar) then the compiler can statically verify neither of these occur and produce a compile time error.
>>
>> So, that's the idea.  I don't know the current GC all that well so I've probably missed something crucial.  I doubt this idea is revolutionary and it is perhaps debatable whether the complexity is worth the effort, also whether it actually makes placing an upper bound on the collection any easier.
>>
>> Thoughts?
>
>
> It sounds kinda complex... but I'm not qualified to comment.
>


May 24, 2013
On 2013-05-24, 16:24, Peter Alexander wrote:

> On Friday, 24 May 2013 at 13:37:36 UTC, Joseph Rushton Wakeling wrote:
>> (To be honest, feels a bit of a design flaw in Unicode that character length can
>> change between lower- and uppercase.)
>
> Unfortunately it's either that or lose compatibility with ASCII. Lower case dotted-i needs to be one byte for ASCII, and upper case dotted-i isn't ASCII, so it needs to be more than one byte.

One could certainly have two different lowercase dotted I's, with one
mapping to I and the other to İ, and their unicode values close to the
upper-case versions.

-- 
Simen
May 24, 2013
On 25 May 2013 00:58, H. S. Teoh <hsteoh@quickfur.ath.cx> wrote:

> On Fri, May 24, 2013 at 09:34:41AM +1000, Manu wrote:
> [...]
> > Just to be clear, while I've hard many have, I've NEVER argued for
> > removing the GC. I think that's a hallmark of a modern language. I
> > want to use the GC in games, but it needs to have performance
> > characteristics that are applicable to realtime and embedded use.
> > Those are:
> > 1. Can't stop the world.
> > 2. Needs tight controls, enable/disable, and the allocators interface
> > so alternative memory sources can be used in mane places.
> > 3. Needs to (somehow) run incrementally. I'm happy to budget a few
> > hundred 盜 per frame, but not a millisecond every 10 frames, or 1
> > second every 1000.
> >     It can have 1-2% of overall frame time each frame, but it can't
> > have 10-100% of random frames here and there. This results in
> > framerate spikes.
>
> Makes sense, so basically the GC should not cause jittery framerates, but should distribute its workload across frames so that the framerate is more-or-less constant?
>

Precisely.

> The GC its self can be much less efficient than the existing GC if it
> > want's, it's only important that it can be halted at fine grained intervals, and that it will eventually complete its collect cycle over the long-term.
> >
> > I know that an incremental GC like this is very complex, but I've never heard of any real experiments, so maybe it's not impossible?
>
> Is there a hard upper limit to how much time the GC can take per frame? Is it acceptable to use, say, a millisecond every frame as long as it's *every* frame and not every 10 frames (which causes jitter)?
>

Errr, well, 1ms is about 7% of the frame, that's quite a long time.
I'd be feeling pretty uneasy about any library that claimed to want 7% of
the whole game time, and didn't offer any visual/gameplay benefits...
Maybe if the GC happened to render some sweet water effects, or perform
some awesome cloth physics or something while it was at it ;)
I'd say 7% is too much for many developers.

I think 2% sacrifice for simplifying memory management would probably get
through without much argument.
That's ~300µs... a few hundred microseconds seems reasonable. Maybe a
little more if targeting 30fps.
If it stuck to that strictly, I'd possibly even grant it permission to stop
the world...

For me, I'm also interested in incremental GCs -- for time-sensitive
> applications (even if it's just soft realtime, not hard), long stop-the-world pauses are really disruptive. I'd rather have the option of a somewhat larger memory footprint and a less efficient GC (in terms of rate of memory reclamation) if it can be incremental, rather than a very efficient GC that introduces big pauses every now and then. I'm even willing to settle for lower framerates if it means I don't have to deal with framerate spikes that makes the result jittery and unpleasant.
>

One important detail to consider for realtime usage, is that it's very
unconventional to allocate at runtime at all...
Perhaps a couple of short lived temp buffers each frame, and the occasional
change in resources as you progress through a world (which are probably not
allocated in GC memory anyway).
Surely the relatively high temporal consistency of the heap across cycles
can be leveraged here somehow to help?


May 24, 2013
24-May-2013 18:35, Manu пишет:
> On 24 May 2013 19:40, Dmitry Olshansky <dmitry.olsh@gmail.com
> <mailto:dmitry.olsh@gmail.com>> wrote:
>
>     24-May-2013 09:02, Manu пишет:
>
>         On 24 May 2013 14:11, Marco Leise <Marco.Leise@gmx.de
>         <mailto:Marco.Leise@gmx.de>
>         <mailto:Marco.Leise@gmx.de <mailto:Marco.Leise@gmx.de>>> wrote:
>
>              Am Thu, 23 May 2013 20:21:47 -0400
>              schrieb "Jonathan M Davis" <jmdavisProg@gmx.com
>         <mailto:jmdavisProg@gmx.com>
>              <mailto:jmdavisProg@gmx.com <mailto:jmdavisProg@gmx.com>>>:
>
>
>               > At some point, we're probably going to need to
>               > benchmark stuff more agressively and optimize Phobos in
>         general
>              more, because
>               > it's the standard library. And eliminating unnecessary
>         memory
>              allocations
>               > definitely goes along with that.
>               >
>               > - Jonathan M Davis
>
>              On a related note, a while back I benchmarked the naive Phobos
>              approach to create a Windows API (wchar) string from a D
>              string with using alloca to convert the string on a piece of
>              stack memory like this: http://dpaste.1azy.net/__b60d37d4
>         <http://dpaste.1azy.net/b60d37d4>
>              IIRC it was 13(!) times faster for ~100 chars of English text
>              and 5 times for some multi-byte characters.
>              I think this approach is too hackish for Phobos, but it
>              demonstrates that there is much room.
>
>
>         I don't think it's hack-ish at all, that's precisely what the
>         stack is
>         there for. It would be awesome for people to use alloca in
>         places that
>         it makes sense.
>         Especially in cases where the function is a leaf or leaf-stem
>         (ie, if
>         there is no possibility of recursion), then using the stack
>         should be
>         encouraged.
>         For safety, obviously phobos should do something like:
>             void[] buffer = bytes < reasonable_anticipated_buffer___size ?
>         alloca(bytes) : new void[bytes];
>
>         toStringz is a very common source of allocations. This alloca
>         approach
>         would be great in those cases, filenames in particular.
>
>
>     Alternatively just make a TLS buffer as scratchpad and use that
>     everywhere.
>
>
> How is that any different than just using the stack in practise?

Can pass across function boundaries up/down. Can grow arbitrary large without blowing up.

-- 
Dmitry Olshansky
May 24, 2013
24-May-2013 18:38, Manu пишет:
> On 24 May 2013 19:49, Jacob Carlborg <doob@me.com <mailto:doob@me.com>>
> wrote:
>
>     On 2013-05-23 23:42, Joseph Rushton Wakeling wrote:
>
>         I'm also in agreement with Manu.  There may well already be bugs
>         for some of
>         them -- e.g. there is one for toUpperInPlace which he referred
>         to, and the
>         source of the allocation is clear and is even responsible for
>         other bugs:
>         http://d.puremagic.com/issues/__show_bug.cgi?id=9629
>         <http://d.puremagic.com/issues/show_bug.cgi?id=9629>
>
>
>     toUpper/lower cannot be made in place if it should handle all
>     Unicode. Some characters will change their length when convert
>     to/from uppercase. Examples of these are the German double S and
>     some Turkish I.
>
>
> ß and SS are both actually 2 bytes, so it works in UTF-8 at least! ;)

Okay, here you go - an UTF-8 table of cased sin :)

Codepoint - upper-case - lower-case
0x01e9e : 0x000df - 3 : 2
0x0023a : 0x02c65 - 2 : 3
0x0023e : 0x02c66 - 2 : 3
0x02c7e : 0x0023f - 3 : 2
0x02c7f : 0x00240 - 3 : 2
0x02c6f : 0x00250 - 3 : 2
0x02c6d : 0x00251 - 3 : 2
0x02c70 : 0x00252 - 3 : 2
0x0a78d : 0x00265 - 3 : 2
0x0a7aa : 0x00266 - 3 : 2
0x02c62 : 0x0026b - 3 : 2
0x02c6e : 0x00271 - 3 : 2
0x02c64 : 0x0027d - 3 : 2
0x01e9e : 0x000df - 3 : 2
0x02c62 : 0x0026b - 3 : 2
0x02c64 : 0x0027d - 3 : 2
0x0023a : 0x02c65 - 2 : 3
0x0023e : 0x02c66 - 2 : 3
0x02c6d : 0x00251 - 3 : 2
0x02c6e : 0x00271 - 3 : 2
0x02c6f : 0x00250 - 3 : 2
0x02c70 : 0x00252 - 3 : 2
0x02c7e : 0x0023f - 3 : 2
0x02c7f : 0x00240 - 3 : 2
0x0a78d : 0x00265 - 3 : 2
0x0a7aa : 0x00266 - 3 : 2

And this is only with 1:1 mapping.

Generated by:

void main(){
    import std.uni, std.utf, std.stdio;
    char buf[4];
    foreach(dchar ch; unicode.Cased_Letter.byCodepoint){
        dchar upper = toUpper(ch);
        dchar lower = toLower(ch);

        int uLen = encode(buf, upper);
        int lLen = encode(buf, lower);
        if(uLen != lLen)
            writefln("0x%05x : 0x%05x - %d : %d", upper, lower, uLen, lLen);
    }
}



-- 
Dmitry Olshansky
May 24, 2013
On Fri, 24 May 2013 15:50:43 +0100, Manu <turkeyman@gmail.com> wrote:
> On 24 May 2013 20:24, Regan Heath <regan@netmail.co.nz> wrote:
>
> It sounds kinda complex... but I'm not qualified to comment.

Yeah, there is complexity.  It all boils down to whether it is possible using modern GC techniques (precise, incremental, etc) to perform a collection in 300us as you require.  If a full collection cannot be done in that time, perhaps a smaller subset can - that is where I was heading with this idea.

R

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/