February 03, 2016
On Wednesday, February 03, 2016 15:20:21 Marc Schütz wrote:
> I assume you also don't want to allow access to the underlying storage (most likely `char[]`?). I don't see this as a good thing at all. Without the ability of getting at least a `const(char)[]` out of it, RCString would be next to useless in practice.

Would it? Most string code is really supposed to be range-based at this point anyway, in which case, the fact that it doesn't convert to some variant of char[] is pretty much a non-issue. The only real problem is the code that takes something like string or const(char)[] explicitly, and maybe some of that code shouldn't be doing that.

In any case, with opIndex and length gone, what we get with RCString is really more a container of code units that gives us various types of ranges over it rather than a string, which means that complaining that you can't get a char[] out of it is a bit like complaining that you can't get a char[] out of an Array!char or a DList!char. Sure, there are cases where it won't work, but in most cases, it will, because most stuff is range-based.

> > * Immutable does not play well with reference counting. I'm of a mind to reject immutable rcstring for now and figure out later how to go about it. Then const rcstring is okay because we always consider const a view on mutable strings (even though they're gone). We'll cast const away when manipulating the refcount.

Does immutable even make sense for a ref-counted object? You can't share a ref-counted object across threads without shared anyway, because the ref-count needs to be protected to be thread-safe. So, why not just use const? It's just as good as immutable if you have no mutable references, and you can't share across threads implicitly anyway. I'm sure that it merits a larger discussion, but at the moment, I'm inclined to believe that immutable simply doesn't make sense for ref-counted objects.

- Jonathan M Davis


_______________________________________________
Dlang-study mailing list
Dlang-study@puremagic.com
http://lists.puremagic.com/cgi-bin/mailman/listinfo/dlang-study

February 04, 2016
(back to list)

Am Wed, 3 Feb 2016 11:24:16 -0500
schrieb Andrei Alexandrescu <andrei@erdani.com>:

> On 02/03/2016 10:20 AM, Marc Schütz wrote:
> > I assume you also don't want to allow access to the underlying storage (most likely `char[]`?). I don't see this as a good thing at all. Without the ability of getting at least a `const(char)[]` out of it, RCString would be next to useless in practice.
> 
> Could you please substantiate that?

Well, no existing code will know about RCString. That leaves only rangified functions, which currently isn't even a large part of Phobos, and probably even less of third-party libraries. It can be changed, of course, but it's a lot of work, and it often makes code considerably uglier and complicated, less readible, and therefore more error prone.

> 
> I was thinking we could provide access to the underlying slice as a @system primitive. Code using it would be @system or rebrand it as @trusted etc.

That's an option. I wonder if we should even make it implicit
via `alias this`. It surely is more ergonomic, and we don't lose safety.

> 
> >> * Since string-compatibility is off the table, how about we fix string's issues with autodecoding? RCString should offer no indexed access and no length. Instead it offers the ranges byCodeUnit, byChar, byWChar, and byDChar. The first one does not do any decoding and offers length and random access. (What should be its element type?) The other ones are bidirectional ranges that do the appropriate decoding.
> >
> > Definitely. I would prefer this even for normal strings... The element type of `byCodeUnit` should be char/wchar/dchar, of course.
> 
> I guess we could add byOctet that iterates ubytes.

std.string.representation could be extended to accept any string-like
type:
http://dlang.org/phobos/std_string.html#.representation

> 
> >> * Immutable does not play well with reference counting. I'm of a mind to reject immutable rcstring for now and figure out later how to go about it. Then const rcstring is okay because we always consider const a view on mutable strings (even though they're gone). We'll cast const away when manipulating the refcount.
> >
> > As noted by others, this is not a good idea. We can't solve this recurring problem by ignoring it.
> 
> RCString neither tries to solve the matter, nor ignore it. It simply acknowledges it. Once we do have a solution (possibly informed by RCString itself), we will use it to "turn on" immutable RCString, which will be a great test for the feature and a non-breaking improvement. It will all work out nicely.
> 
> The point here is to break the stalemate by taking a step in a good direction.


February 06, 2016
On 02/03/2016 05:46 AM, Михаил Страшун wrote:
> Idea of "next generation" opt-in string replacement sounds very appealing on its own. However, I don't see it fitting for `std.experimental.lifetime` because to act as a full blown string replacement (and to be eventually enhanced by compiler support) it needs to be part of druntime.

Right, this has to become part of druntime at some point, e.g. so we can use it in the exception hierarchy.




February 06, 2016
On 02/02/2016 10:40 PM, Andrei Alexandrescu wrote:
> Hello everyone,
> 
> * call it rcstring or RCString? The first makes it closer to "string", the other is politically correct.

You can defer that decision. I'd initially go w/ RCString and use rcstring iff it becomes part of the language.

> * Characters are small so no need to return them by reference. Because of this, making RCString @safe should be possible in current D. However, this also makes RCString not a plug-in replacement for string (which may after all be a good thing)
> 
> * Since string-compatibility is off the table, how about we fix string's issues with autodecoding? RCString should offer no indexed access and no length. Instead it offers the ranges byCodeUnit, byChar, byWChar, and byDChar.

Please don't repeat the mistake and use byUTF!char, byUTF!wchar, and byUTF!dchar.

> The first one does not do any decoding and offers length and random access. (What should be its element type?)

As byCodeUnit accesses the representation the element type of code units should be unsigned integers.

ubyte[] <-> char[]
ushort[] <-> wchar[]
uint[] <-> dchar[]

> * Immutable does not play well with reference counting. I'm of a mind to reject immutable rcstring for now and figure out later how to go about it. Then const rcstring is okay because we always consider const a view on mutable strings (even though they're gone). We'll cast const away when manipulating the refcount.

Are you only talking about RCString itself or the underlying buffer?
It seems, that many languages choose immutable strings (the buffer),
would be good to get a better picture of different approaches
(immutable, mutable, CoW), how to construct immutable strings (first
build then freeze?).





February 06, 2016
On 02/03/2016 12:52 PM, Jonathan M Davis wrote:
> On Wednesday, February 03, 2016 06:46:53 Михаил Страшун wrote:
>> Element type of `byCodeUnit` should be `ubyte` in my opinion so that it becomes clear each separate element is not a valid char on its own.
> By definition, char is a UTF-8 code unit, wchar is a UTF-16 code unit, and dchar is a UTF-32 code unit, and so code is supposed to be able to assume that. So, I don't see why it would make sense to use ubyte for a code unit. We already have types which are explictly for code units.
>
> Now, by that same token, having the I/O stuff use ubyte rather than char (as you suggested elsewhere in your post) does make a lot of sense precisely because there's no guarantee that what's read in is actually in UTF-8, and any code where it's not sure really should be using ubyte, ushort, or ulong instead of char, wchar, or dchar. Having the I/O functions assume UTF-8 was definitely a mistake IMHO, much as it usually works. But the strings themselves are supposed to be UTF-8, UTF-16, or UTF-32. So, IMHO, RCString should be operating on chars, wchars, or dchars and not ubytes, ushorts, or ulongs.
>
> - Jonathan M Davis

You are absolutely correct from the point of view of initial language definition.

However my main concern about `char` element type of such range is how
it will interact
with old string/char[] types (which won't be changed). It will cause the
same unobvious
corner case in generic code:

void foo (R) (R range)
    if (is(InputRange!R))
{
    auto eager_sequence = lazy_range.array;
    // will fail if range is code unit range:
    static assert (is(ElementType!R ==
ElementType!(typeof(eager_sequence)));
}

I think not having to add special cases for new strings in Phobos should be a major design goal.




February 06, 2016
On 02/02/2016 11:46 PM, Михаил Страшун wrote:
> Not sure what you mean about "no need to return them by reference"
> though. Does that apply only to byX ranges or you want to make the whole
> string effectively unmodifiable? In other words, how the idiom of
> mutable reusable buffer will look like?

Characters in a string may be modifiable by means of opAssign, opOpAssign etc.

> When it comes to encoding, there is also issue of how lacking is current
> support of non-UTF encodings in Phobos.

D uses UTF for strings. Vivid anecdotes aside, we really can't be everything to everyone. Your friend could have written a translator to UTF in a few lines. The DNA optimization points at performance bugs in phobos that far as I know have been fixed or are fixable by rote. I think this non-UTF requirement would just stretch things too far and smacks of solving the wrong problem.

>> * Immutable does not play well with reference counting. I'm of a mind
>> to reject immutable rcstring for now and figure out later how to go
>> about it. Then const rcstring is okay because we always consider const
>> a view on mutable strings (even though they're gone). We'll cast const
>> away when manipulating the refcount.
>
> This one is the toughest in my opinion. Putting aside my own opinion and
> preferences, you should have answers on several points if pursuing this way:
>
> * What are cases for const if one wants to prohibit immutable for a
> given a type?

Const is a non-modifiable view on data that may otherwise be mutable.

> Being a wildcard for mutable/immutable is main idea behind
> const.

No, the "view" aspect is. Const is good for passing things to functions that only want to look at them.

> Everything else is just making compiler happy when it forces
> const on you (like `this` pointer within in/out contracts).
> * As a consequence, how will compiler ensure in/out contracts won't
> affect refcounting state for `this` if it becomes legal to cast const
> away and mutate?

I am sorry but I don't understand this question. To the extent I do I do not have an answer for the time being.

> * How do you envision efficient cross-thread sharing of rcstring if
> immutability is out of the question?

Initially no sharing will be allowed. Following the initial implementation we may add implementation for the "shared" qualifier for rcstring.

> * If one can't support immutability for something relatively simple and
> specialized like char array, doesn't it effectively kill the concept or
> immutable containers important for multi-threading?

Back to basics: immutability in D has always been in intent "real" immutability. That means the bytes of an immutable object are effectively read-only after initialization. Composition implies that all bytes of all members of an immutable object are immutable.

The advantage of this is because we can share with minimal barriers (only need to make sure data is not shared before initialization has finished). The disadvantage is that reference counting is not compatible with immutable objects.

We can't deliver two contradictory guarantees at the same time.

> Right now I am of opinion that issues with immutability highlight issues
> of your desired approach to const and should not be discarded that
> easily.

What is my desired approach to const?



Andrei
_______________________________________________
Dlang-study mailing list
Dlang-study@puremagic.com
http://lists.puremagic.com/cgi-bin/mailman/listinfo/dlang-study

February 06, 2016
I am sorry that I keep arguing about things that may looks unimportant to you but so far it looks like a major effort will go into designing a thing that will only be helpful in a few situations and won't change overall situation much.

>> Not sure what you mean about "no need to return them by reference" though. Does that apply only to byX ranges or you want to make the whole string effectively unmodifiable? In other words, how the idiom of mutable reusable buffer will look like?
>
> Characters in a string may be modifiable by means of opAssign, opOpAssign etc.

Makes sense. And just to be extra sure - opSliceAssign is planned to be allowed to, right? If yes, that should be enough to for most scenarios. Interaction with C bindings may become complicated though, do you have any vision about it?

>> When it comes to encoding, there is also issue of how lacking is current support of non-UTF encodings in Phobos.
>
> D uses UTF for strings. Vivid anecdotes aside, we really can't be everything to everyone. Your friend could have written a translator to UTF in a few lines.The DNA optimization points at performance bugs in phobos that far as I know have been fixed or are fixable by rote. I think this non-UTF requirement would just stretch things too far and smacks of solving the wrong problem.

From a pure technical point of view you are perfectly right. But does that makes the fact potential users leave dissapointed better? UTF isn't a silver bullet and good standard library should either support other encodings naturally or have a good documentation which shows idioms to transcode input to unicode without sacrificing performance.

Remember: not every string is a text but right now Phobos allows you only to choose between raw bytes and UTF-8 when it comes to stuff like File.byLineCopy - it is hardly a surprise library author prefer convenience of the latter and ignore other options. Standard library doesn't only provide some bits of functionality - it influences minds and habits of developers of 3d-party libraries.

Considering each new incompatible change of same feature domain comes with exponential user resistance, this is pretty much last chance to get string semantics as future-proof as possible.

>> * What are cases for const if one wants to prohibit immutable for a given a type?
>
> Const is a non-modifiable view on data that may otherwise be mutable.

Again, I'd like to read confirmation from Walter on this because I recall different statements from him in the past on this topic. Also my own experience of trying to use const in such manner (== effectively logical const, like in C++) is rather bad and if it was the intention, it feels like major design PITA that is much too intrusive for declared goals. Physical immutability guarantess add at last some justification for it being that demanding.

>> Everything else is just making compiler happy when it forces
>> const on you (like `this` pointer within in/out contracts).
>> * As a consequence, how will compiler ensure in/out contracts won't
>> affect refcounting state for `this` if it becomes legal to cast const
>> away and mutate?
>
> I am sorry but I don't understand this question. To the extent I do I do not have an answer for the time being.

You may call me paranoid but I was thinking about this :)

void foo ( )
in
{
    // compiler currently qualified "this" inside a conract with const
    // which gives guarantees that enabling/disabling contracts has no
    // (accidental) effect on class/struct semantics. If passing
    // const "this" to a function may actually change a refcount, it
    // may add to a contract impact in a subtle way
    bar(this);
}
body
{

>> * How do you envision efficient cross-thread sharing of rcstring if immutability is out of the question?
>
> Initially no sharing will be allowed. Following the initial implementation we may add implementation for the "shared" qualifier for rcstring.

That is one of main decision topics for any string replacement. If sharing support is even not supposed to be discussed, what is the point of the case study?

I have seen plenty of successful thread-local implementation of reference counted strings. It is multi-threading that makes things complicated - and commiting to new standard design which does not plan for sharing from the very beginning is a good way to ensure it will not be usable in such way.

>> * If one can't support immutability for something relatively simple and specialized like char array, doesn't it effectively kill the concept or immutable containers important for multi-threading?
>
> Back to basics: immutability in D has always been in intent "real" immutability. That means the bytes of an immutable object are effectively read-only after initialization. Composition implies that all bytes of all members of an immutable object are immutable.
>
> The advantage of this is because we can share with minimal barriers (only need to make sure data is not shared before initialization has finished). The disadvantage is that reference counting is not compatible with immutable objects.
>
> We can't deliver two contradictory guarantees at the same time.

I know what immutability is in D, but that doesn't really answer my question :) Right now I am aware of two truly scaling approaches to sharing in D:

- `@safe Unique!T` which allows multi-threaded ownsership transfer (not actually supported by Phobos yet, but all prerequisites seem to be there) - immutable (both directly and by making immutable copy from mutable data) + atomics

Anything that involves locking a mutex on method calls (like it tends to happen with all straightforward shared RC implementations) destroys performance so hard it is hardly even considered an option these days.

So considering you are willing to abandond immutability and unqiqueness support still has a long way to go, what does remain? Will new "standard" string type be incapable of lock-free sharing?

On a related topic:

Why do you completely discard external reference counting approach (i.e. storing refcount in GC/allocator internal data structures bound to allocated memory blocks)? Is there any paper explaining pitfalls of such concept?

BR,
Dicebot



February 06, 2016
On 2/2/2016 8:46 PM, Михаил Страшун wrote:
> When it comes to encoding, there is also issue of how lacking is current support of non-UTF encodings in Phobos.

This is a deliberate choice. Phobos is designed so that UTF is the only encoding supported. Other encodings are expected to:

   other => toUTF => usePhobos => toOther

i.e. translate to UTF, do the processing, and then translate back to whatever encoding is desired.

I have experience with other encodings, like Shift-JIS. May they all burn in hell. If you think that people forgetting they are dealing with UTF being a problem, imagine all the other encodings, and their peculiar weirdnesses that **** up every piece of code that is not explicitly set up to handle them.

_______________________________________________
Dlang-study mailing list
Dlang-study@puremagic.com
http://lists.puremagic.com/cgi-bin/mailman/listinfo/dlang-study

February 06, 2016
On 02/06/2016 04:51 PM, Михаил Страшун wrote:
> I am sorry that I keep arguing about things that may looks unimportant
> to you but so far it looks like a major effort will go into designing a
> thing that will only be helpful in a few situations and won't change
> overall situation much.

Of course. Sadly it may be the case that I don't have solutions for such things, not that I don't find them important.

>> Characters in a string may be modifiable by means of opAssign,
>> opOpAssign etc.
>
> Makes sense. And just to be extra sure - opSliceAssign is planned to be
> allowed to, right? If yes, that should be enough to for most scenarios.

That should be easy to support.

> Interaction with C bindings may become complicated though, do you have
> any vision about it?

A @system method offering access to the underlying buffer.

>>> When it comes to encoding, there is also issue of how lacking is current
>>> support of non-UTF encodings in Phobos.
>>
>> D uses UTF for strings. Vivid anecdotes aside, we really can't be
>> everything to everyone. Your friend could have written a translator to
>> UTF in a few lines.The DNA optimization points at performance bugs in
>> phobos that far as I know have been fixed or are fixable by rote. I
>> think this non-UTF requirement would just stretch things too far and
>> smacks of solving the wrong problem.
>
>  From a pure technical point of view you are perfectly right. But does
> that makes the fact potential users leave dissapointed better?

What greener pastures do they leave to? We should draw a page from the languages that support multiple encodings seamlessly.

>> Const is a non-modifiable view on data that may otherwise be mutable.
>
> Again, I'd like to read confirmation from Walter on this because I
> recall different statements from him in the past on this topic.

Yeah, he's on board.

> Also my
> own experience of trying to use const in such manner (== effectively
> logical const, like in C++) is rather bad and if it was the intention,
> it feels like major design PITA that is much too intrusive for declared
> goals. Physical immutability guarantess add at last some justification
> for it being that demanding.

We don't know how to make things work otherwise.

> You may call me paranoid but I was thinking about this :)
>
> void foo ( )
> in
> {
>      // compiler currently qualified "this" inside a conract with const
>      // which gives guarantees that enabling/disabling contracts has no
>      // (accidental) effect on class/struct semantics. If passing
>      // const "this" to a function may actually change a refcount, it
>      // may add to a contract impact in a subtle way
>      bar(this);
> }
> body
> {

I'm not sure what to say about this. We'll cross that bridge when we come to it.

>> Initially no sharing will be allowed. Following the initial
>> implementation we may add implementation for the "shared" qualifier for
>> rcstring.
>
> That is one of main decision topics for any string replacement. If
> sharing support is even not supposed to be discussed, what is the point
> of the case study?

We're free to discuss it. All I said is I don't know how to do it. So the logical thing do to is explicitly not support it; when we do know how to do it, the addition can be made without breaking any code.

> I have seen plenty of successful thread-local implementation of
> reference counted strings. It is multi-threading that makes things
> complicated - and commiting to new standard design which does not plan
> for sharing from the very beginning is a good way to ensure it will not
> be usable in such way.

The design is not "not planned" for sharing. The plan is to explicitly not do it for now. "We will drive around Stuttgart" is not "We did not plan for Stuttgart".

If any part of the API might impede future expansion into sharing semantics, by all means the point needs to be raised. Generally it is known where matters lie - reference count updates. As long as we don't expose some gnarly details to the user we should be safe for future extension.

>> We can't deliver two contradictory guarantees at the same time.
>
> I know what immutability is in D, but that doesn't really answer my
> question :) Right now I am aware of two truly scaling approaches to
> sharing in D:
>
> - `@safe Unique!T` which allows multi-threaded ownsership transfer (not
> actually supported by Phobos yet, but all prerequisites seem to be there)
> - immutable (both directly and by making immutable copy from mutable
> data) + atomics
>
> Anything that involves locking a mutex on method calls (like it tends to
> happen with all straightforward shared RC implementations) destroys
> performance so hard it is hardly even considered an option these days.
>
> So considering you are willing to abandond immutability and unqiqueness
> support still has a long way to go, what does remain?

There is no abandoning of immutability.

> Will new
> "standard" string type be incapable of lock-free sharing?

There is one standard string type today that is shareable lock-free and uses the garbage collector.

> On a related topic:
>
> Why do you completely discard external reference counting approach (i.e.
> storing refcount in GC/allocator internal data structures bound to
> allocated memory blocks)? Is there any paper explaining pitfalls of such
> concept?

One reason for creating this forum was to have a smaller confined circle for design discussions outside the corrosive atmosphere of the forum. A place where intellectual discussion, careful consideration, and pushing forward the state of affairs prevails having the louder voice, demonstrating competence, or winning arguments. In this smaller circle, I kindly but firmly invite everyone to steer clear of assumptions on the state of mind of other participants such as "things that look unimportant to you", "you are willing to abandon immutability", or "you completely discard". These do little else than putting the other in defensive and derailing the discussion. Now they don't need to further the technical argument, but instead they need to explain that in fact no, they don't want to do these things. Thank you.

Also: I am not using the Socratic method here. If I put forward an idea, design etc. that has shortcomings it simply means I don't know how to do better. Therefore, pointing the shortcomings will not push things forward; too many of those and we're back to stalemate. The best way is to propose better ideas.

Storing refcounts separately in the allocator is definitely possible but my understanding is it just moves the problem elsewhere. The "poison cast" that takes immutability away from the reference counter in order to manipulate it moves from rcstring's code to the allocator. I see an upside to that - we could move the allocator to "the language" and guarantee things about it (such as it can cast immutability away). What other advantages of the scheme do you see?


Andrei
_______________________________________________
Dlang-study mailing list
Dlang-study@puremagic.com
http://lists.puremagic.com/cgi-bin/mailman/listinfo/dlang-study

February 07, 2016
On 02/07/2016 01:47 AM, Andrei Alexandrescu wrote:
> On 02/06/2016 04:51 PM, Михаил Страшун wrote:
>> Why do you completely discard external reference counting approach (i.e. storing refcount in GC/allocator internal data structures bound to allocated memory blocks)? Is there any paper explaining pitfalls of such concept?

> Storing refcounts separately in the allocator is definitely possible but my understanding is it just moves the problem elsewhere. The "poison cast" that takes immutability away from the reference counter in order to manipulate it moves from rcstring's code to the allocator. I see an upside to that - we could move the allocator to "the language" and guarantee things about it (such as it can cast immutability away). What other advantages of the scheme do you see?

If reference count is stored inside allocator metadata, no cast becomes necessary as relevant memory will never be allocated as immutable (allocator is in full control on how its own metadata is stored).

Consider two simplified examples, with internal and external refcount:

struct RCInternal
{
    private size_t rc;
    private void[] data;

    ~this () { --this.rc; }
}

struct RCExternal
{
    alias allocator = theGlobalAllocator;

    private void[] data;

    ~this () { allocator.decRef(data.ptr); }
}

In the first case it is impossible for user code to create `immutable RCInteral instance` because of violation of physical immutability. It would if struct definition was somehow changed to `private immutable void[]` and host kept being declared as plain mutable `RCInternal` but that is not very generic.

In the second case, however, `immutable RCExternal instance` will be able to satisfy language immutability requirements and thus work more or less in the same style as immutable GC-collected data. To support that, allocator could either reserve a mutable refcount block before request block of immutable memory - or keep a separate refcount table which uses allocated block pointer as lookup key.

Advantages:
- API remains very similar to existing GC-based data
- can keep physical immutability
- allows interesting designs for multi-threaded sharing with separate
thread-local and thread-global refcounts (i.e. thread-global refcount
gets decremented with a mutex only when thread-local one goes to 0, not
upon every inc/dec)

Disadvantages:
- for thread-local mutable data quite likely to have worse performance
because data locality is worse (need to load to CPU cache two different
memory pages to work with one reference-counted instance)
- puts extra complexity pressure on allocator implementations
- bad encapsulation, matching allocator API must be @system
- incompatbile with idiom of storing reference to allocator as runtime
interface inside data structures, requires some form of global allocator