Jump to page: 1 2 3
Thread overview
[Dlang-study] [rcstring] Defining rcstring
Feb 02, 2016
Jonathan M Davis
Feb 03, 2016
Jonathan M Davis
Feb 03, 2016
Jonathan M Davis
Feb 06, 2016
Martin Nowak
Re: [Dlang-study] [rcstring] external reference counting
Feb 06, 2016
Walter Bright
Feb 07, 2016
Walter Bright
Feb 03, 2016
Marc Schütz
Feb 03, 2016
Jonathan M Davis
Feb 04, 2016
Marc Schütz
Feb 06, 2016
Martin Nowak
February 02, 2016
Hello everyone,


Let's reboot this list with a simpler discussion: rcstring, a reference-counted string. My initial thoughts:

* make it part of a nascent std.experimental.lifetime module/package

* call it rcstring or RCString? The first makes it closer to "string", the other is politically correct.

* Characters are small so no need to return them by reference. Because of this, making RCString @safe should be possible in current D. However, this also makes RCString not a plug-in replacement for string (which may after all be a good thing)

* Since string-compatibility is off the table, how about we fix string's issues with autodecoding? RCString should offer no indexed access and no length. Instead it offers the ranges byCodeUnit, byChar, byWChar, and byDChar. The first one does not do any decoding and offers length and random access. (What should be its element type?) The other ones are bidirectional ranges that do the appropriate decoding.

* Immutable does not play well with reference counting. I'm of a mind to reject immutable rcstring for now and figure out later how to go about it. Then const rcstring is okay because we always consider const a view on mutable strings (even though they're gone). We'll cast const away when manipulating the refcount.

* I don't have the small string optimization implemented yet, but obviously the definition of the type should allow it.


Thoughts?

Andrei
_______________________________________________
Dlang-study mailing list
Dlang-study@puremagic.com
http://lists.puremagic.com/cgi-bin/mailman/listinfo/dlang-study

February 02, 2016
On Tuesday, February 02, 2016 16:40:26 Andrei Alexandrescu wrote:
> * call it rcstring or RCString? The first makes it closer to "string", the other is politically correct.

Since it's not built-in, I really so no reason to break the naming conventions - especially if it's not a drop-in replacement for string.

> * Characters are small so no need to return them by reference. Because of this, making RCString @safe should be possible in current D. However, this also makes RCString not a plug-in replacement for string (which may after all be a good thing)

I take it that you mean that any functions that RCString might have (like opIndex or front or whatever) always return individual characters by value? That could be a bit annoying upon occasion, but I don't think that it's all that big a deal - particularly if it means that it can be @safe.

> * Since string-compatibility is off the table, how about we fix string's issues with autodecoding? RCString should offer no indexed access and no length. Instead it offers the ranges byCodeUnit, byChar, byWChar, and byDChar. The first one does not do any decoding and offers length and random access. (What should be its element type?) The other ones are bidirectional ranges that do the appropriate decoding.

I think that it's a great idea to make it so that the programmer has to ask for byCodeUnit, byChar, or whatever they want in terms of decoding. That will make dealing with RCStrings less error-prone (though it would be nice if we could do something similar to string).

As for the element type, I assume that you're talking about how it stores them internally rather than the external element type? Since if you have to use byChar or byCodeUnit or whatever to access the elements, then the element type does (on some level at least) become an implementation detail. I would have assumed that we'd either make it char (in which case the documentation should probably make that clear for efficiency purposes, but it wouldn't really matter to the API), or we would templatize RCString on the character type, in which case, byCodeUnit would be over that character type, and the other ranges would decode as necessary.

And between those two options, I'd favor templatizing it. I definitely think that most code should use char as the code unit, but some of the Windows-centric folks are going to want wchar to avoid decoding to talk to Windows system calls, and if you're doing anything where you do want to operate at the code point level quite a bit, then having dchar as the code unit (and thus avoiding decoding altogether) would be desirable for efficiency. And all it really costs us is that using RCString is a bit more verbose, because you have to do RCString!char instead of RCString, but maybe we can come up with good aliases if that's a problem.

> * Immutable does not play well with reference counting. I'm of a mind to reject immutable rcstring for now and figure out later how to go about it. Then const rcstring is okay because we always consider const a view on mutable strings (even though they're gone). We'll cast const away when manipulating the refcount.

I would point out that it's undefined behavior to cast away const from a variable and mutate it, and the spec is very explicit about that:

http://dlang.org/spec/const3.html

So, unless you get Walter to agree that that should no longer be undefined behavior, and then we update the spec and make sure that the compiler doesn't do anything anymore that treats it as undefined behavior, then casting away const to mutate is not something that we should ever be doing. And Walter has generally been very adamant that const needs to not have any backdoors, so I don't know how easy it's going to be to convince him.

As nice as it is to be able to depend on a const variable not mutating a mutable one, there are an annoying number of places where we can't use const as long as casting away const and mutating is undefined. So, such a change may very well be for the better, but it _is_ a change, and we certainly shouldn't be introducing anything into Phobos that does it unless we change the spec and make sure that the compiler is in line with the change.


On another note, how does this relate to discussions on adding reference counting into the language? I would assume that this can be done with or with that, but does it affect the API in any way that someone using it would care about if we introduce it with library-based reference counting and then later change it to use a new language construct if/when we add a reference counting mechanism to the language?

- Jonathan M Davis

_______________________________________________
Dlang-study mailing list
Dlang-study@puremagic.com
http://lists.puremagic.com/cgi-bin/mailman/listinfo/dlang-study

February 03, 2016
On 02/02/2016 11:40 PM, Andrei Alexandrescu wrote:
> Hello everyone,
>
> Let's reboot this list with a simpler discussion: rcstring, a reference-counted string. My initial thoughts:

Quick proposal before discussion gets hot - I see several orthogonal topics in your initial mail and it would be nice to actually create a different mail thread for each to stay focused. It is just too simply to start going circles when topics are interleaved.

The way I read it there are 3 different topics that have no direct relation to each other but all needs to be addressed as part of [rcstring]:

1. actual reference counting implementation (have it been
discussed/decided before?)
2. making use of opportunity of new string type and getting its encoding
semantics right
3. interaction of `const` with reference counting and allowing casting
away const to mutate refcount

There is also topic of idiomatic sharing semantics for new string type but it feels closely tied to (3).

Andrei, if that distinction makes sense to you, would you mind branching your first mail into 3 new roots for discussion? For example: "[rcstring] [rc]", "[rcstring] [API]", "[rcstring] [(im)mutability]".



February 02, 2016
On 02/02/2016 06:57 PM, Jonathan M Davis wrote:
> On another note, how does this relate to discussions on adding reference
> counting into the language?

Thanks for your comments. This is a small and useful step that we can take and accumulate experience with before changing the language. -- Andrei
_______________________________________________
Dlang-study mailing list
Dlang-study@puremagic.com
http://lists.puremagic.com/cgi-bin/mailman/listinfo/dlang-study

February 02, 2016
On 02/02/2016 07:12 PM, Михаил Страшун wrote:
> Andrei, if that distinction makes sense to you, would you mind branching
> your first mail into 3 new roots for discussion? For example:
> "[rcstring] [rc]", "[rcstring] [API]", "[rcstring] [(im)mutability]".

Sorry, I don't find that sensible. Once we have a prototype going, forking discussion may be in order. For now I don't know how to do one without simultaneously minding the others. -- Andrei
_______________________________________________
Dlang-study mailing list
Dlang-study@puremagic.com
http://lists.puremagic.com/cgi-bin/mailman/listinfo/dlang-study

February 03, 2016
On 02/03/2016 02:34 AM, Andrei Alexandrescu wrote:
> On 02/02/2016 07:12 PM, Михаил Страшун wrote:
>> Andrei, if that distinction makes sense to you, would you mind branching your first mail into 3 new roots for discussion? For example: "[rcstring] [rc]", "[rcstring] [API]", "[rcstring] [(im)mutability]".
>
> Sorry, I don't find that sensible. Once we have a prototype going, forking discussion may be in order. For now I don't know how to do one without simultaneously minding the others. -- Andrei

NP, it was just a random proposal. I'll respond to the original mail in detail tiny bit later in that case.



February 02, 2016
On Tuesday, February 02, 2016 19:31:55 Andrei Alexandrescu wrote:
> On 02/02/2016 06:57 PM, Jonathan M Davis wrote:
> > On another note, how does this relate to discussions on adding reference counting into the language?
>
> Thanks for your comments. This is a small and useful step that we can
> take and accumulate experience with before changing the language. -- Andrei

And I see no problem with that. I just think that we should make sure that we don't do something with RCString that's likely to be incompatible with switching to built-in reference counting if that ever gets added. And I would expect that that would probably be invisible to how RCString is normally used, but I think that we should make sure that we keep that in mind in case there is something the ends up in the design which would cause problems when changing to built-in reference counting, since we would presumably want to avoid such problems.

- Jonathan M Davis

_______________________________________________
Dlang-study mailing list
Dlang-study@puremagic.com
http://lists.puremagic.com/cgi-bin/mailman/listinfo/dlang-study

February 03, 2016
Quick remark: I am out of the loop for latest RC discussions and have no idea how it is going to be implemented in compiler-friendly way. So for now I'll assume that it just magically works and focus on derived topics.

On 02/02/2016 11:40 PM, Andrei Alexandrescu wrote:
> * make it part of a nascent std.experimental.lifetime module/package
>
> * call it rcstring or RCString? The first makes it closer to "string", the other is politically correct.
>
> * Characters are small so no need to return them by reference. Because of this, making RCString @safe should be possible in current D. However, this also makes RCString not a plug-in replacement for string (which may after all be a good thing)
>
> * Since string-compatibility is off the table, how about we fix
> string's issues with autodecoding? RCString should offer no indexed
> access and no length. Instead it offers the ranges byCodeUnit, byChar,
> byWChar, and byDChar. The first one does not do any decoding and
> offers length and random access. (What should be its element type?)
> The other ones are bidirectional ranges that do the appropriate decoding.

Opinion
-------

Idea of "next generation" opt-in string replacement sounds very appealing on its own. However, I don't see it fitting for `std.experimental.lifetime` because to act as a full blown string replacement (and to be eventually enhanced by compiler support) it needs to be part of druntime. Putting initial proof of concept implementation into `std.experimental` is good approach but it I don't think it makes sense to promise from the very beginning that it will go to matching `std.lifetime`.

I absolutely support explicit `byCodeUnit`/`byChar` requirement (skipping naming bikeshedding for now) but it also needs to work naturally with `byGrapheme`. Importing all std.uni is certainly an overkill but when writing documentation for new modules it must be mentioned as most "correct" option for multi-language text processing. Element type of `byCodeUnit` should be `ubyte` in my opinion so that it becomes clear each separate element is not a valid char on its own.

Not sure what you mean about "no need to return them by reference" though. Does that apply only to byX ranges or you want to make the whole string effectively unmodifiable? In other words, how the idiom of mutable reusable buffer will look like?

Related Stories
---------------

When it comes to encoding, there is also issue of how lacking is current support of non-UTF encodings in Phobos.I want to share two stories from personal experience when it was an issue:

1. A friend of mine got very excited about Pegged from DConf videos and wanted to try it out for a project consisting of bunch of very small DSL implementations. However after initial experiments he has quickly found out that Pegged always uses char[] internally with auto-decoding and no way to change it. It wasn't even a performance issues - my friend needed resulting parser to work with extended ASCII text input which isn't a valid UTF at all. He has abandoned the idea of using D/Pegged when discovering it and switched to some solution he liked less but which actually worked.

2. A while ago I have been helping to optimize small piece of bioinformatics code which was processing DNA sequence from a huge text file. Initial performance was surprisingly bad and one of biggest speedups (~ 2x-3x) was achieved by casting read data to ubyte[] and reimplementing Phobos functions from std.string that expect `string` arguments to also accept ubyte[] - because the text

Both stories show one recurring issue with existing string design - simply using `string` or `char[]` every time you think about text is so easy that even experienced developers keep forgetting it must represent Unicode text and that other options may be more applicable. Requiring explicit `byChar` is one good way to encourage thinking about used encoding but I think it is also very important for new `rcstring` to also support other kinds of encodings without too much added hassle.

Proposals
---------

For me idealised sequence of events could look like this:

1. Put draft implementation of `rcstring` into
`std.experimental.rcstring` making it templated on encoding policy
2. When it looks good, move implementation to druntime and very slowly
start using it internally there
3. Make `utf8string` an alias for `rcstring!(Encoding.UTF8)` and use it
as only string type within druntime itself (it doesn't need to support
any others).
4. Enhance std.encoding to provide any additional rcstring encodings,
most importantly raw ASCII
5. Review Phobos to deprecate uncalled assumptions for `char[]`,
especially when working with I/O (files, stdin)

It would take a very long time of course but I always prefer to know a long-term picture for any new design.

> * Immutable does not play well with reference counting. I'm of a mind to reject immutable rcstring for now and figure out later how to go about it. Then const rcstring is okay because we always consider const a view on mutable strings (even though they're gone). We'll cast const away when manipulating the refcount.

This one is the toughest in my opinion. Putting aside my own opinion and preferences, you should have answers on several points if pursuing this way:

* What are cases for const if one wants to prohibit immutable for a
given a type? Being a wildcard for mutable/immutable is main idea behind
const. Everything else is just making compiler happy when it forces
const on you (like `this` pointer within in/out contracts).
* As a consequence, how will compiler ensure in/out contracts won't
affect refcounting state for `this` if it becomes legal to cast const
away and mutate?
* How do you envision efficient cross-thread sharing of rcstring if
immutability is out of the question?
* If one can't support immutability for something relatively simple and
specialized like char array, doesn't it effectively kill the concept or
immutable containers important for multi-threading?

Right now I am of opinion that issues with immutability highlight issues of your desired approach to const and should not be discarded that easily. But it is a feeling caused by more by lack of understanding than hard proof thus I am interested in how you envision it all together. I am also most interested to learn what Walter has to say on topic, especially in regards to how such changes to const would affect code generation and optimizations available to compiler.

> * I don't have the small string optimization implemented yet, but obviously the definition of the type should allow it.

Sounds fairly uncontroversal.

Best Regards,
Dicebot



February 03, 2016
On Wednesday, February 03, 2016 06:46:53 Михаил Страшун wrote:
> Element type of `byCodeUnit` should be `ubyte` in my opinion so that it becomes clear each separate element is not a valid char on its own.

By definition, char is a UTF-8 code unit, wchar is a UTF-16 code unit, and dchar is a UTF-32 code unit, and so code is supposed to be able to assume that. So, I don't see why it would make sense to use ubyte for a code unit. We already have types which are explictly for code units.

Now, by that same token, having the I/O stuff use ubyte rather than char (as you suggested elsewhere in your post) does make a lot of sense precisely because there's no guarantee that what's read in is actually in UTF-8, and any code where it's not sure really should be using ubyte, ushort, or ulong instead of char, wchar, or dchar. Having the I/O functions assume UTF-8 was definitely a mistake IMHO, much as it usually works. But the strings themselves are supposed to be UTF-8, UTF-16, or UTF-32. So, IMHO, RCString should be operating on chars, wchars, or dchars and not ubytes, ushorts, or ulongs.

- Jonathan M Davis


_______________________________________________
Dlang-study mailing list
Dlang-study@puremagic.com
http://lists.puremagic.com/cgi-bin/mailman/listinfo/dlang-study

February 03, 2016
On Tuesday, 2 February 2016 at 21:40:26 UTC, Andrei Alexandrescu wrote:
> * call it rcstring or RCString? The first makes it closer to "string", the other is politically correct.

RCString

> * Characters are small so no need to return them by reference. Because of this, making RCString @safe should be possible in current D. However, this also makes RCString not a plug-in replacement for string (which may after all be a good thing)

I assume you also don't want to allow access to the underlying storage (most likely `char[]`?). I don't see this as a good thing at all. Without the ability of getting at least a `const(char)[]` out of it, RCString would be next to useless in practice.

>
> * Since string-compatibility is off the table, how about we fix string's issues with autodecoding? RCString should offer no indexed access and no length. Instead it offers the ranges byCodeUnit, byChar, byWChar, and byDChar. The first one does not do any decoding and offers length and random access. (What should be its element type?) The other ones are bidirectional ranges that do the appropriate decoding.

Definitely. I would prefer this even for normal strings... The element type of `byCodeUnit` should be char/wchar/dchar, of course.

>
> * Immutable does not play well with reference counting. I'm of a mind to reject immutable rcstring for now and figure out later how to go about it. Then const rcstring is okay because we always consider const a view on mutable strings (even though they're gone). We'll cast const away when manipulating the refcount.

As noted by others, this is not a good idea. We can't solve this recurring problem by ignoring it.
_______________________________________________
Dlang-study mailing list
Dlang-study@puremagic.com
http://lists.puremagic.com/cgi-bin/mailman/listinfo/dlang-study

« First   ‹ Prev
1 2 3