Jump to page: 1 2 3
Thread overview
std.experimental.collections.rcstring and its integration in Phobos
Jul 17, 2018
Seb
Jul 17, 2018
Jonathan M Davis
Jul 17, 2018
Seb
Jul 17, 2018
Jonathan M Davis
Jul 18, 2018
Seb
Jul 18, 2018
Jonathan M Davis
Jul 18, 2018
Jonathan M Davis
Jul 18, 2018
Andrea Fontana
Jul 17, 2018
Jacob Carlborg
Jul 18, 2018
rikki cattermole
Jul 18, 2018
Seb
Jul 18, 2018
Eugene Wissner
Jul 19, 2018
sarn
Jul 18, 2018
Jacob Carlborg
Jul 17, 2018
jmh530
Jul 18, 2018
Seb
Jul 18, 2018
jmh530
Jul 18, 2018
jmh530
Jul 18, 2018
Jon Degenhardt
Jul 18, 2018
Seb
Jul 18, 2018
aliak
Jul 19, 2018
Radu
July 17, 2018
So we managed to revive the rcstring project and it's already a PR for Phobos:

https://github.com/dlang/phobos/pull/6631 (still WIP though)

The current approach in short:

- uses the new @nogc, @safe and nothrow Array from the collections library (check Eduardo's DConf18 talk)
- uses reference counting
- _no_ range by default (it needs an explicit `.by!{d,w,}char`) (as in no auto-decoding by default)

Still to be done:

- integration in Phobos (the current idea is to generate additional overloads for rcstring)
- performance
- use of static immutable rcstring in fully @nogc
- extensive testing

Especially the "seamless" integration in Phobos will be challenging.
I made a rough listing of all symbols that one would expect to be usable with an rcstring type (https://gist.github.com/wilzbach/d74712269f889827cff6b2c7a08d07f8). It's more than 200.
As rcstring isn't a range by default, but one excepts `"foo".rcstring.equal("foo")` to work, overloads for all these symbols would need to be added.

What do you think about this approach? Do you have a better idea?
July 17, 2018
On Tuesday, July 17, 2018 15:21:30 Seb via Digitalmars-d wrote:
> So we managed to revive the rcstring project and it's already a PR for Phobos:
>
> https://github.com/dlang/phobos/pull/6631 (still WIP though)
>
> The current approach in short:
>
> - uses the new @nogc, @safe and nothrow Array from the
> collections library (check Eduardo's DConf18 talk)
> - uses reference counting
> - _no_ range by default (it needs an explicit `.by!{d,w,}char`)
> (as in no auto-decoding by default)
>
> Still to be done:
>
> - integration in Phobos (the current idea is to generate
> additional overloads for rcstring)
> - performance
> - use of static immutable rcstring in fully @nogc
> - extensive testing
>
> Especially the "seamless" integration in Phobos will be
> challenging.
> I made a rough listing of all symbols that one would expect to be
> usable with an rcstring type
> (https://gist.github.com/wilzbach/d74712269f889827cff6b2c7a08d07f8). It's
> more than 200. As rcstring isn't a range by default, but one excepts
> `"foo".rcstring.equal("foo")` to work, overloads for all these
> symbols would need to be added.
>
> What do you think about this approach? Do you have a better idea?

If it's not a range by default, why would you expect _anything_ which operates on ranges to work with rcstring directly? IMHO, if it's not a range, then range-based functions shouldn't work with it, and I don't see how they even _can_ work with it unless you assume code units, or code points, or graphemes as the default. If it's designed to not be a range, then it should be up to the programmer to call the appropriate function on it to get the appropriate range type for a particular use case, in which case, you really shouldn't need to add much of any overloads for it.

- Jonathan M Davis

July 17, 2018
On Tuesday, 17 July 2018 at 16:58:37 UTC, Jonathan M Davis wrote:
> On Tuesday, July 17, 2018 15:21:30 Seb via Digitalmars-d wrote:
>> [...]
>
> If it's not a range by default, why would you expect _anything_ which operates on ranges to work with rcstring directly? IMHO, if it's not a range, then range-based functions shouldn't work with it, and I don't see how they even _can_ work with it unless you assume code units, or code points, or graphemes as the default. If it's designed to not be a range, then it should be up to the programmer to call the appropriate function on it to get the appropriate range type for a particular use case, in which case, you really shouldn't need to add much of any overloads for it.
>
> - Jonathan M Davis

Well, there are few cases where the range type doesn't matter and one can simply compare bytes, e.g.

equal (e.g. "ä" == "ä" <=> [195, 164] == [195, 164])
commonPrefix
find
...

Of course this assumes that there's no normalization necessary, but the current auto-decoding assumes this too.
July 17, 2018
On 2018-07-17 17:21, Seb wrote:

> - _no_ range by default (it needs an explicit `.by!{d,w,}char`) (as in no auto-decoding by default)
> 
> What do you think about this approach? Do you have a better idea?

I vote for .by!char to be the default.

-- 
/Jacob Carlborg
July 17, 2018
On Tuesday, July 17, 2018 17:28:19 Seb via Digitalmars-d wrote:
> On Tuesday, 17 July 2018 at 16:58:37 UTC, Jonathan M Davis wrote:
> > On Tuesday, July 17, 2018 15:21:30 Seb via Digitalmars-d wrote:
> >> [...]
> >
> > If it's not a range by default, why would you expect _anything_ which operates on ranges to work with rcstring directly? IMHO, if it's not a range, then range-based functions shouldn't work with it, and I don't see how they even _can_ work with it unless you assume code units, or code points, or graphemes as the default. If it's designed to not be a range, then it should be up to the programmer to call the appropriate function on it to get the appropriate range type for a particular use case, in which case, you really shouldn't need to add much of any overloads for it.
> >
> > - Jonathan M Davis
>
> Well, there are few cases where the range type doesn't matter and one can simply compare bytes, e.g.
>
> equal (e.g. "ä" == "ä" <=> [195, 164] == [195, 164])
> commonPrefix
> find
> ...

That effectively means treating rcstring as a range of char by default rather than not treating it as a range by default. And if we then do that only with functions that overload on rcstring rather than making rcstring actually a range of char, then why aren't we just treating it as a range of char in general?

IMHO, the fact that so many alogorithms currently special-case on arrays of characters is one reason that auto-decoding has been a disaster, and adding a bunch of overloads for rcstring is just compounding the problem. Algorithms should properly support arbitrary ranges of characters, and then rcstring can be passed to them by calling one of the functions on it to get a range of code units, code points, or graphemes to get an actual range - either that, or rcstring should default to being a range of char. going halfway and making it work with some functions via overloads really doesn't make sense.

Now, if we're talking about functions that really operate on strings and not ranges of characters (and thus do stuff like append), then that becomes a different question, but we've mostly been trying to move away from functions like that in Phobos.

> Of course this assumes that there's no normalization necessary, but the current auto-decoding assumes this too.

You can still normalize with auto-decoding (the code units - and thus code points - are in a specific order even when encoded, and that order can be normalized), and really, anyone who wants fully correct string comparisons needs to be normalizing their strings. With that in mind, rcstring probably should support normalization of its internal representation.

- Jonathan M Davis


July 17, 2018
On Tuesday, 17 July 2018 at 15:21:30 UTC, Seb wrote:
> So we managed to revive the rcstring project and it's already a PR for Phobos:
>
> [snip]
>

I'm glad this is getting worked on. It feels like something that D has been working towards for a while.

Unfortunately, I haven't (yet) watched the collections video at Dconf and don't see a presentation on the website. Because of that, I don't really understand some of the design decisions.

For instance, I also don't really understand how RCIAllocator is different from the old IAllocator (the documentation could use some work, IMO). It looks like RCIAllocator is part of what drives the reference counting semantics, but it also looks like Array has some support for reference counting, like addRef, that invoke RCIAllocator somehow. But Array also has some support for gc_allocator as the default, so my cursory examination suggests that Array is not really intended to be an RCArray...

So at that point I started wondering why not just have String as an alias of Array, akin to how D does it for dynamic arrays to strings currently. If there is stuff in rcstring now that isn't in Array, then that could be included in Array as a compile-time specialization for the relevant types (at the cost of bloating Array). And then leave it up to the user how to allocate.

I think part of the above design decision connects in with why rcstring stores the data as ubytes, even for wchar and dchar. Recent comments suggest that it is related to auto-decoding. My sense is that an rcstring that does not have auto-decoding, even if it requires more work to get working with phobos is a better solution over the long-run.
July 17, 2018
On 7/17/18 12:58 PM, Jonathan M Davis wrote:
> If it's not a range by default, why would you expect_anything_  which
> operates on ranges to work with rcstring directly?

Many functions do not care about the range aspect, but do care about the string aspect. Consider e.g. chdir.
July 18, 2018
On 18/07/2018 5:41 AM, Jacob Carlborg wrote:
> On 2018-07-17 17:21, Seb wrote:
> 
>> - _no_ range by default (it needs an explicit `.by!{d,w,}char`) (as in no auto-decoding by default)
>>
>> What do you think about this approach? Do you have a better idea?
> 
> I vote for .by!char to be the default.

I'm thinking .as!T

So we can cover, ubyte/char/wchar/dchar, string/wstring/dstring all in one.

I think whatever we expose as the default for string/wstring/dstring however should be settable. e.g.

```
struct RCString(DefaultStringType=string) {
	alias .as!DefaultStringType this;
}
```

Which is a perfect example of what my named parameter DIP is for ;)
July 18, 2018
On Tuesday, 17 July 2018 at 15:21:30 UTC, Seb wrote:
> So we managed to revive the rcstring project and it's already a PR for Phobos:
>
> https://github.com/dlang/phobos/pull/6631 (still WIP though)
>
> The current approach in short:
>
> - uses the new @nogc, @safe and nothrow Array from the collections library (check Eduardo's DConf18 talk)
> - uses reference counting
> - _no_ range by default (it needs an explicit `.by!{d,w,}char`) (as in no auto-decoding by default)
>
> [snip]
>
> What do you think about this approach? Do you have a better idea?

I don't know the goals/role rcstring is expected to play, especially wrt existing string/character facilities. Perhaps you could describe more?

Strings are central to many applications, so I'm wondering about things like whether rcstring is intended as a replacement for string that would be used by most new programs, and whether applications would use arrays and ranges of char together with rcstring, or rcstring would be used for everything.

Perhaps its too early for these questions, and the current goal is simpler. For example, adding a meaningful collection class that is @nogc, @safe and ref-counted that be used as a proving ground for the newer memory management facilities being developed.

Such simpler goals would be quite reasonable. What's got me wondering about the larger questions are the comments about ranges and autodecoding. If rcstring is intended as a vehicle for general @nogc handling of character data and/or for reducing the impact of autodecoding, then it makes sense to consider from those perspectives.

--Jon
July 17, 2018
On Tuesday, July 17, 2018 22:45:33 Andrei Alexandrescu via Digitalmars-d wrote:
> On 7/17/18 12:58 PM, Jonathan M Davis wrote:
> > If it's not a range by default, why would you expect_anything_  which operates on ranges to work with rcstring directly?
>
> Many functions do not care about the range aspect, but do care about the string aspect. Consider e.g. chdir.

It doesn't care about strings either. It operates on a range of characters. If a function is just taking a value as input and isn't storing it or mutating its elements, then a range of characters works perfectly fine and is more flexible than any particular type - and IMHO shouldn't then be having overloads for particular ranges of characters or string types if we can avoid it. If we're talking about a functions that's really operating on a string as a string and doing things like appending as opposed to doing range-based operations, then maybe overloading for other string types makes sense rather than requiring an array of characters. But if it's just taking a string and reading it? That has no need to operate on strings specifically and should be operating on a range of characters - something that we've been moving towards with Phobos.

As such, I don't think that it generally makes sense for functions in Phobos to be explicitly accepting rcstring unless it's actually a range. If it's not actually a range, then such functions should already work with it by calling the appropriate function to get a range over it without needing to special-case anything.

- Jonathan M Davis

« First   ‹ Prev
1 2 3