Ranges of char and wchar - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Ranges of char and wchar

Thread overview

Ranges of char and wchar
May 08, 2014 Andrei Alexandrescu
May 08, 2014 H. S. Teoh
May 08, 2014 Walter Bright
May 08, 2014 Nick Sabalausky
May 08, 2014 Luís Marques
May 08, 2014 Andrei Alexandrescu
May 08, 2014 H. S. Teoh
May 08, 2014 H. S. Teoh
May 08, 2014 Luís Marques
May 09, 2014 Kagamin
May 10, 2014 Peter Alexander
May 08, 2014 Walter Bright
May 08, 2014 H. S. Teoh
May 08, 2014 Andrei Alexandrescu
May 08, 2014 Walter Bright
May 09, 2014 Jacob Carlborg
May 08, 2014 Nick Sabalausky
May 08, 2014 Walter Bright
May 08, 2014 Nick Sabalausky
May 08, 2014 Walter Bright
May 08, 2014 Nick Sabalausky
May 08, 2014 Walter Bright
May 08, 2014 John Colvin
May 08, 2014 Andrei Alexandrescu
May 08, 2014 Andrei Alexandrescu
May 08, 2014 Walter Bright
May 13, 2014 Jesse Phillips

May 08, 2014

Ranges of char and wchar

Posted by Andrei Alexandrescu

Andrei Alexandrescu

A discussion is building around https://github.com/D-Programming-Language/phobos/pull/2149, which is a nice initiative by Walter to allow Phobos users to avoid or control memory allocation.

First instance of the pull request copied the inputs into an output range.

The second instance (right now) creates an input range that lazily creates the result. The element type of that range is the encoding type of the first argument (i.e. char or wchar most of the time). This is different from string/wstring/etc element-wise iteration because it'll be done code unit-wise, not code point-wise.

We need a robust idiom for doing such string manipulation without allocation, for which setExtension is just an example. Going the output range route has nice things going for it because the output range decides the encoding in advance and then accepts via put() calls any encoding, with only the minimum transcoding needed.

However output range means the string operation will be done eagerly, whereas lazy has advantages (nice piping, saving on work etc).

On the other hand, there's the risk of becoming "more catholic than the Pope" by insisting on lazy string processing. Most string operations are eager, and insisting on a general framework for lazy encoded operations on strings may be an exaggeration.

D{iscuss,estroy}!


Andrei

May 08, 2014

Re: Ranges of char and wchar

Posted by H. S. Teoh
in reply to Andrei Alexandrescu

H. S. Teoh

Posted in reply to Andrei Alexandrescu

On Thu, May 08, 2014 at 10:46:12AM -0700, Andrei Alexandrescu via Digitalmars-d wrote:
> A discussion is building around https://github.com/D-Programming-Language/phobos/pull/2149, which is a nice initiative by Walter to allow Phobos users to avoid or control memory allocation.
> 
> First instance of the pull request copied the inputs into an output range.
> 
> The second instance (right now) creates an input range that lazily
> creates the result.

I've thought about input ranges vs. output ranges for a bit.  I think it doesn't make sense for functions that process data to take an output range: output ranges are data sinks, and should only be used for the endpoint of a data processing pipeline. Since the string function doesn't know whether or not it's the last in a pipeline (only the calling code can know this), it should return an input range. If the user code wants to put the result into an output range, then it should simply use std.algorithm.copy.

This way, you maximize the usability of the function -- it can participate in UFCS chains, compose with other std.algorithm functions, etc..

[...]
> We need a robust idiom for doing such string manipulation without allocation, for which setExtension is just an example. Going the output range route has nice things going for it because the output range decides the encoding in advance and then accepts via put() calls any encoding, with only the minimum transcoding needed.

The problem with this approach is that it hampers usage in UFCS pipelines.

> However output range means the string operation will be done eagerly, whereas lazy has advantages (nice piping, saving on work etc).
> 
> On the other hand, there's the risk of becoming "more catholic than the Pope" by insisting on lazy string processing. Most string operations are eager, and insisting on a general framework for lazy encoded operations on strings may be an exaggeration.
[...]

In terms of usability, my opinion is that it makes most sense to return an input range. Let the user decide when the result should be copied into an output range (via std.algorithm.copy).

Compare the following for constructing a path from a directory name, a filename, and an extension:

Case 1: setExtension takes an output range:

	// Look how ugly this is:
	string dirname = ...;
	string filename = ...;

	// Need temp buffer to store result
	char[128] result;
	char[] outputRange = result[];

	dirname.copy(outputRange);
	setExtension(filename, ".ext", outputRange);

	writeln(result);

Case 2: setExtension takes an input range:

	// Look how clean this is:
	string dirname = ...;
	string filename = ...;

	writeln(chain(dirname, setExtension(filename, ".ext")));

In case 1, the user has to manually create various intermediate buffers to store intermediate results. I used a trivial example here, but in application code, the processing you need is usually far more complex. This means creating lots of intermediate buffers, making sure you link the right ones together, etc.. The code becomes very verbose, and becomes a maintenance nightmare (which of the tmp1, tmp2, tmp3 buffers refer to which fragment of the result again? Oh oops, I think I passed the wrong output range to setExtension).

In case 2, the user decides when a buffer is needed and when it's not. The function calls chain very nicely. The code is more readable, and easy to maintain (and needless allocations -- including temporary static buffers -- are eliminated).

T

-- 
Nearly all men can stand adversity, but if you want to test a man's character, give him power. -- Abraham Lincoln

May 08, 2014

Re: Ranges of char and wchar

Posted by Walter Bright
in reply to Andrei Alexandrescu

Walter Bright

Posted in reply to Andrei Alexandrescu

On 5/8/2014 10:46 AM, Andrei Alexandrescu wrote:
> A discussion is building around
> https://github.com/D-Programming-Language/phobos/pull/2149, which is a nice
> initiative by Walter to allow Phobos users to avoid or control memory allocation.

The setExtension() function is itself not very important, but what is important is an example for how to put together ranges.

Some design goals:

1. purity, @safe, nothrow, @nogc
2. composability
3. have them work in a consistent way, so there's less for a user to learn

May 08, 2014

Re: Ranges of char and wchar

Posted by Nick Sabalausky
in reply to Andrei Alexandrescu

Nick Sabalausky

Posted in reply to Andrei Alexandrescu

On 5/8/2014 1:46 PM, Andrei Alexandrescu wrote:
>
> However output range means the string operation will be done eagerly,
> whereas lazy has advantages (nice piping, saving on work etc).

Isn't eagerness what we have array() for?

Also, while naming is often just bikeshedding, in this case I think we really need to address it. It's unfortunate that the new InputRange version can't be overloaded with the non-range version, because distinguishing them with "setExt" vs "setExtention" is just a non-option as far as I'm concerned.

But IMO that's the minor stuff, I think there's a bigger elephant in the room here:

The various benefits of doing it as an InputRange instead of OutputRange, combined with the terrible mess that resulted from converting such a straightforward function from OutputRange to InputRange is yet another clear indication that we REALLY need a coroutine-like/C#-like way to create InputRanges. Do we really expect simple algorithms exploding into messes like that to become the norm in D? I certainly hope not. I'd rather just go back to opApply.

I'm convinced this is a real problem for D moving forward, and I don't like continually seeing it get swept under the same "too unimportant for right now" rug as things that actually belong under there (like named parameters, ast macros, or pattern matching). Even C# is making us look really, really bad here, and C# can't even do generic code worth a damn.

May 08, 2014

Re: Ranges of char and wchar

Posted by H. S. Teoh
in reply to Walter Bright

H. S. Teoh

Posted in reply to Walter Bright

On Thu, May 08, 2014 at 11:27:54AM -0700, Walter Bright via Digitalmars-d wrote:
> On 5/8/2014 10:46 AM, Andrei Alexandrescu wrote:
> >A discussion is building around https://github.com/D-Programming-Language/phobos/pull/2149, which is a nice initiative by Walter to allow Phobos users to avoid or control memory allocation.
> 
> The setExtension() function is itself not very important, but what is important is an example for how to put together ranges.
> 
> Some design goals:
> 
> 1. purity, @safe, nothrow, @nogc
> 2. composability
> 3. have them work in a consistent way, so there's less for a user to learn

I think having setExtension() return an input range is the best solution. It will satisfy all 3 requirements: it's easy to make it pure, safe, nothrow, nogc (since it lazily generates the result); it's easy to compose with UFCS chains, and it's consistent with the rest of the range-based functions in Phobos.

Having setExtension (or any string function) take an output range would break (2) -- you have to allocate intermediate buffers to serve as output ranges if you want to do further processing of the result.

If setExtension returns an input range, then you can just use std.algorithm.copy to copy the result to an output range should you need to. Going the other way, you can't easily convert an output range into an input range.

There is the issue of decoding, though. Perhaps setExtension should take a compile-time argument specifying which char type is desired in the result? So you could do:

	string filename, ext;
	wstring pathname_w = filename.setExtension!wchar(ext).array;
	dstring pathname_d = filename.setExtension!dchar(ext).array;

This way, if the output char type is identical to the input char type, the function can bypass decoding.

T

-- 
"Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next. -- (Stolen from the net)

May 08, 2014

Re: Ranges of char and wchar

Posted by Walter Bright
in reply to Andrei Alexandrescu

Walter Bright

Posted in reply to Andrei Alexandrescu

On 5/8/2014 10:46 AM, Andrei Alexandrescu wrote:
> However output range means the string operation will be done eagerly, whereas
> lazy has advantages (nice piping, saving on work etc).

The lazy vs eager issue seems at first blush to be an esoteric, potayto-potahto bikeshed.

But I think it's more interesting than that.

Lazy means that the computation is divided into two pieces:

1. creating an engine to do the calculations
2. feeding the data to the engine to generate a result

This, of course, reminds me of regex. By creating the regex engine separately, and then using that engine for multiple sets of data, one can afford to spend extra computational effort at optimizing the engine. Creating the engine can even be done at compile time (as Dmitry's most excellent regex engine does).

Engines can make it easier to parallelize computation.

And lastly, it's easy to make an eager version by wrapping a lazy one. But to make a lazy version from an eager one means reimplementing it.

May 08, 2014

Re: Ranges of char and wchar

Posted by Walter Bright
in reply to H. S. Teoh

Walter Bright

Posted in reply to H. S. Teoh

On 5/8/2014 11:19 AM, H. S. Teoh via Digitalmars-d wrote:
> In case 1, the user has to manually create various intermediate buffers
> to store intermediate results. I used a trivial example here, but in
> application code, the processing you need is usually far more complex.
> This means creating lots of intermediate buffers, making sure you link
> the right ones together, etc.. The code becomes very verbose, and
> becomes a maintenance nightmare (which of the tmp1, tmp2, tmp3 buffers
> refer to which fragment of the result again? Oh oops, I think I passed
> the wrong output range to setExtension).
>
> In case 2, the user decides when a buffer is needed and when it's not.
> The function calls chain very nicely. The code is more readable, and
> easy to maintain (and needless allocations -- including temporary static
> buffers -- are eliminated).

I think you nailed it.

Being able to eliminate temporary buffers is a big win. The fastest way to manage allocated memory is to not need allocated memory.

May 08, 2014

Re: Ranges of char and wchar

Posted by Nick Sabalausky
in reply to H. S. Teoh

Nick Sabalausky

Posted in reply to H. S. Teoh

On 5/8/2014 2:19 PM, H. S. Teoh via Digitalmars-d wrote:
>
> This way, you maximize the usability of the function -- it can
> participate in UFCS chains, compose with other std.algorithm functions,
> etc..
>

I think that actually raises an issue worth considering. I posit that the results of an InputRange setExtention don't make any sense with most std.algorithm functions:

What's an InputRange? It's a set of elements. So what are the "elements" of a setExtention? Uhh...

The result of setExtention really just has exactly one "logical" element: the resulting filesystem path. But that's not necessarily the *actual* result. The actual result is the "elements" of "however the hell setExtention's internal algorithm happened to feel like plopping them out". What are you going to zip() them, map() them, sort() them? All completely meaningless.

The "elements" of setExtention's InputRange are NOT defined. At least not in any well-encapsulated way. And really, they SHOULDN'T be defined.

So we have setExtention, and anything that follows it's model, returning what's more or less an undefined result that's basically useless for pretty much anything besides plopping straight into array() or copy().

If the big benefit of going InputRange instead of OutputRange for such functions is just simply call chaining, then maybe we're solving the wrong problem and should instead be looking at ways to improve the usage of OutputRange-based functions.

May 08, 2014

Re: Ranges of char and wchar

Posted by Nick Sabalausky
in reply to Walter Bright

Nick Sabalausky

Posted in reply to Walter Bright

On 5/8/2014 2:46 PM, Walter Bright wrote:
>
> But to make a lazy version from an eager one means reimplementing it.
>

Or yield()-ing inside the eager one's sink.

And note also there is such as thing as a stackless "fiber", so I'm not certain a full-fledged context-switching fiber would necessarily be required (although it might - stackless fibers do have limitations).

May 08, 2014

Re: Ranges of char and wchar

Posted by John Colvin
in reply to Andrei Alexandrescu

John Colvin

Posted in reply to Andrei Alexandrescu

On Thursday, 8 May 2014 at 17:46:06 UTC, Andrei Alexandrescu wrote:
> A discussion is building around https://github.com/D-Programming-Language/phobos/pull/2149, which is a nice initiative by Walter to allow Phobos users to avoid or control memory allocation.
>
> First instance of the pull request copied the inputs into an output range.
>
> The second instance (right now) creates an input range that lazily creates the result. The element type of that range is the encoding type of the first argument (i.e. char or wchar most of the time). This is different from string/wstring/etc element-wise iteration because it'll be done code unit-wise, not code point-wise.
>
> We need a robust idiom for doing such string manipulation without allocation, for which setExtension is just an example. Going the output range route has nice things going for it because the output range decides the encoding in advance and then accepts via put() calls any encoding, with only the minimum transcoding needed.
>
> However output range means the string operation will be done eagerly, whereas lazy has advantages (nice piping, saving on work etc).
>
> On the other hand, there's the risk of becoming "more catholic than the Pope" by insisting on lazy string processing. Most string operations are eager, and insisting on a general framework for lazy encoded operations on strings may be an exaggeration.
>
> D{iscuss,estroy}!
>
>
> Andrei

A little thing I've been thinking about on a related note:

someOutputRange <| someInputRange;

equivalent to

someInputRange |> someOutputRange;

which is conceptually the same as

someInputRange.copy(someOutputRange);

perhaps implemented by introducing new operators opSink (in the output range) and opSource (in the input range). These would be invoked as a pair, opSource returning an input range which is passed to opSink.

If not implemented, opSource defaults to `auto opSource() { return this; }` or even `alias opSource = this;`. opSink would default to a std.algorithm.copy style thing (can't really be std.algorithm.copy itself because you don't want the dependency).

I'm not sure what ground this really breaks, if any, but it sure looks nice to me.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation