March 29, 2014 Re: protocol for using InputRanges | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | 29-Mar-2014 06:47, Andrei Alexandrescu пишет: > On 3/28/14, 11:40 AM, Dmitry Olshansky wrote: >> 28-Mar-2014 22:29, Walter Bright пишет: >>> On 3/28/2014 10:11 AM, Dmitry Olshansky wrote: >>>> WAT? The overhead is in issuing system calls, you'd want to do as >>>> little of them >>>> as possible. Reading byte by byte is an exemplar of idiocy in I/O code. >>> >>> That's why we have things like byLine(). >>> >> >> Which uses C's BUFFERED I/O and it reads from it byte by byte via getc. >> Even though sys calls are amortized by C runtime, we have a function >> call per byte. No wonder it's SLOW. > > byLine doesn't use getc. -- Andrei Indeed not every version. Linux looks almost OK. There is still this stuff in empty: https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1604 And looking through the code I see that Win64, OSX and FreeBSD versions still use getc. Win32 hacks right into DMC run-time, and on linux getdelim is used. The point about ranges only ever making sense over buffered I/O still stands. Preferably not C runtime. -- Dmitry Olshansky |
March 29, 2014 Re: protocol for using InputRanges | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | 29-Mar-2014 02:05, Walter Bright пишет: > On 3/28/2014 11:40 AM, Dmitry Olshansky wrote: >> Which uses C's BUFFERED I/O and it reads from it byte by byte via >> getc. Even >> though sys calls are amortized by C runtime, we have a function call >> per byte. >> No wonder it's SLOW. > > How about a PR to fix it? Sorry, I'm not inclined to do any work on hacking arbitrary C runtime libraries. Too much work on reverse engineering and making sure it stays in sync. We need new I/O package and hopefully Steven has something brewing. -- Dmitry Olshansky |
March 29, 2014 Re: protocol for using InputRanges | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrei Alexandrescu | On Saturday, 29 March 2014 at 00:14:40 UTC, Andrei Alexandrescu wrote:
> It increasingly seems to me we need to do this. -- Andrei
Question: Should this new restriction only be applied to *Input*Ranges? If so, then it might make more sense.
Ranges that also satisfy the *Forward* interface (which has save), or specifically, the *Output* interface (that doesn't even have empty) should not have this obligation. In particular, output ranges *write*, so usually can't even buffer data.
*Input*-only Ranges are the ones that are usually implemented over streams, and that are most subject to the "test empty" issue too, so they are the ones that need this restriction.
Could this maybe be a good solution for everyone?
|
March 29, 2014 Re: protocol for using InputRanges | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Sunday, 23 March 2014 at 00:50:34 UTC, Walter Bright wrote:
> It's become clear to me that we've underspecified what an InputRange is.
As someone who has missed this discussion, can I just say this: can we please document the result somewhere, and possibly even announce it clearly so that people know that something has changed?
|
March 29, 2014 Re: protocol for using InputRanges | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | On Saturday, 29 March 2014 at 01:36:46 UTC, Steven Schveighoffer wrote:
> On Fri, 28 Mar 2014 07:47:22 -0400, Marc Schütz <schuetzm@gmx.net> wrote:
>> On Thursday, 27 March 2014 at 16:12:36 UTC, monarch_dodra wrote:
>>> If you initialized the next element in both constructor and popFront, then you'd get rid of both these checks.
>>>
>>
>> ... but of course lose laziness.
>
> In this case, laziness is not critical. Decoding the element is an O(1) operation, and when looping through, you will decode it anyway.
>
> When processing a 20 element string, the cost of checking to see if you have decoded on every empty or front call may override the front-loaded cost of decoding the first element on construction. It's sure to add to the cost if you are processing all 20 elements, since you decode them all anyway.
>
> On other ranges, it's more important when the first element costs a lot to fetch. HOWEVER, it's not critically important to delay that unless you are not going to process that element. For example, if you are foreach'ing over all the elements, the delay doesn't matter.
>
> I'd rather the higher level code decide whether to delay or not, depending on the situation. Requiring a protocol change for such detailed knowledge seems unbalanced.
I was more thinking of the fact that you need to read something on construction, rather than on consumption, and this reading might be noticeable. There was the example of `stdin.byLine().filter(...)` (or something similar, don't remember exactly), which reads from stdin on construction. This changes the behaviour of the program, because the read operation will (probably) block.
I'd suggest to make it a requirement for ranges and algorithms _not_ to start consuming the underlying data until one of empty/front/popFront is called, even if that has a negative effect on performance. That's why I was asking for performance numbers, to see whether there even is an effect. If there isn't, that's just another argument for adding that requirement.
This is then, IMO, a very acceptable additional burden to place on the writers of ranges. I agree, however, that it's not a good idea to change the range protocol, i.e. what _users_ of ranges have to abide by. That would be a breaking change, and it would be an especially bad one because there I see no way to detect that a user failed to call `empty` in an iteration if they knew that there are more elements available.
|
March 30, 2014 Re: protocol for using InputRanges | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Am Fri, 28 Mar 2014 19:23:29 -0700 schrieb Walter Bright <newshound2@digitalmars.com>: > On 3/28/2014 3:42 AM, Steven Schveighoffer wrote: > >> I'm also curious what a generic read() should do when the stream is empty. > > > > Returns 0 elements read. I guess we all have a clear concept of streams in our mind from all kinds of programming languages. They typically operate on bytes, have an EOF flag and offer read/write methods that you pass a byte pointer and a length into. The result is the count of bytes read or written. Optionally they have an "available" property and handle any combination of the following: o all basic data types of the programming language o POD structs o formatted strings o bitwise operations o seeking after calculating offsets They are used for I/O on heterogeneous data like binary file formats or network protocols. Ranges on the other hand work on sequences of items of the same type, which is a small subset of what streams are supposed to support. While there should be a connection between both worlds, one cannot replace the other. There will always be a separate raw stream type in Phobos. > Meaning it must also write through a pointer if it did read something. > > How is that faster than a 1 element buffer? You can write your stream in such a way that you "map" a memory area. E.g. you call "stream.waitForSoManyBytes(123);" and then "ubyte[] arr = stream.mapMemory(123);" where arr is then a slice into the stream's internal buffer. (This also requires a mechanism to release the buffer after use, so the stream can reuse it: stream.doneWithSoManyBytes(123);) -- Marco |
March 30, 2014 Re: protocol for using InputRanges | ||||
---|---|---|---|---|
| ||||
Posted in reply to Luís Marques | ""Luís Marques " wrote in message news:exmfmqftpykoaxdgluit@forum.dlang.org... > - Is it allowed for an InputRange to become empty and then not empty again? For a finite RandomAccessRange to increase in length? E.g.: Not by itself - if a range's empty has returned true, calling empty repeatedly should continue to return true. If you change it externally (ie not by using the range interface) then it can become non-empty. eg int[] arr; assert(arr.empty); arr ~= 3; // no longer empty, but we used a method outside the range interface to change it. I don't think a random number generator or some hardware device producing more data would be a good reason to change empty - range.empty is not asking 'is there more data ready', it's asking 'are there items left to iterate over'. > - Is it allowed to not call front? E.g.: Yes. |
March 30, 2014 Re: protocol for using InputRanges | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On Saturday, 29 March 2014 at 09:06:45 UTC, monarch_dodra wrote:
> On Saturday, 29 March 2014 at 00:14:40 UTC, Andrei Alexandrescu wrote:
>> It increasingly seems to me we need to do this. -- Andrei
>
> Question: Should this new restriction only be applied to *Input*Ranges? If so, then it might make more sense.
>
> Ranges that also satisfy the *Forward* interface (which has save), or specifically, the *Output* interface (that doesn't even have empty) should not have this obligation. In particular, output ranges *write*, so usually can't even buffer data.
>
> *Input*-only Ranges are the ones that are usually implemented over streams, and that are most subject to the "test empty" issue too, so they are the ones that need this restriction.
>
> Could this maybe be a good solution for everyone?
Did this conversation die the instant I actually had something smart to say?
@andralex, @WalterBright ? Thoughts?
|
March 30, 2014 Re: protocol for using InputRanges | ||||
---|---|---|---|---|
| ||||
Posted in reply to Tobias Müller | On Friday, 28 March 2014 at 23:14:56 UTC, Tobias Müller wrote:
> On Thursday, 27 March 2014 at 20:49:16 UTC, Walter Bright wrote:
>> On 3/27/2014 12:21 PM, Rainer Schuetze wrote:
>>> This loop is intuitive. Not being allowed to call empty or front multiple times
>>> or not at all is unintuitive. They should not be named as if they are properties
>>> then.
>>
>> I can concede that. But I can't concede being able to call front without first calling empty, or calling popFront without calling empty and front, or requiring 'pump priming' in the constructor.
>
> Disclaimer: I'm a C++ programmer just lurking here, I've never
> actually used D.
>
> I find it very counter-intuitive that 'empty' is required before
> front or popFront.
> Since 'pump priming' in the constructor isn't wanted either, i'd
> suggest the following protocol:
>
> while (popFront())
> {
> front;
> }
>
> popFront is then required to return !empty.
> 'empty' as a separate property getter can stay but is not
> required for the protocol.
>
> This way, it's clear that the work to fetch the next element is
> always done in popFront.
>
> Generally I find dependecies between functions problematic that
> require a specific call sequence. If they can be removed, they
> should.
>
> With my proposed solution, there's still one minor dependency,
> namely that front is not valid before the first popFront. This
> could be solved by again combining the two, as proposed by
> someone else in this thread.
>
> Tobi
First sorry for my english.
I agree.
All work is done in "bool popFront()" (well, this name may no longer be the most appropriate).
In this function the first / next item is obtained and caches. Returns true if any element, false otherwise.
"front" return de cached element as now, as many times as desired and without side effects.
"empty" function is no longer necessary, but it might be useful to keep changing the return type for example to int:
1 - Sure there is more elements
0 - Sure there is NO more elements
-1 - I don't known. Try popFront to know if more element
Of course this function as "front", without side effects
|
March 31, 2014 Re: protocol for using InputRanges | ||||
---|---|---|---|---|
| ||||
Posted in reply to Marc Schütz | On Sat, 29 Mar 2014 17:02:30 -0400, Marc Schütz <schuetzm@gmx.net> wrote:
> On Saturday, 29 March 2014 at 01:36:46 UTC, Steven Schveighoffer wrote:
>> On Fri, 28 Mar 2014 07:47:22 -0400, Marc Schütz <schuetzm@gmx.net> wrote:
>>> On Thursday, 27 March 2014 at 16:12:36 UTC, monarch_dodra wrote:
>>>> If you initialized the next element in both constructor and popFront, then you'd get rid of both these checks.
>>>>
>>>
>>> ... but of course lose laziness.
>>
>> In this case, laziness is not critical. Decoding the element is an O(1) operation, and when looping through, you will decode it anyway.
>>
>> When processing a 20 element string, the cost of checking to see if you have decoded on every empty or front call may override the front-loaded cost of decoding the first element on construction. It's sure to add to the cost if you are processing all 20 elements, since you decode them all anyway.
>>
>> On other ranges, it's more important when the first element costs a lot to fetch. HOWEVER, it's not critically important to delay that unless you are not going to process that element. For example, if you are foreach'ing over all the elements, the delay doesn't matter.
>>
>> I'd rather the higher level code decide whether to delay or not, depending on the situation. Requiring a protocol change for such detailed knowledge seems unbalanced.
>
> I was more thinking of the fact that you need to read something on construction, rather than on consumption, and this reading might be noticeable. There was the example of `stdin.byLine().filter(...)` (or something similar, don't remember exactly), which reads from stdin on construction. This changes the behaviour of the program, because the read operation will (probably) block.
Blocking operations may have a noticeable impact, but they may not. It depends on what you do between construction and processing.
For example, if you have:
foreach(l; stdin.byLine)
Lazily fetching the first line makes no operational difference whatsoever, even if it blocks, because you're immediately going to process it.
But if it *is* lazily fetched, you are paying some possibly small, but nonzero, cost for that laziness that you didn't need to pay.
This is why I said it's not critically important to delay unless you are not going to process the first element. Because there is no primitive to "prime" the range, we must do it on a property fetch, or on construction. But after that, popFront is a perfectly reasonable place to get the next element.
All this fuss is over the *first* element in the range. I think providing a mechanism to decide whether you want it now or later is reasonable. For example a lazy range wrapper.
Note, however, the case Andrei was arguing about was one of the string decoders. When you are blocking for input, hell, even if you aren't blocking, but need to call system calls to get it, the performance cost of checking a boolean every call is negligible. But when you are decoding 1-4 bytes of data in memory, checking a boolean becomes significant.
-Steve
|
Copyright © 1999-2021 by the D Language Foundation