July 26, 2016

On 07/26/2016 10:18 AM, Steven Schveighoffer via Digitalmars-d-learn wrote:
> On 7/26/16 12:58 PM, Charles Hixson via Digitalmars-d-learn wrote:
>
>> Ranges aren't free, are they? If so then I should probably use stdfile,
>> because that is probably less likely to change than core.stdc.stdio.
>
> Do you mean slices?
>
>> When I see "f.rawRead(&item[0 .. 1])" it looks to me as if unneeded code
>> is being generated explictly to be thrown away.  (I don't like using
>> pointer/length either, but it's actually easier to understand than this
>> kind of thing, and this LOOKS like it's generating extra code.)
>
> This is probably a misunderstanding on your part.
>
> &item is accessing the item as a pointer. Since the compiler already has it as a reference, this is a noop -- just an expression to change the type.
>
> [0 .. 1] is constructing a slice out of a pointer. It's done all inline by the compiler (there is no special _d_constructSlice function), so that is very very quick. There is no bounds checking, because pointers do not have bounds checks.
>
> So there is pretty much zero overhead for this. Just push the pointer and length onto the stack (or registers, not sure of ABI), and call rawRead.
>
>> That said, perhaps I should use stdio anyway.  When doing I/O it's the
>> disk speed that's the really slow part, and that so dominates things
>> that worrying about trivialities is foolish.  And since it's going to be
>> wrapped anyway, the ugly will be confined to a very small routine.
>
> Having written a very templated io library (https://github.com/schveiguy/iopipe), I can tell you that in my experience, the slowdown comes from 2 things: 1) spending time calling the kernel, and 2) not being able to inline.
>
> This of course assumes that proper buffering is done. Buffering should mitigate most of the slowdown from the disk. It is expensive, but you amortize the expense by buffering.
>
> C's i/o is pretty much as good as it gets for an opaque non-inlinable system, as long as your requirements are simple enough. The std.stdio code should basically inline into the calls you should be making, and it handles a bunch of stuff that optimizes the calls (such as locking the file handle for one complex operation).
>
> -Steve
Thanks.  Since there isn't any excess overhead I guess I'll use stdio.  Buffering, however, isn't going to help at all since I'm doing randomIO.  I know that most of the data the system reads from disk is going to end up getting thrown away, since my records will generally be smaller than 8K, but there's no help for that.

July 26, 2016
On 7/26/16 1:57 PM, Charles Hixson via Digitalmars-d-learn wrote:

> Thanks.  Since there isn't any excess overhead I guess I'll use stdio.
> Buffering, however, isn't going to help at all since I'm doing
> randomIO.  I know that most of the data the system reads from disk is
> going to end up getting thrown away, since my records will generally be
> smaller than 8K, but there's no help for that.
>

Even for doing random I/O buffering is helpful. It depends on the size of your items.

Essentially, to read 10 bytes from a file probably costs the same as reading 100,000 bytes from a file. So may as well buffer that in case you need it.

Now, C i/o's buffering may not suit your exact needs. So I don't know how it will perform. You may want to consider mmap which tells the kernel to link pages of memory directly to disk access. Then the kernel is doing all the buffering for you. Phobos has support for it, but it's pretty minimal from what I can see: http://dlang.org/phobos/std_mmfile.html

-Steve
July 26, 2016
On 07/26/2016 11:31 AM, Steven Schveighoffer via Digitalmars-d-learn wrote:
> On 7/26/16 1:57 PM, Charles Hixson via Digitalmars-d-learn wrote:
>
>> Thanks.  Since there isn't any excess overhead I guess I'll use stdio.
>> Buffering, however, isn't going to help at all since I'm doing
>> randomIO.  I know that most of the data the system reads from disk is
>> going to end up getting thrown away, since my records will generally be
>> smaller than 8K, but there's no help for that.
>>
>
> Even for doing random I/O buffering is helpful. It depends on the size of your items.
>
> Essentially, to read 10 bytes from a file probably costs the same as reading 100,000 bytes from a file. So may as well buffer that in case you need it.
>
> Now, C i/o's buffering may not suit your exact needs. So I don't know how it will perform. You may want to consider mmap which tells the kernel to link pages of memory directly to disk access. Then the kernel is doing all the buffering for you. Phobos has support for it, but it's pretty minimal from what I can see: http://dlang.org/phobos/std_mmfile.html
>
> -Steve
I've considered mmapfile often, but when I read the documentation I end up realizing that I don't understand it.  So I look up memory mapped files in other places, and I still don't understand it.  It looks as if the entire file is stored in memory, which is not at all what I want, but I also can't really believe that's what's going on.  I know that there was an early form of this in a version of BASIC (the version that RISS was written in, but I don't remember which version that was) and in *that* version array elements were read in as needed.  (It wasn't spectacularly efficient.)  But memory mapped files don't seem to work that way, because people keep talking about how efficient they are.  Do you know a good introductory tutorial?  I'm guessing that "window size" might refer to the number of bytes available, but what if you need to append to the file?  Etc.

A part of the problem is that I don't want this to be a process with an arbitrarily high memory use.  Buffering would be fine, if I could use it, but for my purposes sequential access is likely to be rare, and the working layout of the data in RAM doesn't (can't reasonably) match the layout on disk.  IIUC (this is a few decades old) the system buffer size is about 8K.  I expect to never need to read that large a chunk, but I'm going to try to keep the chunks in multiples of 1024 bytes, and if it's reasonable to exactly 1024 bytes.  So I should never need two reads or writes for a chunk.  I guess to be sure of this I'd better make sure the file header is also 1024 bytes.  (I'm guessing that the seek to position results in the appropriate buffer being read into the system buffer, so if my header were 512 bytes I might occasionally need to do double reads or writes.)

I'm guessing that memory mapped files trade off memory use against speed of access, and for my purposes that's probably a bad trade, even though databases are doing that more and more.  I'm likely to need all the memory I can lay my hands on, and even then thrashing wouldn't surprise me.  So a fixed buffer size seems a huge advantage.
July 26, 2016
On Tuesday, 26 July 2016 at 19:30:35 UTC, Charles Hixson wrote:
> It looks as if the entire file is stored in memory, which is not at all what I want, but I also can't really believe that's what's going on.


It is just mapped to virtual memory without actually being loaded into physical memory, so when you access the array it returns, the kernel loads a page of the file into memory, but it doesn't do that until it actually has to.

Think of it as being like this:

struct MagicFile {
    ubyte[] opIndex(size_t idx) {
          auto buffer = new ubyte[](some_block_length);
          fseek(fp, idx, SEEK_SET);
          fread(buffer.ptr, buffer.length, 1);
          return buffer;
    }
}


And something analogous for writing, but instead of being done with overloaded operators in D, it is done with the MMU hardware by the kernel (and the kernel also does smarter buffering than this little example).


> A part of the problem is that I don't want this to be a process with an arbitrarily high memory use.

The kernel will automatically handle physical memory usage too, similarly to a page file. If you haven't read a portion of the file recently, it will discard that page, since it can always read it again off disk if needed, but if you do have memory to spare, it will keep the data in memory for faster access later.


So basically the operating system handles a lot of the details which makes it efficient.


Growing a memory mapped file is a bit tricky though, you need to unmap and remap. Since it is an OS concept, you can always look for C or C++ examples too, like herE: http://stackoverflow.com/questions/4460507/appending-to-a-memory-mapped-file/4461462#4461462
July 26, 2016
On 7/26/16 3:30 PM, Charles Hixson via Digitalmars-d-learn wrote:
> On 07/26/2016 11:31 AM, Steven Schveighoffer via Digitalmars-d-learn wrote:

>> Now, C i/o's buffering may not suit your exact needs. So I don't know
>> how it will perform. You may want to consider mmap which tells the
>> kernel to link pages of memory directly to disk access. Then the
>> kernel is doing all the buffering for you. Phobos has support for it,
>> but it's pretty minimal from what I can see:
>> http://dlang.org/phobos/std_mmfile.html
>>
> I've considered mmapfile often, but when I read the documentation I end
> up realizing that I don't understand it.  So I look up memory mapped
> files in other places, and I still don't understand it.  It looks as if
> the entire file is stored in memory, which is not at all what I want,
> but I also can't really believe that's what's going on.

Of course that isn't what is happening :)

What happens is that the kernel says memory page 0x12345 (or whatever) is mapped to the file. Then when you access a mapped page, the system memory management unit gets a page fault (because that memory isn't loaded), which triggers the kernel to load that page of memory. Kernel sees that the memory is really mapped to that file, and loads the page from the file instead. As you write to the memory location, the page is marked dirty, and at some point, the kernel flushes that page back to disk.

Everything is done behind the scenes and is in tune with the filesystem itself, so you get a little extra benefit from that.

> I know that
> there was an early form of this in a version of BASIC (the version that
> RISS was written in, but I don't remember which version that was) and in
> *that* version array elements were read in as needed.  (It wasn't
> spectacularly efficient.)  But memory mapped files don't seem to work
> that way, because people keep talking about how efficient they are.  Do
> you know a good introductory tutorial?  I'm guessing that "window size"
> might refer to the number of bytes available, but what if you need to
> append to the file?  Etc.

To be honest, I'm not super familiar with actually using them, I just have a rough idea of how they work. The actual usage you will have to look up.

> A part of the problem is that I don't want this to be a process with an
> arbitrarily high memory use.

You should know that you can allocate as much memory as you want, as long as you have address space for it, and you won't actually map that to physical memory until you use it. So the management of the memory is done lazily, all supported by the MMU hardware. This is true for actual memory too!

Note that the only "memory" you are using for the mmaped file are page buffers in the kernel which are likely already being used to buffer the disk reads. It's not like it's loading the entire file into memory, and probably doesn't even load all sequential pages into memory. It only loads the ones you use.

I'm pretty much at my limit for knowledge of this subject (and maybe I have a few things incorrect), I'm sure others here know much more. I suggest you play a bit with it to see what the performance is like. I have also heard that it's very fast.

-Steve
July 27, 2016
On Tuesday, 26 July 2016 at 16:35:26 UTC, Charles Hixson wrote:
> That's sort of what I have in mind, but I want to do what in Fortran would be (would have been?) called record I/O, except that I want a file header that specifies a few things like magic number, records allocated, head of free list, etc.  In practice I don't see any need for record size not known at compile time...except that if there are different
> versions of the program, they might include different things, so, e.g., the size of the file header might need to be variable.

it looks like you want a serialization library. there are some: http://wiki.dlang.org/Serialization_Libraries
July 26, 2016
On 07/26/2016 12:53 PM, Adam D. Ruppe via Digitalmars-d-learn wrote:
> On Tuesday, 26 July 2016 at 19:30:35 UTC, Charles Hixson wrote:
>> It looks as if the entire file is stored in memory, which is not at all what I want, but I also can't really believe that's what's going on.
>
>
> It is just mapped to virtual memory without actually being loaded into physical memory, so when you access the array it returns, the kernel loads a page of the file into memory, but it doesn't do that until it actually has to.
>
> Think of it as being like this:
>
> struct MagicFile {
>     ubyte[] opIndex(size_t idx) {
>           auto buffer = new ubyte[](some_block_length);
>           fseek(fp, idx, SEEK_SET);
>           fread(buffer.ptr, buffer.length, 1);
>           return buffer;
>     }
> }
>
>
> And something analogous for writing, but instead of being done with overloaded operators in D, it is done with the MMU hardware by the kernel (and the kernel also does smarter buffering than this little example).
>
>
>> A part of the problem is that I don't want this to be a process with an arbitrarily high memory use.
>
> The kernel will automatically handle physical memory usage too, similarly to a page file. If you haven't read a portion of the file recently, it will discard that page, since it can always read it again off disk if needed, but if you do have memory to spare, it will keep the data in memory for faster access later.
>
>
> So basically the operating system handles a lot of the details which makes it efficient.
>
>
> Growing a memory mapped file is a bit tricky though, you need to unmap and remap. Since it is an OS concept, you can always look for C or C++ examples too, like herE: http://stackoverflow.com/questions/4460507/appending-to-a-memory-mapped-file/4461462#4461462
O, dear.  It was sounding like such an excellent approach until this
last paragraph, but growing the file is going to be one of the common
operations.  (Certainly at first.)  It sounds as if that means the file
needs to be closed and re-opened for extensions.  And I quote from
https://www.gnu.org/software/libc/manual/html_node/Memory_002dmapped-I_002fO.html:
<END
Function: /void */ *mremap* /(void *address, size_t length, size_t
new_length, int flag)/

    Preliminary: | MT-Safe | AS-Safe | AC-Safe | See POSIX Safety
    Concepts
    <https://www.gnu.org/software/libc/manual/html_node/POSIX-Safety-Concepts.html#POSIX-Safety-Concepts>.


    This function can be used to change the size of an existing memory
    area. address and length must cover a region entirely mapped in the
    same |mmap| statement. A new mapping with the same characteristics
    will be returned with the length new_length.

...
This function is only available on a few systems. Except for performing
optional optimizations one should not rely on this function.

END
So I'm probably better off sticking to using a seek based i/o system.


July 27, 2016
On Wednesday, 27 July 2016 at 02:20:57 UTC, Charles Hixson wrote:
> O, dear.  It was sounding like such an excellent approach until this
> last paragraph, but growing the file is going to be one of the common
> operations.  (Certainly at first.) (...)
> So I'm probably better off sticking to using a seek based i/o system.

Not necessarily. The usual approach is to over-allocate your file so you don't need to grow it that often. This is the exact same strategy used by D's dynamic arrays and grow-able array-backed lists in other languages - the difference between list length and capacity.

There is no built-in support for this in std.mmfile afaik. But it's not hard to do yourself.
July 27, 2016
On 07/27/2016 06:46 AM, Rene Zwanenburg via Digitalmars-d-learn wrote:
> On Wednesday, 27 July 2016 at 02:20:57 UTC, Charles Hixson wrote:
>> O, dear.  It was sounding like such an excellent approach until this
>> last paragraph, but growing the file is going to be one of the common
>> operations.  (Certainly at first.) (...)
>> So I'm probably better off sticking to using a seek based i/o system.
>
> Not necessarily. The usual approach is to over-allocate your file so you don't need to grow it that often. This is the exact same strategy used by D's dynamic arrays and grow-able array-backed lists in other languages - the difference between list length and capacity.
>
> There is no built-in support for this in std.mmfile afaik. But it's not hard to do yourself.
>
Well, that would mean I didn't need to reopen the file so often, but that sure wouldn't mean I wouldn't need to re-open the file.  And it would add considerable complexity.  Possibly that would be an optimal approach once the data was mainly collected, but I won't want to re-write this bit at that point.
1 2
Next ›   Last »