Reading a structured binary file? (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Reading a structured binary file? (page 2)

August 03, 2013

Re: Reading a structured binary file?

Posted by Gary Willoughby
in reply to Gary Willoughby

Gary Willoughby

Posted in reply to Gary Willoughby

On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby wrote:
> This sounds a great idea but once the file has been opened as a MmFile how to i convert this to a ubyte[] so the std.bitmanip functions work with it?

I'm currently doing this:

	auto file = new MmFile("file.dat");
	ubyte[] buffer = cast(ubyte[])file[];
	buffer.read!uint(); //etc.

Is this how you would recommend?

August 03, 2013

Re: Reading a structured binary file?

Posted by John Colvin
in reply to Gary Willoughby

John Colvin

Posted in reply to Gary Willoughby

On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:
> On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby wrote:
>> This sounds a great idea but once the file has been opened as a MmFile how to i convert this to a ubyte[] so the std.bitmanip functions work with it?
>
> I'm currently doing this:
>
> 	auto file = new MmFile("file.dat");
> 	ubyte[] buffer = cast(ubyte[])file[];
> 	buffer.read!uint(); //etc.
>
> Is this how you would recommend?

That defeats the object of memory mapping, as the [] at the end of cast(ubyte[])file[] implies copying the whole file in to memory.

3 options I can think of:
1) copy read from std.bitmanip and modify it to work nicely with MmFile
2) write a wrapper for MmFile to let it work nicely with read
3) rewrite/modify MmFile

I would love to do 3) at some point, but I'm too busy at the moment.

August 03, 2013

Re: Reading a structured binary file?

Posted by monarch_dodra
in reply to H. S. Teoh

monarch_dodra

Posted in reply to H. S. Teoh

On Friday, 2 August 2013 at 23:51:27 UTC, H. S. Teoh wrote:
> On Fri, Aug 02, 2013 at 06:38:20PM -0500, captaindet wrote:
> [...]
>> FWIW
>> i have to deal with big data files that can be a few GB. for some data
>> analysis software i wrote in C a while back i did some testing with
>> caching and such. turns out that for Win7-64 the automatic caching
>> done by the OS is really good and any attempt to speed things up
>> actually slowed it down. no kidding, i have seen more than 2GB of data
>> being automatically cached. of course the system RAM must be larger
>> than the file size (if i remember my tests correctly by a factor of
>> ~2, but this is maybe not a linear relationship, i did not actually
>> change the RAM just the size of the data file) and it will hold it in
>> the cache only as long as there are no concurrent applications
>> requiring RAM or caching. i guess my point is, if your target is Win7
>> and your files are >5x smaller than the installed RAM i would not
>> bother at all trying to optimize file access. i suppose -nix machine
>> will do a similar good job these days.
> [...]
>
> IIRC, Linux has been caching files (or disk blocks, rather) in memory
> since the days of Win95. Of course, memory in those days was much
> scarcer, but file sizes were smaller too. :) There's still a cost to
> copy the kernel buffers into userspace, though, which should not be
> disregarded. But if you use mmap, then you're essentially accessing that
> memory cache directly, which is as good as it gets.
>
> I don't know how well mmap works on windows, though, IIRC it doesn't
> have the same semantics as Posix, so you could accidentally run into
> performance issues by using it the wrong way on windows.
>
>
> T

I did some benching a while back with user bioinfornatics. He had to do some pretty large file reads, preferably in very little time. Observations showed my algo was *much* faster under windows then linux.

What we observed is that under windows, as soon as you open a file for reading, windows starts buffering the file in a parallel thread.

What we did was create two threads. The first did nothing but read the file, store it into chunks of memory, and then pass it to a worker thread. The worker thread did the parsing proper.

Doing this *halved* the linux runtime, tying it with the "monothreaded" windows run time. Windows saw no change.

FYI, the full thread is here:
forum.dlang.org/thread/gmfqwzgtjfnqiajghmsx@forum.dlang.org

August 06, 2013

Re: Reading a structured binary file?

Posted by Jesse Phillips
in reply to Gary Willoughby

Jesse Phillips

Posted in reply to Gary Willoughby

On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:
> On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby wrote:
>> This sounds a great idea but once the file has been opened as a MmFile how to i convert this to a ubyte[] so the std.bitmanip functions work with it?
>
> I'm currently doing this:
>
> 	auto file = new MmFile("file.dat");
> 	ubyte[] buffer = cast(ubyte[])file[];
> 	buffer.read!uint(); //etc.
>
> Is this how you would recommend?

You will need to slice the size of the data you want, otherwise you're effectively doing std.file.read(). It doesn't need to be for a single value (as in the example), it could be a block of data which is then individual parsed for the pieces.

    auto file = new MmFile("file.dat");
    ubyte[] buffer = cast(ubyte[])file[indexInFile..uint.sizeof];
    indexInFile += uint.sizeof;
    buffer.read!uint(); //etc.

The only way I'm seeing to advance through the file is to keep an index on where you're currently reading from. This actually works perfect for the FileRange I mentioned in the previous post. Though I'm not familiar with how mmfile manages its memory, but hopefully there isn't buffer reuse or storing the slice could be overridden (not an issue for value data, but string data).

August 10, 2013

Re: Reading a structured binary file?

Posted by Jonathan M Davis
in reply to Gary Willoughby

Jonathan M Davis

Posted in reply to Gary Willoughby

On Saturday, August 03, 2013 20:23:55 Gary Willoughby wrote:
> On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby wrote:
> > This sounds a great idea but once the file has been opened as a MmFile how to i convert this to a ubyte[] so the std.bitmanip functions work with it?
> 
> I'm currently doing this:
> 
> 	auto file = new MmFile("file.dat");
> 	ubyte[] buffer = cast(ubyte[])file[];
> 	buffer.read!uint(); //etc.
> 
> Is this how you would recommend?

Yeah. That's how you do it.

- Jonathan M Davis

August 10, 2013

Re: Reading a structured binary file?

Posted by Jonathan M Davis
in reply to John Colvin

Jonathan M Davis

Posted in reply to John Colvin

On Saturday, August 03, 2013 23:10:12 John Colvin wrote:
> On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:
> > On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby
> > 
> > wrote:
> >> This sounds a great idea but once the file has been opened as a MmFile how to i convert this to a ubyte[] so the std.bitmanip functions work with it?
> > 
> > I'm currently doing this:
> > 	auto file = new MmFile("file.dat");
> > 	ubyte[] buffer = cast(ubyte[])file[];
> > 	buffer.read!uint(); //etc.
> > 
> > Is this how you would recommend?
> 
> That defeats the object of memory mapping, as the [] at the end of cast(ubyte[])file[] implies copying the whole file in to memory.

Are you sure about that? Maybe I'm just not familiar enough with mmap, but I don't see anything in MmFile which would result in it copying the whole file into memory. I guess that I'll have to do some more reading up on mmap. Certainly, if slicing it like that copies it all into memory, that's a big problem.

- Jonathan M Davis

August 10, 2013

Re: Reading a structured binary file?

Posted by H. S. Teoh
in reply to monarch_dodra

H. S. Teoh

Posted in reply to monarch_dodra

On Sat, Aug 03, 2013 at 11:29:01PM +0200, monarch_dodra wrote:
> On Friday, 2 August 2013 at 23:51:27 UTC, H. S. Teoh wrote:
> >On Fri, Aug 02, 2013 at 06:38:20PM -0500, captaindet wrote: [...]
> >>FWIW
> >>i have to deal with big data files that can be a few GB. for some
> >>data analysis software i wrote in C a while back i did some testing
> >>with caching and such. turns out that for Win7-64 the automatic
> >>caching done by the OS is really good and any attempt to speed
> >>things up actually slowed it down. no kidding, i have seen more than
> >>2GB of data being automatically cached. of course the system RAM
> >>must be larger than the file size (if i remember my tests correctly
> >>by a factor of ~2, but this is maybe not a linear relationship, i
> >>did not actually change the RAM just the size of the data file) and
> >>it will hold it in the cache only as long as there are no concurrent
> >>applications requiring RAM or caching. i guess my point is, if your
> >>target is Win7 and your files are >5x smaller than the installed RAM
> >>i would not bother at all trying to optimize file access. i suppose
> >>-nix machine will do a similar good job these days.
> >[...]
> >
> >IIRC, Linux has been caching files (or disk blocks, rather) in memory since the days of Win95. Of course, memory in those days was much scarcer, but file sizes were smaller too. :) There's still a cost to copy the kernel buffers into userspace, though, which should not be disregarded. But if you use mmap, then you're essentially accessing that memory cache directly, which is as good as it gets.
> >
> >I don't know how well mmap works on windows, though, IIRC it doesn't have the same semantics as Posix, so you could accidentally run into performance issues by using it the wrong way on windows.
[...]
> I did some benching a while back with user bioinfornatics. He had to do some pretty large file reads, preferably in very little time. Observations showed my algo was *much* faster under windows then linux.

Sorry, I lost the context of this discussion, what algo are you referring to?


> What we observed is that under windows, as soon as you open a file for reading, windows starts buffering the file in a parallel thread.
> 
> What we did was create two threads. The first did nothing but read the file, store it into chunks of memory, and then pass it to a worker thread. The worker thread did the parsing proper.
> 
> Doing this *halved* the linux runtime, tying it with the "monothreaded" windows run time. Windows saw no change.

Interesting. I wonder if you could, under Linux, mmap a file then have one thread access the first byte of each file block while another thread does the real work with the data.


> FYI, the full thread is here: forum.dlang.org/thread/gmfqwzgtjfnqiajghmsx@forum.dlang.org

I'll take a look, thanks.


T

-- 
The diminished 7th chord is the most flexible and fear-instilling chord. Use it often, use it unsparingly, to subdue your listeners into submission!

August 10, 2013

Re: Reading a structured binary file?

Posted by H. S. Teoh

H. S. Teoh

On Sat, Aug 03, 2013 at 02:25:23PM -0700, Jonathan M Davis wrote:
> On Saturday, August 03, 2013 23:10:12 John Colvin wrote:
> > On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:
> > > On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby
> > > 
> > > wrote:
> > >> This sounds a great idea but once the file has been opened as a MmFile how to i convert this to a ubyte[] so the std.bitmanip functions work with it?
> > > 
> > > I'm currently doing this:
> > > 	auto file = new MmFile("file.dat");
> > > 	ubyte[] buffer = cast(ubyte[])file[];
> > > 	buffer.read!uint(); //etc.
> > > 
> > > Is this how you would recommend?
> > 
> > That defeats the object of memory mapping, as the [] at the end of cast(ubyte[])file[] implies copying the whole file in to memory.
> 
> Are you sure about that? Maybe I'm just not familiar enough with mmap, but I don't see anything in MmFile which would result in it copying the whole file into memory. I guess that I'll have to do some more reading up on mmap.  Certainly, if slicing it like that copies it all into memory, that's a big problem.
[...]

I think he meant that the OS will have to load the entire file into memory if you sliced the mmap'ed file, not that you'll copy all the data.

I'm not certain this is true, though, because slicing as I understand it only returns the address of the start of the mmap'ed addresses coupled with its length. I don't think the OS will actually load anything into memory until you reference an address within that mmap'ed range. And even then, only those disk blocks that correspond with the referenced addresses will actually be loaded -- this is the point of virtual memory, after all.

T

-- 
The computer is only a tool. Unfortunately, so is the user. -- Armaphine, K5

August 10, 2013

Re: Reading a structured binary file?

Posted by Jonathan M Davis

Jonathan M Davis

On Saturday, August 03, 2013 14:31:16 H. S. Teoh wrote:
> On Sat, Aug 03, 2013 at 02:25:23PM -0700, Jonathan M Davis wrote:
> > On Saturday, August 03, 2013 23:10:12 John Colvin wrote:
> > > On Saturday, 3 August 2013 at 18:23:58 UTC, Gary Willoughby wrote:
> > > > On Saturday, 3 August 2013 at 18:14:47 UTC, Gary Willoughby
> > > > 
> > > > wrote:
> > > >> This sounds a great idea but once the file has been opened as a MmFile how to i convert this to a ubyte[] so the std.bitmanip functions work with it?
> > > > 
> > > > I'm currently doing this:
> > > > 	auto file = new MmFile("file.dat");
> > > > 	ubyte[] buffer = cast(ubyte[])file[];
> > > > 	buffer.read!uint(); //etc.
> > > > 
> > > > Is this how you would recommend?
> > > 
> > > That defeats the object of memory mapping, as the [] at the end of cast(ubyte[])file[] implies copying the whole file in to memory.
> > 
> > Are you sure about that? Maybe I'm just not familiar enough with mmap, but I don't see anything in MmFile which would result in it copying the whole file into memory. I guess that I'll have to do some more reading up on mmap.  Certainly, if slicing it like that copies it all into memory, that's a big problem.
> 
> [...]
> 
> I think he meant that the OS will have to load the entire file into memory if you sliced the mmap'ed file, not that you'll copy all the data.
> 
> I'm not certain this is true, though, because slicing as I understand it only returns the address of the start of the mmap'ed addresses coupled with its length. I don't think the OS will actually load anything into memory until you reference an address within that mmap'ed range. And even then, only those disk blocks that correspond with the referenced addresses will actually be loaded -- this is the point of virtual memory, after all.

That's what I thought that mmap did, but it's not something that I've studied in detail.

Aside from that though, my main complaint about MmFile is the fact that it's a class when it really should be a reference-counted struct. At some point, we should probably create MMFile or somesuch which _is_ a reference counted struct and then deprecate MmFile. But if we do that, then we should be sure of whatever other changes the implementation needs and do those with it.

- Jonathan M Davis

August 10, 2013

Re: Reading a structured binary file?

Posted by H. S. Teoh
in reply to Jesse Phillips

H. S. Teoh

Posted in reply to Jesse Phillips

On Tue, Aug 06, 2013 at 06:48:12AM +0200, Jesse Phillips wrote: [...]
> The only way I'm seeing to advance through the file is to keep an index on where you're currently reading from. This actually works perfect for the FileRange I mentioned in the previous post. Though I'm not familiar with how mmfile manages its memory, but hopefully there isn't buffer reuse or storing the slice could be overridden (not an issue for value data, but string data).

I don't know about D's Mmfile, but AFAIK, it maps directly to the OS mmap(), which basically maps a portion of your program's address space to the data on the disk. Meaning that the memory is managed by the OS, and addresses will not change from under you.

In the underlying physical memory, pages may get swapped out and reused, but this is invisible to your program, since referencing them will cause the OS to swap the pages back in, so you'll never end up with invalid pointers. The worst that could happen is the I/O performance hit associated with swapping. Such is the utility of virtual memory.


T

-- 
Error: Keyboard not attached. Press F1 to continue. -- Yoon Ha Lee, CONLANG

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation