Thread overview
MmFile
Jul 18, 2005
Kevin Bealer
Jul 18, 2005
Ben Hinkle
Jul 19, 2005
Kevin Bealer
Jul 19, 2005
Ben Hinkle
Jul 19, 2005
Kevin Bealer
Oct 26, 2005
Kevin Bealer
Aug 14, 2005
Ben Hinkle
July 18, 2005
This is on linux (32 bit amd cpu), version 1.126 of dmd.

There are several problems with MmFile.

1. The first is that the MmFile class produces output (via printf() on line 227
of mmfile.d) in the case of a not-found file.

2. The second is that despite this error on open(), there seem to be attempts to stat, then map the file.  The stat fails, but for some reason the mmap seems to work, even though it is passed "-1" as the file descriptor, as shown below in the strace output.

Looking at the source code for MmFile, I can't see how this is even possible, nevertheless, the strace run shows it -- it should throw an exception in errNo() in the same control block as the printf() but for some reason continues running and dispatching the system calls.

3. The file size variable is specified as "size_t", and there is no starting point specifiable.  There should probably be a pair of 64 bit offsets instead, ie begin and end, or start and length.  A 32 bit machine can use 64 bit files by mapping in slices of them.


Small section of strace output:

:open("mmaperr.dq", O_RDONLY)            = -1 ENOENT (No such file or directory)
:fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
:mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0xb7d7b000
:write(1, "\topen error, errno = 2\n", 23        open error, errno = 2
:) = 23

Source code:

:import std.stdio;
:import std.mmfile;
:import std.file;
:
:int main(char[][] x)
:{
:    MmFile mmf;
:    try {
:        mmf = new MmFile("mmaperr.dq");
:    }
:    catch(FileException e) {
:        writefln("A");
:    }
:
:    writefln("B");
:    delete mmf;
:    writefln("C");
:    return 0;
:}

Kevin


July 18, 2005
> 3. The file size variable is specified as "size_t", and there is no
> starting
> point specifiable.  There should probably be a pair of 64 bit offsets
> instead,
> ie begin and end, or start and length.  A 32 bit machine can use 64 bit
> files by
> mapping in slices of them.

Agreed. Users should be given the option of not mapping the whole file. I'd
like to see something like
  class MmFile {
    private ulong start, len;
    ...
    ubyte opIndex(ulong i) {
      // should do bounds checking when implemented for real
      if (i not in mapped region) {
        unmap current region
        map region [len*(i/len) len*((i+1)/len))
      }
      return data[i - start];
    }
    void[] opSlice(ulong i, ulong j) {
      if (slice not in mapped region) {
        unmap current region
        map region[i j) if j-i>len otherwise map len block
      }
      ...
    }
    ...
  }


July 19, 2005
In article <dbg5i0$2qfl$1@digitaldaemon.com>, Ben Hinkle says...
>
>> 3. The file size variable is specified as "size_t", and there is no
>> starting
>> point specifiable.  There should probably be a pair of 64 bit offsets
>> instead,
>> ie begin and end, or start and length.  A 32 bit machine can use 64 bit
>> files by
>> mapping in slices of them.
>
>Agreed. Users should be given the option of not mapping the whole file. I'd like to see something like
>  class MmFile {
>    private ulong start, len;
>    ...
>    ubyte opIndex(ulong i) {
>      // should do bounds checking when implemented for real
>      if (i not in mapped region) {
>        unmap current region
>        map region [len*(i/len) len*((i+1)/len))
>      }
>      return data[i - start];
>    }
>    void[] opSlice(ulong i, ulong j) {
>      if (slice not in mapped region) {
>        unmap current region
>        map region[i j) if j-i>len otherwise map len block
>      }
>      ...
>    }
>    ...
>  }

I have written something like this in C++ at work, but bigger and performance focused.  I call it the "atlas" code because it is a collection of maps (atlas, get it?).  It keeps a table of reference counted mapped areas.  The data sets are often large, in the neighborhood of 13gb for the largest (it will probably be 25% larger by end of year).

The individual files are always less than 1 GB, but working with more than 128MB slices proved very cumbersome -- there is always an mmap() call that fails for indistinct reasons, possibly internal fragmentation.  You can have 1.5 gb of data, but if you don't slice it up small, the process tanks eventually.

(The code and all its dependencies are public domain and CVSable, so if anyone wants to reuse it, let me know and I'll drop a pointer.)

An interesting optimization if anyone uses this technique -- if the file contains a lot of uneven regions (as our files always do), map an extra few blocks on the end of each slice.  That way you almost never need to "piece together" parts of two mmapped regions.  Data "on the border" don't need their own mmap() call.  This technique almost *halves* the number of mmap calls at minimal cost and without copying any bytes.

Kevin



July 19, 2005
> I have written something like this in C++ at work, but bigger and
> performance
> focused.  I call it the "atlas" code because it is a collection of maps
> (atlas,
> get it?).  It keeps a table of reference counted mapped areas.  The data
> sets
> are often large, in the neighborhood of 13gb for the largest (it will
> probably
> be 25% larger by end of year).
>
> The individual files are always less than 1 GB, but working with more than
> 128MB
> slices proved very cumbersome -- there is always an mmap() call that fails
> for
> indistinct reasons, possibly internal fragmentation.  You can have 1.5 gb
> of
> data, but if you don't slice it up small, the process tanks eventually.
>
> (The code and all its dependencies are public domain and CVSable, so if
> anyone
> wants to reuse it, let me know and I'll drop a pointer.)

does it have a URL?

> An interesting optimization if anyone uses this technique -- if the file
> contains a lot of uneven regions (as our files always do), map an extra
> few
> blocks on the end of each slice.  That way you almost never need to "piece
> together" parts of two mmapped regions.  Data "on the border" don't need
> their
> own mmap() call.  This technique almost *halves* the number of mmap calls
> at
> minimal cost and without copying any bytes.

Good idea. You know, I don't mean to be nosy but it sounds like you would be
a great person to fix up MmFile. :-) I've never used mmfiles to the extent
you have and who knows how much experience other folks around here have with
large dataset handling.
I know it sucks when you post some bugs or enhancement requests and the
response is "you have the code, fix it up and send it to Walter" but please
consider it if you have the time.


July 19, 2005
In article <dbiqat$29q4$1@digitaldaemon.com>, Ben Hinkle says...
>
>> I have written something like this in C++ at work, but bigger and
>> performance
>> focused.  I call it the "atlas" code because it is a collection of maps
>> (atlas,
>> get it?).  It keeps a table of reference counted mapped areas.  The data
>> sets
>> are often large, in the neighborhood of 13gb for the largest (it will
>> probably
>> be 25% larger by end of year).
>>
>> The individual files are always less than 1 GB, but working with more than
>> 128MB
>> slices proved very cumbersome -- there is always an mmap() call that fails
>> for
>> indistinct reasons, possibly internal fragmentation.  You can have 1.5 gb
>> of
>> data, but if you don't slice it up small, the process tanks eventually.
>>
>> (The code and all its dependencies are public domain and CVSable, so if
>> anyone
>> wants to reuse it, let me know and I'll drop a pointer.)
>
>does it have a URL?

You can see the part I described are at http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbatlas_8hpp-source.html and http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbatlas_8cpp-source.html.

To get the code to actually compile, bigger pieces are probably required.

Kevin

>> An interesting optimization if anyone uses this technique -- if the file
>> contains a lot of uneven regions (as our files always do), map an extra
>> few
>> blocks on the end of each slice.  That way you almost never need to "piece
>> together" parts of two mmapped regions.  Data "on the border" don't need
>> their
>> own mmap() call.  This technique almost *halves* the number of mmap calls
>> at
>> minimal cost and without copying any bytes.
>
>Good idea. You know, I don't mean to be nosy but it sounds like you would be
>a great person to fix up MmFile. :-) I've never used mmfiles to the extent
>you have and who knows how much experience other folks around here have with
>large dataset handling.
>I know it sucks when you post some bugs or enhancement requests and the
>response is "you have the code, fix it up and send it to Walter" but please
>consider it if you have the time.

I've been looking for a project; I'll have a look at it.

Kevin


August 14, 2005
I debugged the issue and the fstat and mmap are due to the writef. For example
try
: void main() {
:   printf("hi\n");
: }
and you'll see the same fstat/mmap before the print. It only happens before the
first printf/writef - subsequent prints don't fstat/mmap.

>2. The second is that despite this error on open(), there seem to be attempts to stat, then map the file.  The stat fails, but for some reason the mmap seems to work, even though it is passed "-1" as the file descriptor, as shown below in the strace output.
>
>Looking at the source code for MmFile, I can't see how this is even possible, nevertheless, the strace run shows it -- it should throw an exception in errNo() in the same control block as the printf() but for some reason continues running and dispatching the system calls.

>Small section of strace output:
>
>:open("mmaperr.dq", O_RDONLY)            = -1 ENOENT (No such file or directory)
>:fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
>:mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
>0xb7d7b000
>:write(1, "\topen error, errno = 2\n", 23        open error, errno = 2
>:) = 23
>
>Source code:
>
>:import std.stdio;
>:import std.mmfile;
>:import std.file;
>:
>:int main(char[][] x)
>:{
>:    MmFile mmf;
>:    try {
>:        mmf = new MmFile("mmaperr.dq");
>:    }
>:    catch(FileException e) {
>:        writefln("A");
>:    }
>:
>:    writefln("B");
>:    delete mmf;
>:    writefln("C");
>:    return 0;
>:}
>
>Kevin
>
>


October 26, 2005
In article <dbjn30$mlo$1@digitaldaemon.com>, Kevin Bealer says...
>
>In article <dbiqat$29q4$1@digitaldaemon.com>, Ben Hinkle says...
>>
>>> I have written something like this in C++ at work, but bigger and
>>> performance
>>> focused.  I call it the "atlas" code because it is a collection of maps
>>> (atlas,
>>> get it?).  It keeps a table of reference counted mapped areas.  The data
>>> sets
>>> are often large, in the neighborhood of 13gb for the largest (it will
>>> probably
>>> be 25% larger by end of year).
>>>
>>> The individual files are always less than 1 GB, but working with more than
>>> 128MB
>>> slices proved very cumbersome -- there is always an mmap() call that fails
>>> for
>>> indistinct reasons, possibly internal fragmentation.  You can have 1.5 gb
>>> of
>>> data, but if you don't slice it up small, the process tanks eventually.
>>>
>>> (The code and all its dependencies are public domain and CVSable, so if
>>> anyone
>>> wants to reuse it, let me know and I'll drop a pointer.)
>>
>>does it have a URL?
>
>You can see the part I described are at http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbatlas_8hpp-source.html and http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbatlas_8cpp-source.html.
>
>To get the code to actually compile, bigger pieces are probably required.
>
>Kevin
>
>>> An interesting optimization if anyone uses this technique -- if the file
>>> contains a lot of uneven regions (as our files always do), map an extra
>>> few
>>> blocks on the end of each slice.  That way you almost never need to "piece
>>> together" parts of two mmapped regions.  Data "on the border" don't need
>>> their
>>> own mmap() call.  This technique almost *halves* the number of mmap calls
>>> at
>>> minimal cost and without copying any bytes.
>>
>>Good idea. You know, I don't mean to be nosy but it sounds like you would be
>>a great person to fix up MmFile. :-) I've never used mmfiles to the extent
>>you have and who knows how much experience other folks around here have with
>>large dataset handling.
>>I know it sucks when you post some bugs or enhancement requests and the
>>response is "you have the code, fix it up and send it to Walter" but please
>>consider it if you have the time.
>
>I've been looking for a project; I'll have a look at it.
>
>Kevin


Sorry - I said I would look at this and I did, but I got distracted by various other aspects of my life - some important, most not.  In any case, I have written code for this for Windows and Linux.

I could describe my changes, and/or email what I have to someone.  Is it still needed?  I saw something about mmfile fixes, but I haven't checked how complete they are.

The basic thing I changed was to allow 64 bit files to be mapped given arbitrary start and end offsets.

I could describe my changes, and/or email the code I have to someone.  Or, I could merge the differences in and test it further it is still needed?  I saw something about mmfile fixes, but I haven't checked how complete they are.

(I haven't done anything else in D or even kept up on reading the forum, for several months, so my compiler isnt even up to date anymore.  I have a couple of half finished projects that have been stagnating all this time too.)

Kevin