Thread overview
Saving and loading large data sets easily and efficiently
Sep 30
Brett
Oct 01
JN
Oct 03
Brett
September 30
I have done some large computations where the data set is around 10GB and takes several minutes to run. Rather than running it every time regenerating the same data, can I simply save it to disk easily?

The data is ordered in arrays and structs. It's just numbers/POD except some arrays use pointers to elements in other arrays(so the structs are not duplicated).

Hence any saving routine would have to take this in to account and properly reference the references structs rather than duplicate them...

Hence why it is not so simple as writing out the array data. Ideally it would write to binary to save space.

Essentially pointers will get to file offsets and so, in some sense, it is as if one had a memory map where ptr = 0 is the start of the data structure and ptr=34 would reference the 34 byte. All pointers in the data struct refer to all pointers relative to the main structure(no heal allocations except for the arrays of struct pointers).

So it much more difficult than POD but would still be a little more work to right... hoping that there is something already out there than can do this. It should be


The way the data is structured is that I have a master array of non-ptr structs.

E.g.,

S[] Data;
S*[] OtherStuff;

then every pointer points to an element in to Data. I did not use int's as "pointers" for a specific non-relevant reason but I should be able to convert every pointer to an index by simply removing the offset. [Technically I do not know if this is occurring but it should]

OtherStuff's elements just reference Data's elements.


I imagine it wouldn't be that difficult to write out the Data. Save Data to a file then append the rest of the info and all pointers from memory can be converted to pointers to disk by simple a ptr - Data.ptr. It still requires managing some issues stuff though. As I do have some associative arrays:

S*[int] MoreStuff;

So I'm looking for a more robust solution that will handle any future expansions.

October 01
On Monday, 30 September 2019 at 20:10:21 UTC, Brett wrote:
> So it much more difficult than POD but would still be a little more work to right... hoping that there is something already out there than can do this. It should be

I'm afraid there's nothing like this available. Out of serialization libraries that I know, msgpack-d and cerealed don't store references and instead duplicate the pointed-to content. Orange does it, but it doesn't support binary output format, only XML, so it isn't a good fit for your data.
October 03
On Monday, 30 September 2019 at 20:10:21 UTC, Brett wrote:
[...]
> The way the data is structured is that I have a master array of non-ptr structs.
>
> E.g.,
>
> S[] Data;
> S*[] OtherStuff;
>
> then every pointer points to an element in to Data. I did not use int's as "pointers" for a specific non-relevant reason [...]

I would seriously consider turning that around and work with indices primarily, then take the address of an indexed element whenever you do need a pointer for that specific non-relevant reason. It makes I/O trivial, and it is safer too.

size_t[] OtherStuff;
size_t[int] MoreStuff;

Bastiaan.
October 03
On Thursday, 3 October 2019 at 14:38:35 UTC, Bastiaan Veelo wrote:
> On Monday, 30 September 2019 at 20:10:21 UTC, Brett wrote:
> [...]
>> The way the data is structured is that I have a master array of non-ptr structs.
>>
>> E.g.,
>>
>> S[] Data;
>> S*[] OtherStuff;
>>
>> then every pointer points to an element in to Data. I did not use int's as "pointers" for a specific non-relevant reason [...]
>
> I would seriously consider turning that around and work with indices primarily, then take the address of an indexed element whenever you do need a pointer for that specific non-relevant reason. It makes I/O trivial, and it is safer too.
>
> size_t[] OtherStuff;
> size_t[int] MoreStuff;
>
> Bastiaan.

No it doesn't.