January 21, 2022
On Friday, 21 January 2022 at 03:57:01 UTC, H. S. Teoh wrote:
>
> std.array.appender is your friend.
>
> T

:-)

// --

void ProcessRecords
(in int[][int][] recArray, const(string) fname)
{
    auto file = File(fname, "w");
    scope(exit) file.close;

    Appender!string bigString = appender!string;
    bigString.reserve(recArray.length);
    debug { writefln("bigString.capacity is %s", bigString.capacity); }

    void processRecord(const(int) id, const(int)[] values)
    {
        bigString ~= id.to!string ~ values.format!"%(%s,%)" ~ "\n";
    }

    foreach(ref const record; recArray)
    {
        foreach (ref rp; record.byPair)
        {
            processRecord(rp.expand);
        }
    }

    file.write(bigString[]);
}

// ---
January 21, 2022
On Friday, 21 January 2022 at 04:08:33 UTC, forkit wrote:
>
> // --
>
> void ProcessRecords
> (in int[][int][] recArray, const(string) fname)
> {
>     auto file = File(fname, "w");
>     scope(exit) file.close;
>
>     Appender!string bigString = appender!string;
>     bigString.reserve(recArray.length);
>     debug { writefln("bigString.capacity is %s", bigString.capacity); }
>
>     void processRecord(const(int) id, const(int)[] values)
>     {
>         bigString ~= id.to!string ~ values.format!"%(%s,%)" ~ "\n";
>     }
>
>     foreach(ref const record; recArray)
>     {
>         foreach (ref rp; record.byPair)
>         {
>             processRecord(rp.expand);
>         }
>     }
>
>     file.write(bigString[]);
> }
>
> // ---

actually something not right with Appender I think...

100_000 records took 20sec (ok)


1_000_000 records never finished - after 1hr/45min I cancelled the process.

??
January 21, 2022
On Friday, 21 January 2022 at 03:50:37 UTC, forkit wrote:

> I might have to use a kindof stringbuilder instead, then write a massive string once to the file.

You're using writeln, which goes through C I/O buffered writes. Whether you make one call or several is of little consequence - you're limited by buffer size and options.
January 21, 2022
On Friday, 21 January 2022 at 08:53:26 UTC, Stanislav Blinov wrote:
>

turns out the problem has nothing to do with appender...

It's actually this line:

if (!idArray.canFind(x)):

when i comment this out in the function below, the program does what I want in seconds.

only problem is, the ids are no longer unique (in the file)

// ---
void createUniqueIDArray
(ref int[] idArray, const(int) recordsNeeded)
{
    idArray.reserve(recordsNeeded);
    debug { writefln("idArray.capacity is %s", idArray.capacity); }

    int i = 0;
    int x;
    while(i != recordsNeeded)
    {
       // id needs to be 9 digits, and needs to start with 999
       x = uniform(999*10^^6, 10^^9); // thanks Stanislav

       // ensure every id added is unique.
       //if (!idArray.canFind(x))
       //{
           idArray ~= x; // NOTE: does NOT appear to register with -profile=gc
           i++;
       //}
    }

    debug { writefln("idArray.length = %s", idArray.length); }
}

// ----
January 21, 2022
On Friday, 21 January 2022 at 09:10:56 UTC, forkit wrote:
>

ok... in the interest of corecting the code I posted previously...

... here is a version that actually works in secs (for a million records), as opposed to hours!


// ---------------

/+
  =====================================================================
   This program create a sample dataset consisting of 'random' records,
   and then outputs that dataset to a file.

   Arguments can be passed on the command line,
   or otherwise default values are used instead.

   Example of that output can be seen at the end of this code.
   =====================================================================
+/

module test;
@safe:
import std.stdio : write, writef, writeln, writefln;
import std.range : iota, takeExactly;
import std.array : array, byPair, Appender, appender;
import std.random : Random, unpredictableSeed, dice, choice, uniform;
import std.algorithm : map, uniq, canFind, among;
import std.conv : to;
import std.format;
import std.stdio : File;
import std.file : exists;
import std.exception : enforce;

debug { import std; }

Random rnd;
static this() {  rnd = Random(unpredictableSeed); } // thanks Ali

void main(string[] args)
{
    int recordsNeeded, valuesPerRecord;
    string fname;

    if(args.length < 4)
    {
        //recordsNeeded = 1_000_000;
        //recordsNeeded = 100_000;
        recordsNeeded = 10;

        valuesPerRecord= 8;

        //fname = "D:/rnd_records.txt";
        fname = "./rnd_records.txt";
    }
    else
    {
        // assumes valid values being passed in ;-)
        recordsNeeded = to!int(args[1]);
        valuesPerRecord = to!int(args[2]);
        fname = args[3];
    }

    debug
        { writefln("%s records, %s values for record, will be written to file: %s", recordsNeeded, valuesPerRecord, fname); }
    else
        { enforce(!exists(fname), "Oop! That file already exists!"); }

    // id needs to be 9 digits, and needs to start with 999
    int[] idArray = takeExactly(iota(999*10^^6, 10^^9), recordsNeeded).array;
    debug { writefln("idArray.length = %s", idArray.length); }

    int[][] valuesArray;
    createValuesArray(valuesArray, recordsNeeded, valuesPerRecord);

    int[][int][] records = CreateDataSet(idArray, valuesArray, recordsNeeded);

    ProcessRecords(records, fname);

    writefln("All done. Check if records written to %s", fname);
}

void createValuesArray
(ref int[][] valuesArray, const(int) recordsNeeded, const(int) valuesPerRecord)
{
    valuesArray = iota(recordsNeeded)
            .map!(i => iota(valuesPerRecord)
            .map!(valuesPerRecord => cast(int)rnd.dice(0.6, 1.4))
            .array).array;  // NOTE: does register with -profile=gc

    debug { writefln("valuesArray.length = %s", valuesArray.length); }

}

int[][int][] CreateDataSet
(const(int)[] idArray, int[][] valuesArray, const(int) numRecords)
{
    int[][int][] records;
    records.reserve(numRecords);
    debug { writefln("records.capacity is %s", records.capacity); }

    foreach(i, const id; idArray)
    {
        // NOTE: below does register with -profile=gc
        records ~= [ idArray[i] : valuesArray[i] ];
    }

    debug { writefln("records.length = %s", records.length); }

    return records.dup;
}

void ProcessRecords
(in int[][int][] recArray, const(string) fname)
{
    auto file = File(fname, "w");
    scope(exit) file.close;

    Appender!string bigString = appender!string;
    bigString.reserve(recArray.length);
    debug { writefln("bigString.capacity is %s", bigString.capacity); }

    // NOTE: forward declaration required for this nested function
    void processRecord(const(int) id, const(int)[] values)
    {
        // NOTE: below does register with -profile=gc
        bigString ~= id.to!string ~ "," ~ values.format!"%(%s,%)" ~ "\n";
    }

    foreach(ref const record; recArray)
    {
        foreach (ref rp; record.byPair)
        {
            processRecord(rp.expand);
        }
    }

    file.write(bigString[]);
}

/+
sample file output:

9992511730,1,0,1,0,1,0,1
9995369731,1,1,1,1,1,1,1
9993136031,1,0,0,0,1,0,0
9998979051,1,1,1,1,0,1,1
9998438090,1,1,0,1,1,0,0
9995132750,0,0,1,0,1,1,1
9997123630,0,1,1,1,0,1,1
9998351590,1,0,0,1,1,1,1
9991454121,1,1,1,1,1,0,1
9997673520,1,1,1,1,1,1,1

+/

// ---------------

January 21, 2022
On Fri, Jan 21, 2022 at 09:10:56AM +0000, forkit via Digitalmars-d-learn wrote: [...]
> turns out the problem has nothing to do with appender...
> 
> It's actually this line:
> 
> if (!idArray.canFind(x)):
> 
> when i comment this out in the function below, the program does what I want in seconds.
> 
> only problem is, the ids are no longer unique (in the file)
[...]

Ah yes, the good ole O(N²) trap that new programmers often fall into.
:-)

Using .canFind on an array of generated IDs means scanning the entire array every time you find a non-colliding ID. As the array grows, the cost of doing this increases. The overall effect is O(N²) time complexity, because you're continually scanning the array every time you generate a new ID.

Use an AA instead, and performance should dramatically increase. I.e., instead of:

	size_t[] idArray;
	...
	if (!idArray.canFind(x)): // O(N) cost to scan array

write:

	bool[size_t] idAA;
	...
	if (x in idAA) ...	// O(1) cost to look up an ID


T

-- 
VI = Visual Irritation
January 21, 2022
On Fri, Jan 21, 2022 at 10:12:42AM +0000, forkit via Digitalmars-d-learn wrote: [...]
> Random rnd;
> static this() {  rnd = Random(unpredictableSeed); } // thanks Ali

Actually you don't even need to do this, unless you want precise control over the initialization of your RNG.  If you don't specify the RNG parameter in the calls to std.random functions, they will use the default RNG, which is already initialized with unpredictableSeed.


[...]
>     // id needs to be 9 digits, and needs to start with 999
>     int[] idArray = takeExactly(iota(999*10^^6, 10^^9),
> recordsNeeded).array;
[...]

This is wasteful if you're not planning to use every ID in this million-entry long array.  Much better to just use an AA to keep track of which IDs have already been generated instead.  Of course, if you plan to use most of the array, then the AA may wind up using more memory than the array. So it depends on your use case.


T

-- 
Never wrestle a pig. You both get covered in mud, and the pig likes it.
January 21, 2022

On 1/21/22 1:36 PM, H. S. Teoh wrote:

>

On Fri, Jan 21, 2022 at 10:12:42AM +0000, forkit via Digitalmars-d-learn wrote:

>

[...]

>
 // id needs to be 9 digits, and needs to start with 999
 int[] idArray = takeExactly(iota(999*10^^6, 10^^9),

recordsNeeded).array;
[...]

This is wasteful if you're not planning to use every ID in this
million-entry long array. Much better to just use an AA to keep track
of which IDs have already been generated instead. Of course, if you
plan to use most of the array, then the AA may wind up using more memory
than the array. So it depends on your use case.

Yeah, iota is a random-access range, so you can just pass it directly, and not allocate anything.

Looking at the usage, it doesn't need to be an array at all. But modifying the code to properly accept the range might prove difficult for someone not used to it.

-Steve

January 21, 2022
On Friday, 21 January 2022 at 18:36:42 UTC, H. S. Teoh wrote:
>
> This is wasteful if you're not planning to use every ID in this million-entry long array.  Much better to just use an AA to keep track of which IDs have already been generated instead.  Of course, if you plan to use most of the array, then the AA may wind up using more memory than the array. So it depends on your use case.
>
>
> T

yes, I was thinking this over as I was waking up this morning, and thought... what the hell am I doing generating all those numbers that might never get used.

better to do:

const int iotaStartNum = 100_000_000;
int[] idArray = iota(startiotaStartNum, iotaStartNum + recordsNeeded).array;


January 21, 2022
On Friday, 21 January 2022 at 18:50:46 UTC, Steven Schveighoffer wrote:
>
> Yeah, iota is a random-access range, so you can just pass it directly, and not allocate anything.
>
> Looking at the usage, it doesn't need to be an array at all. But modifying the code to properly accept the range might prove difficult for someone not used to it.
>
> -Steve

thanks. that makes more sense actually ;-)

now i can get rid of the idArray completely, and just do:

foreach(i, id; enumerate(iota(iotaStartNum, iotaStartNum + recordsNeeded)))
{
    records ~= [ id: valuesArray[i] ];
}