Thread overview
How come a count of a range becomes 0 before a foreach?
Apr 09, 2023
ikelaiah
Apr 09, 2023
ikelaiah
Apr 09, 2023
Ali Çehreli
Apr 10, 2023
ikelaiah
Apr 10, 2023
ikelaiah
April 09, 2023

Hi,

I've written a file that converts Rmd (R Markdown file), to a MarkDown file.

All works well if and only if line 73 is commented out.

I marked line 72 in the code below.

If line 73 is not commented out, foreach does not execute as the rmdFiles.walklength in line 82 becomes 0.

How does rmdFiles.walkLength becomes 0 before the foreach? I'm a but confused.

Can someone clarify? Thank you.

module rmd2md;

import std.algorithm;
import std.stdio;
import file = std.file;
import std.conv;
import std.regex;
import std.getopt;
import std.path;
import std.datetime;
import std.parallelism;
import std.range;

void main(string[] args)
{
    // Set variables for the main program
    string programName = "Rmd2md";

    // Setup Regex for capturing Rmd code snippet header
    Regex!char re = regex(r"`{3}\{r[a-zA-Z0-9= ]*\}", "g");

    // Set default values for the arguments
    string inputPath = file.getcwd();
    string fileEndsWith = ".Rmd";
    string outputPath = file.getcwd();

    // Set GetOpt variables
    auto helpInformation = getopt(
        args,
        "path|p", "Path of Rmd files. Default: current working directory.", &inputPath,
        "fext|e", "Extension of Rmd files. Default: `.Rmd`", &fileEndsWith,
        "fout|o", "Output folder to save the MD files. Default: current working directory.", &outputPath
    );

    if (helpInformation.helpWanted)
    {
        defaultGetoptPrinter("Rmd to Markdown (md) file converter.",
            helpInformation.options);
        return;
    }

    // is the path valid?
    if (!std.path.isValidPath(inputPath))
    {
        writeln(programName ~ ": invalid input path");
        return;
    }

    // is output path valid?
    if (!std.path.isValidPath(outputPath))
    {
        writeln(programName ~ ": invalid output path");
        return;
    }

    // is file extension valid?
    if (!startsWith(fileEndsWith, "."))
    {
        writeln(programName ~ ": invalid extension given");
        return;
    }

    writeln(programName ~ ": input directory is " ~ inputPath);
    writeln(programName ~ ": output directory is " ~ outputPath);
    writeln(programName ~ ": ...");

    // Get files in specified inputPath variable with a specific extension
    auto rmdFiles = file.dirEntries(inputPath, file.SpanMode.shallow)
        .filter!(f => f.isFile)
        .filter!(f => f.name.endsWith(fileEndsWith));

    // LINE 72 -- WARNING -- If we count the range here, later it will become 0 in line 82
    writeln(programName ~ ": number of files found " ~ to!string(rmdFiles.walkLength));

    // Get start time
    auto stattime = Clock.currTime();

    // Process each Rmd file
    int fileWrittenCount = 0;

    // LINE 81 -- WARNING -- if line 73 is not commented out, the walkLength returns 0
    writeln(programName ~ ": number of files found " ~ to!string(rmdFiles.walkLength));

    foreach (file.DirEntry item; parallel(rmdFiles))
    {
        writeln(programName ~ ": processing " ~ item.name);

        try
        {
            // Read content as string
            string content = file.readText(item.name);
            // Replace ```{r} or ```{r option1=value} with ```R
            string modified = replaceAll(content, re, "```R");
            // Set the Markdown output file
            string outputFile = replaceAll(baseName(item.name), regex(r".Rmd"), ".md");
            // Build an output path, using output path and baseName(item.name)
            string outputFilenamePath = buildPath(outputPath, outputFile);
            // Save output Markdown file
            file.write(outputFilenamePath, modified);
            writeln(programName ~ ": written " ~ outputFilenamePath);
            // Increase counter to indicate number of files processed
            fileWrittenCount++;
        }
        catch (file.FileException e)
        {
            writeln(programName ~ ": " ~ e.msg);
        }
    }

    writeln(programName ~ ": ...");

    // Gett end clock
    auto endttime = Clock.currTime();
    auto duration = endttime - stattime;
    writeln("Duration: ", duration);

    // Console output a summary
    writeln(programName ~ ": written " ~ to!string(fileWrittenCount) ~ " files");
}

For testing, you can create a text file, save as .Rmd in the same folder as the D file. Run the script as:

rdmd rmd2md.d

It will find the .Rmd file in current path and save it in the current path.

April 08, 2023

On 4/8/23 9:38 PM, ikelaiah wrote:

>

// Get files in specified inputPath variable with a specific extension
    auto rmdFiles = file.dirEntries(inputPath, file.SpanMode.shallow)
        .filter!(f => f.isFile)
        .filter!(f => f.name.endsWith(fileEndsWith));

    // LINE 72 -- WARNING -- If we count the range here, later it will become 0 in line 82
    writeln(programName ~ ": number of files found " ~ to!string(rmdFiles.walkLength));

dirEntries returns an input range, not a forward range. This means that once it's iterated, it's done.

If you want to iterate it twice, you'll have to construct it twice.

-Steve

April 09, 2023

On Sunday, 9 April 2023 at 03:39:52 UTC, Steven Schveighoffer wrote:

>

On 4/8/23 9:38 PM, ikelaiah wrote:

>

// Get files in specified inputPath variable with a specific extension
    auto rmdFiles = file.dirEntries(inputPath, file.SpanMode.shallow)
        .filter!(f => f.isFile)
        .filter!(f => f.name.endsWith(fileEndsWith));

    // LINE 72 -- WARNING -- If we count the range here, later it will become 0 in line 82
    writeln(programName ~ ": number of files found " ~ to!string(rmdFiles.walkLength));

dirEntries returns an input range, not a forward range. This means that once it's iterated, it's done.

If you want to iterate it twice, you'll have to construct it twice.

-Steve

Steve,

You're absolutely right. I did not read the manual correctly.

It is clearly written here that dirEntry is an input range.

I will modify the code to construct it twice.
Many thanks!

-ikelaiah

April 09, 2023
On 4/8/23 21:38, ikelaiah wrote:

> I will modify the code to construct it twice.

Multiple iterations of dirEntries can produce different results, which may or may not be what your program will be happy with.

Sticking an .array at the end will iterate a single time and maintain the list forever because .array returns an array. :)

  auto entries = dirEntries(/* ... */).array;

Ali

April 09, 2023

On 4/9/23 9:16 AM, Ali Çehreli wrote:

>

On 4/8/23 21:38, ikelaiah wrote:

>

I will modify the code to construct it twice.

Multiple iterations of dirEntries can produce different results, which may or may not be what your program will be happy with.

Sticking an .array at the end will iterate a single time and maintain the list forever because .array returns an array. :)

  auto entries = dirEntries(/* ... */).array;

I'd be cautious of that. I don't know what the underlying code uses, it may reuse buffers for e.g. filenames to avoid allocation.

If you are confident the directory contents won't change in that split-second, then I think iterating twice is fine.

-Steve

April 10, 2023
On Sunday, 9 April 2023 at 13:16:51 UTC, Ali Çehreli wrote:
> On 4/8/23 21:38, ikelaiah wrote:
>
> > I will modify the code to construct it twice.
>
> Multiple iterations of dirEntries can produce different results, which may or may not be what your program will be happy with.
>
> Sticking an .array at the end will iterate a single time and maintain the list forever because .array returns an array. :)
>
>   auto entries = dirEntries(/* ... */).array;
>
> Ali


Ali,

I didn't think about returning `dirEntries` as `array`.
Thanks for the Gems (and your online book too).

Regards,
ikelaiah


April 10, 2023

On Monday, 10 April 2023 at 01:01:59 UTC, Steven Schveighoffer wrote:

>

On 4/9/23 9:16 AM, Ali Çehreli wrote:

>

On 4/8/23 21:38, ikelaiah wrote:

>

I will modify the code to construct it twice.

Multiple iterations of dirEntries can produce different results, which may or may not be what your program will be happy with.

Sticking an .array at the end will iterate a single time and maintain the list forever because .array returns an array. :)

  auto entries = dirEntries(/* ... */).array;

I'd be cautious of that. I don't know what the underlying code uses, it may reuse buffers for e.g. filenames to avoid allocation.

If you are confident the directory contents won't change in that split-second, then I think iterating twice is fine.

-Steve

Steve,

The Rmd files are not on a network drive, but saved locally.
So, I'm confident, the files won't change in a split-second.

-ikelaiah.

April 10, 2023
On 4/10/23 6:43 PM, ikelaiah wrote:
> On Monday, 10 April 2023 at 01:01:59 UTC, Steven Schveighoffer wrote:
>> On 4/9/23 9:16 AM, Ali Çehreli wrote:

>>>    auto entries = dirEntries(/* ... */).array;
>>
>> I'd be cautious of that. I don't know what the underlying code uses, it may reuse buffers for e.g. filenames to avoid allocation.
>>
>> If you are confident the directory contents won't change in that split-second, then I think iterating twice is fine.
>>
> 
> Steve,
> 
> The Rmd files are not on a network drive, but saved locally.
> So, I'm confident, the files won't change in a split-second.

That is not what I meant.

What I mean is that `array` is going to copy whatever values the range gives it, which might be later *overwritten* depending on how `dirEntries` is implemented.

e.g. the following code is broken:

```d
auto lines = File("foo.txt").byLine.array;
```

But the following is correct:

```
auto lines = File("foo.txt").byLineCopy.array;
```

Why? Because `byLine` reuses the line buffer eventually to save on allocations. The array of lines might contain garbage in the earlier elements as they got overwritten.

I'm not saying it's wrong for `dirEntries`, I haven't looked. But you may want to be cautious about just using `array` to get you out of trouble, especially for lazy input ranges.

-Steve