Thread overview
Newbie: Error parsing csv file with very long lines
Apr 23, 2016
salvari
Apr 23, 2016
rikki cattermole
Apr 23, 2016
salvari
Apr 23, 2016
Nicholas Wilson
Apr 23, 2016
salvari
Apr 23, 2016
rikki cattermole
Apr 23, 2016
salvari
Apr 23, 2016
Ivan Kazmenko
April 23, 2016
Hello all!

I'm trying to read a csv file (';' as separator) with very long lines.

It seems to be really simple, I read the columns name with no problem. But as soon as the program parses the first line of data, the array containing the columns names seems to be overwrited.

I'm using dmd: DMD64 D Compiler v2.071.0

My code:

import std.stdio;
import std.algorithm;
import std.array;

char[][] columns;


void main() {
 LINE:foreach(line; stdin.byLine()){
    if(line.startsWith("Interfaz")){
      writeln("IN HERE");
      columns = line.split(";");
      writeln(columns);               // Everything seems to be ok
      continue;
    } else{
      auto linedata = line.split(";");
      writefln("My line: %s", line);        // Fine.
      writefln("LineData: %s", linedata);   // Fine. Line data is ok
      writefln("Columns: %s", columns);     // Wrong!!! columsn array
                                            // contains garbage data
                                            // from linedata
    }
  }
}

April 23, 2016
On 23/04/2016 10:40 PM, salvari wrote:
> Hello all!
>
> I'm trying to read a csv file (';' as separator) with very long lines.
>
> It seems to be really simple, I read the columns name with no problem.
> But as soon as the program parses the first line of data, the array
> containing the columns names seems to be overwrited.
>
> I'm using dmd: DMD64 D Compiler v2.071.0
>
> My code:
>
> import std.stdio;
> import std.algorithm;
> import std.array;
>
> char[][] columns;
>
>
> void main() {
>   LINE:foreach(line; stdin.byLine()){
>      if(line.startsWith("Interfaz")){
>        writeln("IN HERE");
>        columns = line.split(";");
>        writeln(columns);               // Everything seems to be ok
>        continue;
>      } else{
>        auto linedata = line.split(";");
>        writefln("My line: %s", line);        // Fine.
>        writefln("LineData: %s", linedata);   // Fine. Line data is ok
>        writefln("Columns: %s", columns);     // Wrong!!! columsn array
>                                              // contains garbage data
>                                              // from linedata
>      }
>    }
> }

Its probably using a buffer.
columns = line.dup.split(";");
Should fix it.
April 23, 2016
Fixed!!!

Thanks a lot. :-)


But I have to think about this. I don't understand the failure.


April 23, 2016
On Saturday, 23 April 2016 at 10:57:04 UTC, salvari wrote:
> Fixed!!!
>
> Thanks a lot. :-)
>
>
> But I have to think about this. I don't understand the failure.

stdin.byLine() reuses its buffer. so the old arrays in columns point to the data in byLine's buffer and they get overwritten by subsequent calls.

Also if you're trying to parse csv check out std.csv

from the docs

string str = "Hello;65;63.63\nWorld;123;3673.562";
struct Layout
{
    string name;
    int value;
    double other;
}

auto records = csvReader!Layout(str,';');

foreach(record; records)
{
    writeln(record.name);
    writeln(record.value);
    writeln(record.other);
}
April 23, 2016
On 23/04/2016 10:57 PM, salvari wrote:
> Fixed!!!
>
> Thanks a lot. :-)
>
>
> But I have to think about this. I don't understand the failure.

.dup duplicates memory.
What this means is, it allocates a new block of memory and copies the values across.

What byLine does is, read up to \n and copies it into a buffer of memory.
Then you get access to said buffer aka line.
So it reuses the memory containing said line, meaning no allocations beyond the first and growth of it.
April 23, 2016
On Saturday, 23 April 2016 at 11:18:08 UTC, rikki cattermole wrote:
> On 23/04/2016 10:57 PM, salvari wrote:
>> Fixed!!!
>>
>> Thanks a lot. :-)
>>
>>
>> But I have to think about this. I don't understand the failure.
>
> .dup duplicates memory.
> What this means is, it allocates a new block of memory and copies the values across.
>
> What byLine does is, read up to \n and copies it into a buffer of memory.
> Then you get access to said buffer aka line.
> So it reuses the memory containing said line, meaning no allocations beyond the first and growth of it.

Now I understand. Slices are still biting me every now and then.
April 23, 2016
On Saturday, 23 April 2016 at 11:13:19 UTC, Nicholas Wilson wrote:
> On Saturday, 23 April 2016 at 10:57:04 UTC, salvari wrote:
>> Fixed!!!
>>
>> Thanks a lot. :-)
>>
>>
>> But I have to think about this. I don't understand the failure.
>
> stdin.byLine() reuses its buffer. so the old arrays in columns point to the data in byLine's buffer and they get overwritten by subsequent calls.
>
> Also if you're trying to parse csv check out std.csv
>
> from the docs
>
> string str = "Hello;65;63.63\nWorld;123;3673.562";
> struct Layout
> {
>     string name;
>     int value;
>     double other;
> }
>
> auto records = csvReader!Layout(str,';');
>
> foreach(record; records)
> {
>     writeln(record.name);
>     writeln(record.value);
>     writeln(record.other);
> }

Thanks for your clue on std.csv!

I think I will use it a lot. I totally missed it.
April 23, 2016
On Saturday, 23 April 2016 at 10:40:13 UTC, salvari wrote:
> It seems to be really simple, I read the columns name with no problem. But as soon as the program parses the first line of data, the array containing the columns names seems to be overwrited.

Another possibility yet not mentioned is to change
foreach(line; stdin.byLine())
into
foreach(line; stdin.byLineCopy())
to make the older lines' contents available after you read the next line.