August 09, 2010
On 08/08/2010 02:32 PM, bearophile wrote:
> Walter Bright:
>> bearophile wrote:
>>> In the D code I have added an idup to make the comparison more fair, because
>>> in the Python code the "line" is a true newly allocated line, you can safely
>>> use it as dictionary key.
>>
>> So it is with byLine, too. You've burdened D with double the amount of allocations.
>
> I think you are wrong two times:
>
> 1) byLine() doesn't return a newly allocated line, you can see it with this small program:
>
> import std.stdio: File, writeln;
>
> void main(string[] args) {
>      char[][] lines;
>      auto file = File(args[1]);
>      foreach (rawLine; file.byLine()) {
>          writeln(rawLine.ptr);
>          lines ~= rawLine;
>      }
>      file.close();
> }
>
>
> Its output shows that all "strings" (char[]) share the same pointer:
>
> 14E5E00
> 14E5E00
> 14E5E00
> 14E5E00
> 14E5E00
> 14E5E00
> 14E5E00
> ...
>
>
> 2) You can't use the result of rawLine() as string key for an associative array, as you I have said you can in Python. Currently you can, but according to Andrei this is a bug. And if it's not a bug then I'll reopen this closed bug 4474:
>
> http://d.puremagic.com/issues/show_bug.cgi?id=4474
>
>
>> Also, I object in general to this method of making things "more fair". Using a
>> less efficient approach in X because Y cannot use such an approach is not a
>> legitimate comparison.
>
> I generally agree, but this it not the case.
> In some situations you indeed don't need a newly allocated string for each loop, because for example you just want to read them and process them and not change/store them. You can't do this in Python, but this is not what I want to test. As I have explained in bug 4474 this behaviour is useful but it is acceptable only if explicitly requested by the programmer, and not as default one. The language is safe, as Andrei explains there, because you are supposed to idup the char[] to use it as key for an associative array (if your associative array is declared as int[char[]] then it can accept such rawLine() as keys, but you can clearly see those aren't strings. This is why I have closed bug 4474).
>
> Bye,
> bearophile

I think at the end of the day, regardless the relative possibilities of file reading in the two languages, we should be faster than Python when allocating one new string per line.

Andrei
August 09, 2010
Andrei Alexandrescu:
> I think at the end of the day, regardless the relative possibilities of file reading in the two languages, we should be faster than Python when allocating one new string per line.

For now I suggest you to aim to be just about as fast as Python in this task :-) Beating Python significantly on this task is probably not easy.
(Later someday I'd also like D AAs to become about as fast as Python dicts.)

Bye,
bearophile
August 09, 2010
On 08/08/2010 10:29 PM, bearophile wrote:
> Andrei Alexandrescu:
>> I think at the end of the day, regardless the relative
>> possibilities of file reading in the two languages, we should be
>> faster than Python when allocating one new string per line.
>
> For now I suggest you to aim to be just about as fast as Python in
> this task :-) Beating Python significantly on this task is probably
> not easy.

Why?

Andrei
August 09, 2010
"Leandro Lucarella" <luca@llucax.com.ar> wrote in message news:20100808212859.GL3360@llucax.com.ar...
> Nick Sabalausky, el  8 de agosto a las 13:31 me escribiste:
>> "Norbert Nemec" <Norbert@Nemec-online.de> wrote in message news:i3lq17$99u$1@digitalmars.com...
>> >I usually do the same thing with a shell pipe
>> > expand | sed 's/ *$//;s/\r$//;s/\r/\n/'
>> >
>>
>> Filed under "Why I don't like regex for non-trivial things" ;)
>
> Those regex are non-trivial?
>

IMHO, A task has to be REALLY trivial to be trivial in regex ;)

> Maybe you're confusing sed statements with regex, in that sed program, there are 3 trivial regex:
>

Ahh, I see. I'm not familiar with sed, so my eyes got to the part after "sed" and began bleeding, so I figured it had to be one of three things:

- Encrypted data
- Hardware crash
- Regex

;)

Insert other joke about "read-only languages" or "languages that look the same before and after RSA encryption" here.

(I'm not genuinely complaining about regexes. They can be very useful. They just tend to get real ugly real fast.)


August 09, 2010
Nick Sabalausky wrote:
> (I'm not genuinely complaining about regexes. They can be very useful. They just tend to get real ugly real fast.)

Regexes are like flying airplanes. You have to do them often or you get "rusty" real fast. (Flying is not a natural behavior, it's not like riding a bike.)
August 09, 2010
bearophile Wrote:

> I think it minimizes heap allocations, the performance is tuned for a line length found to be the "average one" for normal files. So I presume if your text file has very short lines (like 5 chars each) or very long ones (like 1000 chars each) it becomes less efficient.
> 
> So it's probably a matter of good usage of the C I/O functions and probably a more efficient management by the GC.
> 
Don't you minimize heap allocation etc by reading whole file in one io call?

August 09, 2010
Kagamin:

> Don't you minimize heap allocation etc by reading whole file in one io call?

The whole thread was about lazy read of file lines. If the file is very large it's not wise to load it all in RAM at once.

Bye,
bearophile
August 09, 2010
Andrei Alexandrescu:

> > For now I suggest you to aim to be just about as fast as Python in this task :-) Beating Python significantly on this task is probably not easy.
> 
> Why?

Because it's a core functionality for Python so devs probably have optimized it well, it's written in C, and in this case there is very little interpreter overhead.

Bye,
bearophile
August 09, 2010
On 2010-08-09 07:12:38 -0400, bearophile <bearophileHUGS@lycos.com> said:

> Kagamin:
> 
>> Don't you minimize heap allocation etc by reading whole file in one io call?
> 
> The whole thread was about lazy read of file lines. If the file is very large it's not wise to load it all in RAM at once.

For non-huge files that can fit in the memory space, I'd just memory-map the whole file and treat it as a giant string that I could then slice and keep the slices around (yeah!). The virtual memory system will take care of loading the file content's as you read from its memory space, so the file isn't loaded all at once.

But that's not compatible with the C file IO functions. Does Python uses C file IO calls when reading from a file? If not, perhaps that's why it's faster.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

August 09, 2010
On Monday, August 09, 2010 05:30:33 Michel Fortin wrote:
> On 2010-08-09 07:12:38 -0400, bearophile <bearophileHUGS@lycos.com> said:
> > Kagamin:
> >> Don't you minimize heap allocation etc by reading whole file in one io call?
> > 
> > The whole thread was about lazy read of file lines. If the file is very large it's not wise to load it all in RAM at once.
> 
> For non-huge files that can fit in the memory space, I'd just memory-map the whole file and treat it as a giant string that I could then slice and keep the slices around (yeah!). The virtual memory system will take care of loading the file content's as you read from its memory space, so the file isn't loaded all at once.
> 
> But that's not compatible with the C file IO functions. Does Python uses C file IO calls when reading from a file? If not, perhaps that's why it's faster.

Well, you can just read the whole file in as a string with readText(), and any slices to that could stick around, but presumably, that's using the C file I/O calls underneath.

- Jonathan M Davis