A safer File.readln - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » A safer File.readln

Thread overview

A safer File.readln
Jan 22, 2017 Markus Laker
Jan 23, 2017 Chris Wright
Jan 23, 2017 Andrei Alexandrescu
Jan 23, 2017 Markus Laker
Jan 23, 2017 Shachar Shemesh
Jan 23, 2017 Markus Laker
Jan 23, 2017 Shachar Shemesh
Jan 23, 2017 Shachar Shemesh
Jan 23, 2017 Markus Laker
Jan 23, 2017 Andrei Alexandrescu
Jan 23, 2017 Shachar Shemesh
Jan 23, 2017 Shachar Shemesh
Jan 25, 2017 TheGag96
Jan 25, 2017 Andrei Alexandrescu
Jan 25, 2017 Jens Mueller
Jan 25, 2017 Andrei Alexandrescu
Jan 25, 2017 Kagamin

January 22, 2017

A safer File.readln

Posted by Markus Laker

Markus Laker

It's pretty easy to DoS a D program that uses File.readln or File.byLine:

msl@james:~/d$ prlimit --as=4000000000 time ./tinycat.d tinycat.d
#!/usr/bin/rdmd

import std.stdio;

void main(in string[] argv) {
    foreach (const filename; argv[1..$])
        foreach (line; File(filename).byLine)
            writeln(line);
}

0.00user 0.00system 0:00.00elapsed 66%CPU (0avgtext+0avgdata 4280maxresident)k
0inputs+0outputs (0major+292minor)pagefaults 0swaps
msl@james:~/d$ prlimit --as=4000000000 time ./tinycat.d /dev/zero
0.87user 1.45system 0:02.51elapsed 92%CPU (0avgtext+0avgdata 2100168maxresident)k
0inputs+0outputs (0major+524721minor)pagefaults 0swaps
msl@james:~/d$

This trivial program that runs in about 4MiB when asked to print itself chewed up 2GiB of memory in about three seconds when handed an infinitely long input line, and would have kept going if prlimit hadn't killed it.

D is in good company: C++'s getline() and Perl's diamond operator have the same vulnerability.

msl@james:~/d$ prlimit --as=4000000000 time ./a.out tinycat.cpp
#include <fstream>
#include <iostream>
#include <string>

int main(int const argc, char const *argv[]) {
    for (auto i = 1;  i < argc;  ++i) {
        std::ifstream fh {argv[i]};
        for (std::string line;  getline(fh, line, '\n');  )
            std::cout << line << '\n';
    }

    return 0;
}

0.00user 0.00system 0:00.00elapsed 0%CPU (0avgtext+0avgdata 2652maxresident)k
0inputs+0outputs (0major+113minor)pagefaults 0swaps
msl@james:~/d$ prlimit --as=4000000000 time ./a.out /dev/zero
1.12user 1.76system 0:02.92elapsed 98%CPU (0avgtext+0avgdata 1575276maxresident)k
0inputs+0outputs (0major+786530minor)pagefaults 0swaps
msl@james:~/d$ prlimit --as=4000000000 time perl -wpe '' tinycat.d
#!/usr/bin/rdmd

import std.stdio;

void main(in string[] argv) {
    foreach (const filename; argv[1..$])
        foreach (line; File(filename).byLine)
            writeln(line);
}

0.00user 0.00system 0:00.00elapsed 0%CPU (0avgtext+0avgdata 3908maxresident)k
0inputs+0outputs (0major+192minor)pagefaults 0swaps
msl@james:~/d$ prlimit --as=4000000000 time perl -wpe '' /dev/zero
Out of memory!
Command exited with non-zero status 1
4.82user 2.34system 0:07.43elapsed 96%CPU (0avgtext+0avgdata 3681400maxresident)k
0inputs+0outputs (0major+919578minor)pagefaults 0swaps
msl@james:~/d$

But I digress.

What would a safer API look like?  Perhaps we'd slip in a maximum line length as an optional argument to readln, byLine and friends:

enum size_t MaxLength = 1 << 20;    // 1MiB
fh.readln(buf, MaxLength);
buf = fh.readln(MaxLength);
auto range = fh.byLine(MaxLength);

Obviously, we wouldn't want to break compatibility with existing code by demanding a maximum line length at every call site.  Perhaps the default maximum length should change from its current value -- infinity -- to something like 4MiB: longer than lines in most text files, but still affordably small on most modern machines.

What should happen if readln encountered an excessively long line?  Throw an exception?

Markus

January 23, 2017

Re: A safer File.readln

Posted by Chris Wright
in reply to Markus Laker

Chris Wright

Posted in reply to Markus Laker

On Sun, 22 Jan 2017 21:29:39 +0000, Markus Laker wrote:

> It's pretty easy to DoS a D program that uses File.readln or File.byLine:

The /dev/zero version at least could be solved by calling stat on the file and limiting reads to the reported size.

January 22, 2017

Re: A safer File.readln

Posted by Andrei Alexandrescu
in reply to Chris Wright

Andrei Alexandrescu

Posted in reply to Chris Wright

On 1/22/17 8:52 PM, Chris Wright wrote:
> The /dev/zero version at least could be solved by calling stat on the file
> and limiting reads to the reported size.

I recall reported size for special files is zero. We special case std.file.read for those. -- Andrei

January 23, 2017

Re: A safer File.readln

Posted by Markus Laker
in reply to Andrei Alexandrescu

Markus Laker

Posted in reply to Andrei Alexandrescu

On Monday, 23 January 2017 at 01:55:59 UTC, Andrei Alexandrescu wrote:
> I recall reported size for special files is zero. We special case std.file.read for those. -- Andrei

Special files are a bit of a distraction, in any case, because it's easy to create a large disk file full of zeroes:

msl@james:~/d$ dd if=/dev/zero of=zeroes count=2048 bs=1048576
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 11.1792 s, 192 MB/s
msl@james:~/d$ /usr/bin/time ./tinycat.d zeroes > /dev/null
1.67user 3.14system 0:04.83elapsed 99%CPU (0avgtext+0avgdata 4199944maxresident)k
0inputs+0outputs (0major+1049634minor)pagefaults 0swaps
msl@james:~/d$

A 2GiB disk file caused tinycat.d to use over 4GiB of memory.

Cheers,

Markus

January 23, 2017

Re: A safer File.readln

Posted by Shachar Shemesh
in reply to Markus Laker

Shachar Shemesh

Posted in reply to Markus Laker

On 23/01/17 11:15, Markus Laker wrote:

> A 2GiB disk file caused tinycat.d to use over 4GiB of memory.
>

When extending arrays, a common approach is to double the size every time you run out of space. This guarantees an amortized O(1) cost of append.

Unfortunately, this also guarantees that we will never have enough space freed by previous copies to reuse existing memory:

100 byte array

increase

100 bytes free
200 bytes array

increase

300 free
400 array

etc. The array will always be bigger than the total amount of space we freed.

If, instead of increasing its size by 100%, we increase it by a smaller percentage of its previous size, we still maintain the amortized O(1) cost (with a multiplier that might be a little higher, but see the trade off). On the other hand, we can now reuse memory. Let's say we increase by 50% each time:

100 array

increase

100 free
150 array

increase

250 free
225 array

increase

475 free
338 array

813 free
507 array

The next increase will require 761 bytes, but we already have 813 free, so we can allocate the new array over the memory already freed from before, reducing the heap size.

Of course, if we integrate the allocating and the move, we could have reused previously used memory starting from allocation 3, but I'm assuming that would also be possible when increasing by 100%, so I'm assuming we can't do that.

Of course, if, instead of 50% we increase by less (say, 20%), we could reuse previously used memory even sooner.

I am assuming that this is the problem that manifests itself in this use scenario. I would suggest solving it at the language level, rather than the library level.

Shachar

January 23, 2017

Re: A safer File.readln

Posted by Markus Laker
in reply to Shachar Shemesh

Markus Laker

Posted in reply to Shachar Shemesh

On Monday, 23 January 2017 at 10:44:50 UTC, Shachar Shemesh wrote:
> Of course, if, instead of 50% we increase by less (say, 20%), we could reuse previously used memory even sooner.

Yes, you're right, of course: expansion of strings and other arrays is a classic time-versus-space trade-off.  However, expanding strings more slowly is a much bigger change than I have the D experience or credentials to suggest.  And I don't think it really solves the problem: it just requires the attacker to wait another few seconds for /dev/zero to deliver enough data to fill up memory.  A simple length-check in readln, in contrast, would prevent an attacker from flooding us with data in the first place.

Markus

January 23, 2017

Re: A safer File.readln

Posted by Shachar Shemesh
in reply to Markus Laker

Shachar Shemesh

Posted in reply to Markus Laker

On 23/01/17 13:05, Markus Laker wrote:
> On Monday, 23 January 2017 at 10:44:50 UTC, Shachar Shemesh wrote:
>> Of course, if, instead of 50% we increase by less (say, 20%), we could
>> reuse previously used memory even sooner.
>
> Yes, you're right, of course: expansion of strings and other arrays is a
> classic time-versus-space trade-off.  However, expanding strings more
> slowly is a much bigger change than I have the D experience or
> credentials to suggest.  And I don't think it really solves the problem:
> it just requires the attacker to wait another few seconds for /dev/zero
> to deliver enough data to fill up memory.  A simple length-check in
> readln, in contrast, would prevent an attacker from flooding us with
> data in the first place.
>
> Markus

It would mean we consume an order of magnitude of the amount of memory the "attacker" sends.

There is a huge difference between "I send an unterminated string 2GB long, and it takes 2GB of memory, causing trouble", and "I send an unterminated string 2GB long, and it takes 4GB of memory, causing trouble".

The second is a problem. The first might be obvious and/or benign, depending on the use case.

Shachar

January 23, 2017

Re: A safer File.readln

Posted by Shachar Shemesh
in reply to Shachar Shemesh

Shachar Shemesh

Posted in reply to Shachar Shemesh

One more thing.

It is possible to tweak the numbers based on the overall use. For example, add 100% for arrays smaller than 1MB, 50% for 1MB - 100MB, and 20% for arrays above 100MB big. This would eliminate the performance degradation for almost all users.

Shachar

On 23/01/17 12:44, Shachar Shemesh wrote:
> On 23/01/17 11:15, Markus Laker wrote:
>
>> A 2GiB disk file caused tinycat.d to use over 4GiB of memory.
>>
>
> When extending arrays, a common approach is to double the size every
> time you run out of space. This guarantees an amortized O(1) cost of
> append.
>
> Unfortunately, this also guarantees that we will never have enough space
> freed by previous copies to reuse existing memory:
>
> 100 byte array
>
> increase
>
> 100 bytes free
> 200 bytes array
>
> increase
>
> 300 free
> 400 array
>
> etc. The array will always be bigger than the total amount of space we
> freed.
>
> If, instead of increasing its size by 100%, we increase it by a smaller
> percentage of its previous size, we still maintain the amortized O(1)
> cost (with a multiplier that might be a little higher, but see the trade
> off). On the other hand, we can now reuse memory. Let's say we increase
> by 50% each time:
>
> 100 array
>
> increase
>
> 100 free
> 150 array
>
> increase
>
> 250 free
> 225 array
>
> increase
>
> 475 free
> 338 array
>
> 813 free
> 507 array
>
> The next increase will require 761 bytes, but we already have 813 free,
> so we can allocate the new array over the memory already freed from
> before, reducing the heap size.
>
> Of course, if we integrate the allocating and the move, we could have
> reused previously used memory starting from allocation 3, but I'm
> assuming that would also be possible when increasing by 100%, so I'm
> assuming we can't do that.
>
> Of course, if, instead of 50% we increase by less (say, 20%), we could
> reuse previously used memory even sooner.
>
> I am assuming that this is the problem that manifests itself in this use
> scenario. I would suggest solving it at the language level, rather than
> the library level.
>
> Shachar

January 23, 2017

Re: A safer File.readln

Posted by Markus Laker
in reply to Shachar Shemesh

Markus Laker

Posted in reply to Shachar Shemesh

On Monday, 23 January 2017 at 11:30:35 UTC, Shachar Shemesh wrote:
> It is possible to tweak the numbers based on the overall use. For example, add 100% for arrays smaller than 1MB, 50% for 1MB - 100MB, and 20% for arrays above 100MB big. This would eliminate the performance degradation for almost all users.

I'm going to bow out of the discussion about array-expansion, because I think it's a separate topic from readln and I don't know enough about D's internals to contribute meaningfully.  It might be worth raising your proposal in a separate thread in order to ensure visibility.

Cheers,

Markus

January 23, 2017

Re: A safer File.readln

Posted by Andrei Alexandrescu
in reply to Shachar Shemesh

Andrei Alexandrescu

Posted in reply to Shachar Shemesh

On 1/23/17 5:44 AM, Shachar Shemesh wrote:
> If, instead of increasing its size by 100%, we increase it by a smaller
> percentage of its previous size, we still maintain the amortized O(1)
> cost (with a multiplier that might be a little higher, but see the trade
> off). On the other hand, we can now reuse memory.

Heh, I have a talk about it. The limit is the golden cut, 1.6180339887498948482... The proof is fun. Anything larger prevents you from reusing previously used space. -- Andrei

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation