Jump to page: 1 2
Thread overview
[OT] parsing with sscanf is accidentally quadratic due to strlen
Mar 03, 2021
Kagamin
Mar 03, 2021
Patrick Schluter
Mar 03, 2021
Kagamin
Mar 03, 2021
Patrick Schluter
Mar 03, 2021
Simen Kjærås
Mar 03, 2021
Nick Treleaven
Mar 03, 2021
H. S. Teoh
Mar 04, 2021
Kagamin
Mar 03, 2021
deadalnix
Mar 04, 2021
user1234
Mar 04, 2021
user1234
March 03, 2021
Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47
March 03, 2021
On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
> Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
> Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47

Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse and had the my program hog the CPU when I went in production. On my test files that were not that big, memory mapping and changing fscanf() to sscanf() was a no brainer. When it went in production and started to map megabytes or gigabyte sized files, I rediscovered what a O(n²) algorithm looked like...
March 03, 2021
On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter wrote:
> Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse

I think you had another bug there? Memory mapped files are not null terminated with probability 1/4096.
March 03, 2021
On Wednesday, 3 March 2021 at 11:36:54 UTC, Kagamin wrote:
> On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter wrote:
>> Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse
>
> I think you had another bug there? Memory mapped files are not null terminated with probability 1/4096.

No, that bug was tested for. If the file size was a multiple of pagesize (it was on SPARC so 8192 not 4096) I added 1 byte to it which forced mmap to add an empty page at the end => guaranteed 0 at the end.
If you're guaranteed to run on Linux you can force the adding of a supplemental page just by passing a size bigger than the file size to mmap. This was not possible on newer versions of Solaris.
This said, I had a library to handle these details, also by replacing mmap by a malloc/read when the file is below a certain size as mmap has a non negligible overhead.

March 03, 2021
On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
> Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
> Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47

An interesting follow up:
https://www.mattkeeter.com/blog/2021-03-01-happen/
March 03, 2021
On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter wrote:
> On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
>> Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
>> Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47
>
> Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse and had the my program hog the CPU when I went in production. On my test files that were not that big, memory mapping and changing fscanf() to sscanf() was a no brainer. When it went in production and started to map megabytes or gigabyte sized files, I rediscovered what a O(n²) algorithm looked like...

Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.

(https://twitter.com/BruceDawson0xB/status/1120381406700429312)

--
  Simen
March 03, 2021
On Wednesday, 3 March 2021 at 15:16:03 UTC, Simen Kjærås wrote:
> Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.

I think some years back, Andrei proposed using a user-defined attribute to tag functions with their big-O time complexity and maybe even use those attributes to calculate complexity of chained function calls.
March 03, 2021
On Wed, Mar 03, 2021 at 11:09:04AM +0000, Patrick Schluter via Digitalmars-d wrote:
> On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
> > Parsers based on sscanf choke on big strings:
> > https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
> > Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47
> 
> Yes, sscanf() calls strlen(). I got bitten by it also some years ago
> when I memory mapped some log files to parse and had the my program
> hog the CPU when I went in production. On my test files that were not
> that big, memory mapping and changing fscanf() to sscanf() was a no
> brainer. When it went in production and started to map megabytes or
> gigabyte sized files, I rediscovered what a O(n²) algorithm looked
> like...

Ouch.

I myself am no fan of sscanf: too limited and hard to fine-tune parsing behaviour. If it were up to me, I wouldn't run any large data sets through sscanf.  Now this adds one more reason for not using sscanf.

Fortunately in D slices eliminate the strlen problem, and slice-based std.array.split, et al, are generally better for simple parsing tasks IMO than *scanf functions.


T

-- 
Heads I win, tails you lose.
March 04, 2021
On Wednesday, 3 March 2021 at 16:58:27 UTC, H. S. Teoh wrote:
> Fortunately in D slices eliminate the strlen problem, and slice-based std.array.split, et al, are generally better for simple parsing tasks IMO than *scanf functions.

It's used for number parsing, which isn't easy to replace in C: https://github.com/biojppm/rapidyaml/issues/40
March 04, 2021
On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
> Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
> Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47

The JSON part makes me think to D-YAML which had a similar issue ([1], [2]], i.e absurd double checks on AA insertion.

[1]: https://github.com/dlang-community/D-YAML/issues/78
[2]: https://github.com/dlang-community/D-YAML/pull/112
« First   ‹ Prev
1 2