[OT] parsing with sscanf is accidentally quadratic due to strlen

Mar 03, 2021

Kagamin

Mar 03, 2021

Mar 03, 2021

Mar 03, 2021

Mar 03, 2021

Mar 03, 2021

Mar 03, 2021

Mar 04, 2021

Mar 03, 2021

Mar 04, 2021

Mar 04, 2021

Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47

On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote: > Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ > Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47 Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse and had the my program hog the CPU when I went in production. On my test files that were not that big, memory mapping and changing fscanf() to sscanf() was a no brainer. When it went in production and started to map megabytes or gigabyte sized files, I rediscovered what a O(n²) algorithm looked like...

On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter wrote: > Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse I think you had another bug there? Memory mapped files are not null terminated with probability 1/4096.

On Wednesday, 3 March 2021 at 11:36:54 UTC, Kagamin wrote: > On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter wrote: >> Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse > > I think you had another bug there? Memory mapped files are not null terminated with probability 1/4096. No, that bug was tested for. If the file size was a multiple of pagesize (it was on SPARC so 8192 not 4096) I added 1 byte to it which forced mmap to add an empty page at the end => guaranteed 0 at the end. If you're guaranteed to run on Linux you can force the adding of a supplemental page just by passing a size bigger than the file size to mmap. This was not possible on newer versions of Solaris. This said, I had a library to handle these details, also by replacing mmap by a malloc/read when the file is below a certain size as mmap has a non negligible overhead.

On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter wrote: > On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote: >> Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ >> Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47 > > Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse and had the my program hog the CPU when I went in production. On my test files that were not that big, memory mapping and changing fscanf() to sscanf() was a no brainer. When it went in production and started to map megabytes or gigabyte sized files, I rediscovered what a O(n²) algorithm looked like... Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there. (https://twitter.com/BruceDawson0xB/status/1120381406700429312) -- Simen

On Wednesday, 3 March 2021 at 15:16:03 UTC, Simen Kjærås wrote: > Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there. I think some years back, Andrei proposed using a user-defined attribute to tag functions with their big-O time complexity and maybe even use those attributes to calculate complexity of chained function calls.

March 03, 2021

Re: [OT] parsing with sscanf is accidentally quadratic due to strlen

Posted by H. S. Teoh
in reply to Patrick Schluter

Permalink

H. S. Teoh

Posted in reply to Patrick Schluter

Permalink

On Wed, Mar 03, 2021 at 11:09:04AM +0000, Patrick Schluter via Digitalmars-d wrote:
> On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
> > Parsers based on sscanf choke on big strings:
> > https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
> > Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47
> 
> Yes, sscanf() calls strlen(). I got bitten by it also some years ago
> when I memory mapped some log files to parse and had the my program
> hog the CPU when I went in production. On my test files that were not
> that big, memory mapping and changing fscanf() to sscanf() was a no
> brainer. When it went in production and started to map megabytes or
> gigabyte sized files, I rediscovered what a O(n²) algorithm looked
> like...

Ouch.

I myself am no fan of sscanf: too limited and hard to fine-tune parsing behaviour. If it were up to me, I wouldn't run any large data sets through sscanf.  Now this adds one more reason for not using sscanf.

Fortunately in D slices eliminate the strlen problem, and slice-based std.array.split, et al, are generally better for simple parsing tasks IMO than *scanf functions.

T

-- 
Heads I win, tails you lose.

On Wednesday, 3 March 2021 at 16:58:27 UTC, H. S. Teoh wrote: > Fortunately in D slices eliminate the strlen problem, and slice-based std.array.split, et al, are generally better for simple parsing tasks IMO than *scanf functions. It's used for number parsing, which isn't easy to replace in C: https://github.com/biojppm/rapidyaml/issues/40

On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote: > Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ > Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47 The JSON part makes me think to D-YAML which had a similar issue ([1], [2]], i.e absurd double checks on AA insertion. [1]: https://github.com/dlang-community/D-YAML/issues/78 [2]: https://github.com/dlang-community/D-YAML/pull/112

Forums