Thread overview | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
March 03, 2021 [OT] parsing with sscanf is accidentally quadratic due to strlen | ||||
---|---|---|---|---|
| ||||
Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47 |
March 03, 2021 Re: [OT] parsing with sscanf is accidentally quadratic due to strlen | ||||
---|---|---|---|---|
| ||||
Posted in reply to Kagamin | On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote:
> Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
> Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47
Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse and had the my program hog the CPU when I went in production. On my test files that were not that big, memory mapping and changing fscanf() to sscanf() was a no brainer. When it went in production and started to map megabytes or gigabyte sized files, I rediscovered what a O(n²) algorithm looked like...
|
March 03, 2021 Re: [OT] parsing with sscanf is accidentally quadratic due to strlen | ||||
---|---|---|---|---|
| ||||
Posted in reply to Patrick Schluter | On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter wrote:
> Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse
I think you had another bug there? Memory mapped files are not null terminated with probability 1/4096.
|
March 03, 2021 Re: [OT] parsing with sscanf is accidentally quadratic due to strlen | ||||
---|---|---|---|---|
| ||||
Posted in reply to Kagamin | On Wednesday, 3 March 2021 at 11:36:54 UTC, Kagamin wrote:
> On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter wrote:
>> Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse
>
> I think you had another bug there? Memory mapped files are not null terminated with probability 1/4096.
No, that bug was tested for. If the file size was a multiple of pagesize (it was on SPARC so 8192 not 4096) I added 1 byte to it which forced mmap to add an empty page at the end => guaranteed 0 at the end.
If you're guaranteed to run on Linux you can force the adding of a supplemental page just by passing a size bigger than the file size to mmap. This was not possible on newer versions of Solaris.
This said, I had a library to handle these details, also by replacing mmap by a malloc/read when the file is below a certain size as mmap has a non negligible overhead.
|
March 03, 2021 Re: [OT] parsing with sscanf is accidentally quadratic due to strlen | ||||
---|---|---|---|---|
| ||||
Posted in reply to Kagamin | On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote: > Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ > Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47 An interesting follow up: https://www.mattkeeter.com/blog/2021-03-01-happen/ |
March 03, 2021 Re: [OT] parsing with sscanf is accidentally quadratic due to strlen | ||||
---|---|---|---|---|
| ||||
Posted in reply to Patrick Schluter | On Wednesday, 3 March 2021 at 11:09:04 UTC, Patrick Schluter wrote: > On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote: >> Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ >> Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47 > > Yes, sscanf() calls strlen(). I got bitten by it also some years ago when I memory mapped some log files to parse and had the my program hog the CPU when I went in production. On my test files that were not that big, memory mapping and changing fscanf() to sscanf() was a no brainer. When it went in production and started to map megabytes or gigabyte sized files, I rediscovered what a O(n²) algorithm looked like... Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there. (https://twitter.com/BruceDawson0xB/status/1120381406700429312) -- Simen |
March 03, 2021 Re: [OT] parsing with sscanf is accidentally quadratic due to strlen | ||||
---|---|---|---|---|
| ||||
Posted in reply to Simen Kjærås | On Wednesday, 3 March 2021 at 15:16:03 UTC, Simen Kjærås wrote:
> Dawson’s first law of computing: O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
I think some years back, Andrei proposed using a user-defined attribute to tag functions with their big-O time complexity and maybe even use those attributes to calculate complexity of chained function calls.
|
March 03, 2021 Re: [OT] parsing with sscanf is accidentally quadratic due to strlen | ||||
---|---|---|---|---|
| ||||
Posted in reply to Patrick Schluter | On Wed, Mar 03, 2021 at 11:09:04AM +0000, Patrick Schluter via Digitalmars-d wrote: > On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote: > > Parsers based on sscanf choke on big strings: > > https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ > > Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47 > > Yes, sscanf() calls strlen(). I got bitten by it also some years ago > when I memory mapped some log files to parse and had the my program > hog the CPU when I went in production. On my test files that were not > that big, memory mapping and changing fscanf() to sscanf() was a no > brainer. When it went in production and started to map megabytes or > gigabyte sized files, I rediscovered what a O(n²) algorithm looked > like... Ouch. I myself am no fan of sscanf: too limited and hard to fine-tune parsing behaviour. If it were up to me, I wouldn't run any large data sets through sscanf. Now this adds one more reason for not using sscanf. Fortunately in D slices eliminate the strlen problem, and slice-based std.array.split, et al, are generally better for simple parsing tasks IMO than *scanf functions. T -- Heads I win, tails you lose. |
March 04, 2021 Re: [OT] parsing with sscanf is accidentally quadratic due to strlen | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Wednesday, 3 March 2021 at 16:58:27 UTC, H. S. Teoh wrote: > Fortunately in D slices eliminate the strlen problem, and slice-based std.array.split, et al, are generally better for simple parsing tasks IMO than *scanf functions. It's used for number parsing, which isn't easy to replace in C: https://github.com/biojppm/rapidyaml/issues/40 |
March 04, 2021 Re: [OT] parsing with sscanf is accidentally quadratic due to strlen | ||||
---|---|---|---|---|
| ||||
Posted in reply to Kagamin | On Wednesday, 3 March 2021 at 09:12:19 UTC, Kagamin wrote: > Parsers based on sscanf choke on big strings: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ > Source: https://github.com/chakra-core/ChakraCore/blob/master/pal/src/safecrt/sscanf.c#L47 The JSON part makes me think to D-YAML which had a similar issue ([1], [2]], i.e absurd double checks on AA insertion. [1]: https://github.com/dlang-community/D-YAML/issues/78 [2]: https://github.com/dlang-community/D-YAML/pull/112 |
Copyright © 1999-2021 by the D Language Foundation