Thread overview | ||||||
---|---|---|---|---|---|---|
|
August 23, 2017 (SIMD) Optimized multi-byte chunk scanning | ||||
---|---|---|---|---|
| ||||
I recall seeing some C/C++/D code that optimizes the comment- and whitespace-skipping parts (tokens) of lexers by operating on 2, 4 or 8-byte chunks instead of single-byte chunks. This in the case when token-terminators are expressed as sets of (alternative) ASCII-characters. For instance, when searching for the end of a line comment, I would like to speed up the while-loop in size_t offset; string input = "// \n"; // a line-comment string import std.algorithm : among; // until end-of-line or file terminator while (!input[offset].among!('\0', '\n', '\r') { ++offset; } by taking `offset`-steps larger than one. Note that my file reading function that creates the real `input`, appends a '\0' at the end to enable sentinel-based search as shown in the call to `among` above. I further recall that there are x86_64 intrinsics that can be used here for further speedups. Refs, anyone? |
August 25, 2017 Re: (SIMD) Optimized multi-byte chunk scanning | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nordlöw | On Wednesday, 23 August 2017 at 22:07:30 UTC, Nordlöw wrote: > I recall seeing some C/C++/D code that optimizes the comment- and whitespace-skipping parts (tokens) of lexers by operating on 2, 4 or 8-byte chunks instead of single-byte chunks. This in the case when token-terminators are expressed as sets of (alternative) ASCII-characters. > > For instance, when searching for the end of a line comment, I would like to speed up the while-loop in > > size_t offset; > string input = "// \n"; // a line-comment string > import std.algorithm : among; > // until end-of-line or file terminator > while (!input[offset].among!('\0', '\n', '\r') > { > ++offset; > } > > by taking `offset`-steps larger than one. > > Note that my file reading function that creates the real `input`, appends a '\0' at the end to enable sentinel-based search as shown in the call to `among` above. > > I further recall that there are x86_64 intrinsics that can be used here for further speedups. > > Refs, anyone? On line comments it doesn't sound like it will pay off since you would have to do extra work to make sure you work on 16 byte aligned memory. For multi-line comments maybe. As for a nice reference of intel intrinsics: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ |
August 25, 2017 Re: (SIMD) Optimized multi-byte chunk scanning | ||||
---|---|---|---|---|
| ||||
Posted in reply to Igor | On Friday, 25 August 2017 at 09:40:28 UTC, Igor wrote:
> As for a nice reference of intel intrinsics: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Wow, what a fabulous UX!
|
August 26, 2017 Re: (SIMD) Optimized multi-byte chunk scanning | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nordlöw | On Friday, 25 August 2017 at 18:52:57 UTC, Nordlöw wrote: > On Friday, 25 August 2017 at 09:40:28 UTC, Igor wrote: >> As for a nice reference of intel intrinsics: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ > > Wow, what a fabulous UX! The pcmpestri instruction is probably what you are looking for? There is a useful resource in the Intel optimisation guide. There is also an Intel article about speeding up XML parsing with this instruction, but if memory serves it's really messy - a right palaver. A lot depends on how much control, if any you have over the input data, typically none I quite understand. Based on this article, https://graphics.stanford.edu/~seander/bithacks.html I wrote a short d routine to help me learn the language as I was thinking about faster strlen using larger-sized gulps. The above article has a good test for whether or not a wide word contains a particular byte value somewhere in it. I wrote a bool hasZeroByte( in uint64_t x ) function based on that method. I'm intending to write a friendlier d convenience routine to give access to inline pcmpestri code generation in GDC when I get round to it (one instruction all fully inlined and flexibly optimised at compile-time, with no subroutine call to an instruction). Agner Fog's libraries and articles are superb, take a look. He must have published code to deal with these C standard library byte string processing functions efficiently with wide aligned machine words, unless I'm getting very forgetful. A bit of googling? |
Copyright © 1999-2021 by the D Language Foundation