Of possible interest: fast UTF8 validation (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Of possible interest: fast UTF8 validation (page 2)

May 16, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Ethan
in reply to Walter Bright

Ethan

Posted in reply to Walter Bright

On Wednesday, 16 May 2018 at 16:54:19 UTC, Walter Bright wrote:
> I used to do things like that a simpler way. 3 functions would be created:
>
>   void FeatureInHardware();
>   void EmulateFeature();
>   void Select();
>   void function() doIt = &Select;
>
> I.e. the first time doIt is called, it calls the Select function which then resets doIt to either FeatureInHardware() or EmulateFeature().
>
> It costs an indirect call, but if you move it up the call hierarchy a bit so it isn't in the hot loops, the indirect function call cost is negligible.
>
> The advantage is there was only one binary.

It certainly sounds reasonable enough for 99% of use cases. But I'm definitely the 1% here ;-)

Indirect calls invoke the wrath of the branch predictor on XB1/PS4 (ie an AMD Jaguar processor). But there's certainly some more interesting non-processor behaviour, at least on MSVC compilers. The provided auto-DLL loading in that environment performs a call to your DLL-boundary-crossing function, which actually winds up in a jump table that performs a jump instruction to actually get to your DLL code. I suspect this is more costly than the indirect jump at a "write a basic test" level. Doing an indirect call as the only action in a for-loop is guaranteed to bring out the costly branch predictor on the Jaguar. Without getting in and profiling a bunch of stuff, I'm not entirely sure which approach I'd prefer for a general approach.

Certainly, as far as this particular thread goes, every general purpose function of a few lines that I write that use intrinsics is forced inline. No function calls, indirect or otherwise. And on top of that, the inlined code usually pushes the branches in the code out the code across the byte boundary lines just far enough that the simple branch predictor is only ever invoked.

(Related: one feature I'd really really really love for linkers to implement is the ability to mark up certain functions to only ever be linked at a certain byte boundary. And that's purely because Jaguar branch prediction often made my profiling tests non-deterministic between compiles. A NOP is a legit optimisation on those processors.)

May 16, 2018

Re: Of possible interest: fast UTF8 validation

Posted by xenon325
in reply to Walter Bright

xenon325

Posted in reply to Walter Bright

On Wednesday, 16 May 2018 at 16:54:19 UTC, Walter Bright wrote:
> I used to do things like that a simpler way. 3 functions would be created:
>
>   void FeatureInHardware();
>   void EmulateFeature();
>   void Select();
>   void function() doIt = &Select;
>
> I.e. the first time doIt is called, it calls the Select function which then resets doIt to either FeatureInHardware() or EmulateFeature().
>
> It costs an indirect call [...]

Is this basically the same as Function MultiVersioning [1] ?

I never had a need to use it and always wondered how does it work out it real life.
From description it seems this would incur indirection:

"To keep the cost of dispatching low, the IFUNC [2] mechanism is used for dispatching. This makes the call to the dispatcher a one-time thing during startup and a call to a function version is a single jump indirect instruction."

In linked article [2] Ian Lance Taylor says glibc uses this for memcpy(), so this should be pretty efficient (but than again, one doesn't call memcpy() in hot loops too often)

[1] https://gcc.gnu.org/wiki/FunctionMultiVersioning
[2] https://www.airs.com/blog/archives/403

--
Alexander

May 16, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Jack Stouffer
in reply to Joakim

Jack Stouffer

Posted in reply to Joakim

On Wednesday, 16 May 2018 at 17:18:06 UTC, Joakim wrote:
> I think you know what I'm referring to, which is that UTF-8 is a badly designed format, not that input validation shouldn't be done.

UTF-8 seems like the best option available given the problem space.

Junk data is going to be a problem with any possible string format given that encoding translations and programmer error will always be prevalent.

May 16, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Andrei Alexandrescu
in reply to Joakim

Andrei Alexandrescu

Posted in reply to Joakim

On 5/16/18 1:18 PM, Joakim wrote:
> On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky wrote:
>> On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
>>> On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:
>>>> https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/ 
>>>>
>>>
>>> Sigh, this reminds me of the old quote about people spending a bunch of time making more efficient what shouldn't be done at all.
>>
>> Validating UTF-8 is super common, most text protocols and files these days would use it, other would have an option to do so.
>>
>> I’d like our validateUtf to be fast, since right now we do validation every time we decode string. And THAT is slow. Trying to not validate on decode means most things should be validated on input...
> 
> I think you know what I'm referring to, which is that UTF-8 is a badly designed format, not that input validation shouldn't be done.

I find this an interesting minority opinion, at least from the perspective of the circles I frequent, where UTF8 is unanimously heralded as a great design. Only a couple of weeks ago I saw Dylan Beattie give a very entertaining talk on exactly this topic: https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/

If you could share some details on why you think UTF8 is badly designed and how you believe it could be/have been better, I'd be in your debt!

Andrei

May 16, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Walter Bright
in reply to Ethan

Walter Bright

Posted in reply to Ethan

On 5/16/2018 10:28 AM, Ethan wrote:
> (Related: one feature I'd really really really love for linkers to implement is the ability to mark up certain functions to only ever be linked at a certain byte boundary. And that's purely because Jaguar branch prediction often made my profiling tests non-deterministic between compiles. A NOP is a legit optimisation on those processors.)

Linkers already do that. Alignment is specified on all symbols emitted by the compiler, and the linker uses that info.

May 16, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Walter Bright
in reply to Ethan Watson

Walter Bright

Posted in reply to Ethan Watson

On 5/16/2018 5:47 AM, Ethan Watson wrote:
> I re-implemented some common string functionality at Remedy using SSE 4.2 instructions. Pretty handy. Except we had to turn that code off for released products since nowhere near enough people are running SSE 4.2 capable hardware.

It would be nice to get this technique put into std.algorithm!

May 17, 2018

Re: Of possible interest: fast UTF8 validation

Posted by rikki cattermole
in reply to Walter Bright

rikki cattermole

Posted in reply to Walter Bright

On 17/05/2018 8:34 AM, Walter Bright wrote:
> On 5/16/2018 10:28 AM, Ethan wrote:
>> (Related: one feature I'd really really really love for linkers to implement is the ability to mark up certain functions to only ever be linked at a certain byte boundary. And that's purely because Jaguar branch prediction often made my profiling tests non-deterministic between compiles. A NOP is a legit optimisation on those processors.)
> 
> Linkers already do that. Alignment is specified on all symbols emitted by the compiler, and the linker uses that info.

Would allowing align attribute on functions, make sense here for Ethan?

May 16, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Walter Bright
in reply to Andrei Alexandrescu

Walter Bright

Posted in reply to Andrei Alexandrescu

On 5/16/2018 1:11 PM, Andrei Alexandrescu wrote:
> If you could share some details on why you think UTF8 is badly designed and how you believe it could be/have been better, I'd be in your debt!

Me too. I think UTF-8 is brilliant (and I suffered for years under the lash of other multibyte encodings prior to UTF-8). Shift-JIS: shudder!

Perhaps you're referring to the redundancy in UTF-8 - though illegal encodings are made possible by such redundancy.

May 16, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Jonathan M Davis
in reply to Walter Bright

Jonathan M Davis

Posted in reply to Walter Bright

On Wednesday, May 16, 2018 13:42:11 Walter Bright via Digitalmars-d wrote:
> On 5/16/2018 1:11 PM, Andrei Alexandrescu wrote:
> > If you could share some details on why you think UTF8 is badly designed and how you believe it could be/have been better, I'd be in your debt!
>
> Me too. I think UTF-8 is brilliant (and I suffered for years under the lash of other multibyte encodings prior to UTF-8). Shift-JIS: shudder!
>
> Perhaps you're referring to the redundancy in UTF-8 - though illegal encodings are made possible by such redundancy.

I'm inclined to think that the redundancy is a serious flaw. I'd argue that if it were truly well-designed, there would be exactly one way to represent every character - including clear up to grapheme clusters where multiple code points are involved (i.e. there would be no normalization issues in valid Unicode, because there would be only one valid normalization). But there may be some technical issues that I'm not aware of that would make that problematic. Either way, the issues that I have with UTF-8 are issues that UTF-16 and UTF-32 have as well, since they're really issues relating to code points.

Overall, I think that UTF-8 is by far the best encoding that we have, and I don't think that we're going to get anything better, but I'm also definitely inclined to think that it's still flawed - just far less flawed than the alternatives.

And in general, I have to wonder if there would be a way to make Unicode less complicated if we could do it from scratch without worrying about any kind of compatability, since what we have is complicated enough that most programmers don't come close to understanding it, and it's just way too hard to get right. But I suspect that if efficiency matters, there's enough inherent complexity that we'd just be screwed on that front even if we could do a better job than was done with Unicode as we know it.

- Jonathan M Davis

May 17, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Joakim
in reply to Andrei Alexandrescu

Joakim

Posted in reply to Andrei Alexandrescu

On Wednesday, 16 May 2018 at 20:11:35 UTC, Andrei Alexandrescu wrote:
> On 5/16/18 1:18 PM, Joakim wrote:
>> On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky wrote:
>>> On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
>>>> On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:
>>>>> https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/
>>>>>
>>>>
>>>> Sigh, this reminds me of the old quote about people spending a bunch of time making more efficient what shouldn't be done at all.
>>>
>>> Validating UTF-8 is super common, most text protocols and files these days would use it, other would have an option to do so.
>>>
>>> I’d like our validateUtf to be fast, since right now we do validation every time we decode string. And THAT is slow. Trying to not validate on decode means most things should be validated on input...
>> 
>> I think you know what I'm referring to, which is that UTF-8 is a badly designed format, not that input validation shouldn't be done.
>
> I find this an interesting minority opinion, at least from the perspective of the circles I frequent, where UTF8 is unanimously heralded as a great design. Only a couple of weeks ago I saw Dylan Beattie give a very entertaining talk on exactly this topic: https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/

Thanks for the link, skipped to the part about text encodings, should be fun to read the rest later.

> If you could share some details on why you think UTF8 is badly designed and how you believe it could be/have been better, I'd be in your debt!

Unicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme, as they mostly were except for obviously the Asian CJK languages. That way, you optimize for the common string, ie one that contains a single language or at least no CJK, rather than pessimizing every non-ASCII language by doubling its character width, as UTF-8 does. This UTF-8 issue is one of the first topics I raised in this forum, but as you noted at the time nobody agreed and I don't want to dredge that all up again.

I have been researching this a bit since then, and the stated goals for UTF-8 at inception were that it _could not overlap with ASCII anywhere for other languages_, to avoid issues with legacy software wrongly processing other languages as ASCII, and to allow seeking from an arbitrary location within a byte stream:

https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

I have no dispute with these priorities at the time, as they were optimizing for the institutional and tech realities of 1992 as Dylan also notes, and UTF-8 is actually a nice hack given those constraints. What I question is that those priorities are at all relevant today, when billions of smartphone users are regularly not using ASCII, and these tech companies are the largest private organizations on the planet, ie they have the resources to design a new transfer format. I see basically no relevance for the streaming requirement today, as I noted in this forum years ago, but I can see why it might have been considered important in the early '90s, before packet-based networking protocols had won.

I think a header-based scheme would be _much_ better today and the reason I know Dmitry knows that is that I have discussed privately with him over email that I plan to prototype a format like that in D. Even if UTF-8 is already fairly widespread, something like that could be useful as a better intermediate format for string processing, and maybe someday could replace UTF-8 too.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation