Of possible interest: fast UTF8 validation (page 4) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Of possible interest: fast UTF8 validation (page 4)

May 17, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Ethan
in reply to Walter Bright

Ethan

Posted in reply to Walter Bright

And at the risk of getting this topic back on track:

On Wednesday, 16 May 2018 at 20:34:26 UTC, Walter Bright wrote:
> Linkers already do that. Alignment is specified on all symbols emitted by the compiler, and the linker uses that info.

Mea culpa. Upon further thinking, two things strike me:

1) As suggested, there's no way to instruct the front-end to align functions to byte boundaries outside of "optimise for speed" command line flags

2) I would have heavily relied on incremental linking to iterate on these tests when trying to work out how the processor behaved. I expect MSVC's incremental linker would turn out to be just rubbish enough to not care about how those flags originally behaved.

On Wednesday, 16 May 2018 at 20:36:10 UTC, Walter Bright wrote:
> It would be nice to get this technique put into std.algorithm!

The code I wrote originally was C++ code with intrinsics. But I can certainly look at adapting it to DMD/LDC. The DMD frontend providing natural mappings for Intel's published intrinsics would be massively beneficial here.

May 17, 2018

Re: Of possible interest: fast UTF8 validation

Posted by H. S. Teoh
in reply to Patrick Schluter

H. S. Teoh

Posted in reply to Patrick Schluter

On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via Digitalmars-d wrote: [...]
> - the auto-synchronization and the statelessness are big deals.

Yes.  Imagine if we standardized on a header-based string encoding, and we wanted to implement a substring function over a string that contains multiple segments of different languages. Instead of a cheap slicing over the string, you'd need to scan the string or otherwise keep track of which segment the start/end of the substring lies in, allocate memory to insert headers so that the segments are properly interpreted, etc.. It would be an implementational nightmare, and an unavoidable performance hit (you'd have to copy data every time you take a substring), and the @nogc guys would be up in arms.

And that's assuming we have a sane header-based encoding for strings that contain segments in multiple languages in the first place. Linguistic analysis articles, for example, would easily contain many such segments within a paragraph, or perhaps in the same sentence. How would a header-based encoding work for such documents?  Nevermind the recent trend of liberally sprinkling emojis all over regular text. If every emoticon embedded in a string requires splitting the string into 3 segments complete with their own headers, I dare not imagine what the code that manipulates such strings would look like.


T

-- 
Famous last words: I *think* this will work...

May 18, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Patrick Schluter
in reply to H. S. Teoh

Patrick Schluter

Posted in reply to H. S. Teoh

On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
> On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via Digitalmars-d wrote: [...]
>> [...]
>
> Yes.  Imagine if we standardized on a header-based string encoding, and we wanted to implement a substring function over a string that contains multiple segments of different languages. Instead of a cheap slicing over the string, you'd need to scan the string or otherwise keep track of which segment the start/end of the substring lies in, allocate memory to insert headers so that the segments are properly interpreted, etc.. It would be an implementational nightmare, and an unavoidable performance hit (you'd have to copy data every time you take a substring), and the @nogc guys would be up in arms.
>
> [...]
That's what rtf with code pages was essentially. I'm happy that we got rid of it and that they were replaced by xml, even if Microsoft's document xml being a bloated, ridiculous mess, it's still an order of magnitude less problematic than rtf (I mean at the text encoding level).

May 18, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Joakim
in reply to Ethan

Joakim

Posted in reply to Ethan

On Thursday, 17 May 2018 at 23:11:22 UTC, Ethan wrote:
> On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:
>> TCP being  reliable just plain doesn’t cut it. Corruption of
>> single bit is very real.
>
> Quoting to highlight and agree.
>
> TCP is reliable because it resends dropped packets and delivers them in order.
>
> I don't write TCP packets to my long-term storage medium.
>
> UTF as a transportation protocol Unicode is *far* more useful than just sending across a network.

The point wasn't that TCP is handling all the errors, it was a throwaway example of one other layer of the system, the network transport layer, that actually has a checksum that will detect a single bitflip, which UTF-8 will not usually detect. I mentioned that the filesystem and several other layers have their own such error detection, yet you guys crazily latch on to the TCP example alone.

On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
> On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via Digitalmars-d wrote: [...]
>> - the auto-synchronization and the statelessness are big deals.
>
> Yes.  Imagine if we standardized on a header-based string encoding, and we wanted to implement a substring function over a string that contains multiple segments of different languages. Instead of a cheap slicing over the string, you'd need to scan the string or otherwise keep track of which segment the start/end of the substring lies in, allocate memory to insert headers so that the segments are properly interpreted, etc.. It would be an implementational nightmare, and an unavoidable performance hit (you'd have to copy data every time you take a substring), and the @nogc guys would be up in arms.

As we discussed when I first raised this header scheme years ago, you're right that slicing could be more expensive, depending on whether you chose to allocate a new header for the substring or not. The question is whether the optimizations available from such a header telling you where all the language substrings are in a multi-language string make up for having to expensively process the entire UTF-8 string to get that or other data. I think it's fairly obvious the design tradeoff of the header would beat out UTF-8 for all but a few degenerate cases, but maybe you don't see it.

> And that's assuming we have a sane header-based encoding for strings that contain segments in multiple languages in the first place. Linguistic analysis articles, for example, would easily contain many such segments within a paragraph, or perhaps in the same sentence. How would a header-based encoding work for such documents?

It would bloat the header to some extent, but less so than a UTF-8 string. You may want to use special header encodings for such edge cases too, if you want to maintain the same large performance lead over UTF-8 that you'd have for the common case.

>Nevermind the recent trend of
> liberally sprinkling emojis all over regular text. If every emoticon embedded in a string requires splitting the string into 3 segments complete with their own headers, I dare not imagine what the code that manipulates such strings would look like.

Personally, I don't consider emojis worth implementing, :) they shouldn't be part of Unicode. But since they are, I'm fairly certain header-based text messages with emojis would be significantly smaller than using UTF-8/16.

I was surprised to see that adding a emoji to a text message I sent last year cut my message character quota in half.  I googled why this was and it turns out that when you add an emoji, the text messaging client actually changes your message encoding from UTF-8 to UTF-16! I don't know if this is a limitation of the default Android messaging client, my telco carrier, or SMS, but I strongly suspect this is widespread.

Anyway, I can see the arguments about UTF-8 this time around are as bad as the first time I raised it five years back, so I'll leave this thread here.

May 18, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Nemanja Boric
in reply to Joakim

Nemanja Boric

Posted in reply to Joakim

On Friday, 18 May 2018 at 08:44:41 UTC, Joakim wrote:
>
> I was surprised to see that adding a emoji to a text message I sent last year cut my message character quota in half.  I googled why this was and it turns out that when you add an emoji, the text messaging client actually changes your message encoding from UTF-8 to UTF-16! I don't know if this is a limitation of the default Android messaging client, my telco carrier, or SMS, but I strongly suspect this is widespread.
>

Welcome to my world (and probably world of most Europeans) where I don't type ć, č, ž and other non-ascii letters since early 2000s, even though SMS are today mostly flat rate and people chat via WhatsApp anyway.

May 18, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Neia Neutuladh
in reply to H. S. Teoh

Neia Neutuladh

Posted in reply to H. S. Teoh

On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
> On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via Digitalmars-d wrote: [...]
>> - the auto-synchronization and the statelessness are big deals.
>
> Yes.  Imagine if we standardized on a header-based string encoding, and we wanted to implement a substring function over a string that contains multiple segments of different languages. Instead of a cheap slicing over the string, you'd need to scan the string or otherwise keep track of which segment the start/end of the substring lies in, allocate memory to insert headers so that the segments are properly interpreted, etc.. It would be an implementational nightmare, and an unavoidable performance hit (you'd have to copy data every time you take a substring), and the @nogc guys would be up in arms.

You'd have three data structures: Strand, Rope, and Slice.

A Strand is a series of bytes with an encoding. A Rope is a series of Strands. A Slice is a pair of location references within a Rope. You probably want a special datastructure to name a location within a Rope: Strand offset, then byte offset. Total of five words instead of two to pass a Slice, but zero dynamic allocations.

This would be a problem for data locality. However, rope-style datastructures are handy for some types of string manipulation.

As an alternative, you might have a separate document specifying what encodings apply to what byte ranges. Slices would then be three words long (pointer to the string struct, start offset, end offset). Iterating would cost O(log(S) + M), where S is the number of encoded segments and M is the number of bytes in the slice.

Anyway, you either get a more complex data structure, or you have terrible time complexity, but you don't have both.

> And that's assuming we have a sane header-based encoding for strings that contain segments in multiple languages in the first place. Linguistic analysis articles, for example, would easily contain many such segments within a paragraph, or perhaps in the same sentence. How would a header-based encoding work for such documents?  Nevermind the recent trend of liberally sprinkling emojis all over regular text. If every emoticon embedded in a string requires splitting the string into 3 segments complete with their own headers, I dare not imagine what the code that manipulates such strings would look like.

"Header" implies that all encoding data appears at the start of the document, or in a separate metadata segment. (Call it a start index and two bytes to specify the encoding; reserve the first few bits of the encoding to specify the width.) It also brings to mind HTTP, and reminds me that most documents are either mostly ASCII or a heavy mix of ASCII and something else (HTML and XML being the forerunners).

If the encoding succeeded at making most scripts single-byte, then, testing with https://ar.wikipedia.org/wiki/Main_Page, you might get within 15% of UTF-8's efficiency. And then a simple sentence like "Ĉu ĝi ŝajnas ankaŭ esti ŝafo?" is 2.64 times as long in this encoding as UTF-8, since it has ten encoded segments, each with overhead. (Assuming the header supports strings up to 2^32 bytes long.)

If it didn't succeed at making Latin and Arabic single-byte scripts (and Latin contains over 800 characters in Unicode, while Arabic has over three hundred), it would be worse than UTF-16.

May 19, 2018

Re: Of possible interest: fast UTF8 validation

Posted by David Nadlinger
in reply to Ethan Watson

David Nadlinger

Posted in reply to Ethan Watson

On Wednesday, 16 May 2018 at 14:48:54 UTC, Ethan Watson wrote:
> And even better - LDC doesn't support core.simd and has its own intrinsics that don't match the SSE/AVX intrinsics API published by Intel.

To provide some context here: LDC only supports the types from core.simd, but not the __simd "assembler macro" that DMD uses to more or less directly emit the corresponding x86 opcodes.

LDC does support most of the GCC-style SIMD builtins for the respective target (x86, ARM, …), but there are two problems with this:

 1) As Ethan pointed out, the GCC API does not match Intel's intrinsics; for example, it is `__builtin_ia32_vfnmsubpd256_mask3` instead of `_mm256_mask_fnmsub_pd`, and the argument orders differ as well.

 2) The functions that LDC exposes as intrinsics are those that are intrinsics on the LLVM IR level. However, some operations can be directly represented in normal, instruction-set-independent LLVM IR – no explicit intrinsics are provided for these.

Unfortunately, LLVM doesn't seem to provide any particularly helpful tools for implementing Intel's intrinsics API. x86intrin.h is manually implemented for Clang as a collection of various macros and functions.

It would be seriously cool if someone could write a small tool to parse those headers, (semi-)automatically convert them to D, and generate tests for comparing the emitted IR against Clang. I'm happy to help with the LDC side of things.

 — David

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation