Of possible interest: fast UTF8 validation (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Of possible interest: fast UTF8 validation (page 3)

May 17, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Patrick Schluter
in reply to Joakim

Patrick Schluter

Posted in reply to Joakim

On Thursday, 17 May 2018 at 05:01:54 UTC, Joakim wrote:
> On Wednesday, 16 May 2018 at 20:11:35 UTC, Andrei Alexandrescu wrote:
>> On 5/16/18 1:18 PM, Joakim wrote:
>>> On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky wrote:
>>>> On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
>>>>> On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:
>>>>>> https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/
>>>>>>
>>>>>
>>>>> Sigh, this reminds me of the old quote about people spending a bunch of time making more efficient what shouldn't be done at all.
>>>>
>>>> Validating UTF-8 is super common, most text protocols and files these days would use it, other would have an option to do so.
>>>>
>>>> I’d like our validateUtf to be fast, since right now we do validation every time we decode string. And THAT is slow. Trying to not validate on decode means most things should be validated on input...
>>> 
>>> I think you know what I'm referring to, which is that UTF-8 is a badly designed format, not that input validation shouldn't be done.
>>
>> I find this an interesting minority opinion, at least from the perspective of the circles I frequent, where UTF8 is unanimously heralded as a great design. Only a couple of weeks ago I saw Dylan Beattie give a very entertaining talk on exactly this topic: https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/
>
> Thanks for the link, skipped to the part about text encodings, should be fun to read the rest later.
>
>> If you could share some details on why you think UTF8 is badly designed and how you believe it could be/have been better, I'd be in your debt!
>
> Unicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme,

This is not practical, sorry. What happens when your message loses the header? Exactly, the rest of the message is garbled. That's exactly what happened with code page based texts when you don't know in which code page it is encoded. It has the supplemental inconvenience that mixing languages becomes impossible or at least very cumbersome.
UTF-8 has several properties that are difficult to have with other schemes.
- It is state-less, means any byte in a stream always means the same thing. Its meaning  does not depend on external or a previous byte.
- It can mix any language in the same stream without acrobatics and if one thinks that mixing languages doesn't happen often should get his head extracted from his rear, because it is very common (check wikipedia's front page for example).
- The multi byte nature of other alphabets is not as bad as people think because texts in computer do not live on their own, meaning that they are generally embedded inside file formats, which more often than not are extremely bloated (xml, html, xliff, akoma ntoso, rtf etc.). The few bytes more in the text do not weigh that much.

I'm in charge at the European Commission of the biggest translation memory in the world. It handles currently 30 languages and without UTF-8 and UTF-16 it would be unmanageable. I still remember when I started there in 2002 when we handled only 11 languages of which only 1 was of another alphabet (Greek). Everything was based on RTF with codepages and it was a braindead mess. My first job in 2003 was to extend the system to handle the 8 newcomer languages and with ASCII based encodings it was completely unmanageable because every document processed mixes languages and alphabets freely (addresses and names are often written in their original form for instance).
2 years ago we implemented also support for Chinese. The nice thing was that we didn't have to change much to do that thanks to Unicode. The second surprise was with the file sizes, Chinese documents were generally smaller than their European counterparts. Yes CJK requires 3 bytes for each ideogram, but generally 1 ideogram replaces many letters. The ideogram 亿 replaces "One hundred million" for example, which of them take more bytes? So if CJK indeed requires more bytes to encode, it is firstly because they NEED many more bits in the first place (there are around 30000 CJK codepoints in the BMP alone, add to it the 60000 that are in the SIP and we have a need of 17 bits only to encode them.


> as they mostly were except for obviously the Asian CJK languages. That way, you optimize for the common string, ie one that contains a single language or at least no CJK, rather than pessimizing every non-ASCII language by doubling its character width, as UTF-8 does. This UTF-8 issue is one of the first topics I raised in this forum, but as you noted at the time nobody agreed and I don't want to dredge that all up again.
>
> I have been researching this a bit since then, and the stated goals for UTF-8 at inception were that it _could not overlap with ASCII anywhere for other languages_, to avoid issues with legacy software wrongly processing other languages as ASCII, and to allow seeking from an arbitrary location within a byte stream:
>
> https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
>
> I have no dispute with these priorities at the time, as they were optimizing for the institutional and tech realities of 1992 as Dylan also notes, and UTF-8 is actually a nice hack given those constraints. What I question is that those priorities are at all relevant today, when billions of smartphone users are regularly not using ASCII, and these tech companies are the largest private organizations on the planet, ie they have the resources to design a new transfer format. I see basically no relevance for the streaming requirement today, as I noted in this forum years ago, but I can see why it might have been considered important in the early '90s, before packet-based networking protocols had won.
>
> I think a header-based scheme would be _much_ better today and the reason I know Dmitry knows that is that I have discussed privately with him over email that I plan to prototype a format like that in D. Even if UTF-8 is already fairly widespread, something like that could be useful as a better intermediate format for string processing, and maybe someday could replace UTF-8 too.

May 17, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Joakim
in reply to Patrick Schluter

Joakim

Posted in reply to Patrick Schluter

On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluter wrote:
> This is not practical, sorry. What happens when your message loses the header? Exactly, the rest of the message is garbled.

Why would it lose the header? TCP guarantees delivery and checksums the data, that's effective enough at the transport layer.

I agree that UTF-8 is a more redundant format, as others have mentioned earlier, and is thus more robust to certain types of data loss than a header-based scheme. However, I don't consider that the job of the text format, it's better done by other layers, like transport protocols or filesystems, which will guard against such losses much more reliably and efficiently.

For example, a random bitflip somewhere in the middle of a UTF-8 string will not be detectable most of the time. However, more robust error-correcting schemes at other layers of the system will easily catch that.

> That's exactly what happened with code page based texts when you don't know in which code page it is encoded. It has the supplemental inconvenience that mixing languages becomes impossible or at least very cumbersome.
> UTF-8 has several properties that are difficult to have with other schemes.
> - It is state-less, means any byte in a stream always means the same thing. Its meaning  does not depend on external or a previous byte.

I realize this was considered important at one time, but I think it has proven to be a bad design decision, for HTTP too. There are some advantages when building rudimentary systems with crude hardware and lots of noise, as was the case back then, but that's not the tech world we live in today. That's why almost every HTTP request today is part of a stateful session that explicitly keeps track of the connection, whether through cookies, https encryption, or HTTP/2.

> - It can mix any language in the same stream without acrobatics and if one thinks that mixing languages doesn't happen often should get his head extracted from his rear, because it is very common (check wikipedia's front page for example).

I question that almost anybody needs to mix "streams." As for messages or files, headers handle multiple language mixing easily, as noted in that earlier thread.

> - The multi byte nature of other alphabets is not as bad as people think because texts in computer do not live on their own, meaning that they are generally embedded inside file formats, which more often than not are extremely bloated (xml, html, xliff, akoma ntoso, rtf etc.). The few bytes more in the text do not weigh that much.

Heh, the other parts of the tech stack are much more bloated, so this bloat is okay? A unique argument, but I'd argue that's why those bloated formats you mention are largely dying off too.

> I'm in charge at the European Commission of the biggest translation memory in the world. It handles currently 30 languages and without UTF-8 and UTF-16 it would be unmanageable. I still remember when I started there in 2002 when we handled only 11 languages of which only 1 was of another alphabet (Greek). Everything was based on RTF with codepages and it was a braindead mess. My first job in 2003 was to extend the system to handle the 8 newcomer languages and with ASCII based encodings it was completely unmanageable because every document processed mixes languages and alphabets freely (addresses and names are often written in their original form for instance).

I have no idea what a "translation memory" is. I don't doubt that dealing with non-standard codepages or layouts was difficult, and that a standard like Unicode made your life easier. But the question isn't whether standards would clean things up, of course they would, the question is whether a hypothetical header-based standard would be better than the current continuation byte standard, UTF-8. I think your life would've been even easier with the former, though depending on your usage, maybe the main gain for you would be just from standardization.

> 2 years ago we implemented also support for Chinese. The nice thing was that we didn't have to change much to do that thanks to Unicode. The second surprise was with the file sizes, Chinese documents were generally smaller than their European counterparts. Yes CJK requires 3 bytes for each ideogram, but generally 1 ideogram replaces many letters. The ideogram 亿 replaces "One hundred million" for example, which of them take more bytes? So if CJK indeed requires more bytes to encode, it is firstly because they NEED many more bits in the first place (there are around 30000 CJK codepoints in the BMP alone, add to it the 60000 that are in the SIP and we have a need of 17 bits only to encode them.

That's not the relevant criteria: nobody cares if the CJK documents were smaller than their European counterparts. What they care about is that, given a different transfer format, the CJK document could have been significantly smaller still. Because almost nobody cares about which translation version is smaller, they care that the text they sent in Chinese or Korean is as small as it can be.

Anyway, I didn't mean to restart this debate, so I'll leave it here.

May 17, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Andrei Alexandrescu
in reply to Patrick Schluter

Andrei Alexandrescu

Posted in reply to Patrick Schluter

On 05/17/2018 09:14 AM, Patrick Schluter wrote:
> I'm in charge at the European Commission of the biggest translation memory in the world.

Impressive! Is that the Europarl?

May 17, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Walter Bright
in reply to Joakim

Walter Bright

Posted in reply to Joakim

On 5/16/2018 10:01 PM, Joakim wrote:
> Unicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme, as they mostly were except for obviously the Asian CJK languages. That way, you optimize for the common string, ie one that contains a single language or at least no CJK, rather than pessimizing every non-ASCII language by doubling its character width, as UTF-8 does. This UTF-8 issue is one of the first topics I raised in this forum, but as you noted at the time nobody agreed and I don't want to dredge that all up again.

It sounds like the main issue is that a header based encoding would take less size?

If that's correct, then I hypothesize that adding an LZW compression layer would achieve the same or better result.

May 17, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Dmitry Olshansky
in reply to Walter Bright

Dmitry Olshansky

Posted in reply to Walter Bright

On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:
> On 5/16/2018 10:01 PM, Joakim wrote:
>> Unicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme, as they mostly were except for obviously the Asian CJK languages. That way, you optimize for the common string, ie one that contains a single language or at least no CJK, rather than pessimizing every non-ASCII language by doubling its character width, as UTF-8 does. This UTF-8 issue is one of the first topics I raised in this forum, but as you noted at the time nobody agreed and I don't want to dredge that all up again.
>
> It sounds like the main issue is that a header based encoding would take less size?
>
> If that's correct, then I hypothesize that adding an LZW compression layer would achieve the same or better result.

Indeed, and some other compression/deduplication options that would allow limited random access / slicing (by decoding a single “block” to access an element for instance).

Anything that depends on external information and is not self-sync is awful for interchange. Internally the application can do some smarts though, but even then things like interning (partial interning) might be more valuable approach. TCP being reliable just plain doesn’t cut it. Corruption of single bit is very real.

May 17, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Patrick Schluter
in reply to Andrei Alexandrescu

Patrick Schluter

Posted in reply to Andrei Alexandrescu

On Thursday, 17 May 2018 at 15:37:01 UTC, Andrei Alexandrescu wrote:
> On 05/17/2018 09:14 AM, Patrick Schluter wrote:
>> I'm in charge at the European Commission of the biggest translation memory in the world.
>
> Impressive! Is that the Europarl?

No, Euramis. The central translation memory developed by the Commission and used also by the other institutions. The database contains more than a billion segments from parallel texts and is afaik the biggest of its kind. One of the big strength of the Euramis TM is its multi-target language store this allows fuzzy searches in all combinations including indirect translations (i.e. if a document written in english was translated in Romanian and in Maltese it is then possible to search for alignments between ro and mt). It's not the only system to do that but on that volume it is quite unique.
We publish also every year an extract of it of the published legislation [1] from the official journal so that they can be used by the research community. All the machine translation engines use it. It is one of most accessed data collection on the European Open Data portal [2].

The very uncommon thing about the backend software of EURAMIS is that it is written in C. Pure unadultered C. I'm trying to introduce D but with the strange (to say it politely) configurations our server have it is quite challenging.

[1]: https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory
[2]: http://data.europa.eu/euodp/fr/data

May 17, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Joakim
in reply to Walter Bright

Joakim

Posted in reply to Walter Bright

On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:
> On 5/16/2018 10:01 PM, Joakim wrote:
>> Unicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme, as they mostly were except for obviously the Asian CJK languages. That way, you optimize for the common string, ie one that contains a single language or at least no CJK, rather than pessimizing every non-ASCII language by doubling its character width, as UTF-8 does. This UTF-8 issue is one of the first topics I raised in this forum, but as you noted at the time nobody agreed and I don't want to dredge that all up again.
>
> It sounds like the main issue is that a header based encoding would take less size?

Yes, and be easier to process.

> If that's correct, then I hypothesize that adding an LZW compression layer would achieve the same or better result.

In general, you would be wrong, a carefully designed binary format will usually beat the pants off general-purpose compression:

https://www.w3.org/TR/2009/WD-exi-evaluation-20090407/#compactness-results

Of course, that's because you can tailor your binary format for specific types of data, text in this case, and take advantage of patterns in that subset, such as specialized image compression formats do. In this case though, I haven't compared this scheme to general compression of UTF-8 strings, so I don't know which would compress better.

However, that would mostly matter for network transmission, another big gain of a header-based scheme that doesn't use compression is much faster string processing in memory. Yes, the average end user doesn't care for this, but giant consumers of text data, like search engines, would benefit greatly from this.

On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:
> Indeed, and some other compression/deduplication options that would allow limited random access / slicing (by decoding a single “block” to access an element for instance).

Possibly competitive for compression only for transmission over the network, but unlikely for processing, as noted for Walter's idea.

> Anything that depends on external information and is not self-sync is awful for interchange.

You are describing the vast majority of all formats and protocols, amazing how we got by with them all this time.

> Internally the application can do some smarts though, but even then things like interning (partial interning) might be more valuable approach. TCP being reliable just plain doesn’t cut it. Corruption of single bit is very real.

You seem to have missed my point entirely: UTF-8 will not catch most bit flips either, only if it happens to corrupt certain key bits in a certain way, a minority of the possibilities. Nobody is arguing that data corruption doesn't happen or that some error-correction shouldn't be done somewhere.

The question is whether the extremely limited robustness of UTF-8 added by its significant redundancy is a good tradeoff. I think it's obvious that it isn't, and I posit that anybody who knows anything about error-correcting codes would agree with that assessment. You would be much better off by having a more compact header-based transfer format and layering on the level of error correction you need at a different level, which as I noted is already done at the link and transport layers and various other parts of the system already.

If you need more error-correction than that, do it right, not in a broken way as UTF-8 does. Honestly, error detection/correction is the most laughably broken part of UTF-8, it is amazing that people even bring that up as a benefit.

May 17, 2018

Re: Of possible interest: fast UTF8 validation

Posted by H. S. Teoh
in reply to Walter Bright

H. S. Teoh

Posted in reply to Walter Bright

On Thu, May 17, 2018 at 10:16:03AM -0700, Walter Bright via Digitalmars-d wrote:
> On 5/16/2018 10:01 PM, Joakim wrote:
> > Unicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme, as they mostly were except for obviously the Asian CJK languages. That way, you optimize for the common string, ie one that contains a single language or at least no CJK, rather than pessimizing every non-ASCII language by doubling its character width, as UTF-8 does. This UTF-8 issue is one of the first topics I raised in this forum, but as you noted at the time nobody agreed and I don't want to dredge that all up again.
> 
> It sounds like the main issue is that a header based encoding would take less size?
> 
> If that's correct, then I hypothesize that adding an LZW compression layer would achieve the same or better result.

My bet is on the LZW being *far* better than a header-based encoding. Natural language, which a large part of textual data consists of, tends to have a lot of built-in redundancy, and therefore is highly compressible.  A proper compression algorithm will beat any header-based size reduction scheme, while still maintaining the context-free nature of UTF-8.


T

-- 
In a world without fences, who needs Windows and Gates? -- Christian Surchi

May 17, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Patrick Schluter
in reply to Joakim

Patrick Schluter

Posted in reply to Joakim

On Thursday, 17 May 2018 at 15:16:19 UTC, Joakim wrote:
> On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluter wrote:
>> This is not practical, sorry. What happens when your message loses the header? Exactly, the rest of the message is garbled.
>
> Why would it lose the header? TCP guarantees delivery and checksums the data, that's effective enough at the transport layer.

What does TCP/IP got to do with anything in discussion here. UTF-8 (or UTF-16 or UTF-32) has nothing to do with network protocols. That's completely unrelated. A file encoded on a disk may never leave the machine it is written on and may never see a wire in its lifetime and its encoding is still of vital importance. That's why a header encoding is too restrictive.

>
> I agree that UTF-8 is a more redundant format, as others have mentioned earlier, and is thus more robust to certain types of data loss than a header-based scheme. However, I don't consider that the job of the text format, it's better done by other layers, like transport protocols or filesystems, which will guard against such losses much more reliably and efficiently.

No. A text format cannot depend on a network protocol. It would be as if you could only listen to a music file or a video on streaming and never save it on offline file as there was nowhere the information of what that blob of bytes represents. It doesn't make any sense.

>
> For example, a random bitflip somewhere in the middle of a UTF-8 string will not be detectable most of the time. However, more robust error-correcting schemes at other layers of the system will easily catch that.

That's the job of the other layers. Any other file would have the same problem. At least, with utf-8 there will be at most only ever 1 codepoint lost or changed. Any other encoding would fare better. This said if a checksum header for your document is important you can add it to externally anyway.

>
>> That's exactly what happened with code page based texts when you don't know in which code page it is encoded. It has the supplemental inconvenience that mixing languages becomes impossible or at least very cumbersome.
>> UTF-8 has several properties that are difficult to have with other schemes.
>> - It is state-less, means any byte in a stream always means the same thing. Its meaning  does not depend on external or a previous byte.
>
> I realize this was considered important at one time, but I think it has proven to be a bad design decision, for HTTP too. There are some advantages when building rudimentary systems with crude hardware and lots of noise, as was the case back then, but that's not the tech world we live in today. That's why almost every HTTP request today is part of a stateful session that explicitly keeps track of the connection, whether through cookies, https encryption, or HTTP/2.

Again, orthogonal to utf-8. When I speak above of streams it doesn't limit to sockets, file are also read in streams. So stop of equating UTF-8 with the Internet, these are 2 different domains. Internet and its protocols were defined and invented long before Unicode and Unicode is very usefull also offline.

>> - It can mix any language in the same stream without acrobatics and if one thinks that mixing languages doesn't happen often should get his head extracted from his rear, because it is very common (check wikipedia's front page for example).
>
> I question that almost anybody needs to mix "streams." As for messages or files, headers handle multiple language mixing easily, as noted in that earlier thread.

Ok, show me how you transmit that, I'm curious:

<prop type="Txt::Doc. No.">E2010C0002</prop>
<tuv lang="EN-GB">
<seg>EFTA Surveillance Authority Decision</seg>
</tuv>
<tuv lang="DE-DE">
<seg>Beschluss der EFTA-Überwachungsbehörde</seg>
</tuv>
<tuv lang="DA-01">
<seg>EFTA-Tilsynsmyndighedens beslutning</seg>
</tuv>
<tuv lang="EL-01">
<seg>Απόφαση της Εποπτεύουσας Αρχής της ΕΖΕΣ</seg>
</tuv>
<tuv lang="ES-ES">
<seg>Decisión del Órgano de Vigilancia de la AELC</seg>
</tuv>
<tuv lang="FI-01">
<seg>EFTAn valvontaviranomaisen päätös</seg>
</tuv>
<tuv lang="FR-FR">
<seg>Décision de l'Autorité de surveillance AELE</seg>
</tuv>
<tuv lang="IT-IT">
<seg>Decisione dell’Autorità di vigilanza EFTA</seg>
</tuv>
<tuv lang="NL-NL">
<seg>Besluit van de Toezichthoudende Autoriteit van de EVA</seg>
</tuv>
<tuv lang="PT-PT">
<seg>Decisão do Órgão de Fiscalização da EFTA</seg>
</tuv>
<tuv lang="SV-SE">
<seg>Beslut av Eftas övervakningsmyndighet</seg>
</tuv>
<tuv lang="LV-01">
<seg>EBTA Uzraudzības iestādes Lēmums</seg>
</tuv>
<tuv lang="CS-01">
<seg>Rozhodnutí Kontrolního úřadu ESVO</seg>
</tuv>
<tuv lang="ET-01">
<seg>EFTA järelevalveameti otsus</seg>
</tuv>
<tuv lang="PL-01">
<seg>Decyzja Urzędu Nadzoru EFTA</seg>
</tuv>
<tuv lang="SL-01">
<seg>Odločba Nadzornega organa EFTE</seg>
</tuv>
<tuv lang="LT-01">
<seg>ELPA priežiūros institucijos sprendimas</seg>
</tuv>
<tuv lang="MT-01">
<seg>Deċiżjoni tal-Awtorità tas-Sorveljanza tal-EFTA</seg>
</tuv>
<tuv lang="SK-01">
<seg>Rozhodnutie Dozorného orgánu EZVO</seg>
</tuv>
<tuv lang="BG-01">
<seg>Решение на Надзорния орган на ЕАСТ</seg>
</tuv>
</tu>
<tu>

>
>> - The multi byte nature of other alphabets is not as bad as people think because texts in computer do not live on their own, meaning that they are generally embedded inside file formats, which more often than not are extremely bloated (xml, html, xliff, akoma ntoso, rtf etc.). The few bytes more in the text do not weigh that much.
>
> Heh, the other parts of the tech stack are much more bloated, so this bloat is okay? A unique argument, but I'd argue that's why those bloated formats you mention are largely dying off too.

They don't, it's getting worse by the day, that's why I mentioned Akoma Ntoso and XLIFF, they will be used more and more. The world is not limited to webshit (see n-gate.com for the reference).

>
>> I'm in charge at the European Commission of the biggest translation memory in the world. It handles currently 30 languages and without UTF-8 and UTF-16 it would be unmanageable. I still remember when I started there in 2002 when we handled only 11 languages of which only 1 was of another alphabet (Greek). Everything was based on RTF with codepages and it was a braindead mess. My first job in 2003 was to extend the system to handle the 8 newcomer languages and with ASCII based encodings it was completely unmanageable because every document processed mixes languages and alphabets freely (addresses and names are often written in their original form for instance).
>
> I have no idea what a "translation memory" is. I don't doubt that dealing with non-standard codepages or layouts was difficult, and that a standard like Unicode made your life easier. But the question isn't whether standards would clean things up, of course they would, the question is whether a hypothetical header-based standard would be better than the current continuation byte standard, UTF-8. I think your life would've been even easier with the former, though depending on your usage, maybe the main gain for you would be just from standardization.

I doubt it because the issue has nothing to do with network protocols as you seem to imply. It is about data format, i.e. the content that may be shuffled over a net, but can also stay on a disk, be printed on paper (gasp so old tech) or used interactively in a GUI.

>
>> 2 years ago we implemented also support for Chinese. The nice thing was that we didn't have to change much to do that thanks to Unicode. The second surprise was with the file sizes, Chinese documents were generally smaller than their European counterparts. Yes CJK requires 3 bytes for each ideogram, but generally 1 ideogram replaces many letters. The ideogram 亿 replaces "One hundred million" for example, which of them take more bytes? So if CJK indeed requires more bytes to encode, it is firstly because they NEED many more bits in the first place (there are around 30000 CJK codepoints in the BMP alone, add to it the 60000 that are in the SIP and we have a need of 17 bits only to encode them.
>
> That's not the relevant criteria: nobody cares if the CJK documents were smaller than their European counterparts. What they care about is that, given a different transfer format, the CJK document could have been significantly smaller still. Because almost nobody cares about which translation version is smaller, they care that the text they sent in Chinese or Korean is as small as it can be.

At most 50% more but if the size is really that important it can use UTF-16 which is the same size as Big-5 or Shit-JIS, or as Walter suggested they would better compress the file in that case.

>
> Anyway, I didn't mean to restart this debate, so I'll leave it here.

- the auto-synchronization and the statelessness are big deals.

May 17, 2018

Re: Of possible interest: fast UTF8 validation

Posted by Ethan
in reply to Dmitry Olshansky

Ethan

Posted in reply to Dmitry Olshansky

On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:
> TCP being  reliable just plain doesn’t cut it. Corruption of
> single bit is very real.

Quoting to highlight and agree.

TCP is reliable because it resends dropped packets and delivers them in order.

I don't write TCP packets to my long-term storage medium.

UTF as a transportation protocol Unicode is *far* more useful than just sending across a network.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation