Why UTF-8/16 character encodings? (page 6) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Why UTF-8/16 character encodings? (page 6)

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Juan Manuel Cabo
in reply to Joakim

Juan Manuel Cabo

Posted in reply to Joakim

On Saturday, 25 May 2013 at 19:51:43 UTC, Joakim wrote:
> On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
>> You can map a codepage to a subset of UCS :)
>> That's what they do internally anyway.
>> If I take you right you propose to define string as a header that denotes a set of windows in code space? I still fail to see how that would scale see below.
> Something like that.  For a multi-language string encoding, the header would contain a single byte for every language used in the string, along with multiple index bytes to signify the start and finish of every run of single-language characters in the string.  So, a list of languages and a list of pure single-language substrings.  This is just off the top of my head, I'm not suggesting it is definitive.
>

You obviously are not thinking it through. Such encoding would have a O(n^2) complexity for appending a character/symbol in a different language to the string, since you would have to update the beginning of the string, and move the contents forward to make room. Not to mention that it wouldn't be backwards compatible with ascii routines, and the complexity of such a header would be have to be carried all the way to font rendering routines in the OS.

Multiple languages/symbols in one string is a blessing of modern humane computing. It is the norm more than the exception in most of the world.

--jm

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Peter Alexander
in reply to Joakim

Peter Alexander

Posted in reply to Joakim

On Saturday, 25 May 2013 at 14:58:02 UTC, Joakim wrote:
> On Saturday, 25 May 2013 at 14:16:21 UTC, Peter Alexander wrote:
>> I suggest you read up on UTF-8. You really don't understand it. There is no need to decode, you just treat the UTF-8 string as if it is an ASCII string.
> Not being aware of this shortcut doesn't mean not understanding UTF-8.

It's not just a shortcut, it is absolutely fundamental to the design of UTF-8. It's like saying you understand Lisp without being aware that everything is a list.

Also, you continuously keep stating disadvantages to UTF-8 that are completely false, like "slicing does require decoding". Again, completely missing the point of UTF-8. I cannot conceive how you can claim to understand how UTF-8 works yet repeatedly demonstrating that you do not.

You are either ignorant or a successful troll. In either case, I'm done here.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by H. S. Teoh
in reply to Joakim

H. S. Teoh

Posted in reply to Joakim

On Sat, May 25, 2013 at 09:51:42PM +0200, Joakim wrote:
> On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
> >If I take you right you propose to define string as a header that denotes a set of windows in code space? I still fail to see how that would scale see below.
>
> Something like that.  For a multi-language string encoding, the header would contain a single byte for every language used in the string, along with multiple index bytes to signify the start and finish of every run of single-language characters in the string. So, a list of languages and a list of pure single-language substrings.  This is just off the top of my head, I'm not suggesting it is definitive.
[...]

And just how exactly does that help with slicing? If anything, it makes slicing way hairier and error-prone than UTF-8. In fact, this one point alone already defeated any performance gains you may have had with a single-byte encoding. Now you can't do *any* slicing at all without convoluted algorithms to determine what encoding is where at the endpoints of your slice, and the resulting slice must have new headers to indicate the start/end of every different-language substring. By the time you're done with all that, you're going way slower than processing UTF-8.

Again I say, I'm not 100% sold on UTF-8, but what you're proposing here is far worse.

T

-- 
The best compiler is between your ears. -- Michael Abrash

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Walter Bright
in reply to Joakim

Walter Bright

Posted in reply to Joakim

On 5/25/2013 1:03 PM, Joakim wrote:
> On Saturday, 25 May 2013 at 19:30:25 UTC, Walter Bright wrote:
>> On the other hand, Joakim even admits his single byte encoding is variable
>> length, as otherwise he simply dismisses the rarely used (!) Chinese,
>> Japanese, and Korean languages, as well as any text that contains words from
>> more than one language.
> I have noted from the beginning that these large alphabets have to be encoded to
> two bytes, so it is not a true constant-width encoding if you are mixing one of
> those languages into a single-byte encoded string.  But this "variable length"
> encoding is so much simpler than UTF-8, there's no comparison.

If it's one byte sometimes, or two bytes sometimes, it's variable length. You overlook that I've had to deal with this. It isn't "simpler", there's actually more work to write code that adapts to one or two byte encodings.

>> I suspect he's trolling us, and quite successfully.
> Ha, I wondered who would pull out this insult, quite surprised to see it's
> Walter.  It seems to be the trend on the internet to accuse anybody you disagree
> with of trolling, I am honestly surprised to see Walter stoop so low.
> Considering I'm the only one making any cogent arguments here, perhaps I should
> wonder if you're all trolling me. ;)
>
> On Saturday, 25 May 2013 at 19:35:42 UTC, Walter Bright wrote:
>> I suspect the Chinese, Koreans, and Japanese would take exception to being
>> called irrelevant.
> Irrelevant only because they are a small subset of the UCS.  I have noted that
> they would also be handled by a two-byte encoding.
>
>> Good luck with your scheme that can't handle languages written by billions of
>> people!
> So let's see: first you say that my scheme has to be variable length because I
> am using two bytes to handle these languages,

Well, it *is* variable length or you have to disregard Chinese. You cannot have it both ways. Code to deal with two bytes is significantly different than code to deal with one. That means you've got a conditional in your generic code - that isn't going to be faster than the conditional for UTF-8.

> then you claim I don't handle
> these languages.  This kind of blatant contradiction within two posts can only
> be called... trolling!

You gave some vague handwaving about it, and then dismissed it as irrelevant, along with more handwaving about what to do with text that has embedded words in multiple languages.

Worse, there are going to be more than 256 of these encodings - you can't even have a byte to specify them. Remember, Unicode has approximately 256,000 characters in it. How many code pages is that?

I was being kind saying you were trolling, as otherwise I'd be saying your scheme was, to be blunt, absurd.

---------------------------------------

I'll be the first to admit that a lot of great ideas have been initially dismissed by the experts as absurd. If you really believe in this, I recommend that you write it up as a real article, taking care to fill in all the handwaving with something specific, and include some benchmarks to prove your performance claims. Post your article on reddit, stackoverflow, hackernews, etc., and look for fertile ground for it. I'm sorry you're not finding fertile ground here (so far, nobody has agreed with any of your points), and this is the wrong place for such proposals anyway, as D is simply not going to switch over to it.

Remember, extraordinary claims require extraordinary evidence, not handwaving and assumptions disguised as bold assertions.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Walter Bright
in reply to Joakim

Walter Bright

Posted in reply to Joakim

On 5/25/2013 12:51 PM, Joakim wrote:
> For a multi-language string encoding, the header would
> contain a single byte for every language used in the string, along with multiple
> index bytes to signify the start and finish of every run of single-language
> characters in the string. So, a list of languages and a list of pure
> single-language substrings.

Please implement the simple C function strstr() with this simple scheme, and post it here.

http://www.digitalmars.com/rtl/string.html#strstr

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Walter Bright
in reply to Walter Bright

Walter Bright

Posted in reply to Walter Bright

On 5/25/2013 2:51 PM, Walter Bright wrote:
> On 5/25/2013 12:51 PM, Joakim wrote:
>> For a multi-language string encoding, the header would
>> contain a single byte for every language used in the string, along with multiple
>> index bytes to signify the start and finish of every run of single-language
>> characters in the string. So, a list of languages and a list of pure
>> single-language substrings.
>
> Please implement the simple C function strstr() with this simple scheme, and
> post it here.
>
> http://www.digitalmars.com/rtl/string.html#strstr

I'll go first. Here's a simple UTF-8 version in C. It's not the fastest way to do it, but at least it is correct:
----------------------------------
char *strstr(const char *s1,const char *s2) {
    size_t len1 = strlen(s1);
    size_t len2 = strlen(s2);
    if (!len2)
        return (char *) s1;
    char c2 = *s2;
    while (len2 <= len1) {
        if (c2 == *s1)
            if (memcmp(s2,s1,len2) == 0)
                return (char *) s1;
        s1++;
        len1--;
    }
    return NULL;
}

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Diggory
in reply to Joakim

Diggory

Posted in reply to Joakim

On Saturday, 25 May 2013 at 20:03:59 UTC, Joakim wrote:
> I have noted from the beginning that these large alphabets have to be encoded to two bytes, so it is not a true constant-width encoding if you are mixing one of those languages into a single-byte encoded string.  But this "variable length" encoding is so much simpler than UTF-8, there's no comparison.

All I can say is if you think that is simpler than UTF-8 then you have completely the wrong idea about UTF-8.

Let me explain:

1) Take the byte at a particular offset in the string
2) If it is ASCII then we're done
3) Otherwise count the number of '1's at the start of the byte - this is how many bytes make up the character (there's even an ASM instruction to do this)
4) This first byte will look like '1110xxxx' for a 3 byte character, '11110xxx' for a 4 byte character, etc.
5) All following bytes are of the form '10xxxxxx'
6) Now just concatenate all the 'x's together and add an offset to get the code point

Note that this is CONSTANT TIME, O(1) with minimal branching so well suited to pipelining (after the initial byte the other bytes can all be processed in parallel by the CPU) and only sequential memory access so no cache misses, and zero additional memory requirements

Now compare your encoding:
1) Look up the offset in the header using binary search: O(log N) lots of branching
2) Look up the code page ID in a massive array of code pages to work out how many bytes per character
3) Hope this array hasn't been paged out and is still in the cache
4) Extract that many bytes from the string and combine them into a number
5) Look up this new number in yet another large array specific to the code page
6) Hope this array hasn't been paged out and is still in the cache too

This is O(log N) has lots of branching so no pipelining (every stage depends on the result of the stage before), lots of random memory access so lots of cache misses, lots of additional memory requirements to store all those tables, and an algorithm that isn't even any easier to understand.

Plus every other algorithm to operate on it except for decoding is insanely complicated.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Jonathan M Davis
in reply to Walter Bright

Jonathan M Davis

Posted in reply to Walter Bright

On Saturday, May 25, 2013 01:42:20 Walter Bright wrote:
> On 5/25/2013 12:33 AM, Joakim wrote:
> > At what cost?  Most programmers completely punt on unicode, because they just don't want to deal with the complexity. Perhaps you can deal with it and don't mind the performance loss, but I suspect you're in the minority.
> 
> I think you stand alone in your desire to return to code pages. I have years of experience with code pages and the unfixable misery they produce. This has disappeared with Unicode. I find your arguments unpersuasive when stacked against my experience. And yes, I have made a living writing high performance code that deals with characters, and you are quite off base with claims that UTF-8 has inevitable bad performance - though there is inefficient code in Phobos for it, to be sure.
> 
> My grandfather wrote a book that consists of mixed German, French, and Latin words, using special characters unique to those languages. Another failing of code pages is it fails miserably at any such mixed language text. Unicode handles it with aplomb.
> 
> I can't even write an email to Rainer Schütze in English under your scheme.
> 
> Code pages simply are no longer practical nor acceptable for a global community. D is never going to convert to a code page system, and even if it did, there's no way D will ever convince the world to abandon Unicode, and so D would be as useless as EBCDIC.
> 
> I'm afraid your quest is quixotic.

All I've got to say on this subject is "Thank you Walter Bright for building Unicode into D!"

- Jonathan M Davis

May 26, 2013

Re: Why UTF-8/16 character encodings?

Posted by H. S. Teoh

H. S. Teoh

On Sat, May 25, 2013 at 04:14:34PM -0700, Jonathan M Davis wrote:
> On Saturday, May 25, 2013 01:42:20 Walter Bright wrote:
> > On 5/25/2013 12:33 AM, Joakim wrote:
> > > At what cost?  Most programmers completely punt on unicode, because they just don't want to deal with the complexity. Perhaps you can deal with it and don't mind the performance loss, but I suspect you're in the minority.
> > 
> > I think you stand alone in your desire to return to code pages. I have years of experience with code pages and the unfixable misery they produce. This has disappeared with Unicode. I find your arguments unpersuasive when stacked against my experience. And yes, I have made a living writing high performance code that deals with characters, and you are quite off base with claims that UTF-8 has inevitable bad performance - though there is inefficient code in Phobos for it, to be sure.
> > 
> > My grandfather wrote a book that consists of mixed German, French, and Latin words, using special characters unique to those languages. Another failing of code pages is it fails miserably at any such mixed language text.  Unicode handles it with aplomb.
> > 
> > I can't even write an email to Rainer Schütze in English under your scheme.
> > 
> > Code pages simply are no longer practical nor acceptable for a global community. D is never going to convert to a code page system, and even if it did, there's no way D will ever convince the world to abandon Unicode, and so D would be as useless as EBCDIC.
> > 
> > I'm afraid your quest is quixotic.
> 
> All I've got to say on this subject is "Thank you Walter Bright for building Unicode into D!"
[...]

Ditto here!

In fact, Unicode support in D (esp. UTF-8) was one of the major factors that convinced me to adopt D. I had been trying to write language-agnostic programs in C/C++, and ... let's just say that it was one gigantic hairy mess, and required lots of system-dependent hacks and unfounded assumptions ("it appears to work so I think the code's correct even though according to spec it shouldn't have worked"). I18n support in libc was spotty and incomplete, with many common functions breaking in unexpected ways once you step outside ASCII, and libraries like gettext address some of the issues but not all. Getting *real* i18n support required using a full-fledged i18n library like libicu, which required using custom string types. The whole experience was so painful I've since avoided doing any i18n in C/C++ at all.

Then came along D with native Unicode support built right into the language. And not just UTF-16 shoved down your throat like Java does (or was it UTF-32?); UTF-8, UTF-16, and UTF-32 are all equally supported. You cannot imagine what a happy camper I was since then!! Yes, Phobos still has a ways to go in terms of performance w.r.t. UTF-8 strings, but what we have right now is already far, far, superior to the situation in C/C++, and things can only get better.

T

-- 
Freedom of speech: the whole world has no right *not* to hear my spouting off!

May 26, 2013

Re: Why UTF-8/16 character encodings?

Posted by Walter Bright
in reply to H. S. Teoh

Walter Bright

Posted in reply to H. S. Teoh

On 5/25/2013 9:48 PM, H. S. Teoh wrote:
> Then came along D with native Unicode support built right into the
> language. And not just UTF-16 shoved down your throat like Java does (or
> was it UTF-32?); UTF-8, UTF-16, and UTF-32 are all equally supported.
> You cannot imagine what a happy camper I was since then!! Yes, Phobos
> still has a ways to go in terms of performance w.r.t. UTF-8 strings, but
> what we have right now is already far, far, superior to the situation in
> C/C++, and things can only get better.

Many moons ago, when the earth was young and I had a few strands of hair left, a C++ programmer challenged me to a "bakeoff", D vs C++. I wrote the program in D (a string processing program). He said "ahaaaa!" and wrote the C++ one. They were fairly comparable.

I then suggested we do the internationalized version. I resubmitted exactly the same program. He threw in the towel.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation