Why UTF-8/16 character encodings? (page 5) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Why UTF-8/16 character encodings? (page 5)

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Juan Manuel Cabo
in reply to Joakim

Juan Manuel Cabo

Posted in reply to Joakim

░░░░░░░░░ⓌⓉⒻ░
╔╗░╔╗░╔╗╔════╗╔════╗░░
║║░║║░║║╚═╗╔═╝║╔═══╝░░
║║░║║░║║░░║║░░║╚═╗░░░░
║╚═╝╚═╝║╔╗║║╔╗║╔═╝╔╗░░
╚══════╝╚╝╚╝╚╝╚╝░░╚╝░░

░░░░░░░░░░░░░░░░░░░░░░░░
█░█░█░░░░░░▐░░░░░░░░░░▐░
█░█░█▐▀█▐▀█▐░█▐▀█▐▀█▐▀█░
█░█░█▐▄█▐▄█▐▄▀▐▄█▐░█▐░█░
█▄█▄█▐▄▄▐▄▄▐░█▐▄▄▐░█▐▄█░
░░░░░░░░░░░░░░░░░░░░░░░░


--jm

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Diggory
in reply to Joakim

Diggory

Posted in reply to Joakim

"limited success of UTF-8"

Becoming the de-facto standard encoding EVERYWERE except for windows which uses UTF-16 is hardly a failure...

I really don't understand your hatred for UTF-8 - it's simple to decode and encode, fast and space-efficient. Fixed width encodings are not inherently fast, the only thing they are faster at is if you want to randomly access the Nth character instead of the Nth byte. In the rare cases that you need to do a lot of this kind of random access there exists UTF-32...

Any fixed width encoding which can encode every unicode character must use at least 3 bytes, and using 4 bytes is probably going to be faster because of alignment, so I don't see what the great improvement over UTF-32 is going to be.

> slicing does require decoding
Nope.

> I didn't mean that people are literally keeping code pages.  I meant that there's not much of a difference between code pages with 2 bytes per char and the language character sets in UCS.

Unicode doesn't have "language character sets". The different planes only exist for organisational purposes they don't affect how characters are encoded.

> ?!  It's okay because you deem it "coherent in its scheme?"  I deem headers much more coherent. :)

Sure if you change the word "coherent" to mean something completely different... Coherent means that you store related things together, ie. everything that you need to decode a character in the same place, not spread out between part of a character and a header.

> but I suspect substring search not requiring decoding is the exception for UTF-8 algorithms, not the rule.
The only time you need to decode is when you need to do some transformation that depends on the code point such as converting case or identifying which character class a particular character belongs to. Appending, slicing, copying, searching, replacing, etc. basically all the most common text operations can all be done without any encoding or decoding.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to Diggory

Joakim

Posted in reply to Diggory

On Saturday, 25 May 2013 at 18:09:26 UTC, Diggory wrote:
> On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:
>> On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
>>> I think you are a little confused about what unicode actually is... Unicode has nothing to do with code pages and nobody uses code pages any more except for compatibility with legacy applications (with good reason!).
>> Incorrect.
>>
>> "Unicode is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes... In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all the other code pages have been technically redefined as encodings for various subsets of Unicode."
>> http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode
>>
>
> That confirms exactly what I just said...
No, that directly _contradicts_ what you said about Unicode having "nothing to do with code pages."  All UCS did is take a bunch of existing code pages and standardize them into one massive character set.  For example, ISCII was a pre-existing single-byte encoding and Unicode "largely preserves the ISCII layout within each block."
http://en.wikipedia.org/wiki/ISCII

All a code page is is a table of mappings, UCS is just a much larger, standardized table of such mappings.

>>> You said that phobos converts UTF-8 strings to UTF-32 before operating on them but that's not true. As it iterates over UTF-8 strings it iterates over dchars rather than chars, but that's not in any way inefficient so I don't really see the problem.
>> And what's a dchar?  Let's check:
>>
>> dchar : unsigned 32 bit UTF-32
>> http://dlang.org/type.html
>>
>> Of course that's inefficient, you are translating your whole encoding over to a 32-bit encoding every time you need to process it.  Walter as much as said so up above.
>
> Given that all the machine registers are at least 32-bits already it doesn't make the slightest difference. The only additional operations on top of ascii are when it's a multi-byte character, and even then it's some simple bit manipulation which is as fast as any variable width encoding is going to get.
I see you've abandoned without note your claim that phobos doesn't convert UTF-8 to UTF-32 internally.  Perhaps converting to UTF-32 is "as fast as any variable width encoding is going to get" but my claim is that single-byte encodings will be faster.

> The only alternatives to a variable width encoding I can see are:
> - Single code page per string
> This is completely useless because now you can't concatenate strings of different code pages.
I wouldn't be so fast to ditch this.  There is a real argument to be made that strings of different languages are sufficiently different that there should be no multi-language strings.  Is this the best route?  I'm not sure, but I certainly wouldn't dismiss it out of hand.

> - Multiple code pages per string
> This just makes everything overly complicated and is far slower to decode what the actual character is than UTF-8.
I disagree, this would still be far faster than UTF-8, particularly if you designed your header right.

> - String with escape sequences to change code page
> Can no longer access characters in the middle or end of the string, you have to parse the entire string every time which completely negates the benefit of a fixed width encoding.
I didn't think of this possibility, but you may be right that it's sub-optimal.

>>> Also your complaint that UTF-8 reserves the short characters for the english alphabet is not really relevant - the characters with longer encodings tend to be rarer (such as special symbols) or carry more information (such as chinese characters where the same sentence takes only about 1/3 the number of characters).
>> The vast majority of non-english alphabets in UCS can be encoded in a single byte.  It is your exceptions that are not relevant.
>
> Well obviously... That's like saying "if you know what the exact contents of a file are going to be anyway you can compress it to a single byte!"
>
> ie. It's possible to devise an encoding which will encode any given string to an arbitrarily small size. It's still completely useless because you'd have to know the string in advance...
No, it's not the same at all.  The contents of an arbitrary-length file cannot be compressed to a single byte, you would have collisions galore.  But since most non-english alphabets are less than 256 characters, they can all be uniquely encoded in a single byte per character, with the header determining what language's code page to use.  I don't understand your analogy whatsoever.

> - A useful encoding has to be able to handle every unicode character
> - As I've shown the only space-efficient way to do this is using a variable length encoding like UTF-8
You haven't shown this.

> - Given the frequency distribution of unicode characters, UTF-8 does a pretty good job at encoding higher frequency characters in fewer bytes.
No, it does a very bad job of this.  Every non-ASCII character takes at least two bytes to encode, whereas my single-byte encoding scheme would encode every alphabet with less than 256 characters in a single byte.

> - Yes you COULD encode non-english alphabets in a single byte but doing so would be inefficient because it would mean the more frequently used characters take more bytes to encode.
Not sure what you mean by this.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Dmitry Olshansky
in reply to Joakim

Dmitry Olshansky

Posted in reply to Joakim

25-May-2013 22:26, Joakim пишет:
> On Saturday, 25 May 2013 at 17:03:43 UTC, Dmitry Olshansky wrote:
>> 25-May-2013 10:44, Joakim пишет:
>>> Yes, on the encoding, if it's a variable-length encoding like UTF-8, no,
>>> on the code space.  I was originally going to title my post, "Why
>>> Unicode?" but I have no real problem with UCS, which merely standardized
>>> a bunch of pre-existing code pages.  Perhaps there are a lot of problems
>>> with UCS also, I just haven't delved into it enough to know.
>>
>> UCS is dead and gone. Next in line to "640K is enough for everyone".
> I think you are confused.  UCS refers to the Universal Character Set,
> which is the backbone of Unicode:
>
> http://en.wikipedia.org/wiki/Universal_Character_Set
>
> You might be thinking of the unpopular UCS-2 and UCS-4 encodings, which
> I have never referred to.

Yeah got confused. So sorry about that.

>
>>>> Separate code spaces were the case before Unicode (and utf-8). The
>>>> problem is not only that without header text is meaningless (no easy
>>>> slicing) but the fact that encoding of data after header strongly
>>>> depends a variety of factors -  a list of encodings actually. Now
>>>> everybody has to keep a (code) page per language to at least know if
>>>> it's 2 bytes per char or 1 byte per char or whatever. And you still
>>>> work on a basis that there is no combining marks and regional specific
>>>> stuff :)
>>> Everybody is still keeping code pages, UTF-8 hasn't changed that.
>>
>> Legacy. Hard to switch overnight. There are graphs that indicate that
>> few years from now you might never encounter a legacy encoding
>> anymore, only UTF-8/UTF-16.
> I didn't mean that people are literally keeping code pages.  I meant
> that there's not much of a difference between code pages with 2 bytes
> per char and the language character sets in UCS.

You can map a codepage to a subset of UCS :)
That's what they do internally anyway.
If I take you right you propose to define string as a header that denotes a set of windows in code space? I still fail to see how that would scale see below.

>>> It has to do that also. Everyone keeps talking about
>>> "easy slicing" as though UTF-8 provides it, but it doesn't. Phobos
>>> turns UTF-8 into UTF-32 internally for all that ease of use, at least
>>> doubling your string size in the process.  Correct me if I'm wrong, that
>>> was what I read on the newsgroup sometime back.
>>
>> Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't
>> do any decoding and it does return you a slice of a balance of original.
> Perhaps substring search doesn't strictly require decoding but you have
> changed the subject: slicing does require decoding and that's the use
> case you brought up to begin with.  I haven't looked into it, but I
> suspect substring search not requiring decoding is the exception for
> UTF-8 algorithms, not the rule.

Mm... strictly speaking (let's turn that argument backwards) - what are algorithms that require slicing say [5..$] of string without ever looking at it left to right, searching etc.?

>> ??? Simply makes no sense. There is no intersection between some
>> legacy encodings as of now. Or do you want to add N*(N-1)
>> cross-encodings for any combination of 2? What about 3 in one string?
> I sketched two possible encodings above, none of which would require
> "cross-encodings."
>
>>>> We want monoculture! That is to understand each without all these
>>>> "par-le-vu-france?" and codepages of various complexity(insanity).
>>> I hate monoculture, but then I haven't had to decipher some screwed-up
>>> codepage in the middle of the night. ;)
>>
>> So you never had trouble of internationalization? What languages do
>> you use (read/speak/etc.)?
> This was meant as a point in your favor, conceding that I haven't had to
> code with the terrible code pages system from the past.  I can read and
> speak multiple languages, but I don't use anything other than English text.

Okay then.

>>> That said, you could standardize
>>> on UCS for your code space without using a bad encoding like UTF-8, as I
>>> said above.
>>
>> UCS is a myth as of ~5 years ago. Early adopters of Unicode fell into
>> that trap (Java, Windows NT). You shouldn't.
> UCS, the character set, as noted above.  If that's a myth, Unicode is a
> myth. :)

Yeah, that was a mishap on my behalf. I think I've seen your 2 byte argument way to often and it got concatenated to UCS forming UCS-2 :)

>
>> This is it but it's far more flexible in a sense that it allows
>> multi-linguagal strings just fine and lone full-with unicode
>> codepoints as well.
> That's only because it uses a more complex header than a single byte for
> the language, which I noted could be done with my scheme, by adding a
> more complex header,

How would it look like? Or how the processing will go?

> long before you mentioned this unicode compression
> scheme.

It does inline headers or rather tags. That hop between fixed char windows. It's not random-access nor claims to be.

>
>>> But I get the impression that it's only for sending over
>>> the wire, ie transmision, so all the processing issues that UTF-8
>>> introduces would still be there.
>>
>> Use mime-type etc. Standards are always a bit stringy and suboptimal,
>> their acceptance rate is one of chief advantages they have. Unicode
>> has horrifically large momentum now and not a single organization
>> aside from them tries to do this dirty work (=i18n).
> You misunderstand.  I was saying that this unicode compression scheme
> doesn't help you with string processing, it is only for transmission and
> is probably fine for that, precisely because it seems to implement some
> version of my single-byte encoding scheme!  You do raise a good point:
> the only reason why we're likely using such a bad encoding in UTF-8 is
> that nobody else wants to tackle this hairy problem.

Yup, where have you been say almost 10 years ago? :)

>> Consider adding another encoding for "Tuva" for isntance. Now you have
>> to add 2*n conversion routines to match it to other codepages/locales.
> Not sure what you're referring to here.
>
If you adopt the "map to UCS policy" then nothing.

>> Beyond that - there are many things to consider in
>> internationalization and you would have to special case them all by
>> codepage.
> Not necessarily.  But that is actually one of the advantages of
> single-byte encodings, as I have noted above.  toUpper is a NOP for a
> single-byte encoding string with an Asian script, you can't do that with
> a UTF-8 string.

But you have to check what encoding it's in and given that not all codepages are that simple to upper case some generic algorithm is required.

>>> If they're screwing up something so simple,
>>> imagine how much worse everyone is screwing up something complex like
>>> UTF-8?
>>
>> UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] to a
>> sequence of octets. It does it pretty well and compatible with ASCII,
>> even the little rant you posted acknowledged that. Now you are either
>> against Unicode as whole or what?
> The BOM link I gave notes that UTF-8 isn't always ASCII-compatible.
>
> There are two parts to Unicode.  I don't know enough about UCS, the
> character set, ;) to be for it or against it, but I acknowledge that a
> standardized character set may make sense.  I am dead set against the
> UTF-8 variable-width encoding, for all the reasons listed above.

Okay we are getting somewhere, now that I understand your position and got myself confused in the midway there.

> On Saturday, 25 May 2013 at 17:13:41 UTC, Dmitry Olshansky wrote:
>> 25-May-2013 13:05, Joakim пишет:
>>> Nobody is talking about going back to code pages.  I'm talking about
>>> going to single-byte encodings, which do not imply the problems that you
>>> had with code pages way back when.
>>
>> Problem is what you outline is isomorphic with code-pages. Hence the
>> grief of accumulated experience against them.
> They may seem superficially similar but they're not.  For example, from
> the beginning, I have suggested a more complex header that can enable
> multi-language strings, as one possible solution.  I don't think code
> pages provided that.

The problem is how would you define an uppercase algorithm for multilingual string with 3 distinct 256 codespaces (windows)? I bet it's won't be pretty.

>> Well if somebody get a quest to redefine UTF-8 they *might* come up
>> with something that is a bit faster to decode but shares the same
>> properties. Hardly a life saver anyway.
> Perhaps not, but I suspect programmers will flock to a constant-width
> encoding that is much simpler and more efficient than UTF-8.  Programmer
> productivity is the biggest loss from the complexity of UTF-8, as I've
> noted before.

I still don't see how your solution scales to beyond 256 different codepoints per string (= multiple pages/parts of UCS ;) ).

-- 
Dmitry Olshansky

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Walter Bright
in reply to Andrei Alexandrescu

Walter Bright

Posted in reply to Andrei Alexandrescu

On 5/25/2013 5:43 AM, Andrei Alexandrescu wrote:
> On 5/25/13 3:33 AM, Joakim wrote:
>> On Saturday, 25 May 2013 at 01:58:41 UTC, Walter Bright wrote:
>>> This is more a problem with the algorithms taking the easy way than a
>>> problem with UTF-8. You can do all the string algorithms, including
>>> regex, by working with the UTF-8 directly rather than converting to
>>> UTF-32. Then the algorithms work at full speed.
>> I call BS on this. There's no way working on a variable-width encoding
>> can be as "full speed" as a constant-width encoding. Perhaps you mean
>> that the slowdown is minimal, but I doubt that also.
>
> You mentioned this a couple of times, and I wonder what makes you so sure. On
> contemporary architectures small is fast and large is slow; betting on replacing
> larger data with more computation is quite often a win.

On the other hand, Joakim even admits his single byte encoding is variable length, as otherwise he simply dismisses the rarely used (!) Chinese, Japanese, and Korean languages, as well as any text that contains words from more than one language.

I suspect he's trolling us, and quite successfully.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Walter Bright
in reply to Joakim

Walter Bright

Posted in reply to Joakim

On 5/25/2013 1:07 AM, Joakim wrote:
> The vast majority of non-english alphabets in UCS can be encoded in a single
> byte.  It is your exceptions that are not relevant.

I suspect the Chinese, Koreans, and Japanese would take exception to being called irrelevant.

Good luck with your scheme that can't handle languages written by billions of people!

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Diggory
in reply to Joakim

Diggory

Posted in reply to Joakim

On Saturday, 25 May 2013 at 19:02:43 UTC, Joakim wrote:
> On Saturday, 25 May 2013 at 18:09:26 UTC, Diggory wrote:
>> On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:
>>> On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
>>>> I think you are a little confused about what unicode actually is... Unicode has nothing to do with code pages and nobody uses code pages any more except for compatibility with legacy applications (with good reason!).
>>> Incorrect.
>>>
>>> "Unicode is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes... In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all the other code pages have been technically redefined as encodings for various subsets of Unicode."
>>> http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode
>>>
>>
>> That confirms exactly what I just said...
> No, that directly _contradicts_ what you said about Unicode having "nothing to do with code pages."  All UCS did is take a bunch of existing code pages and standardize them into one massive character set.  For example, ISCII was a pre-existing single-byte encoding and Unicode "largely preserves the ISCII layout within each block."
> http://en.wikipedia.org/wiki/ISCII
>
> All a code page is is a table of mappings, UCS is just a much larger, standardized table of such mappings.

UCS does have nothing to do with code pages, it was designed as a replacement for them. A codepage is a strict subset of the possible characters, UCS is the entire set of possible characters.
>
>>>> You said that phobos converts UTF-8 strings to UTF-32 before operating on them but that's not true. As it iterates over UTF-8 strings it iterates over dchars rather than chars, but that's not in any way inefficient so I don't really see the problem.
>>> And what's a dchar?  Let's check:
>>>
>>> dchar : unsigned 32 bit UTF-32
>>> http://dlang.org/type.html
>>>
>>> Of course that's inefficient, you are translating your whole encoding over to a 32-bit encoding every time you need to process it.  Walter as much as said so up above.
>>
>> Given that all the machine registers are at least 32-bits already it doesn't make the slightest difference. The only additional operations on top of ascii are when it's a multi-byte character, and even then it's some simple bit manipulation which is as fast as any variable width encoding is going to get.
> I see you've abandoned without note your claim that phobos doesn't convert UTF-8 to UTF-32 internally.  Perhaps converting to UTF-32 is "as fast as any variable width encoding is going to get" but my claim is that single-byte encodings will be faster.

I haven't "abandoned my claim". It's a simple fact that phobos does not convert UTF-8 string to UTF-32 strings before it uses them.

ie. the difference between this:
string mystr = ...;
dstring temp = mystr.to!dstring;
for (int i = 0; i < temp.length; ++i)
    process(temp[i]);

and this:
string mystr = ...;
size_t i = 0;
while (i < mystr.length) {
    dchar current = decode(mystr, i);
    process(current);
}

And if you can't see why the latter example is far more efficient I give up...

>
>> The only alternatives to a variable width encoding I can see are:
>> - Single code page per string
>> This is completely useless because now you can't concatenate strings of different code pages.
> I wouldn't be so fast to ditch this.  There is a real argument to be made that strings of different languages are sufficiently different that there should be no multi-language strings.  Is this the best route?  I'm not sure, but I certainly wouldn't dismiss it out of hand.
>
>> - Multiple code pages per string
>> This just makes everything overly complicated and is far slower to decode what the actual character is than UTF-8.
> I disagree, this would still be far faster than UTF-8, particularly if you designed your header right.

The cache misses alone caused by simply accessing the separate headers would be a larger overhead than decoding UTF-8 which takes a few assembly instructions and has perfect locality and can be efficiently pipelined by the CPU.

Then there's all the extra processing involved combining the headers when you concatenate strings. Plus you lose the one benefit a fixed width encoding has because random access is no longer possible without first finding out which header controls the location you want to access.

>
>> - String with escape sequences to change code page
>> Can no longer access characters in the middle or end of the string, you have to parse the entire string every time which completely negates the benefit of a fixed width encoding.
> I didn't think of this possibility, but you may be right that it's sub-optimal.
>
>>>> Also your complaint that UTF-8 reserves the short characters for the english alphabet is not really relevant - the characters with longer encodings tend to be rarer (such as special symbols) or carry more information (such as chinese characters where the same sentence takes only about 1/3 the number of characters).
>>> The vast majority of non-english alphabets in UCS can be encoded in a single byte.  It is your exceptions that are not relevant.
>>
>> Well obviously... That's like saying "if you know what the exact contents of a file are going to be anyway you can compress it to a single byte!"
>>
>> ie. It's possible to devise an encoding which will encode any given string to an arbitrarily small size. It's still completely useless because you'd have to know the string in advance...
> No, it's not the same at all.  The contents of an arbitrary-length file cannot be compressed to a single byte, you would have collisions galore.  But since most non-english alphabets are less than 256 characters, they can all be uniquely encoded in a single byte per character, with the header determining what language's code page to use.  I don't understand your analogy whatsoever.

It's very simple - the more information about the type of data you are compressing you have at the time of writing the algorithm the better compression ration you can get, to the point that if you know exactly what the file is going to contain you can compress it to nothing. This is why you have specialised compression algorithms for images, video, audio, etc.

It doesn't matter how few characters non-english alphabets have - unless you know WHICH alphabet it is before-hand you can't store it in a single byte. Since any given character could be in any alphabet the best you can do is look at the probabilities of different characters appearing and use shorter representations for more common ones. (This is the basis for all lossless compression) The english alphabet plus 0-9 and basic punctuation are by far the most common characters used on computers so it makes sense to use one byte for those and multiple bytes for rarer characters.

>
>> - A useful encoding has to be able to handle every unicode character
>> - As I've shown the only space-efficient way to do this is using a variable length encoding like UTF-8
> You haven't shown this.
If you had thought through your suggestion of multiple code pages per string you would see that I had.

>
>> - Given the frequency distribution of unicode characters, UTF-8 does a pretty good job at encoding higher frequency characters in fewer bytes.
> No, it does a very bad job of this.  Every non-ASCII character takes at least two bytes to encode, whereas my single-byte encoding scheme would encode every alphabet with less than 256 characters in a single byte.

And strings with mixed characters would use lots of memory and be extremely slow. Common when using proper names, quotes, inline translations, graphical characters, etc. etc. Not to mention the added complexity to actually implement the algorithms.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to Dmitry Olshansky

Joakim

Posted in reply to Dmitry Olshansky

On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
> You can map a codepage to a subset of UCS :)
> That's what they do internally anyway.
> If I take you right you propose to define string as a header that denotes a set of windows in code space? I still fail to see how that would scale see below.
Something like that.  For a multi-language string encoding, the header would contain a single byte for every language used in the string, along with multiple index bytes to signify the start and finish of every run of single-language characters in the string.  So, a list of languages and a list of pure single-language substrings.  This is just off the top of my head, I'm not suggesting it is definitive.

> Mm... strictly speaking (let's turn that argument backwards) - what are algorithms that require slicing say [5..$] of string without ever looking at it left to right, searching etc.?
Don't know, I was just pointing out that all the claims of easy slicing with UTF-8 are wrong.  But a single-byte encoding would be scanned much faster also, as I've noted above, no decoding necessary and single bytes will always be faster than multiple bytes, even without decoding.

> How would it look like? Or how the processing will go?
Detailed a bit above.  As I mentioned earlier in this thread, functions like toUpper would execute much faster because you wouldn't have to scan substrings containing languages that don't have uppercase, which you have to scan in UTF-8.

>> long before you mentioned this unicode compression
>> scheme.
>
> It does inline headers or rather tags. That hop between fixed char windows. It's not random-access nor claims to be.
I wasn't criticizing it, just saying that it seems to be superficially similar to my scheme. :)

>> version of my single-byte encoding scheme!  You do raise a good point:
>> the only reason why we're likely using such a bad encoding in UTF-8 is
>> that nobody else wants to tackle this hairy problem.
>
> Yup, where have you been say almost 10 years ago? :)
I was in grad school, avoiding writing my thesis. :) I'd never have thought I'd be discussing Unicode today, didn't even know what it was back then.

>> Not necessarily.  But that is actually one of the advantages of
>> single-byte encodings, as I have noted above.  toUpper is a NOP for a
>> single-byte encoding string with an Asian script, you can't do that with
>> a UTF-8 string.
>
> But you have to check what encoding it's in and given that not all codepages are that simple to upper case some generic algorithm is required.
You have to check the language, but my point is that you can look at the header and know that toUpper has to do nothing for a single-byte-encoded string of an Asian script which doesn't have uppercase characters.  With UTF-8, you have to decode the entire string to find that out.


>> They may seem superficially similar but they're not.  For example, from
>> the beginning, I have suggested a more complex header that can enable
>> multi-language strings, as one possible solution.  I don't think code
>> pages provided that.
>
> The problem is how would you define an uppercase algorithm for multilingual string with 3 distinct 256 codespaces (windows)? I bet it's won't be pretty.
How is it done now?  It isn't pretty with UTF-8 now either, as some languages have uppercase characters and others don't.  The version of toUpper for my encoding will be similar, but it will do less work, because it doesn't have to be invoked for every character in the string.

> I still don't see how your solution scales to beyond 256 different codepoints per string (= multiple pages/parts of UCS ;) ).
I assume you're talking about Chinese, Korean, etc. alphabets?  I mentioned those to Walter earlier, they would have a two-byte encoding.  No way around that, but they would still be easier to deal with than UTF-8, because of the header.

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Dmitry Olshansky
in reply to Joakim

Dmitry Olshansky

Posted in reply to Joakim

25-May-2013 23:51, Joakim пишет:
> On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
>> You can map a codepage to a subset of UCS :)
>> That's what they do internally anyway.
>> If I take you right you propose to define string as a header that
>> denotes a set of windows in code space? I still fail to see how that
>> would scale see below.
> Something like that.  For a multi-language string encoding, the header
> would contain a single byte for every language used in the string, along
> with multiple index bytes to signify the start and finish of every run
> of single-language characters in the string. So, a list of languages and
> a list of pure single-language substrings.  This is just off the top of
> my head, I'm not suggesting it is definitive.
>

Runs away in horror :) It's mess even before you've got to details.

Another point about using sometimes a 2-byte encoding - welcome to the nice world of BigEndian/LittleEndian i.e. the very trap UTF-16 has stepped into.

-- 
Dmitry Olshansky

May 25, 2013

Re: Why UTF-8/16 character encodings?

Posted by Joakim
in reply to Walter Bright

Joakim

Posted in reply to Walter Bright

On Saturday, 25 May 2013 at 19:30:25 UTC, Walter Bright wrote:
> On the other hand, Joakim even admits his single byte encoding is variable length, as otherwise he simply dismisses the rarely used (!) Chinese, Japanese, and Korean languages, as well as any text that contains words from more than one language.
I have noted from the beginning that these large alphabets have to be encoded to two bytes, so it is not a true constant-width encoding if you are mixing one of those languages into a single-byte encoded string.  But this "variable length" encoding is so much simpler than UTF-8, there's no comparison.

> I suspect he's trolling us, and quite successfully.
Ha, I wondered who would pull out this insult, quite surprised to see it's Walter.  It seems to be the trend on the internet to accuse anybody you disagree with of trolling, I am honestly surprised to see Walter stoop so low.  Considering I'm the only one making any cogent arguments here, perhaps I should wonder if you're all trolling me. ;)

On Saturday, 25 May 2013 at 19:35:42 UTC, Walter Bright wrote:
> I suspect the Chinese, Koreans, and Japanese would take exception to being called irrelevant.
Irrelevant only because they are a small subset of the UCS.  I have noted that they would also be handled by a two-byte encoding.

> Good luck with your scheme that can't handle languages written by billions of people!
So let's see: first you say that my scheme has to be variable length because I am using two bytes to handle these languages, then you claim I don't handle these languages.  This kind of blatant contradiction within two posts can only be called... trolling!

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation